Prior to this workshop, you should complete setup and installation.
In this workshop, we introduce you to R and RStudio at the beginner level. In it, we cover:
During the workshop, we will build an R script together. This will be
posted with today’s data under live_notes/
after the
workshop at https://github.com/BIGslu/workshops/tree/main/introR.workshop/live_notes/
We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.
When you start RStudio, you will see something like the following window appear:
Notice that the window is divided into three “panes”:
Console (the entire left side): This is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input and the resulting output are printed in different colors.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.
Environment/History (tabbed in upper right): View current user-defined objects and previously-entered commands, respectively. The instructor may have additional tabs like Connection, Git, etc. These are not relevant to this workshop series.
Files/Plots/Packages/Help (tabbed in lower right): As their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.
To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.
Projects are a great feature of RStudio. When you create a project,
RStudio creates an .Rproj
file that links all of your files
and outputs to the project directory. When you import data, R
automatically looks for the file in the project directory instead of you
having to specify a full file path on your computer like
/Users/username/Desktop/
. R also automatically saves any
output to the project directory. Finally, projects allow you to save
your R environment in .RData
so that when you close RStudio
and then re-open it, you can start right where you left off without
re-importing any data or re-calculating any intermediate steps.
RStudio has a simple interface to create and switch between projects, accessed from the button in the top-right corner of the RStudio window. (Labeled “Project: (None)”, initially.)
Let’s create a project to work in for this workshop. Start by clicking the “Project” button in the upper right or going to the “File” menu. Select “New Project” and the following will appear.
You can either create a project in an existing directory or make a new directory on your computer - just be sure you know where it is.
After your project is created, navigate to its directory using your
Finder/File explorer. You will see the .RProj
file has been
created.
To access this project in the future, simply double-click the
.RProj
file and RStudio will open the project or choose
File > Open Project from within an already open RStudio window.
R script files are the primary way in which R facilitates reproducible research. They contain the code that loads your raw data, cleans it, performs the analyses, and creates and saves visualizations. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses. This is one of R’s primary advantages compared to traditional tools like Excel, where it may be unclear how to reproduce the results.
Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.
Additionally, you should annotate your R scripts with comments. In
each line of code, any text preceded by the #
symbol will
not execute. Comments can be useful to remind yourself and to tell other
readers what a specific chunk of code does. In my experience, there can
never be too much commenting.
Let’s create an R script (File > New File > R Script) and save
it as introR_live_notes.R
in your main project directory.
If you again look to the project directory on your computer, you will
see introR_live_notes.R
is now saved there.
We will work together to create and populate the
introR_live_notes.R
script throughout this workshop.
R packages are units of shareable code, containing functions that
facilitate and enhance analyses. Packages are typically installed from
CRAN (The Comprehensive R
Archive Network), which is a database containing R itself as well as
many R packages. Any package can be installed from CRAN using the
install.packages
function. You should input installation
commands into your console (as opposed to live_notes.R
)
since once a package is installed on your computer, you won’t need to
re-install it again.
You should have installed the following packages prior to the
workshop. However, if you did not, here is the code again.
tidyverse
is a meta-package containing several packages
useful in data manipulation and plotting.
You DO NOT need to run this again if you did so as part of the setup instructions.
install.packages("tidyverse")
After installing a package, and every time you open a new
RStudio session, the packages you want to use need to be loaded into the
R workspace with the library
function. This tells R to
access the package’s functions and prevents RStudio from lags that would
occur if it automatically loaded every downloaded package every time you
opened it. To put this in perspective, I had 464 packages installed at
the time this document was made.
# Data manipulation and visualization
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Because tidyverse
is a meta-package, it automatically
tells you want packages it is loading and their versions. In addition,
the Conflicts section let’s you know functions in the
tidyverse
that alter exist in your R session. Because you
chose to load the package, calling the function filter
will
use the tidyverse
function not the stats
function (which comes with base R). If you for some reason needed the
stats
version, you can specify it with
package::
like stats::filter
.
Bioconductor is another repository of R packages. It has different requirements for upload and houses many biology-relevant packages. To install from Bioconductor, you first install its installer from CRAN.
install.packages("BiocManager")
Then install your package of choice using the
BiocManager
installer. We install limma
, a
package for analysis of microarray and RNA-seq data. Again, you did this
in the setup instructions but the code is here again for reference.
If prompted, say a
to “Update all/some/none?
[a/s/n]” and no
to “Do you want to install from sources the
packages which need compilation? (Yes/no/cancel)”
BiocManager::install("limma")
Before doing anything in R, it is a good idea to set your random seed. Your analyses may not end up using a seed but by setting it, you ensure that everything is exactly reproducible.
The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.
set.seed(4389)
Move the data/
directory you downloaded inside
your project directory. You can do this in your regular point-and-click
file explorer.
.csv
, .tsv
, etc.One of R’s most essential data structures is the data frame, which is
simply a table of m
columns by n
rows. First,
we will read in the RNA-seq metadata into RStudio using the base R
read.table
function.
Each R function uses the following basic syntax, where
Function
is the name of the function.
Function(argument1=..., argument2=..., ...)
read.table
has many arguments; however, we only need to
specify 3 arguments to correctly read in our data as a data frame. For
our data, we will need to specify:
file
- gives the path to the file that we want to load
from our working directory (current project directory).sep
- tells R that our data are comma-separatedheader
- tells R that the first row in our data
contains the names of the variables (columns).We will store the data as an object named
meta_csv
using the assignment operator <-
,
so that we can re-use it in our analysis.
# read the data and save it as an object
meta_csv <- read.table(file="data/meta.csv", sep=",", header=TRUE)
Now whenever we want to use these data, we simply call
meta_csv
A little background on these data.
Mycobacterium tuberculosis (Mtb) is the causative agent of tuberculosis (TB). TB is among the top infectious killers worldwide… and had been for centuries. Mtb predominantly infects lung macrophages (a type of immune cell in the class monocyte), and exposure to Mtb results in infection that is cleared, contained, or progresses to disease. Mechanisms that distinguish these outcomes are largely unknown.
The data used in this workshop are from a RNA-seq experiment exploring the impacts of Mtb infection on human monocytes from individuals with latent TB infection (LTBI) or with resistance to Mtb infection (RSTR). Here, we have a subset of 10 individuals including:
libID
: Unique library ID for each sequencing library.
Formatted as ptID_conditionptID
: Patient ID. Each patient has 2 libraries which
differ in their condition
condition
: Experimental condition. Media for media
control. Mtb of M. tuberculosis infected.age_dys
: Age in dayssex
: Biological sex, M or FptID_old
: Old patient ID with leading zeroesRNAseq
: If the library has pass-filter RNA-seq
datamethylation
: If the library has pass-filter methylation
datatotal_seq
: Total number of pass-filter sequencesMore information can be found in
Simmons JD, Dill-McFarland KA, et al. 2022. Monocyte transcriptional responses to Mycobacterium tuberculosis associate with resistance to tuberculin skin test and interferon gamma release assay conversion. mSphere 7(3): e0015922. doi: 10.1128/msphere.00159-22 — GitHub
Now back to R. You can read up about the different arguments of a
specific function by typing ?Function
or
help(Function)
in your R console.
?read.table
You will notice that there are multiple functions of the
read.table
help page. This include similar and related
functions with additional options. For example, since our data are in
.csv
format, we could’ve instead read them into R with
read.csv
which assumes the options
sep=",", header=TRUE
by default.
# read the data with different function
meta_csv <- read.csv(file="data/meta.csv")
.RData
You may have data that do not fit nicely into a single table or into
a table at all (like plots). You can save these as .RData
,
which can be loaded directly into R. You can also save multiple tables
and/or other objects in a single .RData
file to make
loading your data quick and easy. Moreover, .RData
are
automatically compressed so they take up less storage space than
multiple tables.
load("data/meta.RData")
Notice that these data appear already named in your R environment as
meta
. Object names are determined when saving so be sure to
create short but descriptive names before saving to
.RData
.
Note that meta
and meta_csv
are the same
data, just loaded from different files. Because the name is slightly
shorter, we will explore meta
for most of this section.
This data frame consists of 20 rows (observations) and 9 columns
(variables). You can see this quickly using the dimension function
dim
dim(meta)
## [1] 20 9
Each column and each row of a data frame are individual R vectors. R
vectors are one-dimensional arrays of data. For example, we can extract
column vectors from data frames using the $
operator.
# Extract patient IDs
meta$ptID
## [1] "pt01" "pt01" "pt02" "pt02" "pt03" "pt03" "pt04" "pt04" "pt05" "pt05"
## [11] "pt06" "pt06" "pt07" "pt07" "pt08" "pt08" "pt09" "pt09" "pt10" "pt10"
R objects have several different classes (types). Our data frame
contains 5 R data types. The base R class
function will
tell you what data type an object is.
We see that our ptID
column is character
,
meaning it is non-numeric letters and characters. condition
is a factor
, meaning it is a character variable with
discrete levels. A character
variable can be anything while
a factor
can only be one of a set of pre-defined
levels.
class(meta)
## [1] "data.frame"
class(meta$ptID)
## [1] "character"
class(meta$condition)
## [1] "factor"
One advantage of .RData
is that is saves formatting not
found in a .csv
. Here, we see that the
condition
variable in meta_csv
contains the
same data as meta
but meta
from the
.RData
has factor
formatting. Thus,
.RData
are a good way to save your data when you have
factors or other within R formatting.
meta$condition
## [1] Media Mtb Media Mtb Media Mtb Media Mtb Media Mtb Media Mtb
## [13] Media Mtb Media Mtb Media Mtb Media Mtb
## Levels: Media Mtb
class(meta$condition)
## [1] "factor"
meta_csv$condition
## [1] "Media" "Mtb" "Media" "Mtb" "Media" "Mtb" "Media" "Mtb" "Media"
## [10] "Mtb" "Media" "Mtb" "Media" "Mtb" "Media" "Mtb" "Media" "Mtb"
## [19] "Media" "Mtb"
class(meta_csv$condition)
## [1] "character"
Continuing on, age_dys
is numeric
, meaning
a number, and total_seq
is integer
, meaning a
whole number with no decimals.
class(meta$age_dys)
## [1] "integer"
class(meta$total_seq)
## [1] "numeric"
Finally, RNAseq
is logical
meaning
TRUE/FALSE - which need to be all caps for R to recognize them as
such.
class(meta$RNAseq)
## [1] "logical"
You can see the entire structure of the data with
str
.
str(meta)
## 'data.frame': 20 obs. of 9 variables:
## $ libID : chr "pt01_Media" "pt01_Mtb" "pt02_Media" "pt02_Mtb" ...
## $ ptID : chr "pt01" "pt01" "pt02" "pt02" ...
## $ condition : Factor w/ 2 levels "Media","Mtb": 1 2 1 2 1 2 1 2 1 2 ...
## $ age_dys : int 12410 12410 12775 12775 11315 11315 8395 8395 7300 7300 ...
## $ sex : chr "M" "M" "M" "M" ...
## $ ptID_old : chr "pt00001" "pt00001" "pt00002" "pt00002" ...
## $ RNAseq : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ methylation: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ total_seq : num 9114402 8918699 9221555 7733260 6231728 ...
S3
, S4
)We will not use this data type today. However, you may encounter it in our other workshops or in your own work.
S3
and S4
formats mean the data have > 2
dimensions. S3
data are a list of multiple data frames,
vectors, plots, etc. S4
data can be lists or other
specialty data types specific to an R package.
For example, RNA-seq data can be contained in an S3
object from the limma
package. This type of data is called
an EList
for “expression list” and contains data frames
with gene metadata (genes
), sample metadata
(targets
), normalized expression values (E
),
and others.
Similar to our meta
data, you access pieces of an
S3
object with $
only now, you can have
multiple levels to go from the list object data
to each
data frame like targets
to each column in that data frame
like ptID
. You should not run this code as you have not
loaded the dat
object in this workshop.
dat$genes
dat$targets$ptID
Working with S4
objects is very similar to
S3
except that they are accessed with @
instead of $
. Again, we will not use these data types here,
and you can explore them further in more advanced workshops.
A large proportion of R functions operate on vectors to perform quick computations over their values. Here are some examples:
# Compute the variance of total number of sequences
var(meta$total_seq)
## [1] 7.843553e+12
# Find whether any samples have greater than 10 million sequences
meta$total_seq > 10E6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [13] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
# Find the unique values of metadata's condition
unique(meta$sex)
## [1] "M" "F"
unique(meta$condition)
## [1] Media Mtb
## Levels: Media Mtb
Note the difference between a character (sex
) and a
factor here (condition
).
Functions executed on an object in R may respond exclusively to one or more data types or may respond differently depending on the data type. When you use the incorrect data type, you will get an error or warning message. For example, you cannot take the mean of a factor or character.
# Compute the mean of libID
mean(meta$libID)
## Warning in mean.default(meta$libID): argument is not numeric or logical:
## returning NA
## [1] NA
Since vectors are one dimensional arrays of a defined length, their
individual values can be retrieved using vector indices. R uses 1-based
indexing, meaning the first value in an R vector corresponds to the
index 1. (Importantly, if you use python, that language is 0-based,
meaning the first value is index 0.) Each subsequent element increases
the index by 1. For example, we can extract the value of the 5th element
of the libID
vector using the square bracket operator
[ ]
like so.
meta$libID[5]
## [1] "pt03_Media"
In contrast, data frames are two dimensional arrays so indexing is
done across both dimensions as [rows, columns]
. So, we can
extract the same library ID value directly from the data frame knowing
it is in the 5th row and 1st column.
meta[5, 1]
## [1] "pt03_Media"
The square bracket operator is often used with logical vectors
(TRUE/FALSE) to subset data. For example, we can subset our metadata to
all Media
observations (rows).
# Create logical vector for which condition is Media
logical.vector <- meta$condition == "Media"
#View vector
logical.vector
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
#Apply vector to data frame to select only observations where the logical vector is TRUE
meta[logical.vector, ]
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9114402
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9221555
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6231728
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 10205557
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 15536685
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7085995
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 10706098
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9957906
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8216706
Subsetting is extremely useful when working with large data. Please find a reference of common statements used in subsetting below.
Statement | Meaning |
---|---|
<- |
Assign to object in environment |
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
>= |
Greater than or equal to |
< |
Less than |
<= |
Less than or equal to |
%in% |
In or within |
is.na() |
Is missing, e.g NA |
!is.na() |
Is not missing |
& |
And |
| |
Or |
One of the most important (if not THE most important) steps in data cleaning and analysis is plotting your data. As they say, a picture is worth a 1000 words.
Here, we use base R’s plotting function hist
to make a
histogram of the total number of sequences in our data. We can quickly
determine a number of things like 1) there are no extreme outliers and
2) the data are not normally distributed. This can then inform our
downstream analysis such as linear modeling, etc.
hist(meta$total_seq)
We encourage you to learn to make more beautiful plots in the
tidyverse
in our other workshops!
Using the meta
data frame:
ptID_old
variable is.[]
:
libID
value occurs in the 9th rowlibID
for a library where
age_dys
equals 7300 and condition
equals
MediaptID
equals
“pt05” or “pt08”. Hint: Use a logical vector with the
conditional statements above.total_seq
by condition
groups (i.e. Media vs Mtb). Hint: Checkout the
plot
function. Your resulting plot should look like
below.plot
and customize your
plot. You can change the axes labels, add a title, add color, and much
more!