In this workshop, we introduce you to R and RStudio at the beginner level. In it, we cover:
We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.
During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes
When you start RStudio, you will see something like the following window appear:
Notice that the window is divided into three “panes”:
Console (the entire left side): This is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input and the resulting output are printed in different colors.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.
Environment/History (tabbed in upper right): View current user-defined objects and previously-entered commands, respectively. The instructor may have additional tabs like Connection, Git, etc. These are not relevant to this workshop series.
Files/Plots/Packages/Help (tabbed in lower right): As their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.
To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.
Projects are a great feature of RStudio. When you create a project,
RStudio creates an .Rproj
file that links all of your files
and outputs to the project directory. When you import data, R
automatically looks for the file in the project directory instead of you
having to specify a full file path on your computer like
/Users/username/Desktop/
. R also automatically saves any
output to the project directory. Finally, projects allow you to save
your R environment in .RData
so that when you close RStudio
and then re-open it, you can start right where you left off without
re-importing any data or re-calculating any intermediate steps.
RStudio has a simple interface to create and switch between projects, accessed from the button in the top-right corner of the RStudio window. (Labeled “Project: (None)”, initially.)
Let’s create a project to work in for this workshop. Start by clicking the “Project” button in the upper right or going to the “File” menu. Select “New Project” and the following will appear.
You can either create a project in an existing directory or make a new directory on your computer - just be sure you know where it is.
After your project is created, navigate to its directory using your
Finder/File explorer. You will see the .RProj
file has been
created.
To access this project in the future, simply double-click the
.RProj
file and RStudio will open the project or choose
File > Open Project from within an already open RStudio window.
R script files are the primary way in which R facilitates reproducible research. They contain the code that loads your raw data, cleans it, performs the analyses, and creates and saves visualizations. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses. This is one of R’s primary advantages compared to traditional tools like Excel, where it may be unclear how to reproduce the results.
Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.
Additionally, you should annotate your R scripts with comments. In
each line of code, any text preceded by the #
symbol will
not execute. Comments can be useful to remind yourself and to tell other
readers what a specific chunk of code does. In my experience, there can
never be too much commenting.
Let’s create an R script (File > New File > R Script) and save
it as introR_live_notes.R
in your main project directory.
If you again look to the project directory on your computer, you will
see introR_live_notes.R
is now saved there.
We will work together to create and populate the
introR_live_notes.R
script throughout this workshop.
R packages are units of shareable code, containing functions that
facilitate and enhance analyses. Packages are typically installed from
CRAN (The Comprehensive R
Archive Network), which is a database containing R itself as well as
many R packages. Any package can be installed from CRAN using the
install.packages
function. You should input installation
commands into your console (as opposed to live_notes.R
)
since once a package is installed on your computer, you won’t need to
re-install it again.
You should have installed the following packages prior to the
workshop. However, if you did not, here is the code again.
tidyverse
is a meta-package containing several packages
useful in data manipulation and plotting. ggrepel
contains
functions to modify ggplot
plots.
You DO NOT need to run this again if you did so as part of the setup instructions.
install.packages("tidyverse")
install.packages("ggrepel")
After installing a package, and every time you open a new
RStudio session, the packages you want to use need to be loaded into the
R workspace with the library
function. This tells R to
access the package’s functions and prevents RStudio from lags that would
occur if it automatically loaded every downloaded package every time you
opened it. To put this in perspective, I had 464 packages installed at
the time this document was made.
# Data manipulation and visualization
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Because tidyverse
is a meta-package, it automatically
tells you want packages it is loading and their versions. In addition,
the Conflicts section let’s you know functions in the
tidyverse
that alter exist in your R session. Because you
chose to load the package, calling the function filter
will
use the tidyverse
function not the stats
function (which comes with base R). If you for some reason needed the
stats
version, you can specify it with
package::
like stats::filter
.
Bioconductor is another repository of R packages. It has different requirements for upload and houses many biology-relevant packages. To install from Bioconductor, you first install its installer from CRAN.
install.packages("BiocManager")
Then install your package of choice using the
BiocManager
installer. Again, you did this in the setup
instructions but the code is here again for reference. We install
limma
, a package for analysis of microarray and RNA-seq
data.
If prompted, say a
to “Update all/some/none?
[a/s/n]” and no
to “Do you want to install from sources the
packages which need compilation? (Yes/no/cancel)”
BiocManager::install("limma")
Before doing anything in R, it is a good idea to set your random seed. Your analyses may not end up using a seed but by setting it, you ensure that everything is exactly reproducible.
The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.
set.seed(4389)
Create a directory inside your project directory called
data
and move the data files into this directory. You can
do this in your regular point-and-click file explorer.
.csv
, .tsv
, etc.One of R’s most essential data structures is the data frame, which is
simply a table of m
columns by n
rows. First,
we will read in the RNA-seq metadata into RStudio using the base R
read.table
function.
Each R function uses the following basic syntax, where
Function
is the name of the function.
Function(argument1=..., argument2=..., ...)
read.table
has many arguments; however, we only need to
specify 3 arguments to correctly read in our data as a data frame. For
our data, we will need to specify:
file
- gives the path to the file that we want to load
from our working directory (current project directory).sep
- tells R that our data are comma-separatedheader
- tells R that the first row in our data
contains the names of the variables (columns).We will store the data as an object named meta
using the assignment operator <-
, so that we can re-use
it in our analysis.
# read the data and save it as an object
meta <- read.table(file="data/metadata.csv", sep=",", header=TRUE)
Now whenever we want to use these data, we simply call
meta
You can read up about the different arguments of a specific function
by typing ?Function
or help(Function)
in your
R console.
?read.table
You will notice that there are multiple functions of the
read.table
help page. This include similar and related
functions with additional options. For example, since our data are in
.csv
format, we could’ve instead read them into R with
read.csv
which assumes the options
sep=",", header=TRUE
by default.
# read the data with different function
meta <- read.csv(file="data/metadata.csv")
.RData
You may have data that do not fit nicely into a single table or into
a table at all (like plots). You can save these as .RData
,
which can be loaded directly into R. You can also save multiple tables
and/or other objects in a single .RData
file to make
loading your data quick and easy. Moreover, .RData
are
automatically compressed so they take up less storage space than
multiple tables.
load("data/dat_voom.RData")
Notice that these data appear already named in your R environment as
dat
. Object names are determined when saving so be sure to
create short but descriptive names before saving to
.RData
.
Let’s return to the simpler metadata for now. This data frame
consists of 20 rows (observations) and 9 columns (variables). You can
see this quickly using the dimension function dim
dim(meta)
## [1] 20 9
Each column and each row of a data frame are individual R vectors. R
vectors are one-dimensional arrays of data. For example, we can extract
column vectors from data frames using the $
operator.
# Extract patient IDs
meta$ptID
## [1] "pt01" "pt01" "pt02" "pt02" "pt03" "pt03" "pt04" "pt04" "pt05" "pt05"
## [11] "pt06" "pt06" "pt07" "pt07" "pt08" "pt08" "pt09" "pt09" "pt10" "pt10"
R objects have several different classes (types). Our data frame
contains 4 R data types. The base R class
function will
tell you what data type an object is.
class(meta)
## [1] "data.frame"
class(meta$ptID)
## [1] "character"
class(meta$age_dys)
## [1] "integer"
class(meta$total_seq)
## [1] "numeric"
class(meta$RNAseq)
## [1] "logical"
We see that our ptID
column is character
,
meaning it is non-numeric letters and characters. On the other hand,
age_dys
is numeric
, meaning a number, and
total_seq
is integer
, meaning a whole number
with no decimals. Finally, RNAseq
is logical
meaning TRUE/FALSE - which need to be all caps for R to recognize them
as such.
You can see the entire structure of the data with
str
.
str(meta)
## 'data.frame': 20 obs. of 9 variables:
## $ libID : chr "pt01_Media" "pt01_Mtb" "pt02_Media" "pt02_Mtb" ...
## $ ptID : chr "pt01" "pt01" "pt02" "pt02" ...
## $ condition : chr "Media" "Mtb" "Media" "Mtb" ...
## $ age_dys : int 12410 12410 12775 12775 11315 11315 8395 8395 7300 7300 ...
## $ sex : chr "M" "M" "M" "M" ...
## $ ptID_old : chr "pt00001" "pt00001" "pt00002" "pt00002" ...
## $ RNAseq : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ methylation: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ total_seq : num 9114402 8918699 9221555 7733260 6231728 ...
Another, common data type not found in these data is
factor
, non-numeric value with a set number of unique
levels. We will see factor
in the ggplot
section.
S3
, S4
)Now, let’s look at the dat
data. These data are in
S3
format meaning they have 3 dimensions. S3
data are a list of multiple data frames, vectors, plots, etc. In the
case of dat
, there are a specific class from the
limma
package called EList
for expression
list.
class(dat)
## [1] "EList"
## attr(,"package")
## [1] "limma"
If you click on the dat
object in your Environment tab,
you will see multiple pieces.
In these data, pieces are data frames or matrices. You can again see
this with class
, only this time you specify a part of the
dat
object with $
class(dat$genes)
## Loading required package: limma
## [1] "data.frame"
class(dat$E)
## [1] "matrix" "array"
class(dat$targets)
## [1] "data.frame"
class(dat$weights)
## [1] "matrix" "array"
class(dat$design)
## [1] "matrix" "array"
Notice that you get the message
Loading required package: limma
. If you did not have
limma
installed, you could not work with these data because
they are a data type specific to limma
. Hence, why we
installed this package previously.
Or going deeper, you can specify one column in the genes
data frame
class(dat$genes$hgnc_symbol)
## [1] "character"
Of note, working with S4
objects is very similar to
S3
except that they are accessed with @
instead of $
. However, we will not use S4
in
this workshop.
A large proportion of R functions operate on vectors to perform quick computations over their values. Here are some examples:
# Compute the variance of total number of sequences
var(meta$total_seq)
## [1] 7.843553e+12
# Find whether any samples have greater than 10 million sequences
meta$total_seq > 10E6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [13] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
# Find the unique values of metadata's condition
unique(meta$condition)
## [1] "Media" "Mtb"
Functions executed on an object in R may respond exclusively to one or more data types or may respond differently depending on the data type. When you use the incorrect data type, you will get an error or warning message. For example, you cannot take the mean of a factor or character.
# Compute the mean of libID
mean(meta$libID)
## Warning in mean.default(meta$libID): argument is not numeric or logical:
## returning NA
## [1] NA
Since vectors are one dimensional arrays of a defined length, their
individual values can be retrieved using vector indices. R uses 1-based
indexing, meaning the first value in an R vector corresponds to the
index 1. (Importantly, if you use python, that language is 0-based,
meaning the first value is index 0.) Each subsequent element increases
the index by 1. For example, we can extract the value of the 5th element
of the libID
vector using the square bracket operator
[ ]
like so.
meta$libID[5]
## [1] "pt03_Media"
In contrast, data frames are two dimensional arrays so indexing is
done across both dimensions as [rows, columns]
. So, we can
extract the same library ID value directly from the data frame knowing
it is in the 5th row and 1st column.
meta[5, 1]
## [1] "pt03_Media"
The square bracket operator is often used with logical vectors
(TRUE/FALSE) to subset data. For example, we can subset our metadata to
all Media
observations (rows).
# Create logical vector for which condition is Media
logical.vector <- meta$condition == "Media"
#View vector
logical.vector
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
#Apply vector to data frame to select only observations where the logical vector is TRUE
meta[logical.vector, ]
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9114402
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9221555
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6231728
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 10205557
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 15536685
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7085995
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 10706098
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9957906
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8216706
Subsetting is extremely useful when working with large data. We will learn more complex subsets on day 2 using the tidyverse. But first…
Statement | Meaning |
---|---|
<- |
Assign to object in environment |
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
>= |
Greater than or equal to |
< |
Less than |
<= |
Less than or equal to |
%in% |
In or within |
is.na() |
Is missing, e.g NA |
!is.na() |
Is not missing |
& |
And |
| |
Or |
Using the meta
data frame:
ptID_old
variable is.[]
:
libID
value occurs in the 9th rowlibID
for a library where
age_dys
equals 7300 and condition
equals
MediaptID
equals
“pt05” or “pt08”. Hint: Use a logical vector with the
conditional statements above.