1. Introduction to R and RStudio

In this workshop, we introduce you to R and RStudio at the beginner level. In it, we cover:

  • R and RStudio including projects, scripts, and packages
  • The help function
  • Reading in data as a data frame and RData
  • Data types

We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.

During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes

A Tour of RStudio

When you start RStudio, you will see something like the following window appear:

Notice that the window is divided into three “panes”:

  • Console (the entire left side): This is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input and the resulting output are printed in different colors.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.

  • Environment/History (tabbed in upper right): View current user-defined objects and previously-entered commands, respectively. The instructor may have additional tabs like Connection, Git, etc. These are not relevant to this workshop series.

  • Files/Plots/Packages/Help (tabbed in lower right): As their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.

To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.

RStudio Projects

Projects are a great feature of RStudio. When you create a project, RStudio creates an .Rproj file that links all of your files and outputs to the project directory. When you import data, R automatically looks for the file in the project directory instead of you having to specify a full file path on your computer like /Users/username/Desktop/. R also automatically saves any output to the project directory. Finally, projects allow you to save your R environment in .RData so that when you close RStudio and then re-open it, you can start right where you left off without re-importing any data or re-calculating any intermediate steps.

RStudio has a simple interface to create and switch between projects, accessed from the button in the top-right corner of the RStudio window. (Labeled “Project: (None)”, initially.)

Create a Project

Let’s create a project to work in for this workshop. Start by clicking the “Project” button in the upper right or going to the “File” menu. Select “New Project” and the following will appear.

You can either create a project in an existing directory or make a new directory on your computer - just be sure you know where it is.

After your project is created, navigate to its directory using your Finder/File explorer. You will see the .RProj file has been created.

To access this project in the future, simply double-click the .RProj file and RStudio will open the project or choose File > Open Project from within an already open RStudio window.

R Scripts

R script files are the primary way in which R facilitates reproducible research. They contain the code that loads your raw data, cleans it, performs the analyses, and creates and saves visualizations. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses. This is one of R’s primary advantages compared to traditional tools like Excel, where it may be unclear how to reproduce the results.

Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.

Additionally, you should annotate your R scripts with comments. In each line of code, any text preceded by the # symbol will not execute. Comments can be useful to remind yourself and to tell other readers what a specific chunk of code does. In my experience, there can never be too much commenting.

Let’s create an R script (File > New File > R Script) and save it as introR_live_notes.R in your main project directory. If you again look to the project directory on your computer, you will see introR_live_notes.R is now saved there.

We will work together to create and populate the introR_live_notes.R script throughout this workshop.

R packages

CRAN

R packages are units of shareable code, containing functions that facilitate and enhance analyses. Packages are typically installed from CRAN (The Comprehensive R Archive Network), which is a database containing R itself as well as many R packages. Any package can be installed from CRAN using the install.packages function. You should input installation commands into your console (as opposed to live_notes.R) since once a package is installed on your computer, you won’t need to re-install it again.

You should have installed the following packages prior to the workshop. However, if you did not, here is the code again. tidyverse is a meta-package containing several packages useful in data manipulation and plotting. ggrepel contains functions to modify ggplot plots.

You DO NOT need to run this again if you did so as part of the setup instructions.

install.packages("tidyverse")
install.packages("ggrepel")

After installing a package, and every time you open a new RStudio session, the packages you want to use need to be loaded into the R workspace with the library function. This tells R to access the package’s functions and prevents RStudio from lags that would occur if it automatically loaded every downloaded package every time you opened it. To put this in perspective, I had 464 packages installed at the time this document was made.

# Data manipulation and visualization
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Because tidyverse is a meta-package, it automatically tells you want packages it is loading and their versions. In addition, the Conflicts section let’s you know functions in the tidyverse that alter exist in your R session. Because you chose to load the package, calling the function filter will use the tidyverse function not the stats function (which comes with base R). If you for some reason needed the stats version, you can specify it with package:: like stats::filter.

Bioconductor

Bioconductor is another repository of R packages. It has different requirements for upload and houses many biology-relevant packages. To install from Bioconductor, you first install its installer from CRAN.

install.packages("BiocManager")

Then install your package of choice using the BiocManager installer. Again, you did this in the setup instructions but the code is here again for reference. We install limma, a package for analysis of microarray and RNA-seq data.

If prompted, say a to “Update all/some/none? [a/s/n]” and no to “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)”

BiocManager::install("limma")

Getting started

Before doing anything in R, it is a good idea to set your random seed. Your analyses may not end up using a seed but by setting it, you ensure that everything is exactly reproducible.

The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.

set.seed(4389)

Organize data

Create a directory inside your project directory called data and move the data files into this directory. You can do this in your regular point-and-click file explorer.

Loading data into R

Data frames from .csv, .tsv, etc.

One of R’s most essential data structures is the data frame, which is simply a table of m columns by n rows. First, we will read in the RNA-seq metadata into RStudio using the base R read.table function.

Each R function uses the following basic syntax, where Function is the name of the function.

Function(argument1=..., argument2=..., ...)

read.table has many arguments; however, we only need to specify 3 arguments to correctly read in our data as a data frame. For our data, we will need to specify:

  • file - gives the path to the file that we want to load from our working directory (current project directory).
  • sep - tells R that our data are comma-separated
  • header - tells R that the first row in our data contains the names of the variables (columns).

We will store the data as an object named meta using the assignment operator <-, so that we can re-use it in our analysis.

# read the data and save it as an object
meta <- read.table(file="data/metadata.csv", sep=",", header=TRUE)

Now whenever we want to use these data, we simply call meta

Help function

You can read up about the different arguments of a specific function by typing ?Function or help(Function) in your R console.

?read.table

You will notice that there are multiple functions of the read.table help page. This include similar and related functions with additional options. For example, since our data are in .csv format, we could’ve instead read them into R with read.csv which assumes the options sep=",", header=TRUE by default.

# read the data with different function
meta <- read.csv(file="data/metadata.csv")

Complex data from .RData

You may have data that do not fit nicely into a single table or into a table at all (like plots). You can save these as .RData, which can be loaded directly into R. You can also save multiple tables and/or other objects in a single .RData file to make loading your data quick and easy. Moreover, .RData are automatically compressed so they take up less storage space than multiple tables.

load("data/dat_voom.RData")

Notice that these data appear already named in your R environment as dat. Object names are determined when saving so be sure to create short but descriptive names before saving to .RData.

Data types

Simple

Let’s return to the simpler metadata for now. This data frame consists of 20 rows (observations) and 9 columns (variables). You can see this quickly using the dimension function dim

dim(meta)
## [1] 20  9

Each column and each row of a data frame are individual R vectors. R vectors are one-dimensional arrays of data. For example, we can extract column vectors from data frames using the $ operator.

# Extract patient IDs
meta$ptID
##  [1] "pt01" "pt01" "pt02" "pt02" "pt03" "pt03" "pt04" "pt04" "pt05" "pt05"
## [11] "pt06" "pt06" "pt07" "pt07" "pt08" "pt08" "pt09" "pt09" "pt10" "pt10"

R objects have several different classes (types). Our data frame contains 4 R data types. The base R class function will tell you what data type an object is.

class(meta)
## [1] "data.frame"
class(meta$ptID)
## [1] "character"
class(meta$age_dys)
## [1] "integer"
class(meta$total_seq)
## [1] "numeric"
class(meta$RNAseq)
## [1] "logical"

We see that our ptID column is character, meaning it is non-numeric letters and characters. On the other hand, age_dys is numeric, meaning a number, and total_seq is integer, meaning a whole number with no decimals. Finally, RNAseq is logical meaning TRUE/FALSE - which need to be all caps for R to recognize them as such.

You can see the entire structure of the data with str.

str(meta)
## 'data.frame':    20 obs. of  9 variables:
##  $ libID      : chr  "pt01_Media" "pt01_Mtb" "pt02_Media" "pt02_Mtb" ...
##  $ ptID       : chr  "pt01" "pt01" "pt02" "pt02" ...
##  $ condition  : chr  "Media" "Mtb" "Media" "Mtb" ...
##  $ age_dys    : int  12410 12410 12775 12775 11315 11315 8395 8395 7300 7300 ...
##  $ sex        : chr  "M" "M" "M" "M" ...
##  $ ptID_old   : chr  "pt00001" "pt00001" "pt00002" "pt00002" ...
##  $ RNAseq     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ methylation: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ total_seq  : num  9114402 8918699 9221555 7733260 6231728 ...

Another, common data type not found in these data is factor, non-numeric value with a set number of unique levels. We will see factor in the ggplot section.

Complex (S3, S4)

Now, let’s look at the dat data. These data are in S3 format meaning they have 3 dimensions. S3 data are a list of multiple data frames, vectors, plots, etc. In the case of dat, there are a specific class from the limma package called EList for expression list.

class(dat)
## [1] "EList"
## attr(,"package")
## [1] "limma"

If you click on the dat object in your Environment tab, you will see multiple pieces.

In these data, pieces are data frames or matrices. You can again see this with class, only this time you specify a part of the dat object with $

class(dat$genes)
## Loading required package: limma
## [1] "data.frame"
class(dat$E)
## [1] "matrix" "array"
class(dat$targets)
## [1] "data.frame"
class(dat$weights)
## [1] "matrix" "array"
class(dat$design)
## [1] "matrix" "array"

Notice that you get the message Loading required package: limma. If you did not have limma installed, you could not work with these data because they are a data type specific to limma. Hence, why we installed this package previously.

Or going deeper, you can specify one column in the genes data frame

class(dat$genes$hgnc_symbol)
## [1] "character"

Of note, working with S4 objects is very similar to S3 except that they are accessed with @ instead of $. However, we will not use S4 in this workshop.

Working with vectors and data frames

Operating on vectors

A large proportion of R functions operate on vectors to perform quick computations over their values. Here are some examples:

# Compute the variance of total number of sequences
var(meta$total_seq)
## [1] 7.843553e+12
# Find whether any samples have greater than 10 million sequences
meta$total_seq > 10E6
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
## [13]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
# Find the unique values of metadata's condition
unique(meta$condition)
## [1] "Media" "Mtb"

Using the correct class

Functions executed on an object in R may respond exclusively to one or more data types or may respond differently depending on the data type. When you use the incorrect data type, you will get an error or warning message. For example, you cannot take the mean of a factor or character.

# Compute the mean of libID
mean(meta$libID)
## Warning in mean.default(meta$libID): argument is not numeric or logical:
## returning NA
## [1] NA

Subsetting vectors and data frames

Since vectors are one dimensional arrays of a defined length, their individual values can be retrieved using vector indices. R uses 1-based indexing, meaning the first value in an R vector corresponds to the index 1. (Importantly, if you use python, that language is 0-based, meaning the first value is index 0.) Each subsequent element increases the index by 1. For example, we can extract the value of the 5th element of the libID vector using the square bracket operator [ ] like so.

meta$libID[5]
## [1] "pt03_Media"

In contrast, data frames are two dimensional arrays so indexing is done across both dimensions as [rows, columns]. So, we can extract the same library ID value directly from the data frame knowing it is in the 5th row and 1st column.

meta[5, 1]
## [1] "pt03_Media"

The square bracket operator is often used with logical vectors (TRUE/FALSE) to subset data. For example, we can subset our metadata to all Media observations (rows).

# Create logical vector for which condition is Media
logical.vector <- meta$condition == "Media"
#View vector
logical.vector
##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [13]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
#Apply vector to data frame to select only observations where the logical vector is TRUE
meta[logical.vector, ]
##         libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## 1  pt01_Media pt01     Media   12410   M  pt00001   TRUE       FALSE   9114402
## 3  pt02_Media pt02     Media   12775   M  pt00002   TRUE       FALSE   9221555
## 5  pt03_Media pt03     Media   11315   M  pt00003   TRUE       FALSE   6231728
## 7  pt04_Media pt04     Media    8395   M  pt00004   TRUE        TRUE  10205557
## 9  pt05_Media pt05     Media    7300   M  pt00005   TRUE       FALSE  15536685
## 11 pt06_Media pt06     Media    6570   F  pt00006   TRUE       FALSE   7085995
## 13 pt07_Media pt07     Media    7665   F  pt00007   TRUE       FALSE  10706098
## 15 pt08_Media pt08     Media    8760   M  pt00008   TRUE       FALSE   9957906
## 17 pt09_Media pt09     Media    6935   M  pt00009   TRUE       FALSE  13055276
## 19 pt10_Media pt10     Media    8030   F  pt00010   TRUE       FALSE   8216706

Subsetting is extremely useful when working with large data. We will learn more complex subsets on day 2 using the tidyverse. But first…

Quick reference: Conditional statements

Statement Meaning
<- Assign to object in environment
== Equal to
!= Not equal to
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
%in% In or within
is.na() Is missing, e.g NA
!is.na() Is not missing
& And
| Or

Exercises: Intro R

  1. Using help to identify the necessary arguments for the log function, compute the natural logarithm of 4, base 2 logarithm of 4, and base 4 logarithm of 4.

Using the meta data frame:

  1. Using an R function, determine what data type the ptID_old variable is.
  2. Using indexing and the square bracket operator []:
    • determine what libID value occurs in the 9th row
    • return the libID for a library where age_dys equals 7300 and condition equals Media
  3. Subset the data to observations where ptID equals “pt05” or “pt08”. Hint: Use a logical vector with the conditional statements above.