Prior to the workshop, please complete the following:

Download the data

Please download the data in data.zip. Then, move data.zip to your Desktop and unzip it (usually double-clicking it will work).

If the above link does not work, please try downloading the data with Download button at https://github.com/BIGslu/workshops/blob/main/2022.08.15_R.tidyverse.workshop/data/data.zip.

Install R and RStudio

Install R, https://cran.r-project.org/index.html
- If you already have R installed, please upgrade to version 4.1 or newer
Install RStudio (free version), https://www.rstudio.com/products/rstudio/download/

When you open RStudio, it should look like so with multiple panels. If you see only 1 panel, then you’re likely in R, not RStudio.

Install R packages

Install R packages by running the following script in your R console (left panel in the above image).

#CRAN packages
install.packages("tidyverse")
install.packages("ggrepel")
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install("limma")

If prompted, say a to “Update all/some/none? [a/s/n]” and no to “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)”

This can take several minutes.

Check R package install

To make sure packages are correctly installed, load them into R with library( ). If you see any ERROR, please come 15 minutes early to the workshop the day of or contact Kim for assistance.

First, the package(s) that give messages upon loading.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Then, check package(s) that load silently with no messages.

library(limma)
library(ggrepel)

R package versions

For reproducibility, here is the complete list of software used in this workshop.

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.9.1   limma_3.52.2    forcats_0.5.2   stringr_1.4.1  
##  [5] dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0    
##  [9] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9          lubridate_1.8.0     assertthat_0.2.1   
##  [4] rprojroot_2.0.3     digest_0.6.29       utf8_1.2.2         
##  [7] R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
## [10] reprex_2.0.2        evaluate_0.16       httr_1.4.4         
## [13] pillar_1.8.1        rlang_1.0.4         googlesheets4_1.0.1
## [16] readxl_1.4.1        rstudioapi_0.14     jquerylib_0.1.4    
## [19] rmarkdown_2.15      googledrive_2.0.0   munsell_0.5.0      
## [22] broom_1.0.0         compiler_4.2.1      modelr_0.1.9       
## [25] xfun_0.32           pkgconfig_2.0.3     htmltools_0.5.3    
## [28] tidyselect_1.1.2    fansi_1.0.3         crayon_1.5.1       
## [31] tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
## [34] grid_4.2.1          jsonlite_1.8.0      gtable_0.3.0       
## [37] lifecycle_1.0.1     DBI_1.1.3           magrittr_2.0.3     
## [40] scales_1.2.1        cli_3.3.0           stringi_1.7.8      
## [43] cachem_1.0.6        fs_1.5.2            xml2_1.3.3         
## [46] bslib_0.4.0         ellipsis_0.3.2      generics_0.1.3     
## [49] vctrs_0.4.1         tools_4.2.1         glue_1.6.2         
## [52] hms_1.1.2           fastmap_1.1.0       yaml_2.3.5         
## [55] colorspace_2.0-3    gargle_1.2.0        rvest_1.0.3        
## [58] knitr_1.39          haven_2.5.0         sass_0.4.2

1. Introduction to R and RStudio

In this workshop, we introduce you to R and RStudio at the beginner level. In it, we cover:

R and RStudio including projects, scripts, and packages
The help function
Reading in data as a data frame and RData
Data types

We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.

During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes

A Tour of RStudio

When you start RStudio, you will see something like the following window appear:

Notice that the window is divided into three “panes”:

Console (the entire left side): This is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input and the resulting output are printed in different colors.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.
Environment/History (tabbed in upper right): View current user-defined objects and previously-entered commands, respectively. The instructor may have additional tabs like Connection, Git, etc. These are not relevant to this workshop series.
Files/Plots/Packages/Help (tabbed in lower right): As their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.

To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.

RStudio Projects

Projects are a great feature of RStudio. When you create a project, RStudio creates an .Rproj file that links all of your files and outputs to the project directory. When you import data, R automatically looks for the file in the project directory instead of you having to specify a full file path on your computer like /Users/username/Desktop/. R also automatically saves any output to the project directory. Finally, projects allow you to save your R environment in .RData so that when you close RStudio and then re-open it, you can start right where you left off without re-importing any data or re-calculating any intermediate steps.

RStudio has a simple interface to create and switch between projects, accessed from the button in the top-right corner of the RStudio window. (Labeled “Project: (None)”, initially.)

Create a Project

Let’s create a project to work in for this workshop. Start by clicking the “Project” button in the upper right or going to the “File” menu. Select “New Project” and the following will appear.

You can either create a project in an existing directory or make a new directory on your computer - just be sure you know where it is.

After your project is created, navigate to its directory using your Finder/File explorer. You will see the .RProj file has been created.

To access this project in the future, simply double-click the .RProj file and RStudio will open the project or choose File > Open Project from within an already open RStudio window.

R Scripts

R script files are the primary way in which R facilitates reproducible research. They contain the code that loads your raw data, cleans it, performs the analyses, and creates and saves visualizations. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses. This is one of R’s primary advantages compared to traditional tools like Excel, where it may be unclear how to reproduce the results.

Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.

Additionally, you should annotate your R scripts with comments. In each line of code, any text preceded by the # symbol will not execute. Comments can be useful to remind yourself and to tell other readers what a specific chunk of code does. In my experience, there can never be too much commenting.

Let’s create an R script (File > New File > R Script) and save it as introR_live_notes.R in your main project directory. If you again look to the project directory on your computer, you will see introR_live_notes.R is now saved there.

We will work together to create and populate the introR_live_notes.R script throughout this workshop.

R packages

CRAN

R packages are units of shareable code, containing functions that facilitate and enhance analyses. Packages are typically installed from CRAN (The Comprehensive R Archive Network), which is a database containing R itself as well as many R packages. Any package can be installed from CRAN using the install.packages function. You should input installation commands into your console (as opposed to live_notes.R) since once a package is installed on your computer, you won’t need to re-install it again.

You should have installed the following packages prior to the workshop. However, if you did not, here is the code again. tidyverse is a meta-package containing several packages useful in data manipulation and plotting. ggrepel contains functions to modify ggplot plots.

You DO NOT need to run this again if you did so as part of the setup instructions.

install.packages("tidyverse")
install.packages("ggrepel")

After installing a package, and every time you open a new RStudio session, the packages you want to use need to be loaded into the R workspace with the library function. This tells R to access the package’s functions and prevents RStudio from lags that would occur if it automatically loaded every downloaded package every time you opened it. To put this in perspective, I had 464 packages installed at the time this document was made.

# Data manipulation and visualization
library(tidyverse)

Because tidyverse is a meta-package, it automatically tells you want packages it is loading and their versions. In addition, the Conflicts section let’s you know functions in the tidyverse that alter exist in your R session. Because you chose to load the package, calling the function filter will use the tidyverse function not the stats function (which comes with base R). If you for some reason needed the stats version, you can specify it with package:: like stats::filter.

Bioconductor

Bioconductor is another repository of R packages. It has different requirements for upload and houses many biology-relevant packages. To install from Bioconductor, you first install its installer from CRAN.

install.packages("BiocManager")

Then install your package of choice using the BiocManager installer. Again, you did this in the setup instructions but the code is here again for reference. We install limma, a package for analysis of microarray and RNA-seq data.

If prompted, say a to “Update all/some/none? [a/s/n]” and no to “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)”

BiocManager::install("limma")

Getting started

Before doing anything in R, it is a good idea to set your random seed. Your analyses may not end up using a seed but by setting it, you ensure that everything is exactly reproducible.

The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.

set.seed(4389)

Organize data

Create a directory inside your project directory called data and move the data files into this directory. You can do this in your regular point-and-click file explorer.

Loading data into R

Data frames from `.csv`, `.tsv`, etc.

One of R’s most essential data structures is the data frame, which is simply a table of m columns by n rows. First, we will read in the RNA-seq metadata into RStudio using the base R read.table function.

Each R function uses the following basic syntax, where Function is the name of the function.

Function(argument1=..., argument2=..., ...)

read.table has many arguments; however, we only need to specify 3 arguments to correctly read in our data as a data frame. For our data, we will need to specify:

file - gives the path to the file that we want to load from our working directory (current project directory).
sep - tells R that our data are comma-separated
header - tells R that the first row in our data contains the names of the variables (columns).

We will store the data as an object named meta using the assignment operator <-, so that we can re-use it in our analysis.

# read the data and save it as an object
meta <- read.table(file="data/metadata.csv", sep=",", header=TRUE)

Now whenever we want to use these data, we simply call meta

Help function

You can read up about the different arguments of a specific function by typing ?Function or help(Function) in your R console.

?read.table

You will notice that there are multiple functions of the read.table help page. This include similar and related functions with additional options. For example, since our data are in .csv format, we could’ve instead read them into R with read.csv which assumes the options sep=",", header=TRUE by default.

# read the data with different function
meta <- read.csv(file="data/metadata.csv")

Complex data from `.RData`

You may have data that do not fit nicely into a single table or into a table at all (like plots). You can save these as .RData, which can be loaded directly into R. You can also save multiple tables and/or other objects in a single .RData file to make loading your data quick and easy. Moreover, .RData are automatically compressed so they take up less storage space than multiple tables.

load("data/dat_voom.RData")

Notice that these data appear already named in your R environment as dat. Object names are determined when saving so be sure to create short but descriptive names before saving to .RData.

Data types

Simple

Let’s return to the simpler metadata for now. This data frame consists of 20 rows (observations) and 9 columns (variables). You can see this quickly using the dimension function dim

dim(meta)

## [1] 20  9

Each column and each row of a data frame are individual R vectors. R vectors are one-dimensional arrays of data. For example, we can extract column vectors from data frames using the $ operator.

# Extract patient IDs
meta$ptID

##  [1] "pt01" "pt01" "pt02" "pt02" "pt03" "pt03" "pt04" "pt04" "pt05" "pt05"
## [11] "pt06" "pt06" "pt07" "pt07" "pt08" "pt08" "pt09" "pt09" "pt10" "pt10"

R objects have several different classes (types). Our data frame contains 4 R data types. The base R class function will tell you what data type an object is.

class(meta)

## [1] "data.frame"

class(meta$ptID)

## [1] "character"

class(meta$age_dys)

## [1] "integer"

class(meta$total_seq)

## [1] "numeric"

class(meta$RNAseq)

## [1] "logical"

We see that our ptID column is character, meaning it is non-numeric letters and characters. On the other hand, age_dys is numeric, meaning a number, and total_seq is integer, meaning a whole number with no decimals. Finally, RNAseq is logical meaning TRUE/FALSE - which need to be all caps for R to recognize them as such.

You can see the entire structure of the data with str.

str(meta)

## 'data.frame':    20 obs. of  9 variables:
##  $ libID      : chr  "pt01_Media" "pt01_Mtb" "pt02_Media" "pt02_Mtb" ...
##  $ ptID       : chr  "pt01" "pt01" "pt02" "pt02" ...
##  $ condition  : chr  "Media" "Mtb" "Media" "Mtb" ...
##  $ age_dys    : int  12410 12410 12775 12775 11315 11315 8395 8395 7300 7300 ...
##  $ sex        : chr  "M" "M" "M" "M" ...
##  $ ptID_old   : chr  "pt00001" "pt00001" "pt00002" "pt00002" ...
##  $ RNAseq     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ methylation: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ total_seq  : num  9114402 8918699 9221555 7733260 6231728 ...

Another, common data type not found in these data is factor, non-numeric value with a set number of unique levels. We will see factor in the ggplot section.

Complex (`S3`, `S4`)

Now, let’s look at the dat data. These data are in S3 format meaning they have 3 dimensions. S3 data are a list of multiple data frames, vectors, plots, etc. In the case of dat, there are a specific class from the limma package called EList for expression list.

class(dat)

## [1] "EList"
## attr(,"package")
## [1] "limma"

If you click on the dat object in your Environment tab, you will see multiple pieces.

In these data, pieces are data frames or matrices. You can again see this with class, only this time you specify a part of the dat object with $

class(dat$genes)

## [1] "data.frame"

class(dat$E)

## [1] "matrix" "array"

class(dat$targets)

## [1] "data.frame"

class(dat$weights)

## [1] "matrix" "array"

class(dat$design)

## [1] "matrix" "array"

Notice that you get the message Loading required package: limma. If you did not have limma installed, you could not work with these data because they are a data type specific to limma. Hence, why we installed this package previously.

Or going deeper, you can specify one column in the genes data frame

class(dat$genes$hgnc_symbol)

## [1] "character"

Of note, working with S4 objects is very similar to S3 except that they are accessed with @ instead of $. However, we will not use S4 in this workshop.

Working with vectors and data frames

Operating on vectors

A large proportion of R functions operate on vectors to perform quick computations over their values. Here are some examples:

# Compute the variance of total number of sequences
var(meta$total_seq)

## [1] 7.843553e+12

# Find whether any samples have greater than 10 million sequences
meta$total_seq > 10E6

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
## [13]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

# Find the unique values of metadata's condition
unique(meta$condition)

## [1] "Media" "Mtb"

Using the correct class

Functions executed on an object in R may respond exclusively to one or more data types or may respond differently depending on the data type. When you use the incorrect data type, you will get an error or warning message. For example, you cannot take the mean of a factor or character.

# Compute the mean of libID
mean(meta$libID)

## Warning in mean.default(meta$libID): argument is not numeric or logical:
## returning NA

## [1] NA

Subsetting vectors and data frames

Since vectors are one dimensional arrays of a defined length, their individual values can be retrieved using vector indices. R uses 1-based indexing, meaning the first value in an R vector corresponds to the index 1. (Importantly, if you use python, that language is 0-based, meaning the first value is index 0.) Each subsequent element increases the index by 1. For example, we can extract the value of the 5th element of the libID vector using the square bracket operator [ ] like so.

meta$libID[5]

## [1] "pt03_Media"

In contrast, data frames are two dimensional arrays so indexing is done across both dimensions as [rows, columns]. So, we can extract the same library ID value directly from the data frame knowing it is in the 5th row and 1st column.

meta[5, 1]

## [1] "pt03_Media"

The square bracket operator is often used with logical vectors (TRUE/FALSE) to subset data. For example, we can subset our metadata to all Media observations (rows).

# Create logical vector for which condition is Media
logical.vector <- meta$condition == "Media"
#View vector
logical.vector

##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [13]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

#Apply vector to data frame to select only observations where the logical vector is TRUE
meta[logical.vector, ]

##         libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## 1  pt01_Media pt01     Media   12410   M  pt00001   TRUE       FALSE   9114402
## 3  pt02_Media pt02     Media   12775   M  pt00002   TRUE       FALSE   9221555
## 5  pt03_Media pt03     Media   11315   M  pt00003   TRUE       FALSE   6231728
## 7  pt04_Media pt04     Media    8395   M  pt00004   TRUE        TRUE  10205557
## 9  pt05_Media pt05     Media    7300   M  pt00005   TRUE       FALSE  15536685
## 11 pt06_Media pt06     Media    6570   F  pt00006   TRUE       FALSE   7085995
## 13 pt07_Media pt07     Media    7665   F  pt00007   TRUE       FALSE  10706098
## 15 pt08_Media pt08     Media    8760   M  pt00008   TRUE       FALSE   9957906
## 17 pt09_Media pt09     Media    6935   M  pt00009   TRUE       FALSE  13055276
## 19 pt10_Media pt10     Media    8030   F  pt00010   TRUE       FALSE   8216706

Subsetting is extremely useful when working with large data. We will learn more complex subsets on day 2 using the tidyverse. But first…

Quick reference: Conditional statements

Statement	Meaning
`<-`	Assign to object in environment
`==`	Equal to
`!=`	Not equal to
`>`	Greater than
`>=`	Greater than or equal to
`<`	Less than
`<=`	Less than or equal to
`%in%`	In or within
`is.na()`	Is missing, e.g NA
`!is.na()`	Is not missing
`&`	And
`\|`	Or

Exercises: Intro R

Using help to identify the necessary arguments for the log function, compute the natural logarithm of 4, base 2 logarithm of 4, and base 4 logarithm of 4.

Using the meta data frame:

Using an R function, determine what data type the ptID_old variable is.
Using indexing and the square bracket operator []:
- determine what libID value occurs in the 9th row
- return the libID for a library where age_dys equals 7300 and condition equals Media
Subset the data to observations where ptID equals “pt05” or “pt08”. Hint: Use a logical vector with the conditional statements above.

2. Introduction to the tidyverse

In this workshop, we introduce you to the R tidyverse, a suite of packages for data manipulation and visualization in R. In it, we cover how to:

Subset rows/columns
Create new variables
Calculate summary statistics
Convert wide/long formats
Merge data frames

The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse. For more information on installation, see the setup instructions.

library(tidyverse)

We also load the metadata, this time using the tidyverse function read_csv. This function is very similar to base R read.csv, except that it automatically tells you the structure of the loaded data, performs better when determining different classes of data within a data frame, and save the object as a tibble.

meta <- read_csv("data/metadata.csv")

## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

class(meta)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

A tibble is just a data frame with some extra hidden formatting. You will not need to interface with these hidden features and all the following functions would behave the same if you’d read in a simple data.frame with read.csv.

Take a look at the dataset with View():

View(meta)

Here is a brief description of the variables in these data.

libID (character): Unique library ID for each sequencing library. Formatted as ptID_condition
ptID (character): Patient ID. Each patient has 2 libraries which differ in their condition
condition (character): Experimental condition. Media for media control. Mtb of M. tuberculosis infected.
age_dys (numeric): Age in ages
sex (character): Biological sex, M or F
ptID_old (character): Old patient ID with leading zeroes
RNAseq (logical): If the library has pass-filter RNA-seq data
methylation (logical): If the library has pass-filter methylation data
total_seq (numeric): Total number of pass-filter sequences

Intro dplyr

The first tidyverse package we will use is dplyr, which is for data manipulation. You do not need to load this package with library as it was already loaded as part of the tidyverse.

select

You can use the select function to keep only the columns that you want and remove columns that you don’t need. The first argument to the select function is our data frame, and the second argument is the name of the column we want to keep.

select(meta, ptID)

## # A tibble: 20 × 1
##    ptID 
##    <chr>
##  1 pt01 
##  2 pt01 
##  3 pt02 
##  4 pt02 
##  5 pt03 
##  6 pt03 
##  7 pt04 
##  8 pt04 
##  9 pt05 
## 10 pt05 
## 11 pt06 
## 12 pt06 
## 13 pt07 
## 14 pt07 
## 15 pt08 
## 16 pt08 
## 17 pt09 
## 18 pt09 
## 19 pt10 
## 20 pt10

We can continue to list multiple columns to keep by separating with commas.

select(meta, ptID, total_seq)

## # A tibble: 20 × 2
##    ptID  total_seq
##    <chr>     <dbl>
##  1 pt01   9114402.
##  2 pt01   8918699.
##  3 pt02   9221555.
##  4 pt02   7733260.
##  5 pt03   6231728.
##  6 pt03   7105193.
##  7 pt04  10205557.
##  8 pt04   8413543.
##  9 pt05  15536685.
## 10 pt05  15509446.
## 11 pt06   7085995.
## 12 pt06   6588160.
## 13 pt07  10706098.
## 14 pt07   8576245.
## 15 pt08   9957906.
## 16 pt08   8220348.
## 17 pt09  13055276 
## 18 pt09  13800442.
## 19 pt10   8216706.
## 20 pt10   7599609.

If you want to keep all but one column, put the minus (-) sign in front of it. I like to think of it as “subtracting” the column from the data frame.

select(meta, -ptID_old)

## # A tibble: 20 × 8
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276 
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.

filter

You can use filter to remove/keep rows. Instead of specifying a column name as in select, you provide a statement that gives TRUE/FALSE and filter keeps all the rows for which that statement is TRUE. The conditional operators we saw in Intro R will help with this!

Let’s filter to just the Media samples.

filter(meta, condition == "Media")

## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  3 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  4 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  5 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
##  6 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
##  7 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
##  8 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
##  9 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 10 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## # … with abbreviated variable name ¹total_seq

This is the same as what we did in Intro R with the following, only it is more readable and will fit into a tidyverse workflow. More on this when we get to pipes!

meta[meta$condition == "Media", ]

## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  3 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  4 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  5 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
##  6 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
##  7 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
##  8 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
##  9 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 10 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## # … with abbreviated variable name ¹total_seq

You can string together multiple statements and ask if both/all are TRUE with and & or if at least one is TRUE with or |. Notice that the & statement removes all rows because no library has both a Media and Mtb condition. In contrast, the | statement keeps all rows because all libraries are either the Media or Mtb condition.

filter(meta, condition == "Media" & condition == "Mtb")

## # A tibble: 0 × 9
## # … with 9 variables: libID <chr>, ptID <chr>, condition <chr>, age_dys <dbl>,
## #   sex <chr>, ptID_old <chr>, RNAseq <lgl>, methylation <lgl>, total_seq <dbl>

filter(meta, condition == "Media" | condition == "Mtb")

## # A tibble: 20 × 9
##    libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt01_Mtb   pt01  Mtb         12410 M     pt00001  TRUE   FALSE         8.92e6
##  3 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  4 pt02_Mtb   pt02  Mtb         12775 M     pt00002  TRUE   FALSE         7.73e6
##  5 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  6 pt03_Mtb   pt03  Mtb         11315 M     pt00003  TRUE   FALSE         7.11e6
##  7 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  8 pt04_Mtb   pt04  Mtb          8395 M     pt00004  TRUE   TRUE          8.41e6
##  9 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
## 10 pt05_Mtb   pt05  Mtb          7300 M     pt00005  TRUE   FALSE         1.55e7
## 11 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
## 12 pt06_Mtb   pt06  Mtb          6570 F     pt00006  TRUE   FALSE         6.59e6
## 13 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
## 14 pt07_Mtb   pt07  Mtb          7665 F     pt00007  TRUE   FALSE         8.58e6
## 15 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
## 16 pt08_Mtb   pt08  Mtb          8760 M     pt00008  TRUE   FALSE         8.22e6
## 17 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 18 pt09_Mtb   pt09  Mtb          6935 M     pt00009  TRUE   FALSE         1.38e7
## 19 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## 20 pt10_Mtb   pt10  Mtb          8030 F     pt00010  TRUE   FALSE         7.60e6
## # … with abbreviated variable name ¹total_seq

Quotes vs No Quotes

Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.

The general rule is that if you want to use an R object (data frame, column, etc), then you supply the name without quotes. If you want to specify a value within the R object, then use quotes.

rename

Sometimes your variable names are not ideal. rename does just want it says; it renames columns in your data. The syntax for rename is to give the new name first and then tell R what that new name should replace, so new_name = old_name. Here, we rename condition to infection as this better represents those data.

rename(meta, infection = condition)

## # A tibble: 20 × 9
##    libID      ptID  infection age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt01_Mtb   pt01  Mtb         12410 M     pt00001  TRUE   FALSE         8.92e6
##  3 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  4 pt02_Mtb   pt02  Mtb         12775 M     pt00002  TRUE   FALSE         7.73e6
##  5 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  6 pt03_Mtb   pt03  Mtb         11315 M     pt00003  TRUE   FALSE         7.11e6
##  7 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  8 pt04_Mtb   pt04  Mtb          8395 M     pt00004  TRUE   TRUE          8.41e6
##  9 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
## 10 pt05_Mtb   pt05  Mtb          7300 M     pt00005  TRUE   FALSE         1.55e7
## 11 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
## 12 pt06_Mtb   pt06  Mtb          6570 F     pt00006  TRUE   FALSE         6.59e6
## 13 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
## 14 pt07_Mtb   pt07  Mtb          7665 F     pt00007  TRUE   FALSE         8.58e6
## 15 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
## 16 pt08_Mtb   pt08  Mtb          8760 M     pt00008  TRUE   FALSE         8.22e6
## 17 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 18 pt09_Mtb   pt09  Mtb          6935 M     pt00009  TRUE   FALSE         1.38e7
## 19 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## 20 pt10_Mtb   pt10  Mtb          8030 F     pt00010  TRUE   FALSE         7.60e6
## # … with abbreviated variable name ¹total_seq

summarize

The function summarize() performs summary statistics on your data. We can use it to find the mean, median, or other statistics about the dataset. Let’s find the average age in days below:

summarize(meta, mean(age_dys))

## # A tibble: 1 × 1
##   `mean(age_dys)`
##             <dbl>
## 1           9016.

Notice that summarize automatically names the column with the summary statistic based on the function used to calculate that statistic. This is not ideal as the formatting has parentheses. So we can specify a name instead.

summarize(meta, mean_age_dys = mean(age_dys))

## # A tibble: 1 × 1
##   mean_age_dys
##          <dbl>
## 1        9016.

pipes

Instead of including the data as an argument, we can use the pipe operator %>% to pass the data value into the summarize function.

meta %>% summarize(mean_age_dys = mean(age_dys))

## # A tibble: 1 × 1
##   mean_age_dys
##          <dbl>
## 1        9016.

This line of code will do the exact same thing as our previous summary command, but the piping function tells R to use the meta data frame as the first argument in the next function. This lets us “chain” together multiple functions into a pipeline. The pipe takes the output from the left side and use it as input to the right side.

We can also add an Enter to make it look nicer:

meta %>%
  summarize(mean_age_days = mean(age_dys))

## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

Using the pipe operator %>% and enter command makes our code more readable. The pipe operator %>% also helps to avoid using nested function like f(g(x)) and minimizes the need for new variables.

Since we use the pipe operator so often, there is a keyboard shortcut for it in RStudio. You can press Ctrl+Shift+M on Windows or Cmd+Shift+M on a Mac.

group_by

The pipe comes in handy with functions like group_by. If we want to calculate the mean age in days for males and females separately, we can use group_by before summarize:

meta %>% 
  group_by(sex) %>% 
  summarize(mean_age_days = mean(age_dys))

## # A tibble: 2 × 2
##   sex   mean_age_days
##   <chr>         <dbl>
## 1 F             7422.
## 2 M             9699.

And we could merge this with a filter to only look at the Media libraries, since each age is repeated for that patient’s Media and Mtb libraries. The results are the same in this case but filtering would be important if any of our patient’s were missing one of the conditions.

meta %>% 
  filter(condition == "Media") %>% 
  group_by(sex) %>% 
  summarize(mean_age_days = mean(age_dys))

## # A tibble: 2 × 2
##   sex   mean_age_days
##   <chr>         <dbl>
## 1 F             7422.
## 2 M             9699.

mutate

Sometimes we would like to create a new column based on the values in an existing one. For example, age is days is a little difficult to understand. Let’s add a column with age in years.

meta %>% 
  mutate(age_yrs = age_dys / 365.25)

## # A tibble: 20 × 10
##    libID      ptID  condi…¹ age_dys sex   ptID_…² RNAseq methy…³ total…⁴ age_yrs
##    <chr>      <chr> <chr>     <dbl> <chr> <chr>   <lgl>  <lgl>     <dbl>   <dbl>
##  1 pt01_Media pt01  Media     12410 M     pt00001 TRUE   FALSE    9.11e6    34.0
##  2 pt01_Mtb   pt01  Mtb       12410 M     pt00001 TRUE   FALSE    8.92e6    34.0
##  3 pt02_Media pt02  Media     12775 M     pt00002 TRUE   FALSE    9.22e6    35.0
##  4 pt02_Mtb   pt02  Mtb       12775 M     pt00002 TRUE   FALSE    7.73e6    35.0
##  5 pt03_Media pt03  Media     11315 M     pt00003 TRUE   FALSE    6.23e6    31.0
##  6 pt03_Mtb   pt03  Mtb       11315 M     pt00003 TRUE   FALSE    7.11e6    31.0
##  7 pt04_Media pt04  Media      8395 M     pt00004 TRUE   TRUE     1.02e7    23.0
##  8 pt04_Mtb   pt04  Mtb        8395 M     pt00004 TRUE   TRUE     8.41e6    23.0
##  9 pt05_Media pt05  Media      7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 10 pt05_Mtb   pt05  Mtb        7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 11 pt06_Media pt06  Media      6570 F     pt00006 TRUE   FALSE    7.09e6    18.0
## 12 pt06_Mtb   pt06  Mtb        6570 F     pt00006 TRUE   FALSE    6.59e6    18.0
## 13 pt07_Media pt07  Media      7665 F     pt00007 TRUE   FALSE    1.07e7    21.0
## 14 pt07_Mtb   pt07  Mtb        7665 F     pt00007 TRUE   FALSE    8.58e6    21.0
## 15 pt08_Media pt08  Media      8760 M     pt00008 TRUE   FALSE    9.96e6    24.0
## 16 pt08_Mtb   pt08  Mtb        8760 M     pt00008 TRUE   FALSE    8.22e6    24.0
## 17 pt09_Media pt09  Media      6935 M     pt00009 TRUE   FALSE    1.31e7    19.0
## 18 pt09_Mtb   pt09  Mtb        6935 M     pt00009 TRUE   FALSE    1.38e7    19.0
## 19 pt10_Media pt10  Media      8030 F     pt00010 TRUE   FALSE    8.22e6    22.0
## 20 pt10_Mtb   pt10  Mtb        8030 F     pt00010 TRUE   FALSE    7.60e6    22.0
## # … with abbreviated variable names ¹condition, ²ptID_old, ³methylation,
## #   ⁴total_seq

Note that this new column is not saved in meta because we printed the result to the console and did not save it to the environment. We can overwrite the meta data frame so it will now contains the age_yrs column.

meta <- meta %>% 
  mutate(age_yrs = age_dys / 365.25)
meta

## # A tibble: 20 × 10
##    libID      ptID  condi…¹ age_dys sex   ptID_…² RNAseq methy…³ total…⁴ age_yrs
##    <chr>      <chr> <chr>     <dbl> <chr> <chr>   <lgl>  <lgl>     <dbl>   <dbl>
##  1 pt01_Media pt01  Media     12410 M     pt00001 TRUE   FALSE    9.11e6    34.0
##  2 pt01_Mtb   pt01  Mtb       12410 M     pt00001 TRUE   FALSE    8.92e6    34.0
##  3 pt02_Media pt02  Media     12775 M     pt00002 TRUE   FALSE    9.22e6    35.0
##  4 pt02_Mtb   pt02  Mtb       12775 M     pt00002 TRUE   FALSE    7.73e6    35.0
##  5 pt03_Media pt03  Media     11315 M     pt00003 TRUE   FALSE    6.23e6    31.0
##  6 pt03_Mtb   pt03  Mtb       11315 M     pt00003 TRUE   FALSE    7.11e6    31.0
##  7 pt04_Media pt04  Media      8395 M     pt00004 TRUE   TRUE     1.02e7    23.0
##  8 pt04_Mtb   pt04  Mtb        8395 M     pt00004 TRUE   TRUE     8.41e6    23.0
##  9 pt05_Media pt05  Media      7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 10 pt05_Mtb   pt05  Mtb        7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 11 pt06_Media pt06  Media      6570 F     pt00006 TRUE   FALSE    7.09e6    18.0
## 12 pt06_Mtb   pt06  Mtb        6570 F     pt00006 TRUE   FALSE    6.59e6    18.0
## 13 pt07_Media pt07  Media      7665 F     pt00007 TRUE   FALSE    1.07e7    21.0
## 14 pt07_Mtb   pt07  Mtb        7665 F     pt00007 TRUE   FALSE    8.58e6    21.0
## 15 pt08_Media pt08  Media      8760 M     pt00008 TRUE   FALSE    9.96e6    24.0
## 16 pt08_Mtb   pt08  Mtb        8760 M     pt00008 TRUE   FALSE    8.22e6    24.0
## 17 pt09_Media pt09  Media      6935 M     pt00009 TRUE   FALSE    1.31e7    19.0
## 18 pt09_Mtb   pt09  Mtb        6935 M     pt00009 TRUE   FALSE    1.38e7    19.0
## 19 pt10_Media pt10  Media      8030 F     pt00010 TRUE   FALSE    8.22e6    22.0
## 20 pt10_Mtb   pt10  Mtb        8030 F     pt00010 TRUE   FALSE    7.60e6    22.0
## # … with abbreviated variable names ¹condition, ²ptID_old, ³methylation,
## #   ⁴total_seq

Exercises: dplyr

Create a new column for total sequences in million total_seq_millions based on total_seq.
What is another way to end up with only the Media rows instead of condition == "Media"?
Try calculating the mean total number of sequences for the Media and Mtb conditions.

Intro tidyr

tidyr is another data manipulation package in tidyverse. It contains functions for tidying data and generally acts on the entire data frame, as opposed to dplyr which most often works on a subset of columns or rows.

pivot

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The metadata is in a long format. We have one row for each patient sample and each variable is in a different column. We might also describe these data as “tidy” because it makes it easy to work with ggplot2 and dplyr functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically, in “wide” format, each row represents a group of observations and each value is placed in a different column rather than a different row. For example, maybe we want each patient to have 1 row and their Media/Mtb data to be spread across columns.

The tidyr package in the tidyverse contains the functions pivot_wider and pivot_longer that make it easy to switch between the two formats.

pivot_wider

For each patient, we have two samples: Media and Mtb. In the metadata, the only difference in these two conditions is the libID column (which is redundant with the ptID and condition) and the total_seq column. We can take the condition column and create two new columns, one with the total seqs in the Media sample and one with the total seqs in the Mtb sample. This is called “pivoting” the data frame “wider” because we rearrange it by creating an additional column and making it have fewer rows. Let’s try it!

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)

## # A tibble: 10 × 9
##    ptID  age_dys sex   ptID_old RNAseq methylation age_yrs     Media       Mtb
##    <chr>   <dbl> <chr> <chr>    <lgl>  <lgl>         <dbl>     <dbl>     <dbl>
##  1 pt01    12410 M     pt00001  TRUE   FALSE          34.0  9114402.  8918699.
##  2 pt02    12775 M     pt00002  TRUE   FALSE          35.0  9221555.  7733260.
##  3 pt03    11315 M     pt00003  TRUE   FALSE          31.0  6231728.  7105193.
##  4 pt04     8395 M     pt00004  TRUE   TRUE           23.0 10205557.  8413543.
##  5 pt05     7300 M     pt00005  TRUE   FALSE          20.0 15536685. 15509446.
##  6 pt06     6570 F     pt00006  TRUE   FALSE          18.0  7085995.  6588160.
##  7 pt07     7665 F     pt00007  TRUE   FALSE          21.0 10706098.  8576245.
##  8 pt08     8760 M     pt00008  TRUE   FALSE          24.0  9957906.  8220348.
##  9 pt09     6935 M     pt00009  TRUE   FALSE          19.0 13055276  13800442.
## 10 pt10     8030 F     pt00010  TRUE   FALSE          22.0  8216706.  7599609.

Notice here that we tell pivot_wider which columns to pull the names we wish our new columns to be named from the condition variable, and the values to populate those columns from the total_seq variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by condition, and the values populate it with our remaining variables dictating the rows.

Maybe we should assign those columns more informative names.

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq,
              names_prefix = "total_seq_")

## # A tibble: 10 × 9
##    ptID  age_dys sex   ptID_old RNAseq methylation age_yrs total_seq_M…¹ total…²
##    <chr>   <dbl> <chr> <chr>    <lgl>  <lgl>         <dbl>         <dbl>   <dbl>
##  1 pt01    12410 M     pt00001  TRUE   FALSE          34.0      9114402.  8.92e6
##  2 pt02    12775 M     pt00002  TRUE   FALSE          35.0      9221555.  7.73e6
##  3 pt03    11315 M     pt00003  TRUE   FALSE          31.0      6231728.  7.11e6
##  4 pt04     8395 M     pt00004  TRUE   TRUE           23.0     10205557.  8.41e6
##  5 pt05     7300 M     pt00005  TRUE   FALSE          20.0     15536685.  1.55e7
##  6 pt06     6570 F     pt00006  TRUE   FALSE          18.0      7085995.  6.59e6
##  7 pt07     7665 F     pt00007  TRUE   FALSE          21.0     10706098.  8.58e6
##  8 pt08     8760 M     pt00008  TRUE   FALSE          24.0      9957906.  8.22e6
##  9 pt09     6935 M     pt00009  TRUE   FALSE          19.0     13055276   1.38e7
## 10 pt10     8030 F     pt00010  TRUE   FALSE          22.0      8216706.  7.60e6
## # … with abbreviated variable names ¹total_seq_Media, ²total_seq_Mtb

And now let’s save the new wider metadata to a new object.

meta_wide <- meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)

Notice that the number of rows and columns has changed.

dim(meta)

## [1] 20 10

dim(meta_wide)

## [1] 10  9

pivot_longer

Everything we did with pivot_wider, we can reverse with pivot_longer. Maybe you receive a data frame in a wide format, but you need it in a long format. Here’s how we can get back to a long data frame.

meta_wide %>% 
  pivot_longer(c(Media, Mtb), 
               names_to = "condition", values_to = "total_seq")

## # A tibble: 20 × 9
##    ptID  age_dys sex   ptID_old RNAseq methylation age_yrs condition total_seq
##    <chr>   <dbl> <chr> <chr>    <lgl>  <lgl>         <dbl> <chr>         <dbl>
##  1 pt01    12410 M     pt00001  TRUE   FALSE          34.0 Media      9114402.
##  2 pt01    12410 M     pt00001  TRUE   FALSE          34.0 Mtb        8918699.
##  3 pt02    12775 M     pt00002  TRUE   FALSE          35.0 Media      9221555.
##  4 pt02    12775 M     pt00002  TRUE   FALSE          35.0 Mtb        7733260.
##  5 pt03    11315 M     pt00003  TRUE   FALSE          31.0 Media      6231728.
##  6 pt03    11315 M     pt00003  TRUE   FALSE          31.0 Mtb        7105193.
##  7 pt04     8395 M     pt00004  TRUE   TRUE           23.0 Media     10205557.
##  8 pt04     8395 M     pt00004  TRUE   TRUE           23.0 Mtb        8413543.
##  9 pt05     7300 M     pt00005  TRUE   FALSE          20.0 Media     15536685.
## 10 pt05     7300 M     pt00005  TRUE   FALSE          20.0 Mtb       15509446.
## 11 pt06     6570 F     pt00006  TRUE   FALSE          18.0 Media      7085995.
## 12 pt06     6570 F     pt00006  TRUE   FALSE          18.0 Mtb        6588160.
## 13 pt07     7665 F     pt00007  TRUE   FALSE          21.0 Media     10706098.
## 14 pt07     7665 F     pt00007  TRUE   FALSE          21.0 Mtb        8576245.
## 15 pt08     8760 M     pt00008  TRUE   FALSE          24.0 Media      9957906.
## 16 pt08     8760 M     pt00008  TRUE   FALSE          24.0 Mtb        8220348.
## 17 pt09     6935 M     pt00009  TRUE   FALSE          19.0 Media     13055276 
## 18 pt09     6935 M     pt00009  TRUE   FALSE          19.0 Mtb       13800442.
## 19 pt10     8030 F     pt00010  TRUE   FALSE          22.0 Media      8216706.
## 20 pt10     8030 F     pt00010  TRUE   FALSE          22.0 Mtb        7599609.

In the above case, the condition and total_seq names need to be in quotes because they are not yet in our data object. They are instead just characters used to name the new columns. If we forgot the quotes, we’d get an error because R cannot find a condition column in meta_wide as no such column exists.

meta_wide %>% 
  pivot_longer(c(Media, Mtb), 
               names_to = condition, values_to = "total_seq")

## Error in build_longer_spec(data, !!cols, names_to = names_to, values_to = values_to, : object 'condition' not found

colnames(meta_wide)

## [1] "ptID"        "age_dys"     "sex"         "ptID_old"    "RNAseq"     
## [6] "methylation" "age_yrs"     "Media"       "Mtb"

More from dplyr

join

Often not all your data is pre-compiled in one nice, tidy data frame. You may have multiple files, batches, etc to work with. dplyr’s join functions allow us to combine data frames without fear of copy-paste errors or missing anything!

So far, we’ve been working with just the metadata - a single data frame. Let’s adventure into the full RNA-seq dat object so we have multiple data frames to combine.

load("data/dat_voom.RData")

First, we have the normalized log2 counts per million (CPM) in E

dat$E[1:3,1:3]

##       pt01_Media  pt01_Mtb pt02_Media
## A1BG    1.620022  1.717603   1.086946
## A2M     7.259912  6.680758   5.939909
## A2ML1  -4.052404 -3.515057  -4.015712

Second, we have the same metadata as our meta object in the targets data.

dat$targets[1:3,]

##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  8295928     1.092872 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  5716203     0.819490   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  8087601     1.119352 pt02_Media pt02     Media   12775   M
##            ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media  pt00001   TRUE       FALSE   9114402      0.8943797
## pt01_Mtb    pt00001   TRUE       FALSE   8918699      0.9850363
## pt02_Media  pt00002   TRUE       FALSE   9221555      1.2582092

Looking at these two data frames, they do not share any columns - thus there is nothing to join them with. However, notice that the column names in the counts data are the libIDs from the metadata! We can use pivot_longer to get the joining column we need.

Since E is a matrix with rownames, we must force it to be a data frame and move the rownames into a data column with some additional functions.

E_long <- as.data.frame(dat$E) %>% 
  rownames_to_column("gene") %>% 
  pivot_longer(-gene, names_to = "libID", values_to = "log2CPM")

E_long

## # A tibble: 268,380 × 3
##    gene  libID      log2CPM
##    <chr> <chr>        <dbl>
##  1 A1BG  pt01_Media  1.62  
##  2 A1BG  pt01_Mtb    1.72  
##  3 A1BG  pt02_Media  1.09  
##  4 A1BG  pt02_Mtb    1.10  
##  5 A1BG  pt03_Media  0.874 
##  6 A1BG  pt03_Mtb    1.71  
##  7 A1BG  pt04_Media  1.57  
##  8 A1BG  pt04_Mtb   -0.264 
##  9 A1BG  pt05_Media  1.63  
## 10 A1BG  pt05_Mtb   -0.0305
## # … with 268,370 more rows

Now both data frames have a libID column!

Let’s combine our sequence counts with our metadata. To combine two data frames, we will use a join function. The dplyr package has a number of tools for joining data frames together depending on what we want to do with the rows of the data of countries that are not represented in both data frames.

?join

The most common joins are:

inner_join: includes all rows present in BOTH data frames
left_join: includes all rows in the first/left data frame. Remove rows ONLY present in the second/right data frame
right_join: includes all rows in the second/right data frame. Remove rows ONLY present in the first/left data frame
full_join: includes all rows present in EITHER data frame. Fills in rows only present in one data frame with NA in the other

Here, we use inner_join to keep only libraries with both expression data and metadata, which is actually all of them.

full_data <- inner_join(meta, E_long, by = 'libID')

I find this page helpful when working with joins. https://stat545.com/join-cheatsheet.html

Exercises: pivot and join

Filter the metadata to just Media samples. Without running it, outline what the data would look like if you inner_join the long expression data E_long with metadata containing only Media samples.
Similarly, what would happen with full_join?

3. Data visualization in ggplot

In this workshop, we introduce you to data visualization in ggplot (docs.ggplot2.org), the tidyverse package for plots. In it, we cover how to create:

Volcano plot
PCA plot
Customization such as color, shape, theme
Facets

We load ggplot within the tidyverse. For more information on installation, see the setup instructions.

library(tidyverse)

Intro to ggplot

Why ggplot?

ggplot2 is an implementation of Grammar of Graphics (Wilkinson 1999) for R

Benefits:

handsome default settings
snap-together building block approach
automatic legends, colors, facets
statistical overlays like regressions lines and smoothers (with confidence intervals)

Drawbacks:

it can be hard to get it to look exactly the way you want
requires having the input data in a certain format

ggplot layers

ggplot functions by adding layers to a plot, somewhat like how dplyr/tidyr sequentially modify data with the pipe %>%. Instead, ggplot uses the + to combine layers such as:

data: 2D table (data.frame) of variables
aesthetics: map variables to visual attributes (position, color, etc)
geoms: graphical representation of data (points, lines, etc)
stats: statistical transformations to get from data to points in the plot (binning, summarizing, smoothing)
scales: control how to map a variable to an aesthetic
facets: juxtapose mini-plots of data subsets, split by variables
guides: axes, legend, etc. that reflect the variables and their values

The idea is to independently specify and combine layers to create the plot you want. There are at least three things we have to specify to create any plot:

data
aesthetic mappings from data variables to visual properties
a geom describing how to draw those properties

Data

A quick reminder about our data. There are bulk RNA-seq contained in the dat object in file “data/dat_voom.RData”. Normalized log2 counts per million (CPM) are contained in dat$E and library metadata are in dat$targets. You will not need the other pieces of dat for this workshop.

load("data/dat_voom.RData")

dat$E[1:3,1:3]

##       pt01_Media  pt01_Mtb pt02_Media
## A1BG    1.620022  1.717603   1.086946
## A2M     7.259912  6.680758   5.939909
## A2ML1  -4.052404 -3.515057  -4.015712

dat$targets[1:3,]

##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  8295928     1.092872 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  5716203     0.819490   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  8087601     1.119352 pt02_Media pt02     Media   12775   M
##            ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media  pt00001   TRUE       FALSE   9114402      0.8943797
## pt01_Mtb    pt00001   TRUE       FALSE   8918699      0.9850363
## pt02_Media  pt00002   TRUE       FALSE   9221555      1.2582092

Here is a brief description of the variables in the metadata.

libID (character): Unique library ID for each sequencing library. Formatted as ptID_condition
ptID (character): Patient ID. Each patient has 2 libraries which differ in their condition
condition (character): Experimental condition. Media for media control. Mtb of M. tuberculosis infected.
age_dys (numeric): Age in ages
sex (character): Biological sex, M or F
ptID_old (character): Old patient ID with leading zeroes
RNAseq (logical): If the library has pass-filter RNA-seq data
methylation (logical): If the library has pass-filter methylation data
total_seq (numeric): Total number of pass-filter sequences

Basic plots

Boxplot

We will start with a boxplot of the total sequences per library. First, we call a ggplot object. Since we’ve given it no data or aesthetics, this is a blank plot.

ggplot()

Then, we give the plot data and specify aesthetics for which parts of the data we want to plot. Now, we see that ggplot knows what the scale of the y data will be. Throughout this section, we will highlight lines of code that change with the comment #<--- see change.

ggplot(dat$targets) +
  aes(y = total_seq) #<--- see change

Finally, we add a geom to denote how we want the data displayed, in this case as a boxplot.

ggplot(dat$targets) +
  aes(y = total_seq) +
  geom_boxplot() #<--- see change

Since we did not specify an x-variable, all the data are plotted in one boxplot. We add an aesthetic to specify x and split up the data.

ggplot(dat$targets) +
  aes(y = total_seq, x = condition) + #<--- see change
  geom_boxplot()

Data ordering with factors

Our x-variable happens to be in the correct alphabetical order with the Media controls before Mtb infection. However, alphabetical does not always give you want you want. You can force a variable’s order by converting it to a factor. Let’s switch the condition order.

dat$targets %>% 
  mutate(condition_fct = factor(condition, 
                            levels = c("Mtb", "Media"))) %>% #<--- see change
  
  ggplot() +
  aes(y = total_seq, x = condition_fct) + 
  geom_boxplot()

Notice how we now pipe %>% the data into ggplot so that we can modify it in dplyr first.

Dotplot

These same data could be displayed as a dotplot instead. We return to the original condition order since it was correct.

ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_point() #<--- see change

And if we want to make sure we see every point, a jitter works nicely.

ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_jitter() #<--- see change

Multiple geoms

The building block nature of ggplot allows us to add multiple geoms on top of each other. So we can combine the two previous plots like so

ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_boxplot() + #<--- see change
  geom_jitter() #<--- see change

But you may notice that there are some duplicate points. Looking back at the first boxplots, there are points for data outside the box. These points remain in the above plot plus appear jittered from the second geom. We can fix this by telling the boxplot to not plot the outliers.

ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_boxplot(outlier.shape = NA) + #<--- see change
  geom_jitter()

Layer order matters

Because ggplot works in layers, the order matters. If, for example, we put the boxplot after the jitter, the boxes cover up points! We use alpha for transparency so you can see these points. Thus, be careful when considering how you setup your plot!

ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_jitter() + #<--- see change
  geom_boxplot(outlier.shape = NA, alpha = 0.8)  #<--- see change

Why are my dots moving?

With jitter, you are adding random error to be able to see all the points. Because this is truly random, every time you run jitter, you will get a slightly different plot. To combat this, set a seed before making the plot.

set.seed(468) #<--- see change
ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter()

set.seed(468) #<--- see change
ggplot(dat$targets) +
  aes(y = total_seq, x = condition) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter()

RNA-seq plots

PCA

One dotplot of interest in RNA-seq analysis is principle component analysis (PCA). In this plot, each RNA-seq library is a single point. The axes are unitless but in general, points that are closer together represent libraries with more similar gene expression. You can see large scale changes in gene expression in PCA. However, even when you don’t see clear PCA groupings, there can be significant differentially expressed genes. So don’t despair if your data don’t cluster in this plot!

Here, we calculate PCA from the normalized counts. Note that we transpose t() the counts data to get one PCA point per library, instead of one per gene.

dat.pca <- prcomp(t(dat$E), scale.=TRUE, center=TRUE)

dat.pca$x[1:3,1:3]

##                   PC1       PC2      PC3
## pt01_Media -103.91401 -33.55704 22.40058
## pt01_Mtb     52.95006 -74.68146 24.78278
## pt02_Media  -60.61507  27.13460 24.58148

Then we use join to add metadata to the PCA data, so we can annotate the PCA plot with variables of interest.

dat.pca.meta <- as.data.frame(dat.pca$x) %>% 
  rownames_to_column("libID") %>% 
  inner_join(dat$targets)

## Joining, by = "libID"

A nice trick with joins is to not specify what to join by. In this case, the function finds all columns shared between the data frames and uses them to match rows.

With geom_point, we have a PCA!

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2) +
  geom_point()

Color and shape

But it’s not very informative, so let’s annotate with our condition groups. In general, you want to use annotations in the following order: space, color, shape, size. These are ordered by how easily the human eye can discern differences.

We already have space as libraries are separated along x and y. Now we add color.

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2, color = condition) + #<--- see change
  geom_point()

I like to keep all my aesthetics in one layer, but you could keep adding aes() functions to the plot if you prefer. This gives the same plot.

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2) + #<--- see change
  geom_point() +
  aes(color = condition) #<--- see change

We could also use shape for the same data. Which plot to you think is easier to interpret?

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2, shape = condition) + #<--- see change
  geom_point()

We’re not showing size or transparency because these aesthetics are difficult to see and should be avoided unless you really need them or the differences between data points are large enough to see in these aesthetics.

Labels

We can also improve our axes and legend labels. First, we see what proportion of variation is explained by each PC.

summary(dat.pca)$importance

##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     57.21487 38.99933 35.32881 34.21306 29.76604 27.13699
## Proportion of Variance  0.24395  0.11334  0.09301  0.08723  0.06603  0.05488
## Cumulative Proportion   0.24395  0.35729  0.45030  0.53753  0.60356  0.65844
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     24.48881 23.98672 21.83296 21.23941 19.84912 18.33097
## Proportion of Variance  0.04469  0.04288  0.03552  0.03362  0.02936  0.02504
## Cumulative Proportion   0.70313  0.74601  0.78153  0.81515  0.84451  0.86955
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     18.05337 17.40958 17.13667 15.67883 14.48455 14.12701
## Proportion of Variance  0.02429  0.02259  0.02188  0.01832  0.01563  0.01487
## Cumulative Proportion   0.89384  0.91642  0.93831  0.95663  0.97226  0.98713
##                            PC19         PC20
## Standard deviation     13.14034 1.137607e-13
## Proportion of Variance  0.01287 0.000000e+00
## Cumulative Proportion   1.00000 1.000000e+00

And use that for labels as well as change the legend.

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2, color = condition) +
  geom_point() +
  labs(x = "PC1 (24%)", y = "PC2 (11%)", #<--- see change
       color = "Infection")

Coordinates

Since PCA units are relative, it is nice to have the scales of the x and y be the same. That is, 1 inch of space corresponding to the same values for both. This is accomplished with coord_fixed. Note that in this PCA, this function does not change the plot much but it can can a big effect in other PCA! There are also several other coordinate systems you can see with coord_.

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2, color = condition) +
  geom_point() +
  labs(x = "PC1 (24%)", y = "PC2 (11%)",
       color = "Infection") +
  coord_fixed() #<--- see change

Themes

ggplot has a number of themes that change the aesthetics of the plot. I like theme_classic for PCA. Type theme_ into the console to see the available themes suggested with auto-complete. You can also checkout more themes in the package ggthemes.

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2, color = condition) +
  geom_point() +
  labs(x = "PC1 (24%)", y = "PC2 (11%)",
       color = "Infection") +
  coord_fixed() +
  theme_classic() #<--- see change

volcano plot

Let’s move on to another application of geom_point. A volcano plot shows differentially expressed gene significance and fold change.

Linear modeling

We’ll need to perform linear modeling to determine these results. Below is the workflow for comparing Mtb-infected vs media samples, corrected for sex and blocked by patient. We use our R package kimma. You DO NOT need to run these steps. They are here in case you want to run a similar analysis in the future.

install.packages("devtools")
devtools::install_github("BIGslu/kimma")

library(kimma)

fit <- kmFit(dat, model = "~ condition + sex + (1|ptID)",
             run.lme = TRUE, use.weights = TRUE)

lme/lmerel model: expression~condition+sex+(1|ptID)
Input: 20 libraries from 10 unique patients
Model: 20 libraries
Complete: 13419 genes
Failed: 0 genes

save(fit, file="data/kimma_result.RData")

Instead, load the results from the data downloaded for this workshop. Note that this is an S3 object similar to the dat object. In it, we have the linear mixed effects (lme) model results in fit$lme, which includes our log2 fold change estimate and significance FDR for each gene and each variable.

load("data/kimma_result.RData")

fit$lme

## # A tibble: 40,257 × 7
##    model gene   variable   estimate       pval      FDR gene_biotype  
##    <chr> <chr>  <chr>         <dbl>      <dbl>    <dbl> <chr>         
##  1 lme   A1BG   condition   -0.313  0.264      0.347    protein_coding
##  2 lme   A1BG   sex          0.0796 0.814      0.954    protein_coding
##  3 lme   A1BG   (1 | ptID)   0      1          1        protein_coding
##  4 lme   A2M    condition   -0.0940 0.605      0.682    protein_coding
##  5 lme   A2M    sex          1.40   0.0140     0.158    protein_coding
##  6 lme   A2M    (1 | ptID)   7.79   0.00525    0.0574   protein_coding
##  7 lme   A2ML1  condition    0.791  0.0191     0.0356   protein_coding
##  8 lme   A2ML1  sex         -2.53   0.00000156 0.000388 protein_coding
##  9 lme   A2ML1  (1 | ptID)   1.39   0.239      0.498    protein_coding
## 10 lme   A4GALT condition    2.66   0.0000318  0.000108 protein_coding
## # … with 40,247 more rows

Color

Let’s start with a basic dotplot like before. We first filter the fit data to just the variable condition so that we have 1 point per gene.

fit$lme %>% 
  filter(variable == "condition") %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR)) +
  geom_point()

Now let’s color the significant points. There is no “significant” variable in the data frame, so we first make one with mutate.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 ~ "FDR < 0.01", #<--- see change
                                  TRUE ~ "NS")) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) + #<--- see change
  geom_point()

Or go further and color up vs down regulated genes differently.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down", #<--- see change
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point()

And we can specify the colors we want with another layer.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point() +
  scale_color_manual(values = c("blue", "grey", "red")) #<--- see change

Lines

Since we’re defining significance at a cutoff, we can plot that cutoff with a vertical line with geom_vline. Other similar geoms are geom_hline for a horizontal line and geom_abline for a line with slope a and intercept b.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point() +
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) #<--- see change

Scales

You might also wish to center the 0 estimate or otherwise alter the plots limits.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point() +
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  lims(x = c(-9, 9)) #<--- see change

If you accidentally (or purposefully) remove data point with your limits, R will warn you!

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point() +
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  lims(x = c(-3, 3)) #<--- see change

## Warning: Removed 165 rows containing missing values (geom_point).

Label points

We’re going to use a ggplot extension package to label our most significant genes. Let’s load that package now.

library(ggrepel)

First, we determine our top 2 up and down regulated genes. There are some new dplyr functions in here! See the comments for information.

top_deg <- fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(direction = case_when(estimate < 0 ~ "down",
                               estimate > 0 ~ "up")) %>% 
  group_by(direction) %>% 
  slice_min(FDR, n=2) %>% # Keeps n number of minimum FDR values
  pull(gene) # Extract the variable and turns the data into a vector

top_deg

## [1] "IFNAR2"  "NAIP"    "TERF2IP" "ZBTB21"

Then we add those labels to our fit data and use that variable in geom_text_repel. In this case, the data removal warning is expected as we’re not labeling the majority of points.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>%  #<--- see change
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point() +
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  lims(x = c(-9, 9)) +
  geom_text_repel(aes(label = label)) #<--- see change

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

The text legend gets overlayed with the point legend, which looks odd. Let’s remove one of them.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR), color = significance) +
  geom_point() +
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  lims(x = c(-9, 9)) +
  geom_text_repel(aes(label = label), show.legend = FALSE) #<--- see change

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

And a small edit if you don’t want the text to be colored. When you put an aesthetic in a specific geom, it only applies that change to that one geom.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR)) + #<--- see change
  geom_point(aes(color = significance)) + # <--- see change
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  lims(x = c(-9, 9)) +
  geom_text_repel(aes(label = label), show.legend = FALSE)

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

Theme

To round it out, let’s look at theme_bw, my favorite one for volcano plots.

fit$lme %>% 
  filter(variable == "condition") %>% 
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR)) +
  geom_point(aes(color = significance)) + 
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  lims(x = c(-9, 9)) +
  geom_text_repel(aes(label = label)) +
  theme_bw() #<--- see change

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

There you have it! Did you think you’d reach a 11 function plot so soon?

Facets

With big data, we often want to plot more than one thing at once. Facets allow you to make similar plots across multiple subsets of data with only 1 additional line of code. Here, we filter for 2 variable and then facet those into 2 plots.

fit$lme %>% 
  filter(variable %in% c("condition", "sex")) %>% #<--- see change
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR)) +
  geom_point(aes(color = significance)) + 
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  #<--- see change No lims used
  geom_text_repel(aes(label = label)) +
  theme_bw() +
  facet_wrap(~ variable) #<--- see change

## Warning: Removed 26830 rows containing missing values (geom_text_repel).

Since the a and y scales differ for the two variable, we allow them to differ with scales. This is my we removed the lims function earlier. scales can be set to “free_x”, “free_y”, or “free” for both x and y.

fit$lme %>% 
  filter(variable %in% c("condition", "sex")) %>%
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR)) +
  geom_point(aes(color = significance)) + 
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  geom_text_repel(aes(label = label)) +
  theme_bw() +
  facet_wrap(~ variable, scales = "free") #<--- see change

## Warning: Removed 26830 rows containing missing values (geom_text_repel).

Save plots

Once you have your plot as you desire, save it to your computer for use in presentations, publications, Twitter, and more! ggsave allows you to save in a bunch of formats (see the help page with ?ggsave).

Since we’ve only been printing our plots to the plot pane in RStudio, we need to first save a plot to the environment. We’ll use the single condtion variable volcano plot.

p1 <- fit$lme %>% #<--- see change
  filter(variable == "condition") %>%
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>% 
  
  ggplot() +
  aes(x = estimate, y = -log10(FDR)) +
  geom_point(aes(color = significance)) + 
  scale_color_manual(values = c("blue", "grey", "red")) +
  geom_hline(yintercept = -log10(0.01)) +
  geom_text_repel(aes(label = label)) +
  theme_bw() 

p1

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

Now we can save the object. My most commonly used plot formats are .pdf for publication (vectorized PDF that can be edited in Illustrator/Inkscape) and .png for all other uses (not vectorized, smaller files). You specify the file type in the filename and give dimensions in inches (default). Without dimensions, R used whatever the ratio of you plot pane currently is (not reproducible!).

ggsave(filename = "images/volcano_plot.png", plot = p1, width = 3, height = 4)

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

ggsave(filename = "images/volcano_plot.pdf", plot = p1, width = 3, height = 4)

## Warning: Removed 13415 rows containing missing values (geom_text_repel).

Commented plot scripts

Here are each of the final plots fulled comments for your reference.

ggplot(dat.pca.meta) +                   # Make ggplot from data in dat.pca.meta
  aes(x = PC1, y = PC2,                  # Plot PCs 1 and 2 along x and y
      color = condition) +               # Color by condition
  geom_point() +                         # Plot points for each value
  labs(x = "PC1 (24%)", y = "PC2 (11%)", # Relabel axes
       color = "Infection") +            # Relabel legend
  coord_fixed() +                        # Force xy scales to be equal
  theme_classic()                        # Set theme

fit$lme %>% 
  # Keep results for only the condition variable
  filter(variable == "condition") %>% 
  # Create variable for significant up and down genes
  mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
                                  FDR < 0.01 & estimate > 0 ~ "up",
                                  TRUE ~ "NS")) %>% 
  # Create variable for gene name labels
  mutate(label = case_when(gene %in% top_deg ~ gene)) %>% 
  
  ggplot() +                                              # initiate ggplot plot
  aes(x = estimate, y = -log10(FDR)) +                    # plot estimate and FDR
  geom_point(aes(color = significance)) +                 # plot points, color by significance
  scale_color_manual(values = c("blue", "grey", "red")) + # recolor points
  geom_hline(yintercept = -log10(0.01)) +                 # add FDR cuttof line
  lims(x = c(-9, 9)) +                                    # set axis limits
  geom_text_repel(aes(label = label)) +                   # label points
  theme_bw()                                              # set theme

  ... + 
  facet_wrap(~ variable) # facet by variable

Exercises: ggplot

Using the combined expression and metadata you made in the last session, plot the expression of one gene in Media vs Mtb. The plot type is up to you! As a reminder, here is how we made the combined data.

full_data <- as.data.frame(dat$E) %>% 
  rownames_to_column("gene") %>% 
  pivot_longer(-gene, names_to = "libID", values_to = "log2CPM") %>% 
  inner_join(dat$targets)

## Joining, by = "libID"

full_data

## # A tibble: 268,380 × 15
##    gene  libID log2CPM group lib.s…¹ norm.…² ptID  condi…³ age_dys sex   ptID_…⁴
##    <chr> <chr>   <dbl> <fct>   <dbl>   <dbl> <chr> <chr>     <dbl> <chr> <chr>  
##  1 A1BG  pt01…  1.62   1      8.30e6   1.09  pt01  Media     12410 M     pt00001
##  2 A1BG  pt01…  1.72   1      5.72e6   0.819 pt01  Mtb       12410 M     pt00001
##  3 A1BG  pt02…  1.09   1      8.09e6   1.12  pt02  Media     12775 M     pt00002
##  4 A1BG  pt02…  1.10   1      5.28e6   0.912 pt02  Mtb       12775 M     pt00002
##  5 A1BG  pt03…  0.874  1      5.18e6   1.05  pt03  Media     11315 M     pt00003
##  6 A1BG  pt03…  1.71   1      5.23e6   0.958 pt03  Mtb       11315 M     pt00003
##  7 A1BG  pt04…  1.57   1      9.00e6   1.13  pt04  Media      8395 M     pt00004
##  8 A1BG  pt04… -0.264  1      5.40e6   0.855 pt04  Mtb        8395 M     pt00004
##  9 A1BG  pt05…  1.63   1      1.26e7   1.07  pt05  Media      7300 M     pt00005
## 10 A1BG  pt05… -0.0305 1      1.01e7   0.930 pt05  Mtb        7300 M     pt00005
## # … with 268,370 more rows, 4 more variables: RNAseq <lgl>, methylation <lgl>,
## #   total_seq <dbl>, sample.weights <dbl>, and abbreviated variable names
## #   ¹lib.size, ²norm.factors, ³condition, ⁴ptID_old

Color the PCA by the numeric variable total_seq. Explore the help pages for scale_color_continuous, scale_color_gradient, or scale_color_gradient2 and recolor your PCA to more easily see differences. As a reminder, here is the base PCA plot.

ggplot(dat.pca.meta) +
  aes(x = PC1, y = PC2) +
  geom_point() +
  labs(x = "PC1 (24%)", y = "PC2 (11%)",
       color = "Infection") +
  theme_classic()

Using geom_text_repel, label the points in the PCA by their libID

Introduction to R and the tidyverse

RNA-seq edition

Kim Dill-McFarland, U of Washington

version September 08, 2022

Prior to the workshop, please complete the following:

Download the data

Install R and RStudio

Install R packages

Check R package install

R package versions

Navigation

1. Introduction to R and RStudio

A Tour of RStudio

RStudio Projects

Create a Project

R Scripts

R packages

CRAN

Bioconductor

Getting started

Organize data

Loading data into R

Data frames from .csv, .tsv, etc.

Help function

Complex data from .RData

Data types

Simple

Complex (S3, S4)

Working with vectors and data frames

Operating on vectors

Using the correct class

Subsetting vectors and data frames

Quick reference: Conditional statements

Exercises: Intro R

Navigation

2. Introduction to the tidyverse

Intro dplyr

select

filter

Quotes vs No Quotes

rename

summarize

pipes

group_by

mutate

Exercises: dplyr

Intro tidyr

pivot

pivot_wider

pivot_longer

More from dplyr

join

Exercises: pivot and join

Navigation

3. Data visualization in ggplot

Intro to ggplot

Why ggplot?

ggplot layers

Data

Basic plots

Boxplot

Data ordering with factors

Dotplot

Multiple geoms

Layer order matters

Why are my dots moving?

RNA-seq plots

PCA

Color and shape

Labels

Coordinates

Themes

volcano plot

Linear modeling

Color

Lines

Scales

Label points

Theme

Facets

Data frames from `.csv`, `.tsv`, etc.

Complex data from `.RData`

Complex (`S3`, `S4`)