Setup and installation

Install R and RStudio

When you open RStudio, it should look like so with multiple panels. If you see only 1 panel, then you’re likely in R, not RStudio.

Install R packages

Install R packages by running the following script in your R console (left panel in the above image).

#CRAN packages
install.packages(c("tidyverse", "lme4"))

#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "biomaRt", "limma"))

#GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/kimma")

1. Introduction to R and RStudio

In this workshop, we will analyze an RNAseq dataset. To do this, we’ll need two things: data and a platform to analyze the data.

You already downloaded the data. But what platform will we use to analyze the data? We have many options!

We could try to use a spreadsheet program like Microsoft Excel or Google sheets that have limited access, less flexibility, and don’t easily allow for things that are critical to “reproducible” research, like easily sharing the steps used to explore and make changes to the original data.

Instead, we’ll use a programming language to test our hypothesis. Today we will use R, but we could have also used Python for the same reasons we chose R (and we teach workshops for both languages). Both R and Python are freely available, the instructions you use to do the analysis are easily shared, and by using reproducible practices, it’s straightforward to add more data or to change settings like colors or the size of a plotting symbol.

To run R, all you really need is the R program, which is available for computers running the Windows, Mac OS X, or Linux operating systems. You downloaded R while getting set up for this workshop.

To make your life in R easier, there is a great (and free!) program called RStudio that you also downloaded and used during set up. As we work today, we’ll use features that are available in RStudio for writing and running code, managing projects, installing packages, getting help, and much more. It is important to remember that R and RStudio are different, but complementary programs. You need R to use RStudio.

To get started, we’ll spend a little time getting familiar with the RStudio environment and setting it up to suit your tastes. When you start RStudio, you’ll have three panels.

On the left you’ll have a panel with three tabs - Console, Terminal, and Jobs. The Console tab is what running R from the command line looks like. This is where you can enter R code. Try typing in 2+2 at the prompt (>). In the upper right panel are tabs indicating the Environment, History, and a few other things. If you click on the History tab, you’ll see the command you ran at the R prompt.

In the lower right panel are tabs for Files, Plots, Packages, Help, and Viewer. You used the Packages tab to install tidyverse.

We’ll spend more time in each of these tabs as we go through the workshop, so we won’t spend a lot of time discussing them now.

You might want to alter the appearance of your RStudio window. The default appearance has a white background with black text. If you go to the Tools menu at the top of your screen, you’ll see a “Global options” menu at the bottom of the drop down; select that.

From there you will see the ability to alter numerous things about RStudio. Under the Appearances tab you can select the theme you like most. As you can see there’s a lot in Global options that you can set to improve your experience in RStudio. Most of these settings are a matter of personal preference.

However, you can update settings to help you to insure the reproducibility of your code. In the General tab, none of the selectors in the R Sessions, Workspace, and History should be selected. In addition, the toggle next to “Save workspace to .RData on exit” should be set to never. These setting will help ensure that things you worked on previously don’t carry over between sessions.

Let’s get going on our analysis!

One of the helpful features in RStudio is the ability to create a project. A project is a special directory that contains all of the code and data that you will need to run an analysis.

At the top of your screen you’ll see the “File” menu. Select that menu and then the menu for “New Project…”.

When the smaller window opens, select “Existing Directory” and then the “Browse” button in the next window.

Navigate to the directory that contains your code and data from the setup instructions and click the “Open” button.

Then click the “Create Project” button.

Did you notice anything change?

In the lower right corner of your RStudio session, you should notice that your Files tab is now your project directory. You’ll also see a file called 2022_ASM_Microbe_RNAseq.Rproj in that directory.

From now on, you should start RStudio by double clicking on that file. This will make sure you are in the correct directory when you run your analysis.

We’d like to create a file where we can keep track of our R code.

Back in the “File” menu, you’ll see the first option is “New File”. Selecting “New File” opens another menu to the right and the first option is “R Script”. Select “R Script”.

Now we have a fourth panel in the upper left corner of RStudio that includes an Editor tab with an untitled R Script. Let’s save this file as intro.R in our project directory.

We will be entering R code into the Editor tab to run in our Console panel.

On line 1 of intro.R, type 2+2.

With your cursor on the line with the 2+2, click the button that says Run. You should be able to see that 2+2 was run in the Console.

As you write more code, you can highlight multiple lines and then click Run to run all of the lines you have selected.

You can use R like a calculator to add (+), multiply (*), divide (/), and more!

Variables and Types

  • character
  • logical
  • double
  • integer
name <- "Kelly"
favorite_color <- "green"
height_inches <- 64
likes_cats <- TRUE
num_plants <- 14L

You can think of variables as labelled boxes where you store your belongings.

The <- symbol is the assignment operator. It assigns values generated or typed on the right to objects on the left. An alternative symbol that you might see used as an assignment operator is the = but it is clearer to only use <- for assignment. We use this symbol so often that RStudio has a keyboard short cut for it: Alt+- on Windows, and Option+- on Mac.

If you’re not sure of the type of a variable, you can find out with the typeof() function.

typeof(name)
## [1] "character"
typeof(favorite_color)
## [1] "character"
typeof(height_inches)
## [1] "double"
typeof(likes_cats)
## [1] "logical"

Comments

Sometimes you may want to write comments in your code to help you remember what your code is doing, but you don’t want R to think these comments are a part of the code you want to evaluate. That’s where comments come in! Anything after a # symbol in your code will be ignored by R. For example, let’s say we wanted to make a note of what each of the functions we just used do:

 Sys.Date()  # outputs the current date
## [1] "2022-06-01"
 getwd()     # outputs our current working directory (folder)
## [1] "/Users/kim/Documents/GitHub/2022_ASM_Microbe_RNAseq"
 sum(5, 6)   # adds numbers
## [1] 11

Assigning values to objects

Try to assign values to some objects and observe each object after you have assigned a new value. What do you notice?

name <- "Ben"
name
height <- 72
height
name <- "Jerry"
name

Solution

When we assign a value to an object, the object stores that value so we can access it later. However, if we store a new value in an object we have already created, it replaces the old value. The height object does not change, because we never assign it a new value.

Guidelines on naming objects

  • You want your object names to be explicit and not too long.
  • They cannot start with a number (2x is not valid, but x2 is).
  • R is case sensitive, so for example, weight_kg is different from Weight_kg.
  • You cannot use spaces in the name.
  • There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for; see here for a complete list). If in doubt, check the help to see if the name is already in use (?function_name).
  • It is recommended to use nouns for object names and verbs for function names.
  • Be consistent in the styling of your code, such as where you put spaces, how you name objects, etc. Using a consistent coding style makes your code clearer to read for your future self and your collaborators. One popular style guide can be found through the tidyverse.

The power of variables is that you can re-use them later!

cm_per_inch <- 2.54
height_cm <- height_inches * cm_per_inch
height_cm
## [1] 162.56
height_cm / cm_per_inch
## [1] 64

The above variables hold a single value. What if you need multiple values?

dial_numbers <- c(1, 2, 3)
dial_numbers2 <- 1:3
typeof(dial_numbers)
## [1] "double"
fan_settings <- c('low', 'medium', 'high')
typeof(fan_settings)
## [1] "character"
fan_settings <- factor(c('low', 'high', 'medium', 'high', 'medium'), levels = c('low', 'medium', 'high'))

Error messages

Sometimes we make a mistake when writing code, and the code isn’t able to run. In that case, the code will give us an error message.

An error tells us, “there’s something wrong with your code or your data and R didn’t do what you asked”. A warning tells us, “you might want to know about this issue, but R still did what you asked”. You need to fix any errors that arise in order for your code to work. Warnings, are probably best to resolve or at least understand why they are coming up.

How can you fix the error messages in these next lines of code?

heihgt_cm
## Error in eval(expr, envir, enclos): object 'heihgt_cm' not found
2 + "three"
## Error in 2 + "three": non-numeric argument to binary operator

Solution

height_cm
## [1] 162.56
2 + 3
## [1] 5

Functions

Earlier, the typeof() function took the variable we provided and told us the type that the variable belonged to. A function is a way to re-use code, either written by you or by someone else. Functions make it easier to write code because you don’t have to re-invent the wheel for certain tasks.

Viewing function documentation

Each function has a help page that documents what arguments the function expects and what value it will return. You can bring up the help page a few different ways. If you have typed the function name in the Editor windows, you can put your cursor on the function name and press F1 to open help page in the Help viewer in the lower right corner of RStudio. You can also type ? followed by the function name in the console.

For example, try running ?typeof. A help page should pop up with information about what the function is used for and how to use it, as well as useful examples of the function in action. As you can see, the first argument of type is the variable name.

Whenever using a funciton, you type the name of the function followed immediately by parentheses. Any arguments, or inputs, go between the parentheses.

Do all functions need arguments? Let’s test some other functions:

  Sys.Date()
## [1] "2022-06-01"
  getwd()
## [1] "/Users/kim/Documents/GitHub/2022_ASM_Microbe_RNAseq"

While some functions don’t need any arguments, in other functions we may want to use multiple arguments. When we’re using multiple arguments, we separate the arguments with commas. For example, we can use the sum() function to add numbers together:

sum(5, 6)
## [1] 11

Learning more about functions

Look up the function round. What does it do? What will you get as output for the following lines of code?

round(3.1415)
round(3.1415,3)

Solution

round rounds a number. By default, it rounds it to zero digits (in our example above, to 3). If you give it a second number, it rounds it to that number of digits (in our example above, to 3.142)

Position of the arguments in functions

Which of the following lines of code will give you an output of 3.14? For the one(s) that don’t give you 3.14, what do they give you?

round(x = 3.1415)
round(x = 3.1415, digits = 2)
round(digits = 2, x = 3.1415)
round(2, 3.1415)

Solution

The 2nd and 3rd lines will give you the right answer because the arguments are named, and when you use names the order doesn’t matter. The 1st line will give you 3 because the default number of digits is 0. Then 4th line will give you 2 because, since you didn’t name the arguments, x=2 and digits=3.1415.

Sometimes it is helpful - or even necessary - to include the argument name, but often we can skip the argument name, if the argument values are passed in a certain order. If all this function stuff sounds confusing, don’t worry! We’ll see a bunch of examples as we go that will make things clearer.

Packages

So far, all of the functions we have used are part of base R. When you install R, these functions are included by default. However, anyone can write code and share it with other people by putting them in packages. A package is a collection of code that is intended to be shared and re-used. Sometimes you need to do something that isn’t included in base R, or in a different way than the way it is implemented in base R.

Let’s load our first package with library(tidyverse).

Go ahead and run that line in the Console by clicking the Run button on the top right of the Editor tab and choosing Run Selected Lines. This loads a set of useful functions and sample data that makes it easier for us to do complex analyses and create professional visualizations in R.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

For more information on installation of the tidyverse, see the setup instructions.

What’s with all those messages???

When you loaded the tidyverse package, you probably got a message like the one we got above. Don’t panic! These messages are just giving you more information about what happened when you loaded tidyverse. The tidyverse is actually a collection of several different packages, so the first section of the message tells us what packages were installed when we loaded tidyverse (these include ggplot2, dyplr, and tidyr, which we will use a lot!

The second section of messages gives a list of “conflicts.” Sometimes, the same function name will be used in two different packages, and R has to decide which function to use. For example, our message says that:

dplyr::filter() masks stats::filter()

This means that two different packages (dplyr from tidyverse and stats from base R) have a function named filter(). By default, R uses the function that was most recently loaded, so if we try using the filter() function after loading tidyverse, we will be using the filter() function > from dplyr(). You can use double colons (::) to specify which package’s version of the function you want to use. This syntax follows the pattern package::function().

The tidyverse vs Base R

If you’ve used R before, you may have learned commands that are different than the ones we will be using during this workshop. We will be focusing on functions from the tidyverse. The “tidyverse” is a collection of R packages that have been designed to work well together and offer many convenient features that do not come with a fresh install of R (aka “base R”). These packages are very popular and have a lot of developer support including many staff members from RStudio. These functions generally help you to write code that is easier to read and maintain. We believe learning these tools will help you become more productive more quickly.


Data frames & reading data from files

The variable types we discussed above are useful for simple variables, but what if you want to store attributes about lots of different things, like patient samples? For that we use data frames. Just about anything that you can store as rows and columns, like in an Excel spreadsheet, are a great use for data frames. Let’s find out how to read in a dataset as a data frame.

meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

read_csv gave us lots of useful information about the data frame we just loaded!

Rows: 20 Columns: 9
── Column specification ──────────────────────────────────
Delimiter: ","
chr (5): libID, ptID, condition, sex, ptID_old
dbl (2): age_dys, total_seq
lgl (2): RNAseq, methylation

If you want to take a peak at just the first 10 rows of the data frame, you can do so with head:

head(meta)
## # A tibble: 6 × 9
##   libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_seq
##   <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>           <dbl>
## 1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE        9114402.
## 2 pt01_Mtb   pt01  Mtb         12410 M     pt00001  TRUE   FALSE        8918699.
## 3 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE        9221555.
## 4 pt02_Mtb   pt02  Mtb         12775 M     pt00002  TRUE   FALSE        7733260.
## 5 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE        6231728.
## 6 pt03_Mtb   pt03  Mtb         11315 M     pt00003  TRUE   FALSE        7105193.

Or to peak at the last 10 rows, use tail():

tail(meta)
## # A tibble: 6 × 9
##   libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_seq
##   <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>           <dbl>
## 1 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE        9957906.
## 2 pt08_Mtb   pt08  Mtb          8760 M     pt00008  TRUE   FALSE        8220348.
## 3 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE       13055276 
## 4 pt09_Mtb   pt09  Mtb          6935 M     pt00009  TRUE   FALSE       13800442.
## 5 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE        8216706.
## 6 pt10_Mtb   pt10  Mtb          8030 F     pt00010  TRUE   FALSE        7599609.

We can view the whole data frame interactively with View:

View(meta)

Data objects

There are many different ways to store data in R. Most objects have a table-like structure with rows and columns. We will refer to these objects generally as “data objects”. If you’ve used R before, you many be used to calling them “data.frames”. Functions from the “tidyverse” such as read_csv work with objects called “tibbles”, which are a specialized kind of “data.frame.” Another common way to store data is a “data.table”. All of these types of data objects (tibbles, data.frames, and data.tables) can be used with the commands we will learn in this lesson to make plots. We may sometimes use these terms interchangeably.

Other functions for reading datasets:

  • read.csv
  • data.table::fread
  • readxl::read_excel
dim(meta)
## [1] 20  9
nrow(meta)
## [1] 20
ncol(meta)
## [1] 9

Looking at just one column:

meta$ptID
##  [1] "pt01" "pt01" "pt02" "pt02" "pt03" "pt03" "pt04" "pt04" "pt05" "pt05"
## [11] "pt06" "pt06" "pt07" "pt07" "pt08" "pt08" "pt09" "pt09" "pt10" "pt10"
meta['ptID']
## # A tibble: 20 × 1
##    ptID 
##    <chr>
##  1 pt01 
##  2 pt01 
##  3 pt02 
##  4 pt02 
##  5 pt03 
##  6 pt03 
##  7 pt04 
##  8 pt04 
##  9 pt05 
## 10 pt05 
## 11 pt06 
## 12 pt06 
## 13 pt07 
## 14 pt07 
## 15 pt08 
## 16 pt08 
## 17 pt09 
## 18 pt09 
## 19 pt10 
## 20 pt10

Subsetting:

meta[2, 'ptID']
## # A tibble: 1 × 1
##   ptID 
##   <chr>
## 1 pt01
meta[1:4,]
## # A tibble: 4 × 9
##   libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_seq
##   <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>           <dbl>
## 1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE        9114402.
## 2 pt01_Mtb   pt01  Mtb         12410 M     pt00001  TRUE   FALSE        8918699.
## 3 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE        9221555.
## 4 pt02_Mtb   pt02  Mtb         12775 M     pt00002  TRUE   FALSE        7733260.
meta[1:4, c('ptID', 'condition')]
## # A tibble: 4 × 2
##   ptID  condition
##   <chr> <chr>    
## 1 pt01  Media    
## 2 pt01  Mtb      
## 3 pt02  Media    
## 4 pt02  Mtb

Subsetting practice

How would you subset rows 10 through 15 and the columns for “libID”, “methylation”, and “total_seq”?

meta[10:15, c(1, 8:9)]
## # A tibble: 6 × 3
##   libID      methylation total_seq
##   <chr>      <lgl>           <dbl>
## 1 pt05_Mtb   FALSE       15509446.
## 2 pt06_Media FALSE        7085995.
## 3 pt06_Mtb   FALSE        6588160.
## 4 pt07_Media FALSE       10706098.
## 5 pt07_Mtb   FALSE        8576245.
## 6 pt08_Media FALSE        9957906.
meta[10:15, c("libID", "methylation", "total_seq")]
## # A tibble: 6 × 3
##   libID      methylation total_seq
##   <chr>      <lgl>           <dbl>
## 1 pt05_Mtb   FALSE       15509446.
## 2 pt06_Media FALSE        7085995.
## 3 pt06_Mtb   FALSE        6588160.
## 4 pt07_Media FALSE       10706098.
## 5 pt07_Mtb   FALSE        8576245.
## 6 pt08_Media FALSE        9957906.

2. Introduction to the tidyverse

The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse and read in the metadata again.

For more information on installation, see the setup instructions.

library(tidyverse)
meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a look at the dataset with View():

View(meta)

Intro dplyr

summarize

Let’s start with a useful function called summarize() that will let us summarize variables in our data frame. We can use it to find the mean, median, or other statistics about the dataset. Let’s find the average age in days below:

summarize(meta, mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

The first argument to the summarize function is our data frame, and the second argument is a new variable that we want summarize to create from the age_dys column.

pipes

Instead of including the data as an argument, we can use the pipe operator %>% to pass the data value into the summarize function.

meta %>% summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

This line of code will do the exact same thing as our first summary command, but the piping function tells R to use the meta data frame as the first argument in the next function.

This lets us “chain” together multiple functions, which will be helpful later. Note that the pipe (%>%) is a bit different from using the ggplot plus (+). Pipes take the output from the left side and use it as input to the right side. Plusses layer on additional information (right side) to a preexisting plot (left side).

We can also add an Enter to make it look nicer:

meta %>%
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

Using the pipe operator %>% and enter command makes our code more readable. The pipe operator %>% also helps to avoid using nested function and minimizes the need for new variables.

Since we use the pipe operator so often, there is a keyboard shortcut for it in RStudio. You can press Ctrl+Shift+M on Windows or Cmd+Shift+M on a Mac.

Tip: Saving a new dataframe

Notice that when we run the following code, we are not actually saving a new variable:

meta %>%
  summarize(mean_age_days=mean(age_dys))

This simply outputs what we have created, but does not change actually change meta or save a new dataframe. To save a new dataframe, we could run:

meta_summarized <- meta %>%
  summarize(mean_age_days=mean(age_dys))

Or if we wanted to change meta itself:

meta <- meta %>%
  summarize(mean_age_days=mean(age_dys))

IMPORTANT: This would overwrite the existing meta object.

For now, we will not be saving dataframes, since we are just experimenting with dplyr functions, but it will be useful later on in this lesson.

group_by

If we want to calculate the mean age in days for males and females separately, we can use group_by before summarize:

meta %>% 
  group_by(sex) %>% 
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
##   sex   mean_age_days
##   <chr>         <dbl>
## 1 F             7422.
## 2 M             9699.

Exercise: group_by practice

Try calculating the mean total number of sequences for the Media and Mtb conditions. You’ll need to use group_by, summarize, and the mean function

meta %>% 
  group_by(condition) %>% 
  summarize(mean_total_seq = mean(total_seq))
## # A tibble: 2 × 2
##   condition mean_total_seq
##   <chr>              <dbl>
## 1 Media           9933191.
## 2 Mtb             9246495.

select

You can use the select function to keep only the columns that you want and remove columns that you don’t need.

meta %>% 
  select(ptID, age_dys)
## # A tibble: 20 × 2
##    ptID  age_dys
##    <chr>   <dbl>
##  1 pt01    12410
##  2 pt01    12410
##  3 pt02    12775
##  4 pt02    12775
##  5 pt03    11315
##  6 pt03    11315
##  7 pt04     8395
##  8 pt04     8395
##  9 pt05     7300
## 10 pt05     7300
## 11 pt06     6570
## 12 pt06     6570
## 13 pt07     7665
## 14 pt07     7665
## 15 pt08     8760
## 16 pt08     8760
## 17 pt09     6935
## 18 pt09     6935
## 19 pt10     8030
## 20 pt10     8030

If you want to keep all but one column, put the minus (-) sign in front of it. I like to think of it as “subtracting” the column from the data frame.

meta %>% 
  select(-ptID_old)
## # A tibble: 20 × 8
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276 
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.

Let’s save the data frame without the ptID_old column, since ptID tells us everything we need to know in a more compact way.

meta <- meta %>% 
  select(-ptID_old)

mutate

Sometimes we would like to create a new column based on the values in an existing one.

meta %>% 
  mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.    18.0
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.    21.0
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0

We can overwrite the meta data frame so it will now contain the age_yrs column.

meta <- meta %>% 
  mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.    18.0
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.    21.0
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0

Exercise: practice mutate

Try creating a new column total_seq_millions based on total_seq.

meta %>% mutate(total_seq_millions = total_seq / 1000000)
## # A tibble: 20 × 10
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.    18.0
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.    21.0
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0
## # … with 1 more variable: total_seq_millions <dbl>

filter

For each sample, we have both a “Media” sample and a “Mtb” sample. If you want to use only the “Media” samples, we can narrow down the data frame to only the rows from “Media” with filter():

meta %>% 
  filter(condition == 'Media')
## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  3 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  4 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  5 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
##  6 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
##  7 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
##  8 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
##  9 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 10 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0

We used the == sign to test which rows in the condition column were equal to the value “Media”. Comparison operators return TRUE or FALSE. There are many useful comparison operators in R:

operator meaning
== equal
!= not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to

Quotes vs No Quotes

Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.

The general rule is that if you want to use values from the columns of your data object, then you supply the name of the column without quotes, but if you want to specify a value that does not come from your data, then use quotes.

Can you think of another way to end up with only the Media rows? Hint: try using the “not equal” operator.

meta %>% 
  filter(condition != 'Mtb')
## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  3 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  4 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  5 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
##  6 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
##  7 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
##  8 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
##  9 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 10 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0

With filter you can also specify multiple comparisons to perform. Maybe we want to filter down to only Media samples and where at least 7 million sequences were present. We can use the ampersand (&) to accomplish that:

meta %>% 
  filter(condition == 'Media' & total_seq > 7000000)
## # A tibble: 9 × 9
##   libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##   <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
## 1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
## 2 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
## 3 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
## 4 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 5 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 6 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 7 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 8 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 9 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0

Alternatively, maybe you want to filter down to only samples from patients that are either male or at least 21 years old. You can use the vertical bar | to mean “or”

meta %>% 
  filter(sex == 'M'  | age_yrs >= 21)
## # A tibble: 16 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 12 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 13 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 14 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 15 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 16 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0

Intro ggplot

We will be using the ggplot2 package today to make our plots. This is a very powerful package that creates professional looking plots and is one of the reasons people like using R so much. All plots made using the ggplot2 package start by calling the ggplot() function.

meta %>% ggplot()

When we run this code, the Plots tab will pop to the front in the lower right corner of the RStudio screen. Right now, we just see a big grey rectangle.

What we’ve done is created a ggplot object and told it we will be using the data from the meta object that we’ve loaded into R. We’ve done this by calling the ggplot() function with meta as the data argument.

So we’ve made a plot object, now we need to start telling it what we actually want to draw in this plot. The elements of a plot have a bunch of properties like an x and y position, a size, a color, etc. These properties are called aesthetics. When creating a data visualization, we map a variable in our dataset to an aesthetic in our plot. In ggplot, we can do this by creating an “aesthetic mapping”, which we do with the aes() function.

To create our plot, we need to map variables from our meta object to ggplot aesthetics using the aes() function. Since we have already told ggplot that we are using the data in the meta object, we can access the columns of meta using the object’s column names. (Remember, R is case-sensitive, so we have to be careful to match the column names exactly!)

We are interested in the total sequences per sample, so let’s start by telling our plot object that we want to assign this variable to the x axis of our plot. We do this by adding (+) information to our plot object. Add this new line to your code and run both lines by highlighting them and pressing Ctrl+Enter on your keyboard:

meta %>% 
ggplot() +
  aes(y = total_seq)

Note that we’ve added this new function call to a second line just to make it easier to read. To do this we make sure that the + is at the end of the first line otherwise R will assume your command ends when it starts the next row. The + sign indicates not only that we are adding information, but to continue on to the next line of code.

Observe that our Plot window is no longer a grey square. We now see that we’ve mapped the total_seq column to the x axis of our plot. Note that that column name isn’t very pretty as an x-axis label, so let’s add the labs() function to make a nicer label for the x axis

meta %>% 
ggplot() +
  aes(y = total_seq) +
  labs(y = "Total sequences")

OK. That looks better.

meta %>% 
ggplot() +
  aes(x = age_dys, y = total_seq) +
  labs(x = "Age (days)", y = "Total sequences")

Excellent. We’ve now told our plot object where the x and y values are coming from and what they stand for. But we haven’t told our object how we want it to draw the data.

geom_point

There are many different plot types (bar charts, scatter plots, histograms, etc). We tell our plot object what to draw by adding a “geometry” (“geom” for short) to our object. We will talk about many different geometries today, but for our first plot, let’s draw our data using the “points” geometry for each value in the data set. To do this, we add geom_point() to our plot object:

meta %>% 
ggplot() +
  aes(x = age_dys, y = total_seq) +
  labs(x = "Age (days)", y = "Total sequences") +
  geom_point()

Now we’re really getting somewhere. It finally looks like a proper plot! One thing we could do is use a different color for the sample conditions. To map the condition of each point to a color, we will again use the aes() function:

meta %>% 
ggplot() +
  aes(x = age_dys, y = total_seq, color = condition) +
  labs(x = "Age (days)", y = "Total sequences") +
  geom_point()

So far, we’ve been using geom_point() to plot two continuous (numeric) variables. What if we want to plot a categorical variable and a numeric variable?

geom_boxplot

Using the meta data, use ggplot to create a box plot with condition on the x axis and total_seq on the y axis.

meta %>% 
 ggplot() +
  aes(x = condition, y = total_seq) +
  geom_boxplot()

This type of visualization makes it easy to compare the range and spread of values across groups. The “middle” 50% of the data is located inside the box and outliers that are far away from the central mass of the data are drawn as points.

geom_col

Another way to plot a categorical and a continuous variable is with a bar plot.

meta %>% 
  ggplot() +
  aes(x = libID, y = total_seq) +
  geom_col()

Those library IDs sure are squished together and hard to raed. Maybe we should have total_seq on the x axis and libID on the y axis. In other words, it might be easier if we flip the coordinates. We can rewrite the aes() code, or instead an easy way to change the coordinates is with coord_flip():

meta %>% 
  ggplot() +
  aes(x = libID, y = total_seq) +
  geom_col() +
  coord_flip()

This is looking pretty good. Now let’s add color using the fill aesthetic!

meta %>% 
  ggplot() +
  aes(x = libID, y = total_seq, fill = condition) +
  geom_col() +
  coord_flip()

Now let’s see if we can plot the total sequences for each patient ID:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col() +
  coord_flip()

By default, the bars are stacked. Maybe we would like them to be side-by-side instead: We can tell the geom_col shapes to “dodge” each other:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) +
  coord_flip()

Finally, since ptID is a short name, let’s flip our coordinates back by removing coord_flip():

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge())

We now have our first data cleaning plot, the total number of sequences per library.

Saving plots

If we want to share our plots with other people or put them in manuscripts, we’re going to need to make them a little prettier and save them to our computers. Let’s make it pretty and then save the plot!

Add labels:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) + 
  labs(x = 'patient ID', y = 'Total sequences')

Change the theme:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) + 
  labs(x = 'patient ID', y = 'Total sequences') +
  theme_bw()

Save to a new variable:

seq_per_lib_plot <- meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) + 
  labs(x = 'patient ID', y = 'Total sequences') +
  theme_bw()

Take a look at it by pasting the variable name into the console: seq_per_lib_plot

ggsave("sequences_per_library.png", plot = seq_per_lib_plot)

Intro tidyr

pivot

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The metadata is in a long format. We have one row for each patient sample and each variable is in a different column. We might describe this data as “tidy” because it makes it easy to work with ggplot2 and dplyr functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically in “wide” format each row represents a group of observations and each value is placed in a different column rather than a different row. For example maybe we want only one row per country and want to spread the life expectancy values into different columns (one for each year).

The tidyr package contains the functions pivot_wider and pivot_longer that make it easy to switch between the two formats. The tidyr package is included in the tidyverse package so we don’t need to do anything to load it.

pivot_wider

For each patient, we have two samples: Media and Mtb. In the metadata, the only difference in these two conditions is the libID column (which is redundant with the ptID and condition) and the total_seq column. We can take the condition column and create two new columns, one with the total seqs in the media sample and one with the total seqs in the mtb sample. This is called “pivoting” the data frame “wider” because we rearrange it by creating an additional column and making it have fewer rows. Let’s try it!

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 8
##    ptID  age_dys sex   RNAseq methylation age_yrs     Media       Mtb
##    <chr>   <dbl> <chr> <lgl>  <lgl>         <dbl>     <dbl>     <dbl>
##  1 pt01    12410 M     TRUE   FALSE          34.0  9114402.  8918699.
##  2 pt02    12775 M     TRUE   FALSE          35.0  9221555.  7733260.
##  3 pt03    11315 M     TRUE   FALSE          31.0  6231728.  7105193.
##  4 pt04     8395 M     TRUE   TRUE           23.0 10205557.  8413543.
##  5 pt05     7300 M     TRUE   FALSE          20.0 15536685. 15509446.
##  6 pt06     6570 F     TRUE   FALSE          18.0  7085995.  6588160.
##  7 pt07     7665 F     TRUE   FALSE          21.0 10706098.  8576245.
##  8 pt08     8760 M     TRUE   FALSE          24.0  9957906.  8220348.
##  9 pt09     6935 M     TRUE   FALSE          19.0 13055276  13800442.
## 10 pt10     8030 F     TRUE   FALSE          22.0  8216706.  7599609.

Notice here that we tell pivot_wider() which columns to pull the names we wish our new columns to be named from the year variable, and the values to populate those columns from the condition variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by year, and the values populate it with our remaining variables dictating the rows.

Maybe we should assign those columns more informative names:

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq,
              names_prefix = "total_seq_")
## # A tibble: 10 × 8
##    ptID  age_dys sex   RNAseq methylation age_yrs total_seq_Media total_seq_Mtb
##    <chr>   <dbl> <chr> <lgl>  <lgl>         <dbl>           <dbl>         <dbl>
##  1 pt01    12410 M     TRUE   FALSE          34.0        9114402.      8918699.
##  2 pt02    12775 M     TRUE   FALSE          35.0        9221555.      7733260.
##  3 pt03    11315 M     TRUE   FALSE          31.0        6231728.      7105193.
##  4 pt04     8395 M     TRUE   TRUE           23.0       10205557.      8413543.
##  5 pt05     7300 M     TRUE   FALSE          20.0       15536685.     15509446.
##  6 pt06     6570 F     TRUE   FALSE          18.0        7085995.      6588160.
##  7 pt07     7665 F     TRUE   FALSE          21.0       10706098.      8576245.
##  8 pt08     8760 M     TRUE   FALSE          24.0        9957906.      8220348.
##  9 pt09     6935 M     TRUE   FALSE          19.0       13055276      13800442.
## 10 pt10     8030 F     TRUE   FALSE          22.0        8216706.      7599609.

And now let’s save the new wider metadata to a new variable:

meta_wide <- meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)

Notice that the number of rows and columns has changed:

nrow(meta)
## [1] 20
ncol(meta)
## [1] 9
nrow(meta_wide)
## [1] 10
ncol(meta_wide)
## [1] 8

pivot_longer

Everything we did with pivot_wider, we can reverse with pivot_longer. Maybe you receive a data frame in a wide format, but you need it in a long format. Here’s how we can get back to a long data frame:

meta_wide %>% 
  pivot_longer(c(Media, Mtb), 
               names_to = 'condition', values_to = 'total_seq')
## # A tibble: 20 × 8
##    ptID  age_dys sex   RNAseq methylation age_yrs condition total_seq
##    <chr>   <dbl> <chr> <lgl>  <lgl>         <dbl> <chr>         <dbl>
##  1 pt01    12410 M     TRUE   FALSE          34.0 Media      9114402.
##  2 pt01    12410 M     TRUE   FALSE          34.0 Mtb        8918699.
##  3 pt02    12775 M     TRUE   FALSE          35.0 Media      9221555.
##  4 pt02    12775 M     TRUE   FALSE          35.0 Mtb        7733260.
##  5 pt03    11315 M     TRUE   FALSE          31.0 Media      6231728.
##  6 pt03    11315 M     TRUE   FALSE          31.0 Mtb        7105193.
##  7 pt04     8395 M     TRUE   TRUE           23.0 Media     10205557.
##  8 pt04     8395 M     TRUE   TRUE           23.0 Mtb        8413543.
##  9 pt05     7300 M     TRUE   FALSE          20.0 Media     15536685.
## 10 pt05     7300 M     TRUE   FALSE          20.0 Mtb       15509446.
## 11 pt06     6570 F     TRUE   FALSE          18.0 Media      7085995.
## 12 pt06     6570 F     TRUE   FALSE          18.0 Mtb        6588160.
## 13 pt07     7665 F     TRUE   FALSE          21.0 Media     10706098.
## 14 pt07     7665 F     TRUE   FALSE          21.0 Mtb        8576245.
## 15 pt08     8760 M     TRUE   FALSE          24.0 Media      9957906.
## 16 pt08     8760 M     TRUE   FALSE          24.0 Mtb        8220348.
## 17 pt09     6935 M     TRUE   FALSE          19.0 Media     13055276 
## 18 pt09     6935 M     TRUE   FALSE          19.0 Mtb       13800442.
## 19 pt10     8030 F     TRUE   FALSE          22.0 Media      8216706.
## 20 pt10     8030 F     TRUE   FALSE          22.0 Mtb        7599609.
sequence data

So far, we’ve been working with just the metadata. But this workshop is about analyzing RNAseq data, so let’s start working with actual sequencing data.

count <- read_csv('0_data/raw_counts.csv')
## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a look at the counts data with: View(count)

Notice how each column is a libID and each row is a hgnc_symbol. The values are the raw sequence counts for the genes in each library. Soon, we’re going to want to combine the sequence data with our metadata. But first, we need to change this from a wide to a long format.

count %>% 
  pivot_longer(-hgnc_symbol, names_to = 'libID', values_to = 'raw_count')
## # A tibble: 895,720 × 3
##    hgnc_symbol libID      raw_count
##    <chr>       <chr>          <dbl>
##  1 5_8S_rRNA   pt01_Media     20.0 
##  2 5_8S_rRNA   pt01_Mtb       20.0 
##  3 5_8S_rRNA   pt02_Media      7.11
##  4 5_8S_rRNA   pt02_Mtb        6.06
##  5 5_8S_rRNA   pt03_Media     12.2 
##  6 5_8S_rRNA   pt03_Mtb        4.07
##  7 5_8S_rRNA   pt04_Media      4   
##  8 5_8S_rRNA   pt04_Mtb        3   
##  9 5_8S_rRNA   pt05_Media     22.0 
## 10 5_8S_rRNA   pt05_Mtb       55.7 
## # … with 895,710 more rows

Let’s save this to a new data frame so we can re-use it later.

count_long <- count %>% 
  pivot_longer(-hgnc_symbol, names_to = 'libID', values_to = 'raw_count')

join

Now let’s combine our raw sequence counts with our metadata. To combine two data frames, we will use a join function. The dplyr package has a number of tools for joining data frames together depending on what we want to do with the rows of the data of countries that are not represented in both data frames.

?join

We have a few options here:

The mutating joins add columns from y to x, matching rows based on the keys: - inner_join(): includes all rows in x and y. - left_join(): includes all rows in x. - right_join(): includes all rows in y. - full_join(): includes all rows in x or y.

In an “inner join”, the new data frame only has those rows where the same key is found in both data frames. This is a very commonly used join.

inner_join(meta, count_long, by = 'libID')
## # A tibble: 895,720 × 11
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  3 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  4 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  5 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  6 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  7 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  8 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  9 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
## 10 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
## # … with 895,710 more rows, and 2 more variables: hgnc_symbol <chr>,
## #   raw_count <dbl>

Now let’s save our joined data frame to a new variable:

full_data <- inner_join(meta, count_long, by = 'libID')

Now that we have our metadata joined with our raw count data, we’re ready to clean and analyze it!

Bonus: Other dplyr join functions

Outer joins and can be performed using left_join(), right_join(), and full_join(). In a “left join”, if the key is present in the left hand data frame, it will appear in the output, even if it is not found in the the right hand data frame. For a right join, the opposite is true. For a full join, all possible keys are included in the output data frame.

Using our joined dataset

Let’s find out which genes had the highest raw counts.

top_10_abundant_genes <- full_data %>% 
  filter(condition == 'Media') %>% 
  group_by(hgnc_symbol) %>% 
  summarize(med_raw_count = median(raw_count)) %>% 
  arrange(desc(med_raw_count)) %>% 
  slice_max(order_by = med_raw_count, n = 10)
top_10_abundant_genes
## # A tibble: 10 × 2
##    hgnc_symbol   med_raw_count
##    <chr>                 <dbl>
##  1 MALAT1              263231 
##  2 Metazoa_SRP         208862.
##  3 RN7SL1              189057.
##  4 CH507-513H4.3       140705.
##  5 CH507-513H4.4       140705.
##  6 CH507-513H4.6       139025.
##  7 MT-RNR2             132437.
##  8 RN7SK                83060.
##  9 MT-RNR1              78763 
## 10 MT-CO1               77922.

Now let’s make a boxplot with only those genes.

full_data %>% 
  filter(hgnc_symbol %in% top_10_abundant_genes$hgnc_symbol, 
         condition == 'Media') %>% 
  ggplot() +
  aes(x = hgnc_symbol, y = raw_count) +
  geom_boxplot() +
  coord_flip() +
  labs(title = 'Top 10 most abundant genes')

Try plotting just interferon 1 gamma for each patient:

full_data %>% 
  filter(hgnc_symbol == 'IFNG', condition == 'Media') %>% 
  ggplot() +
  aes(x = ptID, y = raw_count) +
  geom_col() +
  labs(title = 'IFNG RNAseq counts')

3. RNA-seq data cleaning

Overview

Here, we take a raw RNA-seq counts table through quality assessment, filtering, and normalization. For more information on how counts were generated, see our [SEAsnake pipeline][SEAsnake]. These data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)
#RNAseq data
library(edgeR)
## Loading required package: limma
## Warning: package 'limma' was built under R version 4.1.3
library(limma)
library(RNAetc)

#Note we do not load biomaRt because it has conflicts with the tidyverse.
# We instead call its functions with biomaRt::

# Set random seed for reproduciblity
set.seed(651)

Read in data

Using the tidyverse read_ functions you saw earlier today, read in the counts and sample metadata files.

count <- read_csv("0_data/raw_counts.csv")
## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Quality-filter data

Filter poor-quality libraries

We assess library quality using metrics from [samtools flagstat][flagstat] and [picard][picard]. Here, we will only look at total sequences for brevity but you can find more quality measures of interest such as percent alignment and median CV coverage in our full [tutorial][count_to_voom].

Using ggplot, let’s plot the total sequences per library.

ggplot(meta, aes(x = libID, y = total_seq)) +
  geom_col()

The above plot is somewhat difficult to read so we modify it with some ggplot features you’ve seen before and some that are new.

ggplot(meta, 
       #Reorder x axis by y-axis value
       aes(x = reorder(libID, total_seq), 
           y = total_seq)) +
  geom_col() +
  #Add cutoff line
  geom_hline(yintercept = 1E6, color = "red") +
  #Set theme
  theme_classic() +
  #Relabel axes
  labs(x = "Library", y = "Total sequences") +
  #Rotate x-axis labels
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Overall we see that all libraries have more the 1 million sequences. Based on this and other quality metrics not shown here, we determine that all libraries pass quality check.

Filter non-protein-coding genes

Standard differential expression analyses focus on protein-coding genes as these RNA products are the most likely to result in a measurable phenotype. We annotate the counts table genes to their biotypes as well as additional identifiers with biomaRt.

First, load the gene data base from biomaRt. Note that here we use biomaRt::function( ) because we did not load this package with library( ) earlier. The :: operator tells R which package contains the function and prevents downstream issues in this pipeline due to dplyr::select vs biomaRt::select.

#Get database
ensembl <- biomaRt::useEnsembl(biomart="ensembl",
                               dataset="hsapiens_gene_ensembl",
                                mirror = "www")

#Get gene key
key <- biomaRt::getBM(attributes=c("hgnc_symbol", "gene_biotype"),
                      mart=ensembl)

Also, you could extract more than just the biotypes from biomaRt. See all the possible data with listAttributes.

attrib <- biomaRt::listAttributes(ensembl)
head(attrib)
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

Next, filter the key to protein-coding genes present in the RNA-seq count data.

key.filter <- key %>% 
  #Filter genes in count table
  filter(hgnc_symbol %in% count$hgnc_symbol) %>% 
  #Filter protein-coding
  filter(gene_biotype == "protein_coding")

And filter the count table to genes in the protein-coding key.

count.pc <- count %>% 
  filter(hgnc_symbol %in% key.filter$hgnc_symbol)

This removes 26518 genes from this data set and leaves 18268 genes for analysis.

Filter PCA outliers

Sometimes one or more libraries will appear as outliers in PCA. We define this as any library more than 3 standard deviations away from the mean PC1 and/or PC2.

First, we must reformat the counts data to a log counts per million (cpm) matrix in order for PCA to work. We also transpose it so that genes are columns and libraries are rows. Without transposing, we’d get 1 dot per gene instead of 1 per library.

count.for.pca <- count.pc %>% 
  #move HGNC gene symbol to rownames
  column_to_rownames("hgnc_symbol") %>% 
  #convert to log cpm
  cpm(log=TRUE) %>% 
  #transpose
  t()

count.for.pca[1:3,1:3]
##                A1BG      A1CF      A2M
## pt01_Media 1.835063 -1.865271 7.389832
## pt01_Mtb   1.535061 -1.865271 6.397031
## pt02_Media 1.369070 -1.865271 6.106875

Then, we calculate PCA on these transformed data.

PCA <- prcomp(count.for.pca, scale.=TRUE, center=TRUE)

A basic ggplot of these PCA data looks like so, but it does not tell us anything about outliers.

#Determine percent explained for each PC
summary(PCA)$importance
##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     67.41996 46.09409 38.32512 36.86756 34.23585 31.11334
## Proportion of Variance  0.24882  0.11631  0.08040  0.07440  0.06416  0.05299
## Cumulative Proportion   0.24882  0.36513  0.44553  0.51993  0.58409  0.63709
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     28.79221 28.08592 26.04112 24.22482 22.73842 22.68737
## Proportion of Variance  0.04538  0.04318  0.03712  0.03212  0.02830  0.02818
## Cumulative Proportion   0.68247  0.72565  0.76277  0.79489  0.82319  0.85137
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     21.77168 20.72515 20.25228 19.57118 19.30086 18.76942
## Proportion of Variance  0.02595  0.02351  0.02245  0.02097  0.02039  0.01928
## Cumulative Proportion   0.87732  0.90083  0.92328  0.94425  0.96464  0.98393
##                            PC19     PC20
## Standard deviation     17.12298 0.664296
## Proportion of Variance  0.01605 0.000020
## Cumulative Proportion   0.99998 1.000000
#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")

ggplot(as.data.frame(PCA$x),
       aes(x=PC1, y=PC2)) +
  geom_point(size=3) +
  labs(x=PC1.label, y=PC2.label)

We need to calculate the means and standard deviations of our PCs and create a new variable to color the points in PCA. First, let’s get the summary metrics from the PCA data contained in PCA$x.

PCA.summ <- as.data.frame(PCA$x) %>% 
  #move rownames into column of data frame
  rownames_to_column("libID") %>% 
  #Pivot to get all PCs in 1 column
  pivot_longer(-libID) %>% 
  #calculate mean and sd
  group_by(name) %>% 
  summarize(meanPC = mean(value),
            sdPC = sd(value))
PCA.summ
## # A tibble: 20 × 3
##    name    meanPC    sdPC
##    <chr>    <dbl>   <dbl>
##  1 PC1   -0.250   67.4   
##  2 PC10   0.0951  24.2   
##  3 PC11  -0.118   22.7   
##  4 PC12   0.00668 22.7   
##  5 PC13   0.0262  21.8   
##  6 PC14   0.0299  20.7   
##  7 PC15  -0.0288  20.3   
##  8 PC16  -0.100   19.6   
##  9 PC17   0.0166  19.3   
## 10 PC18  -0.0150  18.8   
## 11 PC19   0.158   17.1   
## 12 PC2    0.362   46.1   
## 13 PC20  -0.647    0.0137
## 14 PC3    0.209   38.3   
## 15 PC4   -0.123   36.9   
## 16 PC5   -0.0577  34.2   
## 17 PC6    0.233   31.1   
## 18 PC7   -0.169   28.8   
## 19 PC8    0.141   28.1   
## 20 PC9    0.0745  26.0
#Extract a row by the PC name
PC1.summ <- PCA.summ[PCA.summ$name == "PC1",]
PC2.summ <- PCA.summ[PCA.summ$name == "PC1",]

#create variable for outliers outside 3 sd from mean on either PC1 of PC2
PCA2 <- as.data.frame(PCA$x) %>% 
  mutate(outlier = ifelse(
    #Higher than +3 sd
    PC1 > PC1.summ$meanPC + 3*PC1.summ$sdPC | 
    PC2 > PC2.summ$meanPC + 3*PC2.summ$sdPC | 
    #Lower than -3 sd
    PC1 < PC1.summ$meanPC - 3*PC1.summ$sdPC |
    PC2 < PC2.summ$meanPC - 3*PC2.summ$sdPC, 
    #If any of the above are TRUE, make the variable equal to this
    "outlier",
    #Otherwise (else), use this
    "okay"))

Then our PCA plot reveals that there are no outliers. Note we added some additional ggplot customization to beautify this plot.

ggplot(PCA2, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=outlier), size=3) +
  labs(x=PC1.label, y=PC2.label) +
  #Set theme
  theme_classic() +
  #set xy ratio
  coord_fixed(ratio=1) +
  #change colors
  scale_color_manual(values = c("okay"="grey70",
                                "outlier"="red"))

We recommend that you initially remove dramatic outliers but leave those that are borderline or questionable. Then, you can re-assess outliers after gene filtering and normalization. You may find that some are no longer outliers after these steps. If they are, you can return to this step and remove them before repeating subsequent steps.

Additional steps

This workshop covers the minimum quality control and filtering needed for RNA-seq libraries. Checkout our full tutorial for additional QC metrics, batch effects, custom plotting functions, and more!

Create DGEList

At this stage, we’ve completed sample filtering and can collapse our count and meta data into a single list object. This allows us to shorten our long object names as well as works efficiently with the remaining cleaning steps.

We merge into a DGEList object, which is edgeR format. We must convert the counts data frame to a matrix for this process.

count.pc.mat <- count.pc %>% 
  column_to_rownames("hgnc_symbol") %>% 
  as.matrix()

dat <- DGEList(counts=count.pc.mat, samples=meta, genes=key.filter)

This object is a list of data frames. It is anS3 object in R, because it has 3 dimensions: list[dataframe[column]]. You can access each data frame in the list with $ similar to how you’ve used it to access columns within a data frame.

typeof(dat)
## [1] "list"
names(dat)
## [1] "counts"  "samples" "genes"
#One data frame in dat
dat$samples
##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  7590941            1 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  6975318            1   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  7225255            1 pt02_Media pt02     Media   12775   M
## pt02_Mtb       1  5794398            1   pt02_Mtb pt02       Mtb   12775   M
## pt03_Media     1  4924407            1 pt03_Media pt03     Media   11315   M
## pt03_Mtb       1  5458446            1   pt03_Mtb pt03       Mtb   11315   M
## pt04_Media     1  7985262            1 pt04_Media pt04     Media    8395   M
## pt04_Mtb       1  6323624            1   pt04_Mtb pt04       Mtb    8395   M
## pt05_Media     1 11696793            1 pt05_Media pt05     Media    7300   M
## pt05_Mtb       1 10846286            1   pt05_Mtb pt05       Mtb    7300   M
## pt06_Media     1  5576621            1 pt06_Media pt06     Media    6570   F
## pt06_Mtb       1  5047579            1   pt06_Mtb pt06       Mtb    6570   F
## pt07_Media     1  8554618            1 pt07_Media pt07     Media    7665   F
## pt07_Mtb       1  6203112            1   pt07_Mtb pt07       Mtb    7665   F
## pt08_Media     1  7696280            1 pt08_Media pt08     Media    8760   M
## pt08_Mtb       1  5863972            1   pt08_Mtb pt08       Mtb    8760   M
## pt09_Media     1  9864991            1 pt09_Media pt09     Media    6935   M
## pt09_Mtb       1  9692964            1   pt09_Mtb pt09       Mtb    6935   M
## pt10_Media     1  6639203            1 pt10_Media pt10     Media    8030   F
## pt10_Mtb       1  5774444            1   pt10_Mtb pt10       Mtb    8030   F
##            ptID_old RNAseq methylation total_seq
## pt01_Media  pt00001   TRUE       FALSE   9114402
## pt01_Mtb    pt00001   TRUE       FALSE   8918699
## pt02_Media  pt00002   TRUE       FALSE   9221555
## pt02_Mtb    pt00002   TRUE       FALSE   7733260
## pt03_Media  pt00003   TRUE       FALSE   6231728
## pt03_Mtb    pt00003   TRUE       FALSE   7105193
## pt04_Media  pt00004   TRUE        TRUE  10205557
## pt04_Mtb    pt00004   TRUE        TRUE   8413543
## pt05_Media  pt00005   TRUE       FALSE  15536685
## pt05_Mtb    pt00005   TRUE       FALSE  15509446
## pt06_Media  pt00006   TRUE       FALSE   7085995
## pt06_Mtb    pt00006   TRUE       FALSE   6588160
## pt07_Media  pt00007   TRUE       FALSE  10706098
## pt07_Mtb    pt00007   TRUE       FALSE   8576245
## pt08_Media  pt00008   TRUE       FALSE   9957906
## pt08_Mtb    pt00008   TRUE       FALSE   8220348
## pt09_Media  pt00009   TRUE       FALSE  13055276
## pt09_Mtb    pt00009   TRUE       FALSE  13800442
## pt10_Media  pt00010   TRUE       FALSE   8216706
## pt10_Mtb    pt00010   TRUE       FALSE   7599609
#One column in one data frame in dat
dat$samples$libID
##  [1] "pt01_Media" "pt01_Mtb"   "pt02_Media" "pt02_Mtb"   "pt03_Media"
##  [6] "pt03_Mtb"   "pt04_Media" "pt04_Mtb"   "pt05_Media" "pt05_Mtb"  
## [11] "pt06_Media" "pt06_Mtb"   "pt07_Media" "pt07_Mtb"   "pt08_Media"
## [16] "pt08_Mtb"   "pt09_Media" "pt09_Mtb"   "pt10_Media" "pt10_Mtb"

Filter low abundance genes

Low abundance (small counts) and rare genes (many 0 counts) are removed from the data because they:

  • are unlikely to be significantly differentially expressed
  • greatly inflate multiple comparison correction
  • often do not meet linear modeling assumptions regarding mean variance trends (e.g. because of small N, they show lower variance than what is expected for their mean expression)

Our goal is to remove genes in the lower left of the mean-variance plot because counts (x) and variance (y) are low e.g. these genes break the red mean variance trend line.

Here we co-opt limma’s voom function to make out mean-variance plot for us.

mv1 <- voom(dat, plot=TRUE)

We use our custom function in the package RNAetc to retain only genes that are at least min.CPM counts per million in at least min.sample number of samples OR in at least min.pct percent of samples. Here, we use 0.5 CPM in at least 3 samples.

dat.abund <- filter_rare(dat, min.CPM = 0.5, min.sample = 3,
                         gene.var="hgnc_symbol")
## 4849 (26.54%) of 18268 genes removed.

Plotting the filtered data, we see the red trend line no longer curves down on the left. There is still a bit of the downward tail of dots but this filtering is sufficient.

mv2 <- voom(dat.abund, plot=TRUE)

This filtering generally results in the removal of around 25% of genes. There is no exact cutoff for this filtering, and you should try several cutoffs to observe the effects. In general, we use minimum CPM from 0.1 - 1, minimum samples around 3 for small data sets, or minimum samples from 5 - 10% in larger data sets. It is important to keep the minimum sample cutoff larger than your smallest group of interest, otherwise you may filter genes specifically associated with one group. For example, if you have 10 libraries with 5 each of media and stimulated, your min.sample value should not be > 5 or else you will remove genes only expressed in one condition.

You may also wish to look for specific genes of interest and ensure they are not being filtered. For example, we will analyze interferon gammea (IFNG) later in this workshop so we check that this gene is present after filtering.

"IFNG" %in% rownames(dat.abund$counts)
## [1] TRUE

Normalize data

Trimmed mean of M (TMM)

RNA-seq counts are not independent within a library and not comparable across libraries. A library with 1 million sequences will have higher counts for most genes than one with 1 thousand sequences. We correct for this aspect of the data with the following normalization steps.

TMM defines a reference sample from your data set as the one with the most representative expression for the overall data set. Specifically, the reference sample is the one whose upper quartile is closest to the overall data set upper quartile. The upper quantile is the value (x) where 75% of genes have expression < x. Thus, the reference sample is the sample whose x is the closest to mean(x) across all samples.

All other samples are considered test samples. For each test sample, a scaling factor is calculated based on the weighted mean of log ratios of representative genes between the test and reference. These representative genes are a subset of the data set, removing the highest and lowest expressed genes as well as genes with the highest and lowest log ratios. The exact genes used as representative genes for scaling are dynamic and specific to each test sample.

The calculated scaling factors are then applied to the counts table automatically in the voom step.

dat.abund.norm <- calcNormFactors(dat.abund, method = "TMM")

voom aka log2 counts per million (CPM)

We continue normalization by converting counts to CPM within each sample, thus accounting for differential sampling depth. We also perform log2 transformation, because RNA-seq data are heavily right-skewed and thus, violate assumptions of normality.

as.data.frame(dat.abund$counts) %>% 
  rownames_to_column() %>% 
  pivot_longer(-rowname) %>% 
ggplot() +
  geom_histogram(aes(x=value), bins = 100) +
  theme_classic() +
  labs(x = "count", y = "occurences") +
  lims(x=c(0,1000))

Note: You will see a warning message with the plot above. This is telling you that some data are not represented in the plot. This is expected here as we’ve forced the x-axis limits to be 0 to 1000, therefore removing any data with counts higher than 1000. If you see this warning when you’re not expecting data to be removed, go back to the data frame and look for NA or formatting errors (like a letter in a numeric variable).

voom performs both of these steps! We use voomWithQualityWeights to additionally calculate sample specific quality weights that can be of use as co-variates in linear modeling.

#define model matrix
mm <- model.matrix(~ condition, data=dat.abund.norm$samples)
dat.abund.norm.voom <- voomWithQualityWeights(
                           dat.abund.norm,
                           design=mm,
                           plot=TRUE)

Final data structure

Let’s review the final data structure. We access each data frame within the Elist object using $. The normalized log2 CPM expression data are contained in E.

dat.abund.norm.voom$E[1:3,1:7]
##       pt01_Media  pt01_Mtb pt02_Media  pt02_Mtb pt03_Media  pt03_Mtb pt04_Media
## A1BG    1.620022  1.717603   1.086946  1.095467  0.8744826  1.710618   1.573344
## A2M     7.259912  6.680758   5.939909  5.316170  6.0252987  5.541628   7.330036
## A2ML1  -4.052404 -3.515057  -4.015712 -1.816544 -3.3734450 -3.386150  -4.170278

Library and donor metadata are in targets.

dat.abund.norm.voom$targets[1:3,1:10]
##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  8295928     1.092872 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  5716203     0.819490   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  8087601     1.119352 pt02_Media pt02     Media   12775   M
##            ptID_old RNAseq
## pt01_Media  pt00001   TRUE
## pt01_Mtb    pt00001   TRUE
## pt02_Media  pt00002   TRUE

Gene metadata are in genes.

dat.abund.norm.voom$genes[1:3,]
##      hgnc_symbol   gene_biotype
## A1BG      MT-ND1 protein_coding
## A1CF      MT-ND2 protein_coding
## A2M       MT-CO1 protein_coding

Gene-level quality weights are in weights.

dat.abund.norm.voom$weights[1:3,1:3]
##            [,1]       [,2]       [,3]
## [1,]  1.5140025  1.1586136  2.0890796
## [2,] 10.0514208 10.9377042 14.1566201
## [3,]  0.3364504  0.3989864  0.4689222

Explorative analysis

To get an initial sense of the cleaned data, we plot some variables of interest. Here, we see a clear Mtb infection signal and no relationship to sex.

#transpose normalized expression data
voom.for.pca <- t(dat.abund.norm.voom$E)
#Calculate PCA
PCA3 <- prcomp(voom.for.pca, scale.=TRUE, center=TRUE)

#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")

#Merge PCA results with metadata
PCA.meta <- as.data.frame(PCA3$x) %>% 
  rownames_to_column("libID") %>% 
  full_join(meta)
## Joining, by = "libID"
#color by condition
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=condition), size=3) +
  labs(x=PC1.label, y=PC2.label) +
  #Set theme
  theme_classic() +
  #set xy ratio
  coord_fixed(ratio=1)

#color by sex
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=sex), size=3) +
  labs(x=PC1.label, y=PC2.label) +
  #Set theme
  theme_classic() +
  #set xy ratio
  coord_fixed(ratio=1)

Save

Write final voom object as RData

save(dat.abund.norm.voom, file = "0_data/dat_voom.RData")

We also save information on a single gene, IFNG, for use in our next section on linear modeling. Note that this combines several of the tidyverse functions you saw this morning as well as introduces some new ones.

# Get 
IFNG <- as.data.frame(dat.abund.norm.voom$E) %>% 
  #Move rownames into the data frame as a column
  rownames_to_column("hgnc_symbol") %>% 
  #Filter gene of interest
  filter(hgnc_symbol == "IFNG") %>% 
  #Pivot to long format so libID are in a single column
  pivot_longer(-hgnc_symbol, names_to = "libID", values_to = "E") %>% 
  #Join with sample metadat
  full_join(dat.abund.norm.voom$targets, by = "libID")

#View resulting data frame
IFNG
## # A tibble: 20 × 15
##    hgnc_symbol libID       E group lib.size norm.factors ptID  condition age_dys
##    <chr>       <chr>   <dbl> <fct>    <dbl>        <dbl> <chr> <chr>       <dbl>
##  1 IFNG        pt01_… -4.05  1       8.30e6        1.09  pt01  Media       12410
##  2 IFNG        pt01_…  0.572 1       5.72e6        0.819 pt01  Mtb         12410
##  3 IFNG        pt02_… -0.846 1       8.09e6        1.12  pt02  Media       12775
##  4 IFNG        pt02_…  3.39  1       5.28e6        0.912 pt02  Mtb         12775
##  5 IFNG        pt03_… -3.37  1       5.18e6        1.05  pt03  Media       11315
##  6 IFNG        pt03_…  0.521 1       5.23e6        0.958 pt03  Mtb         11315
##  7 IFNG        pt04_… -1.00  1       9.00e6        1.13  pt04  Media        8395
##  8 IFNG        pt04_…  4.94  1       5.40e6        0.855 pt04  Mtb          8395
##  9 IFNG        pt05_… -4.65  1       1.26e7        1.07  pt05  Media        7300
## 10 IFNG        pt05_… -2.01  1       1.01e7        0.930 pt05  Mtb          7300
## 11 IFNG        pt06_… -3.63  1       6.17e6        1.11  pt06  Media        6570
## 12 IFNG        pt06_… -1.12  1       5.42e6        1.07  pt06  Mtb          6570
## 13 IFNG        pt07_… -1.32  1       8.76e6        1.02  pt07  Media        7665
## 14 IFNG        pt07_…  1.48  1       5.19e6        0.837 pt07  Mtb          7665
## 15 IFNG        pt08_… -2.56  1       8.82e6        1.15  pt08  Media        8760
## 16 IFNG        pt08_… -0.629 1       5.41e6        0.923 pt08  Mtb          8760
## 17 IFNG        pt09_…  0.290 1       1.19e7        1.20  pt09  Media        6935
## 18 IFNG        pt09_…  3.74  1       1.10e7        1.14  pt09  Mtb          6935
## 19 IFNG        pt10_… -0.524 1       6.47e6        0.974 pt10  Media        8030
## 20 IFNG        pt10_…  4.30  1       4.54e6        0.786 pt10  Mtb          8030
## # … with 6 more variables: sex <chr>, ptID_old <chr>, RNAseq <lgl>,
## #   methylation <lgl>, total_seq <dbl>, sample.weights <dbl>
#Save as csv
write_csv(IFNG, file = "0_data/IFNG.csv")

4.1 Linear modeling

Overview

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)
#Linear modeling
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#Get p-values with random effects model
library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step

Load data

For this section, we’re just going to be focusing on one gene, IFNG, which you collapsed into one data frame in the previous section.

modelDat<-read_csv('0_data/IFNG.csv')
## Rows: 20 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): hgnc_symbol, libID, ptID, condition, sex, ptID_old
## dbl (7): E, group, lib.size, norm.factors, age_dys, total_seq, sample.weights
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Awesome, we have our data! Let’s get to the fun part - statistics!

Introduction to (statistical) sampling

In research, we want to try to learn about a target population. However, we usually can’t get data about the entire population, so we take a sample and get data about the sample. Then, using the data about the sample, we want to make inferences about the entire population.

For example, let’s say we wanted to know if people who play video games more than 10 hours a week spend more or less money on electronics than those that play video games less than 10 hours a week. Since we can’t ask everyone about their gaming and spending habits, we would ask a select few (a sample) about their gaming and spending habits. Then we would use statistics to determine the probability or likelihood that our population behaves a certain way based on that sample.

T-tests

T-tests examine the means of two groups. To do a t-test you need to have a binary variable and a continuous variable. Then the test examines if the average of that continuous variable is the same for each binary variable. Using the example above, our binary variable would be if the person plays video games more or less than 10 hours a week and the continuous variable would be the amount spent on electronics.

In our data here, we will be looking at gene expression levels (continuous) and compare Media vs Mtb infection (binary).

In our data here, we will be looking at gene expression levels (continuous) and compare Media vs Mtb infection (binary).

Let’s first look at the distribution of the data.

p <- ggplot(modelDat, aes(x=condition, y=E)) + 
  geom_boxplot() + 
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  xlab("Condtion") +
  ylab(expression(paste('Log'[2], " Expression")))
p

geom_boxplot outputs the actual box plots and geom_jitter adds the actual data points but slightly scatters them so we can see them better.

We can see that the expression levels are lower in the media condition than in the Mtb condition. Now let’s do a t-test to see if that difference is statistically significant.

ttestRes<-t.test(E ~ condition, data = modelDat)
ttestRes
## 
##  Welch Two Sample t-test
## 
## data:  E by condition
## t = -3.9134, df = 16.056, p-value = 0.001231
## alternative hypothesis: true difference in means between group Media and group Mtb is not equal to 0
## 95 percent confidence interval:
##  -5.681177 -1.689521
## sample estimates:
## mean in group Media   mean in group Mtb 
##           -2.166250            1.519099

The notation here is that the continuous variable goes before the ~, then we have the binary variable, and then we specify where the data is found.

From the results we get the t-test statistic (t = -3.9134), the degrees of freedom (df = 16.056), the p-value (p-value = 0.001231), as well as estimates of the mean for each group (Media = -2.17, Mtb = 1.52).

These results suggest that there is a statistically significant difference in gene expression for the gene IFNG between the Media and the Mtb infection conditions. (In statistics language: We reject the null hypothesis that the mean gene expression is equal in the Media and Mtb conditions)

T-test assumptions

1. Data is continuous

Easy! Our data is a transformed measure of gene expression. Definitely continuous

2. Data is collected from a simple random sample

This would be a question for the researchers that collected this sample. I think this would be fair to assume however.

3. The data is normally distributed

This is what we would check with our boxplot. We could also look at a violin plot.

p <- ggplot(modelDat, aes(x=condition, y=E)) + 
  geom_violin() + 
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  xlab("Condtion") +
  ylab(expression(paste('Log'[2], " Expression")))
p

Or a histogram

modelDat %>% ggplot(aes(x=E, fill = condition)) +
  geom_bar() +
  scale_x_binned() +
  xlab(expression(paste('Log'[2], " Expression")))

We don’t have a ton of data here so these plots look a little odd, but this could be useful if you have more data.

There are also statistical tests you can do to test normality. One is the Kolmogorov-Smirnov test. This test is very sensitive to departures from normality. You can also test other distributions (F-distribution, etc). The other is the Shapiro-Wilks test for normality. This test only tests for normality.

ks.test(modelDat$E, "pnorm")
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  modelDat$E
## D = 0.30116, p-value = 0.04181
## alternative hypothesis: two-sided
shapiro.test(modelDat$E)
## 
##  Shapiro-Wilk normality test
## 
## data:  modelDat$E
## W = 0.95163, p-value = 0.3925

4. The sample is “reasonably large”

Well… Generally more data is better, but we only have what we have here.

5. Homogeneity of variance

This means that the variance is the same in the Media condition and in the Mtb condition. We can test this with an F-test.

res <- var.test(E ~ condition, data = modelDat)
res
## 
##  F test to compare two variances
## 
## data:  E by condition
## F = 0.48371, num df = 9, denom df = 9, p-value = 0.2943
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1201468 1.9474147
## sample estimates:
## ratio of variances 
##          0.4837103

Or a Bartlett’s test.

bartlett.test(E ~ condition, data = modelDat)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  E by condition
## Bartlett's K-squared = 1.1005, df = 1, p-value = 0.2942

Paired t-test

A paired t-test is very similar to a regular t-test, except here we are going to be incorporating the fact that each patient has a Media sample and a Mtb infection sample. That is, one person generated both pieces of data.

Another example of paired data might be pre/post tests where a person is given a test prior to some exposure and after some exposure to determine if the exposure caused a difference in test results. For example, we might give people a memory test, then have them do an obstacle course and get physically worn out, then have them do the memory test again to see if physical tiredness causes a change in memory abilities.

Let’s look at a spaghetti plot of our data. These are plots that connect lines between data that is generated by the same person. These plots can be useful for looking at paired data, longitudinal data, or other repeated measures.

p <- modelDat %>%
  ggplot(aes(x = condition, y = E, group = ptID)) +
  geom_line() +
  xlab("Condtion") +
  ylab(expression(paste('Log'[2], " Expression")))
p

Nice! We can see that there’s some correlation between the data. People with higher gene expression in the media tend to have higher gene expression in the Mtb condition.

To do the paired t-test, we will first need to modify our data.

modelDatPair<-modelDat %>% 
  select(c(E, condition, ptID)) %>%
  pivot_wider(names_from = condition,
              values_from = E)

Let’s see what the resulting dataset looks like after that data manipulation. Is it what you expected?

modelDatPair
## # A tibble: 10 × 3
##    ptID   Media    Mtb
##    <chr>  <dbl>  <dbl>
##  1 pt01  -4.05   0.572
##  2 pt02  -0.846  3.39 
##  3 pt03  -3.37   0.521
##  4 pt04  -1.00   4.94 
##  5 pt05  -4.65  -2.01 
##  6 pt06  -3.63  -1.12 
##  7 pt07  -1.32   1.48 
##  8 pt08  -2.56  -0.629
##  9 pt09   0.290  3.74 
## 10 pt10  -0.524  4.30

Let’s find out what exactly the correlation is between the media and the Mtb conditions.

cor.test(modelDatPair$Media, modelDatPair$Mtb)
## 
##  Pearson's product-moment correlation
## 
## data:  modelDatPair$Media and modelDatPair$Mtb
## t = 5.2315, df = 8, p-value = 0.0007914
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5604480 0.9713172
## sample estimates:
##       cor 
## 0.8796646

Wow! Those are really correlated. Statistics is just taking data (numbers) and asking about relationships between those numbers. Since we know that this correlation is just due to the data being generated by the same person, and not a difference in gene expression caused by the condition, we want to do statistics that can take this correlation into account.

So let’s run the paired t-test with this data:

ptTestRes<-t.test(modelDatPair$Mtb,
                  modelDatPair$Media, 
                  paired = T)
ptTestRes
## 
##  Paired t-test
## 
## data:  modelDatPair$Mtb and modelDatPair$Media
## t = 9.3464, df = 9, p-value = 6.262e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.793367 4.577330
## sample estimates:
## mean of the differences 
##                3.685349

Here, we see we get different results. We still get the t-score (9.3464), degrees of freedom (9), and p-value (<0.0001), but the values have slightly changed. This is because we’re now testing if the difference between the matched pairs is 0. The estimate of the mean of the differences is 3.69, which is actually the same as the difference between the two means from before (Media = -2.17, Mtb = 1.52, difference = 3.69). This answer is more accurate because know that the data came from the same person and not two independent samples.

Linear models

One major limitations of t-tests is that we can’t adjust for any other information we know about the patients. We have a lot of other information in our targets dataset including sex, age, total sequences. What if these variables also have an impact on gene expression in each of the conditions? The t-test wouldn’t be able to tell us that.

This is when we would want to use a linear model! Let’s start with a linear model not adjusting for other factors and see what we get.

lmMod<-lm(E ~ condition, data = modelDat)
summary(lmMod)
## 
## Call:
## lm(formula = E ~ condition, data = modelDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5322 -1.5658 -0.2135  1.7005  3.4173 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -2.1662     0.6659  -3.253  0.00442 **
## conditionMtb   3.6853     0.9417   3.913  0.00102 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.106 on 18 degrees of freedom
## Multiple R-squared:  0.4597, Adjusted R-squared:  0.4297 
## F-statistic: 15.31 on 1 and 18 DF,  p-value: 0.001019

Some of these numbers look awfully familiar. -2.16 is the mean gene expression for the Media condition, 3.69 is the difference between the means for the two conditions, and \(t = 3.91\) is the same as for our unpaired t-test. The p-value of 0.00102 slightly different because of how the degrees of freedom are calculated.

Now let’s add some other variables.

lmModBig<-lm(E ~ condition + age_dys + sex, data = modelDat)
summary(lmModBig)
## 
## Call:
## lm(formula = E ~ condition + age_dys + sex, data = modelDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.357 -1.689 -0.254  1.531  3.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -2.268e+00  2.194e+00  -1.033  0.31675   
## conditionMtb  3.685e+00  9.962e-01   3.699  0.00195 **
## age_dys       3.918e-05  2.605e-04   0.150  0.88234   
## sexM         -3.600e-01  1.238e+00  -0.291  0.77503   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.228 on 16 degrees of freedom
## Multiple R-squared:  0.4625, Adjusted R-squared:  0.3618 
## F-statistic:  4.59 on 3 and 16 DF,  p-value: 0.01678

We see now that our numbers have changed and we have additional results! The main result of interest here is the p-value for the condition which is 0.00209. That means, adjusting for age and sex, the gene expression is statistically significantly different by condtion (Media vs Mtb infection).

Exercise: On your own, build a model that also adjusts for the total number of sequences.

Linear Modeling Assumptions

1. The true relationship is linear

This makes more sense when we’re modeling two continuous variables, but essentially we want to make sure that a linear model is the correct model to be using. To do that, we have to assume the relationship between the predictors and the outcome is linear.

2. Errors are normally distributed

3. Homoscedasticity of errors (or, equal variance around the line).

When we model, we can get a predicted value. We also have our observed values. The difference between the predicted values and the observed values are called the errors or residuals. We want those errors to be normally distributed. If there is some pattern in the error terms, this suggests that we might be missing some pattern and our model might be incorrect. We can make a few plots to look at our errors.

First we create a new data frame and then plot a scatter plot of our results.

#Create a new data frame using our results
residDat<- data.frame(pred = lmModBig$fitted.values,
                      resid = lmModBig$residuals)

#Plot a scatterplot of our residuals vs our predicted values
p<-residDat %>% ggplot(aes(x = pred, y = resid)) +
  geom_point() +
  xlab("Predicted value")
  ylab("Residual") 
## $y
## [1] "Residual"
## 
## attr(,"class")
## [1] "labels"
p

We see a gap in our predicted values. This is just because the condition makes a big difference in our predicted values. Otherwise it looks good.

Next, we’ll plot a histogram of the residuals. We are going to use base R for this plot.

#Plot a histogram of our residuals.
hist(residDat$resid, main="Histogram of Residuals",
 ylab="Residuals")

Looks pretty good! We want our histogram to look like a bell curve.

Lastly we’ll plot a Q-Q plot. We want this to look like a straight line with no major trends or deviations.

#Q-Q Plot
qqnorm(residDat$resid)
qqline(residDat$resid)

These plots check both the normality and heteroskedacity of the error terms. If you just see randomness in the first plot, that’s great! If you see a funnel pattern, that’s a problem. In the Q-Q plot if you see a generally straight line with points randomly above or below the line, that’s great! If you see curve (or a trend where most points are below the line, then above the line, then below the lineagain) that’s a problem.

4. Independence of the observations

This means that each data point is independent of all others. The results of one observation do not influence another observation. Here, this would mean that none of the people in our sample are related. If there were relatives in our data, then we would expect them to have more similar results than two randomly selected people.

However, we actually break this assumption since we have paired data and linear models don’t actually take that pairing into account. We’ll talk more about that later.

Exercise: Test these assumptions using your model that includes total sequences.

Goodness-of-Fit

How do we know if we really need to adjust for age and sex? One way is to use the AIC (Akaike’s An Information Criterion). This is a measure of “goodness of fit” or how well the model fits the data. AIC takes into account the number of variables included in the model and the maximum likelihood, which is a measure of how well the model predicts the data. Lower AIC indicates better goodness-of-fit.

AIC(lmMod)
## [1] 90.43791
AIC(lmModBig)
## [1] 94.33237

We see that the AIC is lower for our smaller model. This means we might not actually be benefiting from adding those extra variables to our model. Always use your scientific reasoning though. If you know that there is a difference by age or sex, keep them in the model even if the AIC says otherwise. Be smarter than your data!

Another method to look at goodness-of-fit is BIC (Bayesian Information Criterion). This is similar to the AIC in that it takes into account the number of variables in the model and the maximum likelihood. Lower is also better for BIC. It is calculated slightly differently, however. BIC should not be compared to AIC.

BIC(lmMod)
## [1] 93.42511
BIC(lmModBig)
## [1] 99.31103

This agrees with AIC that the simple model is a better fit.

Exercise: calculate the AIC and BIC for your model including total sequences. Is it a better fit than either of these models?

Linear mixed effects models

The issue with the linear models is the same issue we had with a regular t-test. We’re not using the fact that the one person generated two pieces of data!

To do that, we will use a mixed effects model. Mixed effects models are similar to linear models, except we have a “random effect” that accounts for individual differences between people.

lmeMod<-lmer(E ~ condition + (1 | ptID), data = modelDat)
summary(lmeMod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: E ~ condition + (1 | ptID)
##    Data: modelDat
## 
## REML criterion at convergence: 72.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.13492 -0.51308  0.06058  0.45472  1.52648 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ptID     (Intercept) 3.6570   1.9123  
##  Residual             0.7774   0.8817  
## Number of obs: 20, groups:  ptID, 10
## 
## Fixed effects:
##              Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   -2.1662     0.6659 10.7136  -3.253  0.00795 ** 
## conditionMtb   3.6853     0.3943  9.0000   9.346 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## conditinMtb -0.296

The notation here is similar to the model notation we’ve seen before, but we have a new term here. The (1 | ptID) is the random effect. This is telling the model what is the variable that connects the repeated measures. Here, we use the patient ID. Every row with the same patient ID value is data generated by the same person. This is how we can do a linear model while taking into account the fact that one person generated two pieces of data.

Next, let’s add some other variables to our model!

lmeModBig<-lmer(E ~ condition + sex + age_dys + (1 | ptID), data = modelDat)
## Warning: Some predictor variables are on very different scales: consider
## rescaling

## Warning: Some predictor variables are on very different scales: consider
## rescaling
summary(lmeModBig)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: E ~ condition + sex + age_dys + (1 | ptID)
##    Data: modelDat
## 
## REML criterion at convergence: 83.6
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.09476 -0.52511  0.05475  0.45468  1.48340 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ptID     (Intercept) 4.7828   2.1870  
##  Residual             0.7774   0.8817  
## Number of obs: 20, groups:  ptID, 10
## 
## Fixed effects:
##                Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  -2.268e+00  3.091e+00  7.057e+00  -0.734    0.487    
## conditionMtb  3.685e+00  3.943e-01  9.000e+00   9.346 6.26e-06 ***
## sexM         -3.600e-01  1.788e+00  7.000e+00  -0.201    0.846    
## age_dys       3.918e-05  3.761e-04  7.000e+00   0.104    0.920    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) cndtnM sexM  
## conditinMtb -0.064              
## sexM         0.121  0.000       
## age_dys     -0.903  0.000 -0.479
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling

It looks like our effect estimate for condition didn’t change all that much by adding age and sex. We might not need to incorporate those variables, but we’ll look at that in the next section.

Exercise: Build your own model with a random effect and include total sequences.

Goodness-of-Fit

We can use the same tools to look at goodness-of-fit with our mixed effects model as we did for our linear model.

First, we’ll look at AIC.

AIC(lmeMod)
## [1] 80.23768
AIC(lmeModBig)
## [1] 95.6416

Once again, the smaller model has a lower AIC and thus fits our data better.

Next, let’s look at BIC.

BIC(lmeMod)
## [1] 84.22061
BIC(lmeModBig)
## [1] 101.616

We see the same trend where the smaller model is a better fit.

Now, let’s do something a little funky. Let’s see which fits our model better: our random effects model or our linear model.

AIC(lmMod)
## [1] 90.43791
AIC(lmeMod)
## [1] 80.23768

Looks like our mixed effects model is a better fit based on the smaller AIC! However, we also know that our data is paired so even if the AIC was higher for the mixed effects model, we should still incorporate the random effect. This is another case of using our scientific reasoning over the statistical tests.

Exercise: Compare your mixed effects model with total sequences to the other models we created. Which model is best?

Wrap up

In this section we covered t-tests, paired t-tests, linear models, and mixed effects models. That’s a lot of content in a short amount of time!

If you hope to be doing more of these analyses, we highly recommend taking a formal statistics class. In such a class you will learn more about how to interpret model results (which we didn’t even cover) and how to model different types of data.

Perhaps an even better option is collaborating with a statistician or biostatistician. We are friendly folk and are passionate about helping scientists. We prefer to get involved in projects early on to help design your study to be the most efficient and useful. This helps us from being the bearers of bad news later on when we find out that your data won’t answer your question.

4.2 RNA-seq linear modeling

Overview

Thus far, we’ve modeled one gene at a time. However, RNA-seq data sets have 1000s of genes! Here, we introduce some R packages designed for running linear models across all genes in an efficient manor.

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)
#Linear modeling
library(kimma)
library(limma)

set.seed(651)

Load data

All counts, gene, and sample metadata are contained in a single object.

#Attach data
attach("0_data/dat_voom.RData")
## The following object is masked _by_ .GlobalEnv:
## 
##     dat.abund.norm.voom

Note how the data object does not appear in the R environment. This is because attach gives you access to the data in R but does not load it into the current environment. We can access the data and do so here to rename it with a shorter name.

#Rename data object to be shorter
dat <- dat.abund.norm.voom

Now the dat object appears in our current R environment! We can still access dat.abund.norm.voom if we want but don’t need it cluttering up the environment pane.

Modeling in limma

Simple linear regression in limma

limma take model inputs as a model matrix instead of the ~ variable formula format you’ve seen with lm and lmer. This matrix encodes all the variables from the formula as 0 for N and 1 for Y. For example, here is the model matrix for the formula ~ condition + sex

mm_limma <- model.matrix(~ condition + sex, data=dat$targets)

head(mm_limma)
##            (Intercept) conditionMtb sexM
## pt01_Media           1            0    1
## pt01_Mtb             1            1    1
## pt02_Media           1            0    1
## pt02_Mtb             1            1    1
## pt03_Media           1            0    1
## pt03_Mtb             1            1    1

Note that we only see one level for each variable: Mtb for condition and female for sex. This shows that the Mtb samples are being compared to the reference level (which is media). The reference is determined alphabetically if the variable is character and by level if the variable is a factor. So, if you want a different order than alphabetical, you need to format your variables as factors and set the appropriate order. For example, here we set the reference of sex to female instead of male.

# Set order of sex variable as factors
dat$targets <- dat$targets %>% 
  mutate(sex = factor(sex, levels = c("M","F")))

#Make model matrix
mm_limma <- model.matrix(~ condition + sex, data=dat$targets)

head(mm_limma)
##            (Intercept) conditionMtb sexF
## pt01_Media           1            0    0
## pt01_Mtb             1            1    0
## pt02_Media           1            0    0
## pt02_Mtb             1            1    0
## pt03_Media           1            0    0
## pt03_Mtb             1            1    0

Once we have a model matrix, we fit the model and estimate P-values.

#Fit model
fit_limma <- lmFit(object = dat$E, design = mm_limma,
                   weights = dat$weights)
#Estimate significance
efit_limma <- eBayes(fit = fit_limma)

These outputs contain a lot of information. We can pull out the most commonly used pieces with topTable. By default, this gives your the 10 most significant genes across the entire model.

#Extract results
fdr_limma <- topTable(fit = efit_limma)
## Removing intercept from test coefficients

With some additional parameters, we can get all gene results for each variable.

fdr_limma_mtb <- topTable(fit = efit_limma, 
                      coef = "conditionMtb", number = Inf)
fdr_limma_sex <- topTable(fit = efit_limma, 
                      coef = "sexF", number = Inf)

head(fdr_limma_mtb)
##             logFC  AveExpr        t      P.Value    adj.P.Val        B
## PLEKHM3  2.690201 5.973384 15.37621 6.888863e-13 9.244165e-09 19.59613
## MAFF     3.276295 6.164289 14.07285 3.769007e-12 2.528815e-08 17.95641
## PPP1R15B 1.856514 8.167870 13.37282 9.911130e-12 3.472082e-08 17.02685
## CCL3     5.390670 8.467644 13.34214 1.034975e-11 3.472082e-08 16.94657
## RABGEF1  1.732223 6.129960 12.94266 1.832223e-11 3.582863e-08 16.44090
## IRAK2    4.532959 5.898344 12.93913 1.841603e-11 3.582863e-08 16.36725
head(fdr_limma_sex)
##             logFC  AveExpr         t      P.Value    adj.P.Val        B
## DDX3Y  -10.860939 4.023600 -67.09595 5.979891e-26 4.012208e-22 39.77376
## RPS4Y1  -9.854802 3.295219 -67.41397 5.418008e-26 4.012208e-22 37.39048
## KDM5D   -9.253437 2.870455 -58.22743 1.149610e-24 5.142204e-21 34.82432
## UTY     -8.891150 2.884104 -40.20886 2.522091e-21 6.768787e-18 32.06835
## EIF1AY  -7.837518 1.921431 -49.99166 2.749097e-23 9.222533e-20 29.63353
## ZFY     -8.025645 2.634853 -32.79599 1.693467e-19 3.787438e-16 29.10000

The variables included are:

  • logFC: log fold change. The sign is relative to your reference. For example, negative logFC for conditionMtb means Mtb minus media is negative and thus, expression is lower in Mtb-infected samples.
  • AveExpr: average expression across all samples
  • t: test statistic for significance
  • P.Value
  • adj.P.Val: FDR adjusted P-value
  • B: beta coefficient

With some tidyverse manipulation, we can get results for all genes and variables in one table. Or you can use the kimma function extract_lmFit and skip the coding! This also renames the columns to match kimma’s output.

fdr_limma <- extract_lmFit(design = mm_limma, fit = efit_limma)

head(fdr_limma)
##   model    gene    variable estimate         pval          FDR        t
## 1 limma ANKRD17 (Intercept) 8.000626 8.945156e-35 4.396458e-31 177.2443
## 2 limma   PRPF6 (Intercept) 7.134773 1.965776e-34 4.396458e-31 170.7053
## 3 limma PRPF40A (Intercept) 8.336440 1.663704e-34 4.396458e-31 172.0705
## 4 limma   SF3B2 (Intercept) 8.525784 1.416921e-34 4.396458e-31 173.3946
## 5 limma   MATR3 (Intercept) 8.149936 2.524933e-34 4.840297e-31 168.6773
## 6 limma   CNOT1 (Intercept) 8.725961 1.665219e-34 4.396458e-31 172.0630
##          B  AveExpr
## 1 65.65834 8.002042
## 2 65.55667 7.037995
## 3 65.11819 8.291966
## 4 65.09105 8.497101
## 5 64.94127 8.263252
## 6 64.89197 8.632038

Paired sample design in limma

limma uses a shortcut to model paired sample designs. Unlike what you saw with lmer (which is a true mixed effect model), limma estimates the mean correlation of gene expression between pairs. This is an okay approximation of a mixed effects model and runs very fast.

Using the same model as before, we can calculate the mean correlation.

consensus.corr <- duplicateCorrelation(object = dat$E, 
                                       design = mm_limma,
                                       block = dat$targets$ptID)
consensus.corr$consensus.correlation
## [1] 0.3925634

Note: If you get an error about the package statmod, please install this package with install.packages("statmod")

Then incorporate this estimate into our model fitting.

# Fit model
fit_limma2 <- lmFit(object = dat$E, design = mm_limma,
                    weights = dat$weights,
                    block = dat$targets$ptID,
                    correlation = consensus.corr$consensus.correlation)
#Estimate p-values
efit_limma2 <- eBayes(fit_limma2)

#Get results
fdr_limma2 <- extract_lmFit(design = mm_limma, fit = efit_limma2)

Comparing these results for the condition variable, we see that the pseudo-paired model with duplicateCorrelation tends to give lower FDR values. This doesn’t necessarily mean the model is a better fit!

#Format results for merging
temp <- fdr_limma %>% 
  select(gene, variable, FDR) %>% 
  filter(variable == "conditionMtb") %>% 
  rename(`limma` = FDR)

temp2 <- fdr_limma2 %>% 
  select(gene, variable, FDR) %>% 
  filter(variable == "conditionMtb") %>% 
  rename(limma_dupCor = FDR)

#Merge results and plot FDR
full_join(temp, temp2) %>% 
  ggplot(aes(x = `limma`, y = limma_dupCor)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "FDR for Mtb vs media")
## Joining, by = c("gene", "variable")

limma provides one fit quality metric, sigma, which is the estimated standard deviation of the errors. Looking at this, we see that the best fit model (lowest sigma) varies from gene to gene. To be honest, however, sigma is not the best way to assess model fit. We’ll next move into kimma where we can compare models with more fit metrics as well as run a true paired mixed effects model.

data.frame(`limma` = efit_limma$sigma,
           limma_dupCor = efit_limma2$sigma) %>% 
  
  ggplot(aes(x = `limma`, y = limma_dupCor)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "Sigma")

Modeling in kimma

kimma supports more flexible modeling of RNA-seq data including simple linear and linear mixed effects models with co-variates, weights, random effects, and matrix random effects. Let’s run the same models as we did with limma.

Note that kimma is much slower than limma. It can be run on multiple processors to increase speed but if you don’t have a very powerful laptop, DO NOT RUN the next section of code.

fit_kimma <- kmFit(dat = dat, 
                   model = "~ condition + sex + (1|ptID)", 
                   use.weights = TRUE,
                   run.lm = TRUE, run.lme = TRUE,
                   metrics = TRUE)

save(fit_kimma, file="0_data/kimma_results.RData")
lm model: expression~condition+sex
lme/lmerel model: expression~condition+sex+(1|ptID)
Input: 20 libraries from 10 unique patients
Model: 20 libraries
Complete: 13419 genes
Failed: 0 genes

In the interest of time, we’ve provided the kimma results for you.

load("0_data/kimma_results.RData")

The kimma output contains 4 data frames: one that matches the limma output and one with fit metrics for each model.

names(fit_kimma)
## [1] "lm"      "lme"     "lm.fit"  "lme.fit"
head(fit_kimma$lm)
## # A tibble: 6 × 7
##   model gene  variable     estimate     pval      FDR gene_biotype  
##   <chr> <chr> <chr>           <dbl>    <dbl>    <dbl> <chr>         
## 1 lm    A1BG  (Intercept)    1.38   9.83e- 7 1.35e- 6 protein_coding
## 2 lm    A1BG  conditionMtb  -0.313  2.79e- 1 4.09e- 1 protein_coding
## 3 lm    A1BG  sexF          -0.0796 8.16e- 1 9.39e- 1 protein_coding
## 4 lm    A2M   (Intercept)    6.79   7.69e-15 1.36e-14 protein_coding
## 5 lm    A2M   conditionMtb  -0.127  7.36e- 1 8.15e- 1 protein_coding
## 6 lm    A2M   sexF          -1.44   3.96e- 3 8.24e- 2 protein_coding
head(fit_kimma$lm.fit)
## # A tibble: 6 × 8
##   model  gene   sigma   AIC   BIC    Rsq adj_Rsq gene_biotype  
##   <chr>  <chr>  <dbl> <dbl> <dbl>  <dbl>   <dbl> <chr>         
## 1 lm.fit A1BG   0.766  43.4  47.4 0.0730 -0.0360 protein_coding
## 2 lm.fit A2M    2.82   54.9  58.9 0.400   0.329  protein_coding
## 3 lm.fit A2ML1  0.582  58.4  62.4 0.638   0.595  protein_coding
## 4 lm.fit A4GALT 1.89   80.0  84.0 0.440   0.374  protein_coding
## 5 lm.fit AAAS   0.971  25.1  29.1 0.335   0.257  protein_coding
## 6 lm.fit AACS   0.706  15.6  19.6 0.336   0.258  protein_coding

We can now look at metrics like AIC where we see that best fit still varies by gene (which is very common) and the overall AIC mean is lower for the simple linear model without paired design.

fit_kimma_all <- full_join(fit_kimma$lm.fit, fit_kimma$lme.fit, 
          by = c("gene"), suffix = c("_lm","_lme"))

fit_kimma_all %>%
  ggplot(aes(x = AIC_lm, y = AIC_lme)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "AIC")

mean(fit_kimma$lm.fit$AIC)
## [1] 33.39391
mean(fit_kimma$lme.fit$AIC)
## [1] 37.71397

So in this case, which model do we go with? It would be nice if one model showed consistently better fit across our genes, but this rarely happens. For this experiment, we know we have a paired design so either limma with duplicateCorrelation or kimma with run.lme is appropriate. In our experience, a true mixed effects model in kimma yields fewer false positive genes when you have a paired design, even if metrics like AIC do not support it as the best fit model.

Other models

We don’t have time to go over all the potential models of these data. So here’s a quick reference for you to explore later on.

  • Interaction terms
    • limma or kimma formula with * such as ~ condition * sex which is shorthand for ~ condition + sex + condition:sex
  • Pairwise comparisons between 3+ levels
    • limma makeContrasts
    • kimma run.contrasts = TRUE
  • Matrix random effects
    • limma not supported
    • kimma provide the matrix kin and set run.lmerel = TRUE
  • Removing gene-level weights
  • dream in the package variancePartition

For more help with limma, see Chapter 15 http://bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf

For kimma, see https://github.com/BIGslu/tutorials/blob/main/RNAseq/3.Hawn_RNAseq_voom.to.DEG.pdf

Useful kimma plots

If you use kimma, our other package BIGpicture has several plotting functions you may find useful. Here are some examples.

devtools::install_github("BIGslu/BIGpicture")
library(BIGpicture)

Principle component analysis (PCA)

plot_pca(dat, vars = c("condition", "outlier"))
## Joining, by = "libID"
## $condition

## 
## $outlier

Model fit metrics

plot_fit(model_result = fit_kimma, 
         x="lm", y="lme", metrics = c("AIC","adj_Rsq"))
## Summary
##   Best fit  Metric Total genes Mean delta Stdev delta
## 1       lm adj_Rsq        3583  3.8238324  118.478571
## 2      lme adj_Rsq        9836  0.9833688   28.855033
## 3       lm     AIC       10647  6.5535246    3.859962
## 4      lme     AIC        2772  4.2584460    4.588052

Significant gene Venn

plot_venn_genes(fit_kimma, model = "lme", fdr.cutoff = 0.05)

Gene expression boxplots

plot_genes(dat, fdr = fit_kimma$lme, geneID = "hgnc_symbol",
           subset.genes = "IFNG", variables = c("condition","sex"))
## [[1]]

STRING protein-protein interaction network

#list top 10 most significant genes
genes.signif <- fit_kimma$lme %>% 
  filter(variable == "condition") %>% 
  slice_min(order_by = FDR, n = 10) %>% 
  pull(gene)

#map genes to STRING database
map <- map_string(genes.signif)
#make plot
plot_string(map)