2. Introduction to the tidyverse

The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse and read in the metadata again.

For more information on installation, see the setup instructions.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a look at the dataset with View():

View(meta)

Intro dplyr

summarize

Let’s start with a useful function called summarize() that will let us summarize variables in our data frame. We can use it to find the mean, median, or other statistics about the dataset. Let’s find the average age in days below:

summarize(meta, mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

The first argument to the summarize function is our data frame, and the second argument is a new variable that we want summarize to create from the age_dys column.

pipes

Instead of including the data as an argument, we can use the pipe operator %>% to pass the data value into the summarize function.

meta %>% summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

This line of code will do the exact same thing as our first summary command, but the piping function tells R to use the meta data frame as the first argument in the next function.

This lets us “chain” together multiple functions, which will be helpful later. Note that the pipe (%>%) is a bit different from using the ggplot plus (+). Pipes take the output from the left side and use it as input to the right side. Plusses layer on additional information (right side) to a preexisting plot (left side).

We can also add an Enter to make it look nicer:

meta %>%
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

Using the pipe operator %>% and enter command makes our code more readable. The pipe operator %>% also helps to avoid using nested function and minimizes the need for new variables.

Since we use the pipe operator so often, there is a keyboard shortcut for it in RStudio. You can press Ctrl+Shift+M on Windows or Cmd+Shift+M on a Mac.

Tip: Saving a new dataframe

Notice that when we run the following code, we are not actually saving a new variable:

meta %>%
  summarize(mean_age_days=mean(age_dys))

This simply outputs what we have created, but does not change actually change meta or save a new dataframe. To save a new dataframe, we could run:

meta_summarized <- meta %>%
  summarize(mean_age_days=mean(age_dys))

Or if we wanted to change meta itself:

meta <- meta %>%
  summarize(mean_age_days=mean(age_dys))

IMPORTANT: This would overwrite the existing meta object.

For now, we will not be saving dataframes, since we are just experimenting with dplyr functions, but it will be useful later on in this lesson.

group_by

If we want to calculate the mean age in days for males and females separately, we can use group_by before summarize:

meta %>% 
  group_by(sex) %>% 
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
##   sex   mean_age_days
##   <chr>         <dbl>
## 1 F             7422.
## 2 M             9699.

Exercise: group_by practice

Try calculating the mean total number of sequences for the Media and Mtb conditions. You’ll need to use group_by, summarize, and the mean function

meta %>% 
  group_by(condition) %>% 
  summarize(mean_total_seq = mean(total_seq))
## # A tibble: 2 × 2
##   condition mean_total_seq
##   <chr>              <dbl>
## 1 Media           9933191.
## 2 Mtb             9246495.

select

You can use the select function to keep only the columns that you want and remove columns that you don’t need.

meta %>% 
  select(ptID, age_dys)
## # A tibble: 20 × 2
##    ptID  age_dys
##    <chr>   <dbl>
##  1 pt01    12410
##  2 pt01    12410
##  3 pt02    12775
##  4 pt02    12775
##  5 pt03    11315
##  6 pt03    11315
##  7 pt04     8395
##  8 pt04     8395
##  9 pt05     7300
## 10 pt05     7300
## 11 pt06     6570
## 12 pt06     6570
## 13 pt07     7665
## 14 pt07     7665
## 15 pt08     8760
## 16 pt08     8760
## 17 pt09     6935
## 18 pt09     6935
## 19 pt10     8030
## 20 pt10     8030

If you want to keep all but one column, put the minus (-) sign in front of it. I like to think of it as “subtracting” the column from the data frame.

meta %>% 
  select(-ptID_old)
## # A tibble: 20 × 8
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276 
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.

Let’s save the data frame without the ptID_old column, since ptID tells us everything we need to know in a more compact way.

meta <- meta %>% 
  select(-ptID_old)

mutate

Sometimes we would like to create a new column based on the values in an existing one.

meta %>% 
  mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.    18.0
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.    21.0
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0

We can overwrite the meta data frame so it will now contain the age_yrs column.

meta <- meta %>% 
  mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.    18.0
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.    21.0
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0

Exercise: practice mutate

Try creating a new column total_seq_millions based on total_seq.

meta %>% mutate(total_seq_millions = total_seq / 1000000)
## # A tibble: 20 × 10
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.    18.0
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.    21.0
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0
## # … with 1 more variable: total_seq_millions <dbl>

filter

For each sample, we have both a “Media” sample and a “Mtb” sample. If you want to use only the “Media” samples, we can narrow down the data frame to only the rows from “Media” with filter():

meta %>% 
  filter(condition == 'Media')
## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  3 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  4 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  5 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
##  6 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
##  7 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
##  8 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
##  9 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 10 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0

We used the == sign to test which rows in the condition column were equal to the value “Media”. Comparison operators return TRUE or FALSE. There are many useful comparison operators in R:

operator meaning
== equal
!= not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to

Quotes vs No Quotes

Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.

The general rule is that if you want to use values from the columns of your data object, then you supply the name of the column without quotes, but if you want to specify a value that does not come from your data, then use quotes.

Can you think of another way to end up with only the Media rows? Hint: try using the “not equal” operator.

meta %>% 
  filter(condition != 'Mtb')
## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  3 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  4 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  5 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
##  6 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
##  7 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
##  8 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
##  9 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 10 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0

With filter you can also specify multiple comparisons to perform. Maybe we want to filter down to only Media samples and where at least 7 million sequences were present. We can use the ampersand (&) to accomplish that:

meta %>% 
  filter(condition == 'Media' & total_seq > 7000000)
## # A tibble: 9 × 9
##   libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##   <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
## 1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
## 2 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
## 3 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
## 4 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 5 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.    18.0
## 6 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.    21.0
## 7 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 8 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 9 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0

Alternatively, maybe you want to filter down to only samples from patients that are either male or at least 21 years old. You can use the vertical bar | to mean “or”

meta %>% 
  filter(sex == 'M'  | age_yrs >= 21)
## # A tibble: 16 × 9
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.    34.0
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.    35.0
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.    35.0
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.    31.0
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.    31.0
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.    23.0
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.    23.0
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.    20.0
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.    20.0
## 11 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.    24.0
## 12 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.    24.0
## 13 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276     19.0
## 14 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.    19.0
## 15 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.    22.0
## 16 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.    22.0

Intro ggplot

We will be using the ggplot2 package today to make our plots. This is a very powerful package that creates professional looking plots and is one of the reasons people like using R so much. All plots made using the ggplot2 package start by calling the ggplot() function.

meta %>% ggplot()

When we run this code, the Plots tab will pop to the front in the lower right corner of the RStudio screen. Right now, we just see a big grey rectangle.

What we’ve done is created a ggplot object and told it we will be using the data from the meta object that we’ve loaded into R. We’ve done this by calling the ggplot() function with meta as the data argument.

So we’ve made a plot object, now we need to start telling it what we actually want to draw in this plot. The elements of a plot have a bunch of properties like an x and y position, a size, a color, etc. These properties are called aesthetics. When creating a data visualization, we map a variable in our dataset to an aesthetic in our plot. In ggplot, we can do this by creating an “aesthetic mapping”, which we do with the aes() function.

To create our plot, we need to map variables from our meta object to ggplot aesthetics using the aes() function. Since we have already told ggplot that we are using the data in the meta object, we can access the columns of meta using the object’s column names. (Remember, R is case-sensitive, so we have to be careful to match the column names exactly!)

We are interested in the total sequences per sample, so let’s start by telling our plot object that we want to assign this variable to the x axis of our plot. We do this by adding (+) information to our plot object. Add this new line to your code and run both lines by highlighting them and pressing Ctrl+Enter on your keyboard:

meta %>% 
ggplot() +
  aes(y = total_seq)

Note that we’ve added this new function call to a second line just to make it easier to read. To do this we make sure that the + is at the end of the first line otherwise R will assume your command ends when it starts the next row. The + sign indicates not only that we are adding information, but to continue on to the next line of code.

Observe that our Plot window is no longer a grey square. We now see that we’ve mapped the total_seq column to the x axis of our plot. Note that that column name isn’t very pretty as an x-axis label, so let’s add the labs() function to make a nicer label for the x axis

meta %>% 
ggplot() +
  aes(y = total_seq) +
  labs(y = "Total sequences")

OK. That looks better.

meta %>% 
ggplot() +
  aes(x = age_dys, y = total_seq) +
  labs(x = "Age (days)", y = "Total sequences")

Excellent. We’ve now told our plot object where the x and y values are coming from and what they stand for. But we haven’t told our object how we want it to draw the data.

geom_point

There are many different plot types (bar charts, scatter plots, histograms, etc). We tell our plot object what to draw by adding a “geometry” (“geom” for short) to our object. We will talk about many different geometries today, but for our first plot, let’s draw our data using the “points” geometry for each value in the data set. To do this, we add geom_point() to our plot object:

meta %>% 
ggplot() +
  aes(x = age_dys, y = total_seq) +
  labs(x = "Age (days)", y = "Total sequences") +
  geom_point()

Now we’re really getting somewhere. It finally looks like a proper plot! One thing we could do is use a different color for the sample conditions. To map the condition of each point to a color, we will again use the aes() function:

meta %>% 
ggplot() +
  aes(x = age_dys, y = total_seq, color = condition) +
  labs(x = "Age (days)", y = "Total sequences") +
  geom_point()

So far, we’ve been using geom_point() to plot two continuous (numeric) variables. What if we want to plot a categorical variable and a numeric variable?

geom_boxplot

Using the meta data, use ggplot to create a box plot with condition on the x axis and total_seq on the y axis.

meta %>% 
 ggplot() +
  aes(x = condition, y = total_seq) +
  geom_boxplot()

This type of visualization makes it easy to compare the range and spread of values across groups. The “middle” 50% of the data is located inside the box and outliers that are far away from the central mass of the data are drawn as points.

geom_col

Another way to plot a categorical and a continuous variable is with a bar plot.

meta %>% 
  ggplot() +
  aes(x = libID, y = total_seq) +
  geom_col()

Those library IDs sure are squished together and hard to raed. Maybe we should have total_seq on the x axis and libID on the y axis. In other words, it might be easier if we flip the coordinates. We can rewrite the aes() code, or instead an easy way to change the coordinates is with coord_flip():

meta %>% 
  ggplot() +
  aes(x = libID, y = total_seq) +
  geom_col() +
  coord_flip()

This is looking pretty good. Now let’s add color using the fill aesthetic!

meta %>% 
  ggplot() +
  aes(x = libID, y = total_seq, fill = condition) +
  geom_col() +
  coord_flip()

Now let’s see if we can plot the total sequences for each patient ID:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col() +
  coord_flip()

By default, the bars are stacked. Maybe we would like them to be side-by-side instead: We can tell the geom_col shapes to “dodge” each other:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) +
  coord_flip()

Finally, since ptID is a short name, let’s flip our coordinates back by removing coord_flip():

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge())

We now have our first data cleaning plot, the total number of sequences per library.

Saving plots

If we want to share our plots with other people or put them in manuscripts, we’re going to need to make them a little prettier and save them to our computers. Let’s make it pretty and then save the plot!

Add labels:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) + 
  labs(x = 'patient ID', y = 'Total sequences')

Change the theme:

meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) + 
  labs(x = 'patient ID', y = 'Total sequences') +
  theme_bw()

Save to a new variable:

seq_per_lib_plot <- meta %>% 
  ggplot() +
  aes(x = ptID, y = total_seq, fill = condition) +
  geom_col(position = position_dodge()) + 
  labs(x = 'patient ID', y = 'Total sequences') +
  theme_bw()

Take a look at it by pasting the variable name into the console: seq_per_lib_plot

ggsave("sequences_per_library.png", plot = seq_per_lib_plot)

Intro tidyr

pivot

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The metadata is in a long format. We have one row for each patient sample and each variable is in a different column. We might describe this data as “tidy” because it makes it easy to work with ggplot2 and dplyr functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically in “wide” format each row represents a group of observations and each value is placed in a different column rather than a different row. For example maybe we want only one row per country and want to spread the life expectancy values into different columns (one for each year).

The tidyr package contains the functions pivot_wider and pivot_longer that make it easy to switch between the two formats. The tidyr package is included in the tidyverse package so we don’t need to do anything to load it.

pivot_wider

For each patient, we have two samples: Media and Mtb. In the metadata, the only difference in these two conditions is the libID column (which is redundant with the ptID and condition) and the total_seq column. We can take the condition column and create two new columns, one with the total seqs in the media sample and one with the total seqs in the mtb sample. This is called “pivoting” the data frame “wider” because we rearrange it by creating an additional column and making it have fewer rows. Let’s try it!

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 8
##    ptID  age_dys sex   RNAseq methylation age_yrs     Media       Mtb
##    <chr>   <dbl> <chr> <lgl>  <lgl>         <dbl>     <dbl>     <dbl>
##  1 pt01    12410 M     TRUE   FALSE          34.0  9114402.  8918699.
##  2 pt02    12775 M     TRUE   FALSE          35.0  9221555.  7733260.
##  3 pt03    11315 M     TRUE   FALSE          31.0  6231728.  7105193.
##  4 pt04     8395 M     TRUE   TRUE           23.0 10205557.  8413543.
##  5 pt05     7300 M     TRUE   FALSE          20.0 15536685. 15509446.
##  6 pt06     6570 F     TRUE   FALSE          18.0  7085995.  6588160.
##  7 pt07     7665 F     TRUE   FALSE          21.0 10706098.  8576245.
##  8 pt08     8760 M     TRUE   FALSE          24.0  9957906.  8220348.
##  9 pt09     6935 M     TRUE   FALSE          19.0 13055276  13800442.
## 10 pt10     8030 F     TRUE   FALSE          22.0  8216706.  7599609.

Notice here that we tell pivot_wider() which columns to pull the names we wish our new columns to be named from the year variable, and the values to populate those columns from the condition variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by year, and the values populate it with our remaining variables dictating the rows.

Maybe we should assign those columns more informative names:

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq,
              names_prefix = "total_seq_")
## # A tibble: 10 × 8
##    ptID  age_dys sex   RNAseq methylation age_yrs total_seq_Media total_seq_Mtb
##    <chr>   <dbl> <chr> <lgl>  <lgl>         <dbl>           <dbl>         <dbl>
##  1 pt01    12410 M     TRUE   FALSE          34.0        9114402.      8918699.
##  2 pt02    12775 M     TRUE   FALSE          35.0        9221555.      7733260.
##  3 pt03    11315 M     TRUE   FALSE          31.0        6231728.      7105193.
##  4 pt04     8395 M     TRUE   TRUE           23.0       10205557.      8413543.
##  5 pt05     7300 M     TRUE   FALSE          20.0       15536685.     15509446.
##  6 pt06     6570 F     TRUE   FALSE          18.0        7085995.      6588160.
##  7 pt07     7665 F     TRUE   FALSE          21.0       10706098.      8576245.
##  8 pt08     8760 M     TRUE   FALSE          24.0        9957906.      8220348.
##  9 pt09     6935 M     TRUE   FALSE          19.0       13055276      13800442.
## 10 pt10     8030 F     TRUE   FALSE          22.0        8216706.      7599609.

And now let’s save the new wider metadata to a new variable:

meta_wide <- meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)

Notice that the number of rows and columns has changed:

nrow(meta)
## [1] 20
ncol(meta)
## [1] 9
nrow(meta_wide)
## [1] 10
ncol(meta_wide)
## [1] 8

pivot_longer

Everything we did with pivot_wider, we can reverse with pivot_longer. Maybe you receive a data frame in a wide format, but you need it in a long format. Here’s how we can get back to a long data frame:

meta_wide %>% 
  pivot_longer(c(Media, Mtb), 
               names_to = 'condition', values_to = 'total_seq')
## # A tibble: 20 × 8
##    ptID  age_dys sex   RNAseq methylation age_yrs condition total_seq
##    <chr>   <dbl> <chr> <lgl>  <lgl>         <dbl> <chr>         <dbl>
##  1 pt01    12410 M     TRUE   FALSE          34.0 Media      9114402.
##  2 pt01    12410 M     TRUE   FALSE          34.0 Mtb        8918699.
##  3 pt02    12775 M     TRUE   FALSE          35.0 Media      9221555.
##  4 pt02    12775 M     TRUE   FALSE          35.0 Mtb        7733260.
##  5 pt03    11315 M     TRUE   FALSE          31.0 Media      6231728.
##  6 pt03    11315 M     TRUE   FALSE          31.0 Mtb        7105193.
##  7 pt04     8395 M     TRUE   TRUE           23.0 Media     10205557.
##  8 pt04     8395 M     TRUE   TRUE           23.0 Mtb        8413543.
##  9 pt05     7300 M     TRUE   FALSE          20.0 Media     15536685.
## 10 pt05     7300 M     TRUE   FALSE          20.0 Mtb       15509446.
## 11 pt06     6570 F     TRUE   FALSE          18.0 Media      7085995.
## 12 pt06     6570 F     TRUE   FALSE          18.0 Mtb        6588160.
## 13 pt07     7665 F     TRUE   FALSE          21.0 Media     10706098.
## 14 pt07     7665 F     TRUE   FALSE          21.0 Mtb        8576245.
## 15 pt08     8760 M     TRUE   FALSE          24.0 Media      9957906.
## 16 pt08     8760 M     TRUE   FALSE          24.0 Mtb        8220348.
## 17 pt09     6935 M     TRUE   FALSE          19.0 Media     13055276 
## 18 pt09     6935 M     TRUE   FALSE          19.0 Mtb       13800442.
## 19 pt10     8030 F     TRUE   FALSE          22.0 Media      8216706.
## 20 pt10     8030 F     TRUE   FALSE          22.0 Mtb        7599609.
sequence data

So far, we’ve been working with just the metadata. But this workshop is about analyzing RNAseq data, so let’s start working with actual sequencing data.

count <- read_csv('0_data/raw_counts.csv')
## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Take a look at the counts data with: View(count)

Notice how each column is a libID and each row is a hgnc_symbol. The values are the raw sequence counts for the genes in each library. Soon, we’re going to want to combine the sequence data with our metadata. But first, we need to change this from a wide to a long format.

count %>% 
  pivot_longer(-hgnc_symbol, names_to = 'libID', values_to = 'raw_count')
## # A tibble: 895,720 × 3
##    hgnc_symbol libID      raw_count
##    <chr>       <chr>          <dbl>
##  1 5_8S_rRNA   pt01_Media     20.0 
##  2 5_8S_rRNA   pt01_Mtb       20.0 
##  3 5_8S_rRNA   pt02_Media      7.11
##  4 5_8S_rRNA   pt02_Mtb        6.06
##  5 5_8S_rRNA   pt03_Media     12.2 
##  6 5_8S_rRNA   pt03_Mtb        4.07
##  7 5_8S_rRNA   pt04_Media      4   
##  8 5_8S_rRNA   pt04_Mtb        3   
##  9 5_8S_rRNA   pt05_Media     22.0 
## 10 5_8S_rRNA   pt05_Mtb       55.7 
## # … with 895,710 more rows

Let’s save this to a new data frame so we can re-use it later.

count_long <- count %>% 
  pivot_longer(-hgnc_symbol, names_to = 'libID', values_to = 'raw_count')

join

Now let’s combine our raw sequence counts with our metadata. To combine two data frames, we will use a join function. The dplyr package has a number of tools for joining data frames together depending on what we want to do with the rows of the data of countries that are not represented in both data frames.

?join

We have a few options here:

The mutating joins add columns from y to x, matching rows based on the keys: - inner_join(): includes all rows in x and y. - left_join(): includes all rows in x. - right_join(): includes all rows in y. - full_join(): includes all rows in x or y.

In an “inner join”, the new data frame only has those rows where the same key is found in both data frames. This is a very commonly used join.

inner_join(meta, count_long, by = 'libID')
## # A tibble: 895,720 × 11
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq age_yrs
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>   <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  2 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  3 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  4 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  5 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  6 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  7 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  8 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
##  9 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
## 10 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.    34.0
## # … with 895,710 more rows, and 2 more variables: hgnc_symbol <chr>,
## #   raw_count <dbl>

Now let’s save our joined data frame to a new variable:

full_data <- inner_join(meta, count_long, by = 'libID')

Now that we have our metadata joined with our raw count data, we’re ready to clean and analyze it!

Bonus: Other dplyr join functions

Outer joins and can be performed using left_join(), right_join(), and full_join(). In a “left join”, if the key is present in the left hand data frame, it will appear in the output, even if it is not found in the the right hand data frame. For a right join, the opposite is true. For a full join, all possible keys are included in the output data frame.

Using our joined dataset

Let’s find out which genes had the highest raw counts.

top_10_abundant_genes <- full_data %>% 
  filter(condition == 'Media') %>% 
  group_by(hgnc_symbol) %>% 
  summarize(med_raw_count = median(raw_count)) %>% 
  arrange(desc(med_raw_count)) %>% 
  slice_max(order_by = med_raw_count, n = 10)
top_10_abundant_genes
## # A tibble: 10 × 2
##    hgnc_symbol   med_raw_count
##    <chr>                 <dbl>
##  1 MALAT1              263231 
##  2 Metazoa_SRP         208862.
##  3 RN7SL1              189057.
##  4 CH507-513H4.3       140705.
##  5 CH507-513H4.4       140705.
##  6 CH507-513H4.6       139025.
##  7 MT-RNR2             132437.
##  8 RN7SK                83060.
##  9 MT-RNR1              78763 
## 10 MT-CO1               77922.

Now let’s make a boxplot with only those genes.

full_data %>% 
  filter(hgnc_symbol %in% top_10_abundant_genes$hgnc_symbol, 
         condition == 'Media') %>% 
  ggplot() +
  aes(x = hgnc_symbol, y = raw_count) +
  geom_boxplot() +
  coord_flip() +
  labs(title = 'Top 10 most abundant genes')

Try plotting just interferon 1 gamma for each patient:

full_data %>% 
  filter(hgnc_symbol == 'IFNG', condition == 'Media') %>% 
  ggplot() +
  aes(x = ptID, y = raw_count) +
  geom_col() +
  labs(title = 'IFNG RNAseq counts')