2. Introduction to the tidyverse

In this workshop, we introduce you to the R tidyverse, a suite of packages for data manipulation and visualization in R. In it, we cover how to:

  • Subset rows/columns
  • Create new variables
  • Calculate summary statistics
  • Convert wide/long formats
  • Merge data frames

During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes

The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse. For more information on installation, see the setup instructions.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We also load the metadata, this time using the tidyverse function read_csv. This function is very similar to base R read.csv, except that it automatically tells you the structure of the loaded data, performs better when determining different classes of data within a data frame, and save the object as a tibble.

meta <- read_csv("data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
class(meta)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

A tibble is just a data frame with some extra hidden formatting. You will not need to interface with these hidden features and all the following functions would behave the same if you’d read in a simple data.frame with read.csv.

Take a look at the dataset with View():

View(meta)

Here is a brief description of the variables in these data.

  • libID (character): Unique library ID for each sequencing library. Formatted as ptID_condition
  • ptID (character): Patient ID. Each patient has 2 libraries which differ in their condition
  • condition (character): Experimental condition. Media for media control. Mtb of M. tuberculosis infected.
  • age_dys (numeric): Age in ages
  • sex (character): Biological sex, M or F
  • ptID_old (character): Old patient ID with leading zeroes
  • RNAseq (logical): If the library has pass-filter RNA-seq data
  • methylation (logical): If the library has pass-filter methylation data
  • total_seq (numeric): Total number of pass-filter sequences

Intro dplyr

The first tidyverse package we will use is dplyr, which is for data manipulation. You do not need to load this package with library as it was already loaded as part of the tidyverse.

select

You can use the select function to keep only the columns that you want and remove columns that you don’t need. The first argument to the select function is our data frame, and the second argument is the name of the column we want to keep.

select(meta, ptID)
## # A tibble: 20 × 1
##    ptID 
##    <chr>
##  1 pt01 
##  2 pt01 
##  3 pt02 
##  4 pt02 
##  5 pt03 
##  6 pt03 
##  7 pt04 
##  8 pt04 
##  9 pt05 
## 10 pt05 
## 11 pt06 
## 12 pt06 
## 13 pt07 
## 14 pt07 
## 15 pt08 
## 16 pt08 
## 17 pt09 
## 18 pt09 
## 19 pt10 
## 20 pt10

We can continue to list multiple columns to keep by separating with commas.

select(meta, ptID, total_seq)
## # A tibble: 20 × 2
##    ptID  total_seq
##    <chr>     <dbl>
##  1 pt01   9114402.
##  2 pt01   8918699.
##  3 pt02   9221555.
##  4 pt02   7733260.
##  5 pt03   6231728.
##  6 pt03   7105193.
##  7 pt04  10205557.
##  8 pt04   8413543.
##  9 pt05  15536685.
## 10 pt05  15509446.
## 11 pt06   7085995.
## 12 pt06   6588160.
## 13 pt07  10706098.
## 14 pt07   8576245.
## 15 pt08   9957906.
## 16 pt08   8220348.
## 17 pt09  13055276 
## 18 pt09  13800442.
## 19 pt10   8216706.
## 20 pt10   7599609.

If you want to keep all but one column, put the minus (-) sign in front of it. I like to think of it as “subtracting” the column from the data frame.

select(meta, -ptID_old)
## # A tibble: 20 × 8
##    libID      ptID  condition age_dys sex   RNAseq methylation total_seq
##    <chr>      <chr> <chr>       <dbl> <chr> <lgl>  <lgl>           <dbl>
##  1 pt01_Media pt01  Media       12410 M     TRUE   FALSE        9114402.
##  2 pt01_Mtb   pt01  Mtb         12410 M     TRUE   FALSE        8918699.
##  3 pt02_Media pt02  Media       12775 M     TRUE   FALSE        9221555.
##  4 pt02_Mtb   pt02  Mtb         12775 M     TRUE   FALSE        7733260.
##  5 pt03_Media pt03  Media       11315 M     TRUE   FALSE        6231728.
##  6 pt03_Mtb   pt03  Mtb         11315 M     TRUE   FALSE        7105193.
##  7 pt04_Media pt04  Media        8395 M     TRUE   TRUE        10205557.
##  8 pt04_Mtb   pt04  Mtb          8395 M     TRUE   TRUE         8413543.
##  9 pt05_Media pt05  Media        7300 M     TRUE   FALSE       15536685.
## 10 pt05_Mtb   pt05  Mtb          7300 M     TRUE   FALSE       15509446.
## 11 pt06_Media pt06  Media        6570 F     TRUE   FALSE        7085995.
## 12 pt06_Mtb   pt06  Mtb          6570 F     TRUE   FALSE        6588160.
## 13 pt07_Media pt07  Media        7665 F     TRUE   FALSE       10706098.
## 14 pt07_Mtb   pt07  Mtb          7665 F     TRUE   FALSE        8576245.
## 15 pt08_Media pt08  Media        8760 M     TRUE   FALSE        9957906.
## 16 pt08_Mtb   pt08  Mtb          8760 M     TRUE   FALSE        8220348.
## 17 pt09_Media pt09  Media        6935 M     TRUE   FALSE       13055276 
## 18 pt09_Mtb   pt09  Mtb          6935 M     TRUE   FALSE       13800442.
## 19 pt10_Media pt10  Media        8030 F     TRUE   FALSE        8216706.
## 20 pt10_Mtb   pt10  Mtb          8030 F     TRUE   FALSE        7599609.

filter

You can use filter to remove/keep rows. Instead of specifying a column name as in select, you provide a statement that gives TRUE/FALSE and filter keeps all the rows for which that statement is TRUE. The conditional operators we saw in Intro R will help with this!

Let’s filter to just the Media samples.

filter(meta, condition == "Media")
## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  3 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  4 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  5 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
##  6 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
##  7 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
##  8 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
##  9 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 10 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## # … with abbreviated variable name ¹​total_seq

This is the same as what we did in Intro R with the following, only it is more readable and will fit into a tidyverse workflow. More on this when we get to pipes!

meta[meta$condition == "Media", ]
## # A tibble: 10 × 9
##    libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  3 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  4 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  5 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
##  6 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
##  7 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
##  8 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
##  9 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 10 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## # … with abbreviated variable name ¹​total_seq

You can string together multiple statements and ask if both/all are TRUE with and & or if at least one is TRUE with or |. Notice that the & statement removes all rows because no library has both a Media and Mtb condition. In contrast, the | statement keeps all rows because all libraries are either the Media or Mtb condition.

filter(meta, condition == "Media" & condition == "Mtb")
## # A tibble: 0 × 9
## # … with 9 variables: libID <chr>, ptID <chr>, condition <chr>, age_dys <dbl>,
## #   sex <chr>, ptID_old <chr>, RNAseq <lgl>, methylation <lgl>, total_seq <dbl>
filter(meta, condition == "Media" | condition == "Mtb")
## # A tibble: 20 × 9
##    libID      ptID  condition age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt01_Mtb   pt01  Mtb         12410 M     pt00001  TRUE   FALSE         8.92e6
##  3 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  4 pt02_Mtb   pt02  Mtb         12775 M     pt00002  TRUE   FALSE         7.73e6
##  5 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  6 pt03_Mtb   pt03  Mtb         11315 M     pt00003  TRUE   FALSE         7.11e6
##  7 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  8 pt04_Mtb   pt04  Mtb          8395 M     pt00004  TRUE   TRUE          8.41e6
##  9 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
## 10 pt05_Mtb   pt05  Mtb          7300 M     pt00005  TRUE   FALSE         1.55e7
## 11 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
## 12 pt06_Mtb   pt06  Mtb          6570 F     pt00006  TRUE   FALSE         6.59e6
## 13 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
## 14 pt07_Mtb   pt07  Mtb          7665 F     pt00007  TRUE   FALSE         8.58e6
## 15 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
## 16 pt08_Mtb   pt08  Mtb          8760 M     pt00008  TRUE   FALSE         8.22e6
## 17 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 18 pt09_Mtb   pt09  Mtb          6935 M     pt00009  TRUE   FALSE         1.38e7
## 19 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## 20 pt10_Mtb   pt10  Mtb          8030 F     pt00010  TRUE   FALSE         7.60e6
## # … with abbreviated variable name ¹​total_seq

Quotes vs No Quotes

Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.

The general rule is that if you want to use an R object (data frame, column, etc), then you supply the name without quotes. If you want to specify a value within the R object, then use quotes.

rename

Sometimes your variable names are not ideal. rename does just want it says; it renames columns in your data. The syntax for rename is to give the new name first and then tell R what that new name should replace, so new_name = old_name. Here, we rename condition to infection as this better represents those data.

rename(meta, infection = condition)
## # A tibble: 20 × 9
##    libID      ptID  infection age_dys sex   ptID_old RNAseq methylation total_…¹
##    <chr>      <chr> <chr>       <dbl> <chr> <chr>    <lgl>  <lgl>          <dbl>
##  1 pt01_Media pt01  Media       12410 M     pt00001  TRUE   FALSE         9.11e6
##  2 pt01_Mtb   pt01  Mtb         12410 M     pt00001  TRUE   FALSE         8.92e6
##  3 pt02_Media pt02  Media       12775 M     pt00002  TRUE   FALSE         9.22e6
##  4 pt02_Mtb   pt02  Mtb         12775 M     pt00002  TRUE   FALSE         7.73e6
##  5 pt03_Media pt03  Media       11315 M     pt00003  TRUE   FALSE         6.23e6
##  6 pt03_Mtb   pt03  Mtb         11315 M     pt00003  TRUE   FALSE         7.11e6
##  7 pt04_Media pt04  Media        8395 M     pt00004  TRUE   TRUE          1.02e7
##  8 pt04_Mtb   pt04  Mtb          8395 M     pt00004  TRUE   TRUE          8.41e6
##  9 pt05_Media pt05  Media        7300 M     pt00005  TRUE   FALSE         1.55e7
## 10 pt05_Mtb   pt05  Mtb          7300 M     pt00005  TRUE   FALSE         1.55e7
## 11 pt06_Media pt06  Media        6570 F     pt00006  TRUE   FALSE         7.09e6
## 12 pt06_Mtb   pt06  Mtb          6570 F     pt00006  TRUE   FALSE         6.59e6
## 13 pt07_Media pt07  Media        7665 F     pt00007  TRUE   FALSE         1.07e7
## 14 pt07_Mtb   pt07  Mtb          7665 F     pt00007  TRUE   FALSE         8.58e6
## 15 pt08_Media pt08  Media        8760 M     pt00008  TRUE   FALSE         9.96e6
## 16 pt08_Mtb   pt08  Mtb          8760 M     pt00008  TRUE   FALSE         8.22e6
## 17 pt09_Media pt09  Media        6935 M     pt00009  TRUE   FALSE         1.31e7
## 18 pt09_Mtb   pt09  Mtb          6935 M     pt00009  TRUE   FALSE         1.38e7
## 19 pt10_Media pt10  Media        8030 F     pt00010  TRUE   FALSE         8.22e6
## 20 pt10_Mtb   pt10  Mtb          8030 F     pt00010  TRUE   FALSE         7.60e6
## # … with abbreviated variable name ¹​total_seq

summarize

The function summarize() performs summary statistics on your data. We can use it to find the mean, median, or other statistics about the dataset. Let’s find the average age in days below:

summarize(meta, mean(age_dys))
## # A tibble: 1 × 1
##   `mean(age_dys)`
##             <dbl>
## 1           9016.

Notice that summarize automatically names the column with the summary statistic based on the function used to calculate that statistic. This is not ideal as the formatting has parentheses. So we can specify a name instead.

summarize(meta, mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_dys
##          <dbl>
## 1        9016.

pipes

Instead of including the data as an argument, we can use the pipe operator %>% to pass the data value into the summarize function.

meta %>% summarize(mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_dys
##          <dbl>
## 1        9016.

This line of code will do the exact same thing as our previous summary command, but the piping function tells R to use the meta data frame as the first argument in the next function. This lets us “chain” together multiple functions into a pipeline. The pipe takes the output from the left side and use it as input to the right side.

We can also add an Enter to make it look nicer:

meta %>%
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
##   mean_age_days
##           <dbl>
## 1         9016.

Using the pipe operator %>% and enter command makes our code more readable. The pipe operator %>% also helps to avoid using nested function like f(g(x)) and minimizes the need for new variables.

Since we use the pipe operator so often, there is a keyboard shortcut for it in RStudio. You can press Ctrl+Shift+M on Windows or Cmd+Shift+M on a Mac.

group_by

The pipe comes in handy with functions like group_by. If we want to calculate the mean age in days for males and females separately, we can use group_by before summarize:

meta %>% 
  group_by(sex) %>% 
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
##   sex   mean_age_days
##   <chr>         <dbl>
## 1 F             7422.
## 2 M             9699.

And we could merge this with a filter to only look at the Media libraries, since each age is repeated for that patient’s Media and Mtb libraries. The results are the same in this case but filtering would be important if any of our patient’s were missing one of the conditions.

meta %>% 
  filter(condition == "Media") %>% 
  group_by(sex) %>% 
  summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
##   sex   mean_age_days
##   <chr>         <dbl>
## 1 F             7422.
## 2 M             9699.

mutate

Sometimes we would like to create a new column based on the values in an existing one. For example, age is days is a little difficult to understand. Let’s add a column with age in years.

meta %>% 
  mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 10
##    libID      ptID  condi…¹ age_dys sex   ptID_…² RNAseq methy…³ total…⁴ age_yrs
##    <chr>      <chr> <chr>     <dbl> <chr> <chr>   <lgl>  <lgl>     <dbl>   <dbl>
##  1 pt01_Media pt01  Media     12410 M     pt00001 TRUE   FALSE    9.11e6    34.0
##  2 pt01_Mtb   pt01  Mtb       12410 M     pt00001 TRUE   FALSE    8.92e6    34.0
##  3 pt02_Media pt02  Media     12775 M     pt00002 TRUE   FALSE    9.22e6    35.0
##  4 pt02_Mtb   pt02  Mtb       12775 M     pt00002 TRUE   FALSE    7.73e6    35.0
##  5 pt03_Media pt03  Media     11315 M     pt00003 TRUE   FALSE    6.23e6    31.0
##  6 pt03_Mtb   pt03  Mtb       11315 M     pt00003 TRUE   FALSE    7.11e6    31.0
##  7 pt04_Media pt04  Media      8395 M     pt00004 TRUE   TRUE     1.02e7    23.0
##  8 pt04_Mtb   pt04  Mtb        8395 M     pt00004 TRUE   TRUE     8.41e6    23.0
##  9 pt05_Media pt05  Media      7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 10 pt05_Mtb   pt05  Mtb        7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 11 pt06_Media pt06  Media      6570 F     pt00006 TRUE   FALSE    7.09e6    18.0
## 12 pt06_Mtb   pt06  Mtb        6570 F     pt00006 TRUE   FALSE    6.59e6    18.0
## 13 pt07_Media pt07  Media      7665 F     pt00007 TRUE   FALSE    1.07e7    21.0
## 14 pt07_Mtb   pt07  Mtb        7665 F     pt00007 TRUE   FALSE    8.58e6    21.0
## 15 pt08_Media pt08  Media      8760 M     pt00008 TRUE   FALSE    9.96e6    24.0
## 16 pt08_Mtb   pt08  Mtb        8760 M     pt00008 TRUE   FALSE    8.22e6    24.0
## 17 pt09_Media pt09  Media      6935 M     pt00009 TRUE   FALSE    1.31e7    19.0
## 18 pt09_Mtb   pt09  Mtb        6935 M     pt00009 TRUE   FALSE    1.38e7    19.0
## 19 pt10_Media pt10  Media      8030 F     pt00010 TRUE   FALSE    8.22e6    22.0
## 20 pt10_Mtb   pt10  Mtb        8030 F     pt00010 TRUE   FALSE    7.60e6    22.0
## # … with abbreviated variable names ¹​condition, ²​ptID_old, ³​methylation,
## #   ⁴​total_seq

Note that this new column is not saved in meta because we printed the result to the console and did not save it to the environment. We can overwrite the meta data frame so it will now contains the age_yrs column.

meta <- meta %>% 
  mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 10
##    libID      ptID  condi…¹ age_dys sex   ptID_…² RNAseq methy…³ total…⁴ age_yrs
##    <chr>      <chr> <chr>     <dbl> <chr> <chr>   <lgl>  <lgl>     <dbl>   <dbl>
##  1 pt01_Media pt01  Media     12410 M     pt00001 TRUE   FALSE    9.11e6    34.0
##  2 pt01_Mtb   pt01  Mtb       12410 M     pt00001 TRUE   FALSE    8.92e6    34.0
##  3 pt02_Media pt02  Media     12775 M     pt00002 TRUE   FALSE    9.22e6    35.0
##  4 pt02_Mtb   pt02  Mtb       12775 M     pt00002 TRUE   FALSE    7.73e6    35.0
##  5 pt03_Media pt03  Media     11315 M     pt00003 TRUE   FALSE    6.23e6    31.0
##  6 pt03_Mtb   pt03  Mtb       11315 M     pt00003 TRUE   FALSE    7.11e6    31.0
##  7 pt04_Media pt04  Media      8395 M     pt00004 TRUE   TRUE     1.02e7    23.0
##  8 pt04_Mtb   pt04  Mtb        8395 M     pt00004 TRUE   TRUE     8.41e6    23.0
##  9 pt05_Media pt05  Media      7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 10 pt05_Mtb   pt05  Mtb        7300 M     pt00005 TRUE   FALSE    1.55e7    20.0
## 11 pt06_Media pt06  Media      6570 F     pt00006 TRUE   FALSE    7.09e6    18.0
## 12 pt06_Mtb   pt06  Mtb        6570 F     pt00006 TRUE   FALSE    6.59e6    18.0
## 13 pt07_Media pt07  Media      7665 F     pt00007 TRUE   FALSE    1.07e7    21.0
## 14 pt07_Mtb   pt07  Mtb        7665 F     pt00007 TRUE   FALSE    8.58e6    21.0
## 15 pt08_Media pt08  Media      8760 M     pt00008 TRUE   FALSE    9.96e6    24.0
## 16 pt08_Mtb   pt08  Mtb        8760 M     pt00008 TRUE   FALSE    8.22e6    24.0
## 17 pt09_Media pt09  Media      6935 M     pt00009 TRUE   FALSE    1.31e7    19.0
## 18 pt09_Mtb   pt09  Mtb        6935 M     pt00009 TRUE   FALSE    1.38e7    19.0
## 19 pt10_Media pt10  Media      8030 F     pt00010 TRUE   FALSE    8.22e6    22.0
## 20 pt10_Mtb   pt10  Mtb        8030 F     pt00010 TRUE   FALSE    7.60e6    22.0
## # … with abbreviated variable names ¹​condition, ²​ptID_old, ³​methylation,
## #   ⁴​total_seq

Exercises: dplyr

  1. Create a new column for total sequences in million total_seq_millions based on total_seq.
  2. What is another way to end up with only the Media rows instead of condition == "Media"?
  3. Try calculating the mean total number of sequences for the Media and Mtb conditions.

Intro tidyr

tidyr is another data manipulation package in tidyverse. It contains functions for tidying data and generally acts on the entire data frame, as opposed to dplyr which most often works on a subset of columns or rows.

pivot

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The metadata is in a long format. We have one row for each patient sample and each variable is in a different column. We might also describe these data as “tidy” because it makes it easy to work with ggplot2 and dplyr functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically, in “wide” format, each row represents a group of observations and each value is placed in a different column rather than a different row. For example, maybe we want each patient to have 1 row and their Media/Mtb data to be spread across columns.

The tidyr package in the tidyverse contains the functions pivot_wider and pivot_longer that make it easy to switch between the two formats.

pivot_wider

For each patient, we have two samples: Media and Mtb. In the metadata, the only difference in these two conditions is the libID column (which is redundant with the ptID and condition) and the total_seq column. We can take the condition column and create two new columns, one with the total seqs in the Media sample and one with the total seqs in the Mtb sample. This is called “pivoting” the data frame “wider” because we rearrange it by creating an additional column and making it have fewer rows. Let’s try it!

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 9
##    ptID  age_dys sex   ptID_old RNAseq methylation age_yrs     Media       Mtb
##    <chr>   <dbl> <chr> <chr>    <lgl>  <lgl>         <dbl>     <dbl>     <dbl>
##  1 pt01    12410 M     pt00001  TRUE   FALSE          34.0  9114402.  8918699.
##  2 pt02    12775 M     pt00002  TRUE   FALSE          35.0  9221555.  7733260.
##  3 pt03    11315 M     pt00003  TRUE   FALSE          31.0  6231728.  7105193.
##  4 pt04     8395 M     pt00004  TRUE   TRUE           23.0 10205557.  8413543.
##  5 pt05     7300 M     pt00005  TRUE   FALSE          20.0 15536685. 15509446.
##  6 pt06     6570 F     pt00006  TRUE   FALSE          18.0  7085995.  6588160.
##  7 pt07     7665 F     pt00007  TRUE   FALSE          21.0 10706098.  8576245.
##  8 pt08     8760 M     pt00008  TRUE   FALSE          24.0  9957906.  8220348.
##  9 pt09     6935 M     pt00009  TRUE   FALSE          19.0 13055276  13800442.
## 10 pt10     8030 F     pt00010  TRUE   FALSE          22.0  8216706.  7599609.

Notice here that we tell pivot_wider which columns to pull the names we wish our new columns to be named from the condition variable, and the values to populate those columns from the total_seq variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by condition, and the values populate it with our remaining variables dictating the rows.

Maybe we should assign those columns more informative names.

meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq,
              names_prefix = "total_seq_")
## # A tibble: 10 × 9
##    ptID  age_dys sex   ptID_old RNAseq methylation age_yrs total_seq_M…¹ total…²
##    <chr>   <dbl> <chr> <chr>    <lgl>  <lgl>         <dbl>         <dbl>   <dbl>
##  1 pt01    12410 M     pt00001  TRUE   FALSE          34.0      9114402.  8.92e6
##  2 pt02    12775 M     pt00002  TRUE   FALSE          35.0      9221555.  7.73e6
##  3 pt03    11315 M     pt00003  TRUE   FALSE          31.0      6231728.  7.11e6
##  4 pt04     8395 M     pt00004  TRUE   TRUE           23.0     10205557.  8.41e6
##  5 pt05     7300 M     pt00005  TRUE   FALSE          20.0     15536685.  1.55e7
##  6 pt06     6570 F     pt00006  TRUE   FALSE          18.0      7085995.  6.59e6
##  7 pt07     7665 F     pt00007  TRUE   FALSE          21.0     10706098.  8.58e6
##  8 pt08     8760 M     pt00008  TRUE   FALSE          24.0      9957906.  8.22e6
##  9 pt09     6935 M     pt00009  TRUE   FALSE          19.0     13055276   1.38e7
## 10 pt10     8030 F     pt00010  TRUE   FALSE          22.0      8216706.  7.60e6
## # … with abbreviated variable names ¹​total_seq_Media, ²​total_seq_Mtb

And now let’s save the new wider metadata to a new object.

meta_wide <- meta %>% 
  select(-libID) %>% 
  pivot_wider(names_from = condition, values_from = total_seq)

Notice that the number of rows and columns has changed.

dim(meta)
## [1] 20 10
dim(meta_wide)
## [1] 10  9

pivot_longer

Everything we did with pivot_wider, we can reverse with pivot_longer. Maybe you receive a data frame in a wide format, but you need it in a long format. Here’s how we can get back to a long data frame.

meta_wide %>% 
  pivot_longer(c(Media, Mtb), 
               names_to = "condition", values_to = "total_seq")
## # A tibble: 20 × 9
##    ptID  age_dys sex   ptID_old RNAseq methylation age_yrs condition total_seq
##    <chr>   <dbl> <chr> <chr>    <lgl>  <lgl>         <dbl> <chr>         <dbl>
##  1 pt01    12410 M     pt00001  TRUE   FALSE          34.0 Media      9114402.
##  2 pt01    12410 M     pt00001  TRUE   FALSE          34.0 Mtb        8918699.
##  3 pt02    12775 M     pt00002  TRUE   FALSE          35.0 Media      9221555.
##  4 pt02    12775 M     pt00002  TRUE   FALSE          35.0 Mtb        7733260.
##  5 pt03    11315 M     pt00003  TRUE   FALSE          31.0 Media      6231728.
##  6 pt03    11315 M     pt00003  TRUE   FALSE          31.0 Mtb        7105193.
##  7 pt04     8395 M     pt00004  TRUE   TRUE           23.0 Media     10205557.
##  8 pt04     8395 M     pt00004  TRUE   TRUE           23.0 Mtb        8413543.
##  9 pt05     7300 M     pt00005  TRUE   FALSE          20.0 Media     15536685.
## 10 pt05     7300 M     pt00005  TRUE   FALSE          20.0 Mtb       15509446.
## 11 pt06     6570 F     pt00006  TRUE   FALSE          18.0 Media      7085995.
## 12 pt06     6570 F     pt00006  TRUE   FALSE          18.0 Mtb        6588160.
## 13 pt07     7665 F     pt00007  TRUE   FALSE          21.0 Media     10706098.
## 14 pt07     7665 F     pt00007  TRUE   FALSE          21.0 Mtb        8576245.
## 15 pt08     8760 M     pt00008  TRUE   FALSE          24.0 Media      9957906.
## 16 pt08     8760 M     pt00008  TRUE   FALSE          24.0 Mtb        8220348.
## 17 pt09     6935 M     pt00009  TRUE   FALSE          19.0 Media     13055276 
## 18 pt09     6935 M     pt00009  TRUE   FALSE          19.0 Mtb       13800442.
## 19 pt10     8030 F     pt00010  TRUE   FALSE          22.0 Media      8216706.
## 20 pt10     8030 F     pt00010  TRUE   FALSE          22.0 Mtb        7599609.

In the above case, the condition and total_seq names need to be in quotes because they are not yet in our data object. They are instead just characters used to name the new columns. If we forgot the quotes, we’d get an error because R cannot find a condition column in meta_wide as no such column exists.

meta_wide %>% 
  pivot_longer(c(Media, Mtb), 
               names_to = condition, values_to = "total_seq")
## Error in build_longer_spec(data, !!cols, names_to = names_to, values_to = values_to, : object 'condition' not found
colnames(meta_wide)
## [1] "ptID"        "age_dys"     "sex"         "ptID_old"    "RNAseq"     
## [6] "methylation" "age_yrs"     "Media"       "Mtb"

More from dplyr

join

Often not all your data is pre-compiled in one nice, tidy data frame. You may have multiple files, batches, etc to work with. dplyr’s join functions allow us to combine data frames without fear of copy-paste errors or missing anything!

So far, we’ve been working with just the metadata - a single data frame. Let’s adventure into the full RNA-seq dat object so we have multiple data frames to combine.

load("data/dat_voom.RData")

First, we have the normalized log2 counts per million (CPM) in E

dat$E[1:3,1:3]
## Loading required package: limma
##       pt01_Media  pt01_Mtb pt02_Media
## A1BG    1.620022  1.717603   1.086946
## A2M     7.259912  6.680758   5.939909
## A2ML1  -4.052404 -3.515057  -4.015712

Second, we have the same metadata as our meta object in the targets data.

dat$targets[1:3,]
##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  8295928     1.092872 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  5716203     0.819490   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  8087601     1.119352 pt02_Media pt02     Media   12775   M
##            ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media  pt00001   TRUE       FALSE   9114402      0.8943797
## pt01_Mtb    pt00001   TRUE       FALSE   8918699      0.9850363
## pt02_Media  pt00002   TRUE       FALSE   9221555      1.2582092

Looking at these two data frames, they do not share any columns - thus there is nothing to join them with. However, notice that the column names in the counts data are the libIDs from the metadata! We can use pivot_longer to get the joining column we need.

Since E is a matrix with rownames, we must force it to be a data frame and move the rownames into a data column with some additional functions.

E_long <- as.data.frame(dat$E) %>% 
  rownames_to_column("gene") %>% 
  pivot_longer(-gene, names_to = "libID", values_to = "log2CPM")

E_long
## # A tibble: 268,380 × 3
##    gene  libID      log2CPM
##    <chr> <chr>        <dbl>
##  1 A1BG  pt01_Media  1.62  
##  2 A1BG  pt01_Mtb    1.72  
##  3 A1BG  pt02_Media  1.09  
##  4 A1BG  pt02_Mtb    1.10  
##  5 A1BG  pt03_Media  0.874 
##  6 A1BG  pt03_Mtb    1.71  
##  7 A1BG  pt04_Media  1.57  
##  8 A1BG  pt04_Mtb   -0.264 
##  9 A1BG  pt05_Media  1.63  
## 10 A1BG  pt05_Mtb   -0.0305
## # … with 268,370 more rows

Now both data frames have a libID column!

Let’s combine our sequence counts with our metadata. To combine two data frames, we will use a join function. The dplyr package has a number of tools for joining data frames together depending on what we want to do with the rows of the data of countries that are not represented in both data frames.

?join

The most common joins are:

  • inner_join: includes all rows present in BOTH data frames
  • left_join: includes all rows in the first/left data frame. Remove rows ONLY present in the second/right data frame
  • right_join: includes all rows in the second/right data frame. Remove rows ONLY present in the first/left data frame
  • full_join: includes all rows present in EITHER data frame. Fills in rows only present in one data frame with NA in the other

Here, we use inner_join to keep only libraries with both expression data and metadata, which is actually all of them.

full_data <- inner_join(meta, E_long, by = 'libID')

I find this page helpful when working with joins. https://stat545.com/join-cheatsheet.html

Exercises: pivot and join

  1. Filter the metadata to just Media samples. Without running it, outline what the data would look like if you inner_join the long expression data E_long with metadata containing only Media samples.
  2. Similarly, what would happen with full_join?