In this workshop, we introduce you to the R tidyverse, a suite of packages for data manipulation and visualization in R. In it, we cover how to:
During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes
The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse. For more information on installation, see the setup instructions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
We also load the metadata, this time using the tidyverse
function read_csv
. This function is very similar to base R
read.csv
, except that it automatically tells you the
structure of the loaded data, performs better when determining different
classes of data within a data frame, and save the object as a
tibble.
meta <- read_csv("data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
class(meta)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
A tibble is just a data frame with some extra hidden formatting. You
will not need to interface with these hidden features and all the
following functions would behave the same if you’d read in a simple
data.frame
with read.csv
.
Take a look at the dataset with View()
:
View(meta)
Here is a brief description of the variables in these data.
libID
(character): Unique library ID for each
sequencing library. Formatted as ptID_conditionptID
(character): Patient ID. Each patient has 2
libraries which differ in their condition
condition
(character): Experimental condition. Media
for media control. Mtb of M. tuberculosis infected.age_dys
(numeric): Age in agessex
(character): Biological sex, M or FptID_old
(character): Old patient ID with leading
zeroesRNAseq
(logical): If the library has pass-filter
RNA-seq datamethylation
(logical): If the library has pass-filter
methylation datatotal_seq
(numeric): Total number of pass-filter
sequencesThe first tidyverse
package we will use is
dplyr
, which is for data manipulation. You do not need to
load this package with library
as it was already loaded as
part of the tidyverse
.
You can use the select
function to keep only the columns
that you want and remove columns that you don’t need. The first argument
to the select
function is our data frame, and the second
argument is the name of the column we want to keep.
select(meta, ptID)
## # A tibble: 20 × 1
## ptID
## <chr>
## 1 pt01
## 2 pt01
## 3 pt02
## 4 pt02
## 5 pt03
## 6 pt03
## 7 pt04
## 8 pt04
## 9 pt05
## 10 pt05
## 11 pt06
## 12 pt06
## 13 pt07
## 14 pt07
## 15 pt08
## 16 pt08
## 17 pt09
## 18 pt09
## 19 pt10
## 20 pt10
We can continue to list multiple columns to keep by separating with commas.
select(meta, ptID, total_seq)
## # A tibble: 20 × 2
## ptID total_seq
## <chr> <dbl>
## 1 pt01 9114402.
## 2 pt01 8918699.
## 3 pt02 9221555.
## 4 pt02 7733260.
## 5 pt03 6231728.
## 6 pt03 7105193.
## 7 pt04 10205557.
## 8 pt04 8413543.
## 9 pt05 15536685.
## 10 pt05 15509446.
## 11 pt06 7085995.
## 12 pt06 6588160.
## 13 pt07 10706098.
## 14 pt07 8576245.
## 15 pt08 9957906.
## 16 pt08 8220348.
## 17 pt09 13055276
## 18 pt09 13800442.
## 19 pt10 8216706.
## 20 pt10 7599609.
If you want to keep all but one column, put the minus
(-
) sign in front of it. I like to think of it as
“subtracting” the column from the data frame.
select(meta, -ptID_old)
## # A tibble: 20 × 8
## libID ptID condition age_dys sex RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699.
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260.
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193.
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543.
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446.
## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160.
## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245.
## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348.
## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442.
## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609.
You can use filter
to remove/keep rows. Instead of
specifying a column name as in select
, you provide a
statement that gives TRUE/FALSE and filter
keeps all the
rows for which that statement is TRUE. The conditional
operators we saw in Intro R will help with this!
Let’s filter to just the Media samples.
filter(meta, condition == "Media")
## # A tibble: 10 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 3 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 4 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 5 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 6 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 7 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 8 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 9 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 10 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## # … with abbreviated variable name ¹total_seq
This is the same as what we did in Intro R with the following, only
it is more readable and will fit into a tidyverse
workflow.
More on this when we get to pipes!
meta[meta$condition == "Media", ]
## # A tibble: 10 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 3 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 4 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 5 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 6 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 7 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 8 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 9 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 10 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## # … with abbreviated variable name ¹total_seq
You can string together multiple statements and ask if both/all are
TRUE with and &
or if at least one is TRUE with or
|
. Notice that the &
statement removes all
rows because no library has both a Media and Mtb condition. In contrast,
the |
statement keeps all rows because all libraries are
either the Media or Mtb condition.
filter(meta, condition == "Media" & condition == "Mtb")
## # A tibble: 0 × 9
## # … with 9 variables: libID <chr>, ptID <chr>, condition <chr>, age_dys <dbl>,
## # sex <chr>, ptID_old <chr>, RNAseq <lgl>, methylation <lgl>, total_seq <dbl>
filter(meta, condition == "Media" | condition == "Mtb")
## # A tibble: 20 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6
## # … with abbreviated variable name ¹total_seq
Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.
The general rule is that if you want to use an R object (data frame, column, etc), then you supply the name without quotes. If you want to specify a value within the R object, then use quotes.
Sometimes your variable names are not ideal. rename
does
just want it says; it renames columns in your data. The syntax for
rename
is to give the new name first and then tell R what
that new name should replace, so new_name = old_name
.
Here, we rename condition
to infection
as this
better represents those data.
rename(meta, infection = condition)
## # A tibble: 20 × 9
## libID ptID infection age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6
## # … with abbreviated variable name ¹total_seq
The function summarize()
performs summary statistics on
your data. We can use it to find the mean, median, or other statistics
about the dataset. Let’s find the average age in days below:
summarize(meta, mean(age_dys))
## # A tibble: 1 × 1
## `mean(age_dys)`
## <dbl>
## 1 9016.
Notice that summarize
automatically names the column
with the summary statistic based on the function used to calculate that
statistic. This is not ideal as the formatting has parentheses. So we
can specify a name instead.
summarize(meta, mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_dys
## <dbl>
## 1 9016.
Instead of including the data as an argument, we can use the pipe
operator %>%
to pass the data value into the
summarize
function.
meta %>% summarize(mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_dys
## <dbl>
## 1 9016.
This line of code will do the exact same thing as our previous
summary command, but the piping function tells R to use the
meta
data frame as the first argument in the next function.
This lets us “chain” together multiple functions into a pipeline. The
pipe takes the output from the left side and use it as input to the
right side.
We can also add an
meta %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_days
## <dbl>
## 1 9016.
Using the pipe operator %>%
and enter command makes
our code more readable. The pipe operator %>%
also helps
to avoid using nested function like f(g(x))
and minimizes
the need for new variables.
Since we use the pipe operator so often, there is a keyboard shortcut
for it in RStudio. You can press
The pipe comes in handy with functions like group_by
. If
we want to calculate the mean age in days for males and females
separately, we can use group_by
before summarize:
meta %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
And we could merge this with a filter to only look at the Media libraries, since each age is repeated for that patient’s Media and Mtb libraries. The results are the same in this case but filtering would be important if any of our patient’s were missing one of the conditions.
meta %>%
filter(condition == "Media") %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
Sometimes we would like to create a new column based on the values in an existing one. For example, age is days is a little difficult to understand. Let’s add a column with age in years.
meta %>%
mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 10
## libID ptID condi…¹ age_dys sex ptID_…² RNAseq methy…³ total…⁴ age_yrs
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6 34.0
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6 35.0
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6 31.0
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6 23.0
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6 18.0
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6 18.0
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7 21.0
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6 21.0
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6 24.0
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6 24.0
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7 19.0
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7 19.0
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6 22.0
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6 22.0
## # … with abbreviated variable names ¹condition, ²ptID_old, ³methylation,
## # ⁴total_seq
Note that this new column is not saved in meta because we printed the
result to the console and did not save it to the environment. We can
overwrite the meta
data frame so it will now contains the
age_yrs
column.
meta <- meta %>%
mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 10
## libID ptID condi…¹ age_dys sex ptID_…² RNAseq methy…³ total…⁴ age_yrs
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6 34.0
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6 35.0
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6 31.0
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6 23.0
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6 18.0
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6 18.0
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7 21.0
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6 21.0
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6 24.0
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6 24.0
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7 19.0
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7 19.0
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6 22.0
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6 22.0
## # … with abbreviated variable names ¹condition, ²ptID_old, ³methylation,
## # ⁴total_seq
total_seq_millions
based on total_seq
.condition == "Media"
?tidyr
is another data manipulation package in
tidyverse
. It contains functions for tidying data and
generally acts on the entire data frame, as opposed to
dplyr
which most often works on a subset of columns or
rows.
Data comes in many shapes and sizes, and one way we classify data is
either “wide” or “long.” Data that is “long” has one row per
observation. The metadata is in a long format. We have one row for each
patient sample and each variable is in a different column. We might also
describe these data as “tidy” because it makes it easy to work with
ggplot2
and dplyr
functions (this is where the
“tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we
may want our data in a “wide” format. Typically, in “wide” format, each
row represents a group of observations and each value is placed in a
different column rather than a different row. For example, maybe we want
each patient to have 1 row and their Media/Mtb data to be spread across
columns.
The tidyr
package in the tidyverse contains the
functions pivot_wider
and pivot_longer
that
make it easy to switch between the two formats.
For each patient, we have two samples: Media
and
Mtb
. In the metadata, the only difference in these two
conditions is the libID
column (which is redundant with the
ptID
and condition
) and the
total_seq
column. We can take the condition column and
create two new columns, one with the total seqs in the
Media
sample and one with the total seqs in the
Mtb
sample. This is called “pivoting” the data frame
“wider” because we rearrange it by creating an additional column and
making it have fewer rows. Let’s try it!
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs Media Mtb
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 9114402. 8918699.
## 2 pt02 12775 M pt00002 TRUE FALSE 35.0 9221555. 7733260.
## 3 pt03 11315 M pt00003 TRUE FALSE 31.0 6231728. 7105193.
## 4 pt04 8395 M pt00004 TRUE TRUE 23.0 10205557. 8413543.
## 5 pt05 7300 M pt00005 TRUE FALSE 20.0 15536685. 15509446.
## 6 pt06 6570 F pt00006 TRUE FALSE 18.0 7085995. 6588160.
## 7 pt07 7665 F pt00007 TRUE FALSE 21.0 10706098. 8576245.
## 8 pt08 8760 M pt00008 TRUE FALSE 24.0 9957906. 8220348.
## 9 pt09 6935 M pt00009 TRUE FALSE 19.0 13055276 13800442.
## 10 pt10 8030 F pt00010 TRUE FALSE 22.0 8216706. 7599609.
Notice here that we tell pivot_wider
which columns to
pull the names we wish our new columns to be named from the
condition
variable, and the values to populate those
columns from the total_seq
variable. (Again, neither of
which have to be in quotes in the code when there are no special
characters or spaces - certainly an incentive not to use special
characters or spaces!) We see that the resulting table has new columns
by condition, and the values populate it with our remaining variables
dictating the rows.
Maybe we should assign those columns more informative names.
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq,
names_prefix = "total_seq_")
## # A tibble: 10 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs total_seq_M…¹ total…²
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 9114402. 8.92e6
## 2 pt02 12775 M pt00002 TRUE FALSE 35.0 9221555. 7.73e6
## 3 pt03 11315 M pt00003 TRUE FALSE 31.0 6231728. 7.11e6
## 4 pt04 8395 M pt00004 TRUE TRUE 23.0 10205557. 8.41e6
## 5 pt05 7300 M pt00005 TRUE FALSE 20.0 15536685. 1.55e7
## 6 pt06 6570 F pt00006 TRUE FALSE 18.0 7085995. 6.59e6
## 7 pt07 7665 F pt00007 TRUE FALSE 21.0 10706098. 8.58e6
## 8 pt08 8760 M pt00008 TRUE FALSE 24.0 9957906. 8.22e6
## 9 pt09 6935 M pt00009 TRUE FALSE 19.0 13055276 1.38e7
## 10 pt10 8030 F pt00010 TRUE FALSE 22.0 8216706. 7.60e6
## # … with abbreviated variable names ¹total_seq_Media, ²total_seq_Mtb
And now let’s save the new wider metadata to a new object.
meta_wide <- meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
Notice that the number of rows and columns has changed.
dim(meta)
## [1] 20 10
dim(meta_wide)
## [1] 10 9
Everything we did with pivot_wider
, we can reverse with
pivot_longer
. Maybe you receive a data frame in a wide
format, but you need it in a long format. Here’s how we can get back to
a long data frame.
meta_wide %>%
pivot_longer(c(Media, Mtb),
names_to = "condition", values_to = "total_seq")
## # A tibble: 20 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs condition total_seq
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <chr> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 Media 9114402.
## 2 pt01 12410 M pt00001 TRUE FALSE 34.0 Mtb 8918699.
## 3 pt02 12775 M pt00002 TRUE FALSE 35.0 Media 9221555.
## 4 pt02 12775 M pt00002 TRUE FALSE 35.0 Mtb 7733260.
## 5 pt03 11315 M pt00003 TRUE FALSE 31.0 Media 6231728.
## 6 pt03 11315 M pt00003 TRUE FALSE 31.0 Mtb 7105193.
## 7 pt04 8395 M pt00004 TRUE TRUE 23.0 Media 10205557.
## 8 pt04 8395 M pt00004 TRUE TRUE 23.0 Mtb 8413543.
## 9 pt05 7300 M pt00005 TRUE FALSE 20.0 Media 15536685.
## 10 pt05 7300 M pt00005 TRUE FALSE 20.0 Mtb 15509446.
## 11 pt06 6570 F pt00006 TRUE FALSE 18.0 Media 7085995.
## 12 pt06 6570 F pt00006 TRUE FALSE 18.0 Mtb 6588160.
## 13 pt07 7665 F pt00007 TRUE FALSE 21.0 Media 10706098.
## 14 pt07 7665 F pt00007 TRUE FALSE 21.0 Mtb 8576245.
## 15 pt08 8760 M pt00008 TRUE FALSE 24.0 Media 9957906.
## 16 pt08 8760 M pt00008 TRUE FALSE 24.0 Mtb 8220348.
## 17 pt09 6935 M pt00009 TRUE FALSE 19.0 Media 13055276
## 18 pt09 6935 M pt00009 TRUE FALSE 19.0 Mtb 13800442.
## 19 pt10 8030 F pt00010 TRUE FALSE 22.0 Media 8216706.
## 20 pt10 8030 F pt00010 TRUE FALSE 22.0 Mtb 7599609.
In the above case, the condition and total_seq names need to be in
quotes because they are not yet in our data object. They are instead
just characters used to name the new columns. If we forgot the quotes,
we’d get an error because R cannot find a condition
column
in meta_wide
as no such column exists.
meta_wide %>%
pivot_longer(c(Media, Mtb),
names_to = condition, values_to = "total_seq")
## Error in build_longer_spec(data, !!cols, names_to = names_to, values_to = values_to, : object 'condition' not found
colnames(meta_wide)
## [1] "ptID" "age_dys" "sex" "ptID_old" "RNAseq"
## [6] "methylation" "age_yrs" "Media" "Mtb"
Often not all your data is pre-compiled in one nice, tidy data frame.
You may have multiple files, batches, etc to work with.
dplyr
’s join functions allow us to combine data frames
without fear of copy-paste errors or missing anything!
So far, we’ve been working with just the metadata - a single data
frame. Let’s adventure into the full RNA-seq dat
object so
we have multiple data frames to combine.
load("data/dat_voom.RData")
First, we have the normalized log2 counts per million (CPM) in
E
dat$E[1:3,1:3]
## Loading required package: limma
## pt01_Media pt01_Mtb pt02_Media
## A1BG 1.620022 1.717603 1.086946
## A2M 7.259912 6.680758 5.939909
## A2ML1 -4.052404 -3.515057 -4.015712
Second, we have the same metadata as our meta
object in
the targets
data.
dat$targets[1:3,]
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 8295928 1.092872 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 5716203 0.819490 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 8087601 1.119352 pt02_Media pt02 Media 12775 M
## ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media pt00001 TRUE FALSE 9114402 0.8943797
## pt01_Mtb pt00001 TRUE FALSE 8918699 0.9850363
## pt02_Media pt00002 TRUE FALSE 9221555 1.2582092
Looking at these two data frames, they do not share any columns -
thus there is nothing to join them with. However, notice that the column
names in the counts data are the libIDs from the metadata! We can use
pivot_longer
to get the joining column we need.
Since E
is a matrix with rownames, we must force it to
be a data frame and move the rownames into a data column with some
additional functions.
E_long <- as.data.frame(dat$E) %>%
rownames_to_column("gene") %>%
pivot_longer(-gene, names_to = "libID", values_to = "log2CPM")
E_long
## # A tibble: 268,380 × 3
## gene libID log2CPM
## <chr> <chr> <dbl>
## 1 A1BG pt01_Media 1.62
## 2 A1BG pt01_Mtb 1.72
## 3 A1BG pt02_Media 1.09
## 4 A1BG pt02_Mtb 1.10
## 5 A1BG pt03_Media 0.874
## 6 A1BG pt03_Mtb 1.71
## 7 A1BG pt04_Media 1.57
## 8 A1BG pt04_Mtb -0.264
## 9 A1BG pt05_Media 1.63
## 10 A1BG pt05_Mtb -0.0305
## # … with 268,370 more rows
Now both data frames have a libID
column!
Let’s combine our sequence counts with our metadata. To combine two
data frames, we will use a join
function. The dplyr package
has a number of tools for joining data frames together depending on what
we want to do with the rows of the data of countries that are not
represented in both data frames.
?join
The most common joins are:
inner_join
: includes all rows present in BOTH data
framesleft_join
: includes all rows in the first/left data
frame. Remove rows ONLY present in the second/right data frameright_join
: includes all rows in the second/right data
frame. Remove rows ONLY present in the first/left data framefull_join
: includes all rows present in EITHER data
frame. Fills in rows only present in one data frame with NA in the
otherHere, we use inner_join
to keep only libraries with both
expression data and metadata, which is actually all of them.
full_data <- inner_join(meta, E_long, by = 'libID')
I find this page helpful when working with joins. https://stat545.com/join-cheatsheet.html
inner_join
the long expression data E_long
with metadata containing only Media samples.full_join
?