The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse and read in the metadata again.
For more information on installation, see the setup instructions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the dataset with View()
:
View(meta)
Let’s start with a useful function called summarize()
that will let us summarize variables in our data frame. We can use it to find the mean, median, or other statistics about the dataset. Let’s find the average age in days below:
summarize(meta, mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_days
## <dbl>
## 1 9016.
The first argument to the summarize function is our data frame, and the second argument is a new variable that we want summarize to create from the age_dys
column.
Instead of including the data as an argument, we can use the pipe operator %>%
to pass the data value into the summarize
function.
meta %>% summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_days
## <dbl>
## 1 9016.
This line of code will do the exact same thing as our first summary command, but the piping function tells R to use the meta
data frame as the first argument in the next function.
This lets us “chain” together multiple functions, which will be helpful later. Note that the pipe (%>%
) is a bit different from using the ggplot plus (+
). Pipes take the output from the left side and use it as input to the right side. Plusses layer on additional information (right side) to a preexisting plot (left side).
We can also add an
meta %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_days
## <dbl>
## 1 9016.
Using the pipe operator %>%
and enter command makes our code more readable. The pipe operator %>%
also helps to avoid using nested function and minimizes the need for new variables.
Since we use the pipe operator so often, there is a keyboard shortcut for it in RStudio. You can press
Tip: Saving a new dataframe
Notice that when we run the following code, we are not actually saving a new variable:
meta %>% summarize(mean_age_days=mean(age_dys))
This simply outputs what we have created, but does not change actually change
meta
or save a new dataframe. To save a new dataframe, we could run:meta_summarized <- meta %>% summarize(mean_age_days=mean(age_dys))
Or if we wanted to change
meta
itself:meta <- meta %>% summarize(mean_age_days=mean(age_dys))
IMPORTANT: This would overwrite the existing
meta
object.For now, we will not be saving dataframes, since we are just experimenting with
dplyr
functions, but it will be useful later on in this lesson.
If we want to calculate the mean age in days for males and females separately, we can use group_by
before summarize:
meta %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
Exercise: group_by practice
Try calculating the mean total number of sequences for the Media and Mtb conditions. You’ll need to use group_by, summarize, and the mean function
meta %>% group_by(condition) %>% summarize(mean_total_seq = mean(total_seq))
## # A tibble: 2 × 2 ## condition mean_total_seq ## <chr> <dbl> ## 1 Media 9933191. ## 2 Mtb 9246495.
You can use the select
function to keep only the columns that you want and remove columns that you don’t need.
meta %>%
select(ptID, age_dys)
## # A tibble: 20 × 2
## ptID age_dys
## <chr> <dbl>
## 1 pt01 12410
## 2 pt01 12410
## 3 pt02 12775
## 4 pt02 12775
## 5 pt03 11315
## 6 pt03 11315
## 7 pt04 8395
## 8 pt04 8395
## 9 pt05 7300
## 10 pt05 7300
## 11 pt06 6570
## 12 pt06 6570
## 13 pt07 7665
## 14 pt07 7665
## 15 pt08 8760
## 16 pt08 8760
## 17 pt09 6935
## 18 pt09 6935
## 19 pt10 8030
## 20 pt10 8030
If you want to keep all but one column, put the minus (-
) sign in front of it. I like to think of it as “subtracting” the column from the data frame.
meta %>%
select(-ptID_old)
## # A tibble: 20 × 8
## libID ptID condition age_dys sex RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699.
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260.
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193.
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543.
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446.
## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160.
## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245.
## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348.
## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442.
## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609.
Let’s save the data frame without the ptID_old
column, since ptID
tells us everything we need to know in a more compact way.
meta <- meta %>%
select(-ptID_old)
Sometimes we would like to create a new column based on the values in an existing one.
meta %>%
mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 9
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699. 34.0
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260. 35.0
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728. 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193. 31.0
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543. 23.0
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446. 20.0
## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995. 18.0
## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160. 18.0
## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098. 21.0
## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245. 21.0
## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0
## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348. 24.0
## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0
## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442. 19.0
## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0
## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609. 22.0
We can overwrite the meta
data frame so it will now contain the age_yrs
column.
meta <- meta %>%
mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 9
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699. 34.0
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260. 35.0
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728. 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193. 31.0
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543. 23.0
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446. 20.0
## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995. 18.0
## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160. 18.0
## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098. 21.0
## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245. 21.0
## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0
## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348. 24.0
## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0
## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442. 19.0
## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0
## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609. 22.0
Exercise: practice mutate
Try creating a new column
total_seq_millions
based ontotal_seq
.meta %>% mutate(total_seq_millions = total_seq / 1000000)
## # A tibble: 20 × 10 ## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs ## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl> ## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0 ## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699. 34.0 ## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0 ## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260. 35.0 ## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728. 31.0 ## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193. 31.0 ## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0 ## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543. 23.0 ## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0 ## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446. 20.0 ## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995. 18.0 ## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160. 18.0 ## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098. 21.0 ## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245. 21.0 ## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0 ## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348. 24.0 ## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0 ## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442. 19.0 ## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0 ## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609. 22.0 ## # … with 1 more variable: total_seq_millions <dbl>
For each sample, we have both a “Media” sample and a “Mtb” sample. If you want to use only the “Media” samples, we can narrow down the data frame to only the rows from “Media” with filter()
:
meta %>%
filter(condition == 'Media')
## # A tibble: 10 × 9
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0
## 3 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728. 31.0
## 4 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0
## 5 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0
## 6 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995. 18.0
## 7 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098. 21.0
## 8 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0
## 9 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0
## 10 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0
We used the ==
sign to test which rows in the condition column were equal to the value “Media”. Comparison operators return TRUE or FALSE. There are many useful comparison operators in R:
operator | meaning |
---|---|
== |
equal |
!= |
not equal |
< |
less than |
> |
greater than |
<= |
less than or equal to |
>= |
greater than or equal to |
Quotes vs No Quotes
Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.
The general rule is that if you want to use values from the columns of your data object, then you supply the name of the column without quotes, but if you want to specify a value that does not come from your data, then use quotes.
Can you think of another way to end up with only the Media
rows? Hint: try using the “not equal” operator.
meta %>%
filter(condition != 'Mtb')
## # A tibble: 10 × 9
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0
## 3 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728. 31.0
## 4 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0
## 5 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0
## 6 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995. 18.0
## 7 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098. 21.0
## 8 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0
## 9 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0
## 10 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0
With filter you can also specify multiple comparisons to perform. Maybe we want to filter down to only Media samples and where at least 7 million sequences were present. We can use the ampersand (&
) to accomplish that:
meta %>%
filter(condition == 'Media' & total_seq > 7000000)
## # A tibble: 9 × 9
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0
## 3 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0
## 4 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0
## 5 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995. 18.0
## 6 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098. 21.0
## 7 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0
## 8 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0
## 9 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0
Alternatively, maybe you want to filter down to only samples from patients that are either male or at least 21 years old. You can use the vertical bar |
to mean “or”
meta %>%
filter(sex == 'M' | age_yrs >= 21)
## # A tibble: 16 × 9
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699. 34.0
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555. 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260. 35.0
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728. 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193. 31.0
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557. 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543. 23.0
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685. 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446. 20.0
## 11 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906. 24.0
## 12 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348. 24.0
## 13 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276 19.0
## 14 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442. 19.0
## 15 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706. 22.0
## 16 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609. 22.0
We will be using the ggplot2
package today to make our plots. This is a very powerful package that creates professional looking plots and is one of the reasons people like using R so much. All plots made using the ggplot2
package start by calling the ggplot()
function.
meta %>% ggplot()
When we run this code, the Plots tab will pop to the front in the lower right corner of the RStudio screen. Right now, we just see a big grey rectangle.
What we’ve done is created a ggplot object and told it we will be using the data from the meta
object that we’ve loaded into R. We’ve done this by calling the ggplot()
function with meta
as the data
argument.
So we’ve made a plot object, now we need to start telling it what we actually want to draw in this plot. The elements of a plot have a bunch of properties like an x and y position, a size, a color, etc. These properties are called aesthetics. When creating a data visualization, we map a variable in our dataset to an aesthetic in our plot. In ggplot, we can do this by creating an “aesthetic mapping”, which we do with the aes()
function.
To create our plot, we need to map variables from our meta
object to ggplot aesthetics using the aes()
function. Since we have already told ggplot
that we are using the data in the meta
object, we can access the columns of meta
using the object’s column names. (Remember, R is case-sensitive, so we have to be careful to match the column names exactly!)
We are interested in the total sequences per sample, so let’s start by telling our plot object that we want to assign this variable to the x axis of our plot. We do this by adding (+
) information to our plot object. Add this new line to your code and run both lines by highlighting them and pressing Ctrl+Enter on your keyboard:
meta %>%
ggplot() +
aes(y = total_seq)
Note that we’ve added this new function call to a second line just to make it easier to read. To do this we make sure that the +
is at the end of the first line otherwise R will assume your command ends when it starts the next row. The +
sign indicates not only that we are adding information, but to continue on to the next line of code.
Observe that our Plot window is no longer a grey square. We now see that we’ve mapped the total_seq
column to the x axis of our plot. Note that that column name isn’t very pretty as an x-axis label, so let’s add the labs()
function to make a nicer label for the x axis
meta %>%
ggplot() +
aes(y = total_seq) +
labs(y = "Total sequences")
OK. That looks better.
meta %>%
ggplot() +
aes(x = age_dys, y = total_seq) +
labs(x = "Age (days)", y = "Total sequences")
Excellent. We’ve now told our plot object where the x and y values are coming from and what they stand for. But we haven’t told our object how we want it to draw the data.
There are many different plot types (bar charts, scatter plots, histograms, etc). We tell our plot object what to draw by adding a “geometry” (“geom” for short) to our object. We will talk about many different geometries today, but for our first plot, let’s draw our data using the “points” geometry for each value in the data set. To do this, we add geom_point()
to our plot object:
meta %>%
ggplot() +
aes(x = age_dys, y = total_seq) +
labs(x = "Age (days)", y = "Total sequences") +
geom_point()
Now we’re really getting somewhere. It finally looks like a proper plot! One thing we could do is use a different color for the sample conditions. To map the condition of each point to a color, we will again use the aes()
function:
meta %>%
ggplot() +
aes(x = age_dys, y = total_seq, color = condition) +
labs(x = "Age (days)", y = "Total sequences") +
geom_point()
So far, we’ve been using geom_point()
to plot two continuous (numeric) variables. What if we want to plot a categorical variable and a numeric variable?
Using the meta
data, use ggplot to create a box plot with condition on the x axis and total_seq on the y axis.
meta %>%
ggplot() +
aes(x = condition, y = total_seq) +
geom_boxplot()
This type of visualization makes it easy to compare the range and spread of values across groups. The “middle” 50% of the data is located inside the box and outliers that are far away from the central mass of the data are drawn as points.
Another way to plot a categorical and a continuous variable is with a bar plot.
meta %>%
ggplot() +
aes(x = libID, y = total_seq) +
geom_col()
Those library IDs sure are squished together and hard to raed. Maybe we should have total_seq on the x axis and libID on the y axis. In other words, it might be easier if we flip the coordinates. We can rewrite the aes()
code, or instead an easy way to change the coordinates is with coord_flip()
:
meta %>%
ggplot() +
aes(x = libID, y = total_seq) +
geom_col() +
coord_flip()
This is looking pretty good. Now let’s add color using the fill
aesthetic!
meta %>%
ggplot() +
aes(x = libID, y = total_seq, fill = condition) +
geom_col() +
coord_flip()
Now let’s see if we can plot the total sequences for each patient ID:
meta %>%
ggplot() +
aes(x = ptID, y = total_seq, fill = condition) +
geom_col() +
coord_flip()
By default, the bars are stacked. Maybe we would like them to be side-by-side instead: We can tell the geom_col
shapes to “dodge” each other:
meta %>%
ggplot() +
aes(x = ptID, y = total_seq, fill = condition) +
geom_col(position = position_dodge()) +
coord_flip()
Finally, since ptID is a short name, let’s flip our coordinates back by removing coord_flip()
:
meta %>%
ggplot() +
aes(x = ptID, y = total_seq, fill = condition) +
geom_col(position = position_dodge())
We now have our first data cleaning plot, the total number of sequences per library.
If we want to share our plots with other people or put them in manuscripts, we’re going to need to make them a little prettier and save them to our computers. Let’s make it pretty and then save the plot!
Add labels:
meta %>%
ggplot() +
aes(x = ptID, y = total_seq, fill = condition) +
geom_col(position = position_dodge()) +
labs(x = 'patient ID', y = 'Total sequences')
Change the theme:
meta %>%
ggplot() +
aes(x = ptID, y = total_seq, fill = condition) +
geom_col(position = position_dodge()) +
labs(x = 'patient ID', y = 'Total sequences') +
theme_bw()
Save to a new variable:
seq_per_lib_plot <- meta %>%
ggplot() +
aes(x = ptID, y = total_seq, fill = condition) +
geom_col(position = position_dodge()) +
labs(x = 'patient ID', y = 'Total sequences') +
theme_bw()
Take a look at it by pasting the variable name into the console: seq_per_lib_plot
ggsave("sequences_per_library.png", plot = seq_per_lib_plot)
Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The metadata is in a long format. We have one row for each patient sample and each variable is in a different column. We might describe this data as “tidy” because it makes it easy to work with ggplot2
and dplyr
functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically in “wide” format each row represents a group of observations and each value is placed in a different column rather than a different row. For example maybe we want only one row per country and want to spread the life expectancy values into different columns (one for each year).
The tidyr
package contains the functions pivot_wider
and pivot_longer
that make it easy to switch between the two formats. The tidyr
package is included in the tidyverse
package so we don’t need to do anything to load it.
For each patient, we have two samples: Media
and Mtb
. In the metadata, the only difference in these two conditions is the libID
column (which is redundant with the ptID
and condition
) and the total_seq
column. We can take the condition column and create two new columns, one with the total seqs in the media sample and one with the total seqs in the mtb sample. This is called “pivoting” the data frame “wider” because we rearrange it by creating an additional column and making it have fewer rows. Let’s try it!
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 8
## ptID age_dys sex RNAseq methylation age_yrs Media Mtb
## <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M TRUE FALSE 34.0 9114402. 8918699.
## 2 pt02 12775 M TRUE FALSE 35.0 9221555. 7733260.
## 3 pt03 11315 M TRUE FALSE 31.0 6231728. 7105193.
## 4 pt04 8395 M TRUE TRUE 23.0 10205557. 8413543.
## 5 pt05 7300 M TRUE FALSE 20.0 15536685. 15509446.
## 6 pt06 6570 F TRUE FALSE 18.0 7085995. 6588160.
## 7 pt07 7665 F TRUE FALSE 21.0 10706098. 8576245.
## 8 pt08 8760 M TRUE FALSE 24.0 9957906. 8220348.
## 9 pt09 6935 M TRUE FALSE 19.0 13055276 13800442.
## 10 pt10 8030 F TRUE FALSE 22.0 8216706. 7599609.
Notice here that we tell pivot_wider()
which columns to pull the names we wish our new columns to be named from the year variable, and the values to populate those columns from the condition variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by year, and the values populate it with our remaining variables dictating the rows.
Maybe we should assign those columns more informative names:
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq,
names_prefix = "total_seq_")
## # A tibble: 10 × 8
## ptID age_dys sex RNAseq methylation age_yrs total_seq_Media total_seq_Mtb
## <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M TRUE FALSE 34.0 9114402. 8918699.
## 2 pt02 12775 M TRUE FALSE 35.0 9221555. 7733260.
## 3 pt03 11315 M TRUE FALSE 31.0 6231728. 7105193.
## 4 pt04 8395 M TRUE TRUE 23.0 10205557. 8413543.
## 5 pt05 7300 M TRUE FALSE 20.0 15536685. 15509446.
## 6 pt06 6570 F TRUE FALSE 18.0 7085995. 6588160.
## 7 pt07 7665 F TRUE FALSE 21.0 10706098. 8576245.
## 8 pt08 8760 M TRUE FALSE 24.0 9957906. 8220348.
## 9 pt09 6935 M TRUE FALSE 19.0 13055276 13800442.
## 10 pt10 8030 F TRUE FALSE 22.0 8216706. 7599609.
And now let’s save the new wider metadata to a new variable:
meta_wide <- meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
Notice that the number of rows and columns has changed:
nrow(meta)
## [1] 20
ncol(meta)
## [1] 9
nrow(meta_wide)
## [1] 10
ncol(meta_wide)
## [1] 8
Everything we did with pivot_wider
, we can reverse with pivot_longer
. Maybe you receive a data frame in a wide format, but you need it in a long format. Here’s how we can get back to a long data frame:
meta_wide %>%
pivot_longer(c(Media, Mtb),
names_to = 'condition', values_to = 'total_seq')
## # A tibble: 20 × 8
## ptID age_dys sex RNAseq methylation age_yrs condition total_seq
## <chr> <dbl> <chr> <lgl> <lgl> <dbl> <chr> <dbl>
## 1 pt01 12410 M TRUE FALSE 34.0 Media 9114402.
## 2 pt01 12410 M TRUE FALSE 34.0 Mtb 8918699.
## 3 pt02 12775 M TRUE FALSE 35.0 Media 9221555.
## 4 pt02 12775 M TRUE FALSE 35.0 Mtb 7733260.
## 5 pt03 11315 M TRUE FALSE 31.0 Media 6231728.
## 6 pt03 11315 M TRUE FALSE 31.0 Mtb 7105193.
## 7 pt04 8395 M TRUE TRUE 23.0 Media 10205557.
## 8 pt04 8395 M TRUE TRUE 23.0 Mtb 8413543.
## 9 pt05 7300 M TRUE FALSE 20.0 Media 15536685.
## 10 pt05 7300 M TRUE FALSE 20.0 Mtb 15509446.
## 11 pt06 6570 F TRUE FALSE 18.0 Media 7085995.
## 12 pt06 6570 F TRUE FALSE 18.0 Mtb 6588160.
## 13 pt07 7665 F TRUE FALSE 21.0 Media 10706098.
## 14 pt07 7665 F TRUE FALSE 21.0 Mtb 8576245.
## 15 pt08 8760 M TRUE FALSE 24.0 Media 9957906.
## 16 pt08 8760 M TRUE FALSE 24.0 Mtb 8220348.
## 17 pt09 6935 M TRUE FALSE 19.0 Media 13055276
## 18 pt09 6935 M TRUE FALSE 19.0 Mtb 13800442.
## 19 pt10 8030 F TRUE FALSE 22.0 Media 8216706.
## 20 pt10 8030 F TRUE FALSE 22.0 Mtb 7599609.
So far, we’ve been working with just the metadata. But this workshop is about analyzing RNAseq data, so let’s start working with actual sequencing data.
count <- read_csv('0_data/raw_counts.csv')
## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a look at the counts data with: View(count)
Notice how each column is a libID
and each row is a hgnc_symbol
. The values are the raw sequence counts for the genes in each library. Soon, we’re going to want to combine the sequence data with our metadata. But first, we need to change this from a wide to a long format.
count %>%
pivot_longer(-hgnc_symbol, names_to = 'libID', values_to = 'raw_count')
## # A tibble: 895,720 × 3
## hgnc_symbol libID raw_count
## <chr> <chr> <dbl>
## 1 5_8S_rRNA pt01_Media 20.0
## 2 5_8S_rRNA pt01_Mtb 20.0
## 3 5_8S_rRNA pt02_Media 7.11
## 4 5_8S_rRNA pt02_Mtb 6.06
## 5 5_8S_rRNA pt03_Media 12.2
## 6 5_8S_rRNA pt03_Mtb 4.07
## 7 5_8S_rRNA pt04_Media 4
## 8 5_8S_rRNA pt04_Mtb 3
## 9 5_8S_rRNA pt05_Media 22.0
## 10 5_8S_rRNA pt05_Mtb 55.7
## # … with 895,710 more rows
Let’s save this to a new data frame so we can re-use it later.
count_long <- count %>%
pivot_longer(-hgnc_symbol, names_to = 'libID', values_to = 'raw_count')
Now let’s combine our raw sequence counts with our metadata. To combine two data frames, we will use a join
function. The dplyr package has a number of tools for joining data frames together depending on what we want to do with the rows of the data of countries that are not represented in both data frames.
?join
We have a few options here:
The mutating joins add columns from y to x, matching rows based on the keys: - inner_join(): includes all rows in x and y. - left_join(): includes all rows in x. - right_join(): includes all rows in y. - full_join(): includes all rows in x or y.
In an “inner join”, the new data frame only has those rows where the same key is found in both data frames. This is a very commonly used join.
inner_join(meta, count_long, by = 'libID')
## # A tibble: 895,720 × 11
## libID ptID condition age_dys sex RNAseq methylation total_seq age_yrs
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 2 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 3 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 4 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 5 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 6 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 7 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 8 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 9 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## 10 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402. 34.0
## # … with 895,710 more rows, and 2 more variables: hgnc_symbol <chr>,
## # raw_count <dbl>
Now let’s save our joined data frame to a new variable:
full_data <- inner_join(meta, count_long, by = 'libID')
Now that we have our metadata joined with our raw count data, we’re ready to clean and analyze it!
Bonus: Other dplyr join functions
Outer joins and can be performed using
left_join()
,right_join()
, andfull_join()
. In a “left join”, if the key is present in the left hand data frame, it will appear in the output, even if it is not found in the the right hand data frame. For a right join, the opposite is true. For a full join, all possible keys are included in the output data frame.
Let’s find out which genes had the highest raw counts.
top_10_abundant_genes <- full_data %>%
filter(condition == 'Media') %>%
group_by(hgnc_symbol) %>%
summarize(med_raw_count = median(raw_count)) %>%
arrange(desc(med_raw_count)) %>%
slice_max(order_by = med_raw_count, n = 10)
top_10_abundant_genes
## # A tibble: 10 × 2
## hgnc_symbol med_raw_count
## <chr> <dbl>
## 1 MALAT1 263231
## 2 Metazoa_SRP 208862.
## 3 RN7SL1 189057.
## 4 CH507-513H4.3 140705.
## 5 CH507-513H4.4 140705.
## 6 CH507-513H4.6 139025.
## 7 MT-RNR2 132437.
## 8 RN7SK 83060.
## 9 MT-RNR1 78763
## 10 MT-CO1 77922.
Now let’s make a boxplot with only those genes.
full_data %>%
filter(hgnc_symbol %in% top_10_abundant_genes$hgnc_symbol,
condition == 'Media') %>%
ggplot() +
aes(x = hgnc_symbol, y = raw_count) +
geom_boxplot() +
coord_flip() +
labs(title = 'Top 10 most abundant genes')
Try plotting just interferon 1 gamma for each patient:
full_data %>%
filter(hgnc_symbol == 'IFNG', condition == 'Media') %>%
ggplot() +
aes(x = ptID, y = raw_count) +
geom_col() +
labs(title = 'IFNG RNAseq counts')