Please download the data in data.zip. Then, move data.zip to your Desktop and unzip it (usually double-clicking it will work).
If the above link does not work, please try downloading the data with Download button at https://github.com/BIGslu/workshops/blob/main/2022.08.15_R.tidyverse.workshop/data/data.zip.
When you open RStudio, it should look like so with multiple panels. If you see only 1 panel, then you’re likely in R, not RStudio.
Install R packages by running the following script in your R console (left panel in the above image).
#CRAN packages
install.packages("tidyverse")
install.packages("ggrepel")
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install("limma")
If prompted, say a
to “Update all/some/none?
[a/s/n]” and no
to “Do you want to install from sources the
packages which need compilation? (Yes/no/cancel)”
This can take several minutes.
To make sure packages are correctly installed, load them into R with
library( )
. If you see any ERROR, please come 15
minutes early to the workshop the day of or contact Kim for
assistance.
First, the package(s) that give messages upon loading.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Then, check package(s) that load silently with no messages.
library(limma)
library(ggrepel)
For reproducibility, here is the complete list of software used in this workshop.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggrepel_0.9.1 limma_3.52.2 forcats_0.5.2 stringr_1.4.1
## [5] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [9] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 lubridate_1.8.0 assertthat_0.2.1
## [4] rprojroot_2.0.3 digest_0.6.29 utf8_1.2.2
## [7] R6_2.5.1 cellranger_1.1.0 backports_1.4.1
## [10] reprex_2.0.2 evaluate_0.16 httr_1.4.4
## [13] pillar_1.8.1 rlang_1.0.4 googlesheets4_1.0.1
## [16] readxl_1.4.1 rstudioapi_0.14 jquerylib_0.1.4
## [19] rmarkdown_2.15 googledrive_2.0.0 munsell_0.5.0
## [22] broom_1.0.0 compiler_4.2.1 modelr_0.1.9
## [25] xfun_0.32 pkgconfig_2.0.3 htmltools_0.5.3
## [28] tidyselect_1.1.2 fansi_1.0.3 crayon_1.5.1
## [31] tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
## [34] grid_4.2.1 jsonlite_1.8.0 gtable_0.3.0
## [37] lifecycle_1.0.1 DBI_1.1.3 magrittr_2.0.3
## [40] scales_1.2.1 cli_3.3.0 stringi_1.7.8
## [43] cachem_1.0.6 fs_1.5.2 xml2_1.3.3
## [46] bslib_0.4.0 ellipsis_0.3.2 generics_0.1.3
## [49] vctrs_0.4.1 tools_4.2.1 glue_1.6.2
## [52] hms_1.1.2 fastmap_1.1.0 yaml_2.3.5
## [55] colorspace_2.0-3 gargle_1.2.0 rvest_1.0.3
## [58] knitr_1.39 haven_2.5.0 sass_0.4.2
In this workshop, we introduce you to R and RStudio at the beginner level. In it, we cover:
We will do all of our work in RStudio. RStudio is an integrated development and analysis environment for R that brings a number of conveniences over using R in a terminal or other editing environments.
During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes
When you start RStudio, you will see something like the following window appear:
Notice that the window is divided into three “panes”:
Console (the entire left side): This is your view into the R engine. You can type in R commands here and see the output printed by R. (To make it easier to tell them apart, your input and the resulting output are printed in different colors.) There are several editing conveniences available: use up and down arrow keys to go back to previously entered commands, which can then be edited and re-run; TAB for completing the name before the cursor; see more in online docs.
Environment/History (tabbed in upper right): View current user-defined objects and previously-entered commands, respectively. The instructor may have additional tabs like Connection, Git, etc. These are not relevant to this workshop series.
Files/Plots/Packages/Help (tabbed in lower right): As their names suggest, these are used to view the contents of the current directory, graphics created by the user, install packages, and view the built-in help pages.
To change the look of RStudio, you can go to Tools -> Global Options -> Appearance and select colors, font size, etc. If you plan to be working for longer periods, we suggest choosing a dark background color scheme to save your computer battery and your eyes.
Projects are a great feature of RStudio. When you create a project,
RStudio creates an .Rproj
file that links all of your files
and outputs to the project directory. When you import data, R
automatically looks for the file in the project directory instead of you
having to specify a full file path on your computer like
/Users/username/Desktop/
. R also automatically saves any
output to the project directory. Finally, projects allow you to save
your R environment in .RData
so that when you close RStudio
and then re-open it, you can start right where you left off without
re-importing any data or re-calculating any intermediate steps.
RStudio has a simple interface to create and switch between projects, accessed from the button in the top-right corner of the RStudio window. (Labeled “Project: (None)”, initially.)
Let’s create a project to work in for this workshop. Start by clicking the “Project” button in the upper right or going to the “File” menu. Select “New Project” and the following will appear.
You can either create a project in an existing directory or make a new directory on your computer - just be sure you know where it is.
After your project is created, navigate to its directory using your
Finder/File explorer. You will see the .RProj
file has been
created.
To access this project in the future, simply double-click the
.RProj
file and RStudio will open the project or choose
File > Open Project from within an already open RStudio window.
R script files are the primary way in which R facilitates reproducible research. They contain the code that loads your raw data, cleans it, performs the analyses, and creates and saves visualizations. R scripts maintain a record of everything that is done to the raw data to reach the final result. That way, it is very easy to write up and communicate your methods because you have a document listing the precise steps you used to conduct your analyses. This is one of R’s primary advantages compared to traditional tools like Excel, where it may be unclear how to reproduce the results.
Generally, if you are testing an operation (e.g. what would my data look like if I applied a log-transformation to it?), you should do it in the console (left pane of RStudio). If you are committing a step to your analysis (e.g. I want to apply a log-transformation to my data and then conduct the rest of my analyses on the log-transformed data), you should add it to your R script so that it is saved for future use.
Additionally, you should annotate your R scripts with comments. In
each line of code, any text preceded by the #
symbol will
not execute. Comments can be useful to remind yourself and to tell other
readers what a specific chunk of code does. In my experience, there can
never be too much commenting.
Let’s create an R script (File > New File > R Script) and save
it as introR_live_notes.R
in your main project directory.
If you again look to the project directory on your computer, you will
see introR_live_notes.R
is now saved there.
We will work together to create and populate the
introR_live_notes.R
script throughout this workshop.
R packages are units of shareable code, containing functions that
facilitate and enhance analyses. Packages are typically installed from
CRAN (The Comprehensive R
Archive Network), which is a database containing R itself as well as
many R packages. Any package can be installed from CRAN using the
install.packages
function. You should input installation
commands into your console (as opposed to live_notes.R
)
since once a package is installed on your computer, you won’t need to
re-install it again.
You should have installed the following packages prior to the
workshop. However, if you did not, here is the code again.
tidyverse
is a meta-package containing several packages
useful in data manipulation and plotting. ggrepel
contains
functions to modify ggplot
plots.
You DO NOT need to run this again if you did so as part of the setup instructions.
install.packages("tidyverse")
install.packages("ggrepel")
After installing a package, and every time you open a new
RStudio session, the packages you want to use need to be loaded into the
R workspace with the library
function. This tells R to
access the package’s functions and prevents RStudio from lags that would
occur if it automatically loaded every downloaded package every time you
opened it. To put this in perspective, I had 464 packages installed at
the time this document was made.
# Data manipulation and visualization
library(tidyverse)
Because tidyverse
is a meta-package, it automatically
tells you want packages it is loading and their versions. In addition,
the Conflicts section let’s you know functions in the
tidyverse
that alter exist in your R session. Because you
chose to load the package, calling the function filter
will
use the tidyverse
function not the stats
function (which comes with base R). If you for some reason needed the
stats
version, you can specify it with
package::
like stats::filter
.
Bioconductor is another repository of R packages. It has different requirements for upload and houses many biology-relevant packages. To install from Bioconductor, you first install its installer from CRAN.
install.packages("BiocManager")
Then install your package of choice using the
BiocManager
installer. Again, you did this in the setup
instructions but the code is here again for reference. We install
limma
, a package for analysis of microarray and RNA-seq
data.
If prompted, say a
to “Update all/some/none?
[a/s/n]” and no
to “Do you want to install from sources the
packages which need compilation? (Yes/no/cancel)”
BiocManager::install("limma")
Before doing anything in R, it is a good idea to set your random seed. Your analyses may not end up using a seed but by setting it, you ensure that everything is exactly reproducible.
The most common seeds are 0, 1, 123, and 42. You can use any number and change it up between scripts if you’d like.
set.seed(4389)
Create a directory inside your project directory called
data
and move the data files into this directory. You can
do this in your regular point-and-click file explorer.
.csv
, .tsv
, etc.One of R’s most essential data structures is the data frame, which is
simply a table of m
columns by n
rows. First,
we will read in the RNA-seq metadata into RStudio using the base R
read.table
function.
Each R function uses the following basic syntax, where
Function
is the name of the function.
Function(argument1=..., argument2=..., ...)
read.table
has many arguments; however, we only need to
specify 3 arguments to correctly read in our data as a data frame. For
our data, we will need to specify:
file
- gives the path to the file that we want to load
from our working directory (current project directory).sep
- tells R that our data are comma-separatedheader
- tells R that the first row in our data
contains the names of the variables (columns).We will store the data as an object named meta
using the assignment operator <-
, so that we can re-use
it in our analysis.
# read the data and save it as an object
meta <- read.table(file="data/metadata.csv", sep=",", header=TRUE)
Now whenever we want to use these data, we simply call
meta
You can read up about the different arguments of a specific function
by typing ?Function
or help(Function)
in your
R console.
?read.table
You will notice that there are multiple functions of the
read.table
help page. This include similar and related
functions with additional options. For example, since our data are in
.csv
format, we could’ve instead read them into R with
read.csv
which assumes the options
sep=",", header=TRUE
by default.
# read the data with different function
meta <- read.csv(file="data/metadata.csv")
.RData
You may have data that do not fit nicely into a single table or into
a table at all (like plots). You can save these as .RData
,
which can be loaded directly into R. You can also save multiple tables
and/or other objects in a single .RData
file to make
loading your data quick and easy. Moreover, .RData
are
automatically compressed so they take up less storage space than
multiple tables.
load("data/dat_voom.RData")
Notice that these data appear already named in your R environment as
dat
. Object names are determined when saving so be sure to
create short but descriptive names before saving to
.RData
.
Let’s return to the simpler metadata for now. This data frame
consists of 20 rows (observations) and 9 columns (variables). You can
see this quickly using the dimension function dim
dim(meta)
## [1] 20 9
Each column and each row of a data frame are individual R vectors. R
vectors are one-dimensional arrays of data. For example, we can extract
column vectors from data frames using the $
operator.
# Extract patient IDs
meta$ptID
## [1] "pt01" "pt01" "pt02" "pt02" "pt03" "pt03" "pt04" "pt04" "pt05" "pt05"
## [11] "pt06" "pt06" "pt07" "pt07" "pt08" "pt08" "pt09" "pt09" "pt10" "pt10"
R objects have several different classes (types). Our data frame
contains 4 R data types. The base R class
function will
tell you what data type an object is.
class(meta)
## [1] "data.frame"
class(meta$ptID)
## [1] "character"
class(meta$age_dys)
## [1] "integer"
class(meta$total_seq)
## [1] "numeric"
class(meta$RNAseq)
## [1] "logical"
We see that our ptID
column is character
,
meaning it is non-numeric letters and characters. On the other hand,
age_dys
is numeric
, meaning a number, and
total_seq
is integer
, meaning a whole number
with no decimals. Finally, RNAseq
is logical
meaning TRUE/FALSE - which need to be all caps for R to recognize them
as such.
You can see the entire structure of the data with
str
.
str(meta)
## 'data.frame': 20 obs. of 9 variables:
## $ libID : chr "pt01_Media" "pt01_Mtb" "pt02_Media" "pt02_Mtb" ...
## $ ptID : chr "pt01" "pt01" "pt02" "pt02" ...
## $ condition : chr "Media" "Mtb" "Media" "Mtb" ...
## $ age_dys : int 12410 12410 12775 12775 11315 11315 8395 8395 7300 7300 ...
## $ sex : chr "M" "M" "M" "M" ...
## $ ptID_old : chr "pt00001" "pt00001" "pt00002" "pt00002" ...
## $ RNAseq : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ methylation: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ total_seq : num 9114402 8918699 9221555 7733260 6231728 ...
Another, common data type not found in these data is
factor
, non-numeric value with a set number of unique
levels. We will see factor
in the ggplot
section.
S3
, S4
)Now, let’s look at the dat
data. These data are in
S3
format meaning they have 3 dimensions. S3
data are a list of multiple data frames, vectors, plots, etc. In the
case of dat
, there are a specific class from the
limma
package called EList
for expression
list.
class(dat)
## [1] "EList"
## attr(,"package")
## [1] "limma"
If you click on the dat
object in your Environment tab,
you will see multiple pieces.
In these data, pieces are data frames or matrices. You can again see
this with class
, only this time you specify a part of the
dat
object with $
class(dat$genes)
## [1] "data.frame"
class(dat$E)
## [1] "matrix" "array"
class(dat$targets)
## [1] "data.frame"
class(dat$weights)
## [1] "matrix" "array"
class(dat$design)
## [1] "matrix" "array"
Notice that you get the message
Loading required package: limma
. If you did not have
limma
installed, you could not work with these data because
they are a data type specific to limma
. Hence, why we
installed this package previously.
Or going deeper, you can specify one column in the genes
data frame
class(dat$genes$hgnc_symbol)
## [1] "character"
Of note, working with S4
objects is very similar to
S3
except that they are accessed with @
instead of $
. However, we will not use S4
in
this workshop.
A large proportion of R functions operate on vectors to perform quick computations over their values. Here are some examples:
# Compute the variance of total number of sequences
var(meta$total_seq)
## [1] 7.843553e+12
# Find whether any samples have greater than 10 million sequences
meta$total_seq > 10E6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [13] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
# Find the unique values of metadata's condition
unique(meta$condition)
## [1] "Media" "Mtb"
Functions executed on an object in R may respond exclusively to one or more data types or may respond differently depending on the data type. When you use the incorrect data type, you will get an error or warning message. For example, you cannot take the mean of a factor or character.
# Compute the mean of libID
mean(meta$libID)
## Warning in mean.default(meta$libID): argument is not numeric or logical:
## returning NA
## [1] NA
Since vectors are one dimensional arrays of a defined length, their
individual values can be retrieved using vector indices. R uses 1-based
indexing, meaning the first value in an R vector corresponds to the
index 1. (Importantly, if you use python, that language is 0-based,
meaning the first value is index 0.) Each subsequent element increases
the index by 1. For example, we can extract the value of the 5th element
of the libID
vector using the square bracket operator
[ ]
like so.
meta$libID[5]
## [1] "pt03_Media"
In contrast, data frames are two dimensional arrays so indexing is
done across both dimensions as [rows, columns]
. So, we can
extract the same library ID value directly from the data frame knowing
it is in the 5th row and 1st column.
meta[5, 1]
## [1] "pt03_Media"
The square bracket operator is often used with logical vectors
(TRUE/FALSE) to subset data. For example, we can subset our metadata to
all Media
observations (rows).
# Create logical vector for which condition is Media
logical.vector <- meta$condition == "Media"
#View vector
logical.vector
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
#Apply vector to data frame to select only observations where the logical vector is TRUE
meta[logical.vector, ]
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_seq
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9114402
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9221555
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6231728
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 10205557
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 15536685
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7085995
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 10706098
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9957906
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 13055276
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8216706
Subsetting is extremely useful when working with large data. We will learn more complex subsets on day 2 using the tidyverse. But first…
Statement | Meaning |
---|---|
<- |
Assign to object in environment |
== |
Equal to |
!= |
Not equal to |
> |
Greater than |
>= |
Greater than or equal to |
< |
Less than |
<= |
Less than or equal to |
%in% |
In or within |
is.na() |
Is missing, e.g NA |
!is.na() |
Is not missing |
& |
And |
| |
Or |
Using the meta
data frame:
ptID_old
variable is.[]
:
libID
value occurs in the 9th rowlibID
for a library where
age_dys
equals 7300 and condition
equals
MediaptID
equals
“pt05” or “pt08”. Hint: Use a logical vector with the
conditional statements above.In this workshop, we introduce you to the R tidyverse, a suite of packages for data manipulation and visualization in R. In it, we cover how to:
During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes
The tidyverse is not just one package, but a collection of several packages that are designed to work well together for data analysis tasks. Let’s load the tidyverse. For more information on installation, see the setup instructions.
library(tidyverse)
We also load the metadata, this time using the tidyverse
function read_csv
. This function is very similar to base R
read.csv
, except that it automatically tells you the
structure of the loaded data, performs better when determining different
classes of data within a data frame, and save the object as a
tibble.
meta <- read_csv("data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
class(meta)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
A tibble is just a data frame with some extra hidden formatting. You
will not need to interface with these hidden features and all the
following functions would behave the same if you’d read in a simple
data.frame
with read.csv
.
Take a look at the dataset with View()
:
View(meta)
Here is a brief description of the variables in these data.
libID
(character): Unique library ID for each
sequencing library. Formatted as ptID_conditionptID
(character): Patient ID. Each patient has 2
libraries which differ in their condition
condition
(character): Experimental condition. Media
for media control. Mtb of M. tuberculosis infected.age_dys
(numeric): Age in agessex
(character): Biological sex, M or FptID_old
(character): Old patient ID with leading
zeroesRNAseq
(logical): If the library has pass-filter
RNA-seq datamethylation
(logical): If the library has pass-filter
methylation datatotal_seq
(numeric): Total number of pass-filter
sequencesThe first tidyverse
package we will use is
dplyr
, which is for data manipulation. You do not need to
load this package with library
as it was already loaded as
part of the tidyverse
.
You can use the select
function to keep only the columns
that you want and remove columns that you don’t need. The first argument
to the select
function is our data frame, and the second
argument is the name of the column we want to keep.
select(meta, ptID)
## # A tibble: 20 × 1
## ptID
## <chr>
## 1 pt01
## 2 pt01
## 3 pt02
## 4 pt02
## 5 pt03
## 6 pt03
## 7 pt04
## 8 pt04
## 9 pt05
## 10 pt05
## 11 pt06
## 12 pt06
## 13 pt07
## 14 pt07
## 15 pt08
## 16 pt08
## 17 pt09
## 18 pt09
## 19 pt10
## 20 pt10
We can continue to list multiple columns to keep by separating with commas.
select(meta, ptID, total_seq)
## # A tibble: 20 × 2
## ptID total_seq
## <chr> <dbl>
## 1 pt01 9114402.
## 2 pt01 8918699.
## 3 pt02 9221555.
## 4 pt02 7733260.
## 5 pt03 6231728.
## 6 pt03 7105193.
## 7 pt04 10205557.
## 8 pt04 8413543.
## 9 pt05 15536685.
## 10 pt05 15509446.
## 11 pt06 7085995.
## 12 pt06 6588160.
## 13 pt07 10706098.
## 14 pt07 8576245.
## 15 pt08 9957906.
## 16 pt08 8220348.
## 17 pt09 13055276
## 18 pt09 13800442.
## 19 pt10 8216706.
## 20 pt10 7599609.
If you want to keep all but one column, put the minus
(-
) sign in front of it. I like to think of it as
“subtracting” the column from the data frame.
select(meta, -ptID_old)
## # A tibble: 20 × 8
## libID ptID condition age_dys sex RNAseq methylation total_seq
## <chr> <chr> <chr> <dbl> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M TRUE FALSE 9114402.
## 2 pt01_Mtb pt01 Mtb 12410 M TRUE FALSE 8918699.
## 3 pt02_Media pt02 Media 12775 M TRUE FALSE 9221555.
## 4 pt02_Mtb pt02 Mtb 12775 M TRUE FALSE 7733260.
## 5 pt03_Media pt03 Media 11315 M TRUE FALSE 6231728.
## 6 pt03_Mtb pt03 Mtb 11315 M TRUE FALSE 7105193.
## 7 pt04_Media pt04 Media 8395 M TRUE TRUE 10205557.
## 8 pt04_Mtb pt04 Mtb 8395 M TRUE TRUE 8413543.
## 9 pt05_Media pt05 Media 7300 M TRUE FALSE 15536685.
## 10 pt05_Mtb pt05 Mtb 7300 M TRUE FALSE 15509446.
## 11 pt06_Media pt06 Media 6570 F TRUE FALSE 7085995.
## 12 pt06_Mtb pt06 Mtb 6570 F TRUE FALSE 6588160.
## 13 pt07_Media pt07 Media 7665 F TRUE FALSE 10706098.
## 14 pt07_Mtb pt07 Mtb 7665 F TRUE FALSE 8576245.
## 15 pt08_Media pt08 Media 8760 M TRUE FALSE 9957906.
## 16 pt08_Mtb pt08 Mtb 8760 M TRUE FALSE 8220348.
## 17 pt09_Media pt09 Media 6935 M TRUE FALSE 13055276
## 18 pt09_Mtb pt09 Mtb 6935 M TRUE FALSE 13800442.
## 19 pt10_Media pt10 Media 8030 F TRUE FALSE 8216706.
## 20 pt10_Mtb pt10 Mtb 8030 F TRUE FALSE 7599609.
You can use filter
to remove/keep rows. Instead of
specifying a column name as in select
, you provide a
statement that gives TRUE/FALSE and filter
keeps all the
rows for which that statement is TRUE. The conditional
operators we saw in Intro R will help with this!
Let’s filter to just the Media samples.
filter(meta, condition == "Media")
## # A tibble: 10 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 3 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 4 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 5 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 6 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 7 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 8 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 9 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 10 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## # … with abbreviated variable name ¹total_seq
This is the same as what we did in Intro R with the following, only
it is more readable and will fit into a tidyverse
workflow.
More on this when we get to pipes!
meta[meta$condition == "Media", ]
## # A tibble: 10 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 3 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 4 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 5 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 6 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 7 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 8 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 9 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 10 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## # … with abbreviated variable name ¹total_seq
You can string together multiple statements and ask if both/all are
TRUE with and &
or if at least one is TRUE with or
|
. Notice that the &
statement removes all
rows because no library has both a Media and Mtb condition. In contrast,
the |
statement keeps all rows because all libraries are
either the Media or Mtb condition.
filter(meta, condition == "Media" & condition == "Mtb")
## # A tibble: 0 × 9
## # … with 9 variables: libID <chr>, ptID <chr>, condition <chr>, age_dys <dbl>,
## # sex <chr>, ptID_old <chr>, RNAseq <lgl>, methylation <lgl>, total_seq <dbl>
filter(meta, condition == "Media" | condition == "Mtb")
## # A tibble: 20 × 9
## libID ptID condition age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6
## # … with abbreviated variable name ¹total_seq
Notice that we put the word “Media” inside quotes. This is because we are not using a column from inside our data frame. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.
The general rule is that if you want to use an R object (data frame, column, etc), then you supply the name without quotes. If you want to specify a value within the R object, then use quotes.
Sometimes your variable names are not ideal. rename
does
just want it says; it renames columns in your data. The syntax for
rename
is to give the new name first and then tell R what
that new name should replace, so new_name = old_name
.
Here, we rename condition
to infection
as this
better represents those data.
rename(meta, infection = condition)
## # A tibble: 20 × 9
## libID ptID infection age_dys sex ptID_old RNAseq methylation total_…¹
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6
## # … with abbreviated variable name ¹total_seq
The function summarize()
performs summary statistics on
your data. We can use it to find the mean, median, or other statistics
about the dataset. Let’s find the average age in days below:
summarize(meta, mean(age_dys))
## # A tibble: 1 × 1
## `mean(age_dys)`
## <dbl>
## 1 9016.
Notice that summarize
automatically names the column
with the summary statistic based on the function used to calculate that
statistic. This is not ideal as the formatting has parentheses. So we
can specify a name instead.
summarize(meta, mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_dys
## <dbl>
## 1 9016.
Instead of including the data as an argument, we can use the pipe
operator %>%
to pass the data value into the
summarize
function.
meta %>% summarize(mean_age_dys = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_dys
## <dbl>
## 1 9016.
This line of code will do the exact same thing as our previous
summary command, but the piping function tells R to use the
meta
data frame as the first argument in the next function.
This lets us “chain” together multiple functions into a pipeline. The
pipe takes the output from the left side and use it as input to the
right side.
We can also add an
meta %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 1 × 1
## mean_age_days
## <dbl>
## 1 9016.
Using the pipe operator %>%
and enter command makes
our code more readable. The pipe operator %>%
also helps
to avoid using nested function like f(g(x))
and minimizes
the need for new variables.
Since we use the pipe operator so often, there is a keyboard shortcut
for it in RStudio. You can press
The pipe comes in handy with functions like group_by
. If
we want to calculate the mean age in days for males and females
separately, we can use group_by
before summarize:
meta %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
And we could merge this with a filter to only look at the Media libraries, since each age is repeated for that patient’s Media and Mtb libraries. The results are the same in this case but filtering would be important if any of our patient’s were missing one of the conditions.
meta %>%
filter(condition == "Media") %>%
group_by(sex) %>%
summarize(mean_age_days = mean(age_dys))
## # A tibble: 2 × 2
## sex mean_age_days
## <chr> <dbl>
## 1 F 7422.
## 2 M 9699.
Sometimes we would like to create a new column based on the values in an existing one. For example, age is days is a little difficult to understand. Let’s add a column with age in years.
meta %>%
mutate(age_yrs = age_dys / 365.25)
## # A tibble: 20 × 10
## libID ptID condi…¹ age_dys sex ptID_…² RNAseq methy…³ total…⁴ age_yrs
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6 34.0
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6 35.0
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6 31.0
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6 23.0
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6 18.0
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6 18.0
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7 21.0
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6 21.0
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6 24.0
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6 24.0
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7 19.0
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7 19.0
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6 22.0
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6 22.0
## # … with abbreviated variable names ¹condition, ²ptID_old, ³methylation,
## # ⁴total_seq
Note that this new column is not saved in meta because we printed the
result to the console and did not save it to the environment. We can
overwrite the meta
data frame so it will now contains the
age_yrs
column.
meta <- meta %>%
mutate(age_yrs = age_dys / 365.25)
meta
## # A tibble: 20 × 10
## libID ptID condi…¹ age_dys sex ptID_…² RNAseq methy…³ total…⁴ age_yrs
## <chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl>
## 1 pt01_Media pt01 Media 12410 M pt00001 TRUE FALSE 9.11e6 34.0
## 2 pt01_Mtb pt01 Mtb 12410 M pt00001 TRUE FALSE 8.92e6 34.0
## 3 pt02_Media pt02 Media 12775 M pt00002 TRUE FALSE 9.22e6 35.0
## 4 pt02_Mtb pt02 Mtb 12775 M pt00002 TRUE FALSE 7.73e6 35.0
## 5 pt03_Media pt03 Media 11315 M pt00003 TRUE FALSE 6.23e6 31.0
## 6 pt03_Mtb pt03 Mtb 11315 M pt00003 TRUE FALSE 7.11e6 31.0
## 7 pt04_Media pt04 Media 8395 M pt00004 TRUE TRUE 1.02e7 23.0
## 8 pt04_Mtb pt04 Mtb 8395 M pt00004 TRUE TRUE 8.41e6 23.0
## 9 pt05_Media pt05 Media 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 10 pt05_Mtb pt05 Mtb 7300 M pt00005 TRUE FALSE 1.55e7 20.0
## 11 pt06_Media pt06 Media 6570 F pt00006 TRUE FALSE 7.09e6 18.0
## 12 pt06_Mtb pt06 Mtb 6570 F pt00006 TRUE FALSE 6.59e6 18.0
## 13 pt07_Media pt07 Media 7665 F pt00007 TRUE FALSE 1.07e7 21.0
## 14 pt07_Mtb pt07 Mtb 7665 F pt00007 TRUE FALSE 8.58e6 21.0
## 15 pt08_Media pt08 Media 8760 M pt00008 TRUE FALSE 9.96e6 24.0
## 16 pt08_Mtb pt08 Mtb 8760 M pt00008 TRUE FALSE 8.22e6 24.0
## 17 pt09_Media pt09 Media 6935 M pt00009 TRUE FALSE 1.31e7 19.0
## 18 pt09_Mtb pt09 Mtb 6935 M pt00009 TRUE FALSE 1.38e7 19.0
## 19 pt10_Media pt10 Media 8030 F pt00010 TRUE FALSE 8.22e6 22.0
## 20 pt10_Mtb pt10 Mtb 8030 F pt00010 TRUE FALSE 7.60e6 22.0
## # … with abbreviated variable names ¹condition, ²ptID_old, ³methylation,
## # ⁴total_seq
total_seq_millions
based on total_seq
.condition == "Media"
?tidyr
is another data manipulation package in
tidyverse
. It contains functions for tidying data and
generally acts on the entire data frame, as opposed to
dplyr
which most often works on a subset of columns or
rows.
Data comes in many shapes and sizes, and one way we classify data is
either “wide” or “long.” Data that is “long” has one row per
observation. The metadata is in a long format. We have one row for each
patient sample and each variable is in a different column. We might also
describe these data as “tidy” because it makes it easy to work with
ggplot2
and dplyr
functions (this is where the
“tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we
may want our data in a “wide” format. Typically, in “wide” format, each
row represents a group of observations and each value is placed in a
different column rather than a different row. For example, maybe we want
each patient to have 1 row and their Media/Mtb data to be spread across
columns.
The tidyr
package in the tidyverse contains the
functions pivot_wider
and pivot_longer
that
make it easy to switch between the two formats.
For each patient, we have two samples: Media
and
Mtb
. In the metadata, the only difference in these two
conditions is the libID
column (which is redundant with the
ptID
and condition
) and the
total_seq
column. We can take the condition column and
create two new columns, one with the total seqs in the
Media
sample and one with the total seqs in the
Mtb
sample. This is called “pivoting” the data frame
“wider” because we rearrange it by creating an additional column and
making it have fewer rows. Let’s try it!
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
## # A tibble: 10 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs Media Mtb
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 9114402. 8918699.
## 2 pt02 12775 M pt00002 TRUE FALSE 35.0 9221555. 7733260.
## 3 pt03 11315 M pt00003 TRUE FALSE 31.0 6231728. 7105193.
## 4 pt04 8395 M pt00004 TRUE TRUE 23.0 10205557. 8413543.
## 5 pt05 7300 M pt00005 TRUE FALSE 20.0 15536685. 15509446.
## 6 pt06 6570 F pt00006 TRUE FALSE 18.0 7085995. 6588160.
## 7 pt07 7665 F pt00007 TRUE FALSE 21.0 10706098. 8576245.
## 8 pt08 8760 M pt00008 TRUE FALSE 24.0 9957906. 8220348.
## 9 pt09 6935 M pt00009 TRUE FALSE 19.0 13055276 13800442.
## 10 pt10 8030 F pt00010 TRUE FALSE 22.0 8216706. 7599609.
Notice here that we tell pivot_wider
which columns to
pull the names we wish our new columns to be named from the
condition
variable, and the values to populate those
columns from the total_seq
variable. (Again, neither of
which have to be in quotes in the code when there are no special
characters or spaces - certainly an incentive not to use special
characters or spaces!) We see that the resulting table has new columns
by condition, and the values populate it with our remaining variables
dictating the rows.
Maybe we should assign those columns more informative names.
meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq,
names_prefix = "total_seq_")
## # A tibble: 10 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs total_seq_M…¹ total…²
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 9114402. 8.92e6
## 2 pt02 12775 M pt00002 TRUE FALSE 35.0 9221555. 7.73e6
## 3 pt03 11315 M pt00003 TRUE FALSE 31.0 6231728. 7.11e6
## 4 pt04 8395 M pt00004 TRUE TRUE 23.0 10205557. 8.41e6
## 5 pt05 7300 M pt00005 TRUE FALSE 20.0 15536685. 1.55e7
## 6 pt06 6570 F pt00006 TRUE FALSE 18.0 7085995. 6.59e6
## 7 pt07 7665 F pt00007 TRUE FALSE 21.0 10706098. 8.58e6
## 8 pt08 8760 M pt00008 TRUE FALSE 24.0 9957906. 8.22e6
## 9 pt09 6935 M pt00009 TRUE FALSE 19.0 13055276 1.38e7
## 10 pt10 8030 F pt00010 TRUE FALSE 22.0 8216706. 7.60e6
## # … with abbreviated variable names ¹total_seq_Media, ²total_seq_Mtb
And now let’s save the new wider metadata to a new object.
meta_wide <- meta %>%
select(-libID) %>%
pivot_wider(names_from = condition, values_from = total_seq)
Notice that the number of rows and columns has changed.
dim(meta)
## [1] 20 10
dim(meta_wide)
## [1] 10 9
Everything we did with pivot_wider
, we can reverse with
pivot_longer
. Maybe you receive a data frame in a wide
format, but you need it in a long format. Here’s how we can get back to
a long data frame.
meta_wide %>%
pivot_longer(c(Media, Mtb),
names_to = "condition", values_to = "total_seq")
## # A tibble: 20 × 9
## ptID age_dys sex ptID_old RNAseq methylation age_yrs condition total_seq
## <chr> <dbl> <chr> <chr> <lgl> <lgl> <dbl> <chr> <dbl>
## 1 pt01 12410 M pt00001 TRUE FALSE 34.0 Media 9114402.
## 2 pt01 12410 M pt00001 TRUE FALSE 34.0 Mtb 8918699.
## 3 pt02 12775 M pt00002 TRUE FALSE 35.0 Media 9221555.
## 4 pt02 12775 M pt00002 TRUE FALSE 35.0 Mtb 7733260.
## 5 pt03 11315 M pt00003 TRUE FALSE 31.0 Media 6231728.
## 6 pt03 11315 M pt00003 TRUE FALSE 31.0 Mtb 7105193.
## 7 pt04 8395 M pt00004 TRUE TRUE 23.0 Media 10205557.
## 8 pt04 8395 M pt00004 TRUE TRUE 23.0 Mtb 8413543.
## 9 pt05 7300 M pt00005 TRUE FALSE 20.0 Media 15536685.
## 10 pt05 7300 M pt00005 TRUE FALSE 20.0 Mtb 15509446.
## 11 pt06 6570 F pt00006 TRUE FALSE 18.0 Media 7085995.
## 12 pt06 6570 F pt00006 TRUE FALSE 18.0 Mtb 6588160.
## 13 pt07 7665 F pt00007 TRUE FALSE 21.0 Media 10706098.
## 14 pt07 7665 F pt00007 TRUE FALSE 21.0 Mtb 8576245.
## 15 pt08 8760 M pt00008 TRUE FALSE 24.0 Media 9957906.
## 16 pt08 8760 M pt00008 TRUE FALSE 24.0 Mtb 8220348.
## 17 pt09 6935 M pt00009 TRUE FALSE 19.0 Media 13055276
## 18 pt09 6935 M pt00009 TRUE FALSE 19.0 Mtb 13800442.
## 19 pt10 8030 F pt00010 TRUE FALSE 22.0 Media 8216706.
## 20 pt10 8030 F pt00010 TRUE FALSE 22.0 Mtb 7599609.
In the above case, the condition and total_seq names need to be in
quotes because they are not yet in our data object. They are instead
just characters used to name the new columns. If we forgot the quotes,
we’d get an error because R cannot find a condition
column
in meta_wide
as no such column exists.
meta_wide %>%
pivot_longer(c(Media, Mtb),
names_to = condition, values_to = "total_seq")
## Error in build_longer_spec(data, !!cols, names_to = names_to, values_to = values_to, : object 'condition' not found
colnames(meta_wide)
## [1] "ptID" "age_dys" "sex" "ptID_old" "RNAseq"
## [6] "methylation" "age_yrs" "Media" "Mtb"
Often not all your data is pre-compiled in one nice, tidy data frame.
You may have multiple files, batches, etc to work with.
dplyr
’s join functions allow us to combine data frames
without fear of copy-paste errors or missing anything!
So far, we’ve been working with just the metadata - a single data
frame. Let’s adventure into the full RNA-seq dat
object so
we have multiple data frames to combine.
load("data/dat_voom.RData")
First, we have the normalized log2 counts per million (CPM) in
E
dat$E[1:3,1:3]
## pt01_Media pt01_Mtb pt02_Media
## A1BG 1.620022 1.717603 1.086946
## A2M 7.259912 6.680758 5.939909
## A2ML1 -4.052404 -3.515057 -4.015712
Second, we have the same metadata as our meta
object in
the targets
data.
dat$targets[1:3,]
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 8295928 1.092872 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 5716203 0.819490 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 8087601 1.119352 pt02_Media pt02 Media 12775 M
## ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media pt00001 TRUE FALSE 9114402 0.8943797
## pt01_Mtb pt00001 TRUE FALSE 8918699 0.9850363
## pt02_Media pt00002 TRUE FALSE 9221555 1.2582092
Looking at these two data frames, they do not share any columns -
thus there is nothing to join them with. However, notice that the column
names in the counts data are the libIDs from the metadata! We can use
pivot_longer
to get the joining column we need.
Since E
is a matrix with rownames, we must force it to
be a data frame and move the rownames into a data column with some
additional functions.
E_long <- as.data.frame(dat$E) %>%
rownames_to_column("gene") %>%
pivot_longer(-gene, names_to = "libID", values_to = "log2CPM")
E_long
## # A tibble: 268,380 × 3
## gene libID log2CPM
## <chr> <chr> <dbl>
## 1 A1BG pt01_Media 1.62
## 2 A1BG pt01_Mtb 1.72
## 3 A1BG pt02_Media 1.09
## 4 A1BG pt02_Mtb 1.10
## 5 A1BG pt03_Media 0.874
## 6 A1BG pt03_Mtb 1.71
## 7 A1BG pt04_Media 1.57
## 8 A1BG pt04_Mtb -0.264
## 9 A1BG pt05_Media 1.63
## 10 A1BG pt05_Mtb -0.0305
## # … with 268,370 more rows
Now both data frames have a libID
column!
Let’s combine our sequence counts with our metadata. To combine two
data frames, we will use a join
function. The dplyr package
has a number of tools for joining data frames together depending on what
we want to do with the rows of the data of countries that are not
represented in both data frames.
?join
The most common joins are:
inner_join
: includes all rows present in BOTH data
framesleft_join
: includes all rows in the first/left data
frame. Remove rows ONLY present in the second/right data frameright_join
: includes all rows in the second/right data
frame. Remove rows ONLY present in the first/left data framefull_join
: includes all rows present in EITHER data
frame. Fills in rows only present in one data frame with NA in the
otherHere, we use inner_join
to keep only libraries with both
expression data and metadata, which is actually all of them.
full_data <- inner_join(meta, E_long, by = 'libID')
I find this page helpful when working with joins. https://stat545.com/join-cheatsheet.html
inner_join
the long expression data E_long
with metadata containing only Media samples.full_join
?In this workshop, we introduce you to data visualization in
ggplot
(docs.ggplot2.org), the tidyverse package for plots. In
it, we cover how to create:
During the workshop, we will build an R script together, which will be posted as ‘live_notes’ after the workshop at https://github.com/BIGslu/workshops/tree/main/2022.08.15_R.tidyverse.workshop/live_notes
We load ggplot
within the tidyverse.
For
more information on installation, see the setup
instructions.
library(tidyverse)
ggplot2 is an implementation of Grammar of Graphics (Wilkinson 1999) for R
Benefits:
Drawbacks:
ggplot
functions by adding layers to a plot, somewhat
like how dplyr
/tidyr
sequentially modify data
with the pipe %>%
. Instead, ggplot
uses the
+
to combine layers such as:
data.frame
) of
variablesThe idea is to independently specify and combine layers to create the plot you want. There are at least three things we have to specify to create any plot:
A quick reminder about our data. There are bulk RNA-seq contained in
the dat
object in file “data/dat_voom.RData”. Normalized
log2 counts per million (CPM) are contained in dat$E
and
library metadata are in dat$targets
. You will not need the
other pieces of dat
for this workshop.
load("data/dat_voom.RData")
dat$E[1:3,1:3]
## pt01_Media pt01_Mtb pt02_Media
## A1BG 1.620022 1.717603 1.086946
## A2M 7.259912 6.680758 5.939909
## A2ML1 -4.052404 -3.515057 -4.015712
dat$targets[1:3,]
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 8295928 1.092872 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 5716203 0.819490 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 8087601 1.119352 pt02_Media pt02 Media 12775 M
## ptID_old RNAseq methylation total_seq sample.weights
## pt01_Media pt00001 TRUE FALSE 9114402 0.8943797
## pt01_Mtb pt00001 TRUE FALSE 8918699 0.9850363
## pt02_Media pt00002 TRUE FALSE 9221555 1.2582092
Here is a brief description of the variables in the metadata.
libID
(character): Unique library ID for each
sequencing library. Formatted as ptID_conditionptID
(character): Patient ID. Each patient has 2
libraries which differ in their condition
condition
(character): Experimental condition. Media
for media control. Mtb of M. tuberculosis infected.age_dys
(numeric): Age in agessex
(character): Biological sex, M or FptID_old
(character): Old patient ID with leading
zeroesRNAseq
(logical): If the library has pass-filter
RNA-seq datamethylation
(logical): If the library has pass-filter
methylation datatotal_seq
(numeric): Total number of pass-filter
sequencesWe will start with a boxplot of the total sequences per library. First, we call a ggplot object. Since we’ve given it no data or aesthetics, this is a blank plot.
ggplot()
Then, we give the plot data and specify aesthetics for which parts of
the data we want to plot. Now, we see that ggplot knows what the scale
of the y data will be. Throughout this section, we will highlight lines
of code that change with the comment
#<--- see change
.
ggplot(dat$targets) +
aes(y = total_seq) #<--- see change
Finally, we add a geom to denote how we want the data displayed, in this case as a boxplot.
ggplot(dat$targets) +
aes(y = total_seq) +
geom_boxplot() #<--- see change
Since we did not specify an x-variable, all the data are plotted in one boxplot. We add an aesthetic to specify x and split up the data.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) + #<--- see change
geom_boxplot()
Our x-variable happens to be in the correct alphabetical order with
the Media
controls before Mtb
infection.
However, alphabetical does not always give you want you want. You can
force a variable’s order by converting it to a factor. Let’s switch the
condition
order.
dat$targets %>%
mutate(condition_fct = factor(condition,
levels = c("Mtb", "Media"))) %>% #<--- see change
ggplot() +
aes(y = total_seq, x = condition_fct) +
geom_boxplot()
Notice how we now pipe %>%
the data into ggplot so
that we can modify it in dplyr
first.
These same data could be displayed as a dotplot instead. We return to
the original condition
order since it was correct.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_point() #<--- see change
And if we want to make sure we see every point, a jitter works nicely.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_jitter() #<--- see change
The building block nature of ggplot allows us to add multiple geoms on top of each other. So we can combine the two previous plots like so
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot() + #<--- see change
geom_jitter() #<--- see change
But you may notice that there are some duplicate points. Looking back at the first boxplots, there are points for data outside the box. These points remain in the above plot plus appear jittered from the second geom. We can fix this by telling the boxplot to not plot the outliers.
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot(outlier.shape = NA) + #<--- see change
geom_jitter()
Because ggplot works in layers, the order matters. If, for example,
we put the boxplot after the jitter, the boxes cover up points! We use
alpha
for transparency so you can see these points. Thus,
be careful when considering how you setup your plot!
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_jitter() + #<--- see change
geom_boxplot(outlier.shape = NA, alpha = 0.8) #<--- see change
With jitter, you are adding random error to be able to see all the points. Because this is truly random, every time you run jitter, you will get a slightly different plot. To combat this, set a seed before making the plot.
set.seed(468) #<--- see change
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
set.seed(468) #<--- see change
ggplot(dat$targets) +
aes(y = total_seq, x = condition) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
One dotplot of interest in RNA-seq analysis is principle component analysis (PCA). In this plot, each RNA-seq library is a single point. The axes are unitless but in general, points that are closer together represent libraries with more similar gene expression. You can see large scale changes in gene expression in PCA. However, even when you don’t see clear PCA groupings, there can be significant differentially expressed genes. So don’t despair if your data don’t cluster in this plot!
Here, we calculate PCA from the normalized counts. Note that we
transpose t()
the counts data to get one PCA point per
library, instead of one per gene.
dat.pca <- prcomp(t(dat$E), scale.=TRUE, center=TRUE)
dat.pca$x[1:3,1:3]
## PC1 PC2 PC3
## pt01_Media -103.91401 -33.55704 22.40058
## pt01_Mtb 52.95006 -74.68146 24.78278
## pt02_Media -60.61507 27.13460 24.58148
Then we use join to add metadata to the PCA data, so we can annotate the PCA plot with variables of interest.
dat.pca.meta <- as.data.frame(dat.pca$x) %>%
rownames_to_column("libID") %>%
inner_join(dat$targets)
## Joining, by = "libID"
A nice trick with joins is to not specify what to join
by
. In this case, the function finds all columns shared
between the data frames and uses them to match rows.
With geom_point
, we have a PCA!
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2) +
geom_point()
But it’s not very informative, so let’s annotate with our condition groups. In general, you want to use annotations in the following order: space, color, shape, size. These are ordered by how easily the human eye can discern differences.
We already have space as libraries are separated along x and y. Now we add color.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, color = condition) + #<--- see change
geom_point()
I like to keep all my aesthetics in one layer, but you could keep
adding aes()
functions to the plot if you prefer. This
gives the same plot.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2) + #<--- see change
geom_point() +
aes(color = condition) #<--- see change
We could also use shape for the same data. Which plot to you think is easier to interpret?
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, shape = condition) + #<--- see change
geom_point()
We’re not showing size or transparency because these aesthetics are difficult to see and should be avoided unless you really need them or the differences between data points are large enough to see in these aesthetics.
We can also improve our axes and legend labels. First, we see what proportion of variation is explained by each PC.
summary(dat.pca)$importance
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 57.21487 38.99933 35.32881 34.21306 29.76604 27.13699
## Proportion of Variance 0.24395 0.11334 0.09301 0.08723 0.06603 0.05488
## Cumulative Proportion 0.24395 0.35729 0.45030 0.53753 0.60356 0.65844
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 24.48881 23.98672 21.83296 21.23941 19.84912 18.33097
## Proportion of Variance 0.04469 0.04288 0.03552 0.03362 0.02936 0.02504
## Cumulative Proportion 0.70313 0.74601 0.78153 0.81515 0.84451 0.86955
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 18.05337 17.40958 17.13667 15.67883 14.48455 14.12701
## Proportion of Variance 0.02429 0.02259 0.02188 0.01832 0.01563 0.01487
## Cumulative Proportion 0.89384 0.91642 0.93831 0.95663 0.97226 0.98713
## PC19 PC20
## Standard deviation 13.14034 1.137607e-13
## Proportion of Variance 0.01287 0.000000e+00
## Cumulative Proportion 1.00000 1.000000e+00
And use that for labels as well as change the legend.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, color = condition) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)", #<--- see change
color = "Infection")
Since PCA units are relative, it is nice to have the scales of the x
and y be the same. That is, 1 inch of space corresponding to the same
values for both. This is accomplished with coord_fixed
.
Note that in this PCA, this function does not change the plot much but
it can can a big effect in other PCA! There are also several other
coordinate systems you can see with coord_
.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, color = condition) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)",
color = "Infection") +
coord_fixed() #<--- see change
ggplot
has a number of themes that change the aesthetics
of the plot. I like theme_classic
for PCA. Type
theme_
into the console to see the available themes
suggested with auto-complete. You can also checkout more themes in the
package ggthemes
.
ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2, color = condition) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)",
color = "Infection") +
coord_fixed() +
theme_classic() #<--- see change
Let’s move on to another application of geom_point
. A
volcano plot shows differentially expressed gene significance and fold
change.
We’ll need to perform linear modeling to determine these results.
Below is the workflow for comparing Mtb-infected vs media samples,
corrected for sex and blocked by patient. We use our R package kimma
. You
DO NOT need to run these steps. They are here in case
you want to run a similar analysis in the future.
install.packages("devtools")
devtools::install_github("BIGslu/kimma")
library(kimma)
fit <- kmFit(dat, model = "~ condition + sex + (1|ptID)",
run.lme = TRUE, use.weights = TRUE)
lme/lmerel model: expression~condition+sex+(1|ptID)
Input: 20 libraries from 10 unique patients
Model: 20 libraries
Complete: 13419 genes
Failed: 0 genes
save(fit, file="data/kimma_result.RData")
Instead, load the results from the data downloaded for this workshop.
Note that this is an S3 object similar to the dat
object.
In it, we have the linear mixed effects (lme) model results in
fit$lme
, which includes our log2 fold change
estimate
and significance FDR
for each gene
and each variable.
load("data/kimma_result.RData")
fit$lme
## # A tibble: 40,257 × 7
## model gene variable estimate pval FDR gene_biotype
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 lme A1BG condition -0.313 0.264 0.347 protein_coding
## 2 lme A1BG sex 0.0796 0.814 0.954 protein_coding
## 3 lme A1BG (1 | ptID) 0 1 1 protein_coding
## 4 lme A2M condition -0.0940 0.605 0.682 protein_coding
## 5 lme A2M sex 1.40 0.0140 0.158 protein_coding
## 6 lme A2M (1 | ptID) 7.79 0.00525 0.0574 protein_coding
## 7 lme A2ML1 condition 0.791 0.0191 0.0356 protein_coding
## 8 lme A2ML1 sex -2.53 0.00000156 0.000388 protein_coding
## 9 lme A2ML1 (1 | ptID) 1.39 0.239 0.498 protein_coding
## 10 lme A4GALT condition 2.66 0.0000318 0.000108 protein_coding
## # … with 40,247 more rows
Let’s start with a basic dotplot like before. We first filter the fit
data to just the variable condition
so that we have 1 point
per gene.
fit$lme %>%
filter(variable == "condition") %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point()
Now let’s color the significant points. There is no “significant” variable in the data frame, so we first make one with mutate.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 ~ "FDR < 0.01", #<--- see change
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) + #<--- see change
geom_point()
Or go further and color up vs down regulated genes differently.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down", #<--- see change
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point()
And we can specify the colors we want with another layer.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) #<--- see change
Since we’re defining significance at a cutoff, we can plot that
cutoff with a vertical line with geom_vline
. Other similar
geoms are geom_hline
for a horizontal line and
geom_abline
for a line with slope a
and
intercept b
.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) #<--- see change
You might also wish to center the 0 estimate or otherwise alter the plots limits.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) #<--- see change
If you accidentally (or purposefully) remove data point with your limits, R will warn you!
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-3, 3)) #<--- see change
## Warning: Removed 165 rows containing missing values (geom_point).
We’re going to use a ggplot extension package to label our most significant genes. Let’s load that package now.
library(ggrepel)
First, we determine our top 2 up and down regulated genes. There are
some new dplyr
functions in here! See the comments for
information.
top_deg <- fit$lme %>%
filter(variable == "condition") %>%
mutate(direction = case_when(estimate < 0 ~ "down",
estimate > 0 ~ "up")) %>%
group_by(direction) %>%
slice_min(FDR, n=2) %>% # Keeps n number of minimum FDR values
pull(gene) # Extract the variable and turns the data into a vector
top_deg
## [1] "IFNAR2" "NAIP" "TERF2IP" "ZBTB21"
Then we add those labels to our fit data and use that variable in
geom_text_repel
. In this case, the data removal warning is
expected as we’re not labeling the majority of points.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>% #<--- see change
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label)) #<--- see change
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
The text legend gets overlayed with the point legend, which looks odd. Let’s remove one of them.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR), color = significance) +
geom_point() +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label), show.legend = FALSE) #<--- see change
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
And a small edit if you don’t want the text to be colored. When you put an aesthetic in a specific geom, it only applies that change to that one geom.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) + #<--- see change
geom_point(aes(color = significance)) + # <--- see change
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label), show.legend = FALSE)
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
To round it out, let’s look at theme_bw
, my favorite one
for volcano plots.
fit$lme %>%
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
lims(x = c(-9, 9)) +
geom_text_repel(aes(label = label)) +
theme_bw() #<--- see change
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
There you have it! Did you think you’d reach a 11 function plot so soon?
With big data, we often want to plot more than one thing at once.
Facets allow you to make similar plots across multiple subsets of data
with only 1 additional line of code. Here, we filter for 2
variable
and then facet those into 2 plots.
fit$lme %>%
filter(variable %in% c("condition", "sex")) %>% #<--- see change
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
#<--- see change No lims used
geom_text_repel(aes(label = label)) +
theme_bw() +
facet_wrap(~ variable) #<--- see change
## Warning: Removed 26830 rows containing missing values (geom_text_repel).
Since the a and y scales differ for the two variable, we allow them
to differ with scales
. This is my we removed the
lims
function earlier. scales
can be set to
“free_x”, “free_y”, or “free” for both x and y.
fit$lme %>%
filter(variable %in% c("condition", "sex")) %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
geom_text_repel(aes(label = label)) +
theme_bw() +
facet_wrap(~ variable, scales = "free") #<--- see change
## Warning: Removed 26830 rows containing missing values (geom_text_repel).
Once you have your plot as you desire, save it to your computer for
use in presentations, publications, Twitter, and more!
ggsave
allows you to save in a bunch of formats (see the
help page with ?ggsave
).
Since we’ve only been printing our plots to the plot pane in RStudio,
we need to first save a plot to the environment. We’ll use the single
condtion
variable volcano plot.
p1 <- fit$lme %>% #<--- see change
filter(variable == "condition") %>%
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() +
aes(x = estimate, y = -log10(FDR)) +
geom_point(aes(color = significance)) +
scale_color_manual(values = c("blue", "grey", "red")) +
geom_hline(yintercept = -log10(0.01)) +
geom_text_repel(aes(label = label)) +
theme_bw()
p1
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
Now we can save the object. My most commonly used plot formats are
.pdf
for publication (vectorized PDF that can be edited in
Illustrator/Inkscape) and .png
for all other uses (not
vectorized, smaller files). You specify the file type in the filename
and give dimensions in inches (default). Without dimensions, R used
whatever the ratio of you plot pane currently is (not
reproducible!).
ggsave(filename = "images/volcano_plot.png", plot = p1, width = 3, height = 4)
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
ggsave(filename = "images/volcano_plot.pdf", plot = p1, width = 3, height = 4)
## Warning: Removed 13415 rows containing missing values (geom_text_repel).
Here are each of the final plots fulled comments for your reference.
ggplot(dat.pca.meta) + # Make ggplot from data in dat.pca.meta
aes(x = PC1, y = PC2, # Plot PCs 1 and 2 along x and y
color = condition) + # Color by condition
geom_point() + # Plot points for each value
labs(x = "PC1 (24%)", y = "PC2 (11%)", # Relabel axes
color = "Infection") + # Relabel legend
coord_fixed() + # Force xy scales to be equal
theme_classic() # Set theme
fit$lme %>%
# Keep results for only the condition variable
filter(variable == "condition") %>%
# Create variable for significant up and down genes
mutate(significance = case_when(FDR < 0.01 & estimate < 0 ~ "down",
FDR < 0.01 & estimate > 0 ~ "up",
TRUE ~ "NS")) %>%
# Create variable for gene name labels
mutate(label = case_when(gene %in% top_deg ~ gene)) %>%
ggplot() + # initiate ggplot plot
aes(x = estimate, y = -log10(FDR)) + # plot estimate and FDR
geom_point(aes(color = significance)) + # plot points, color by significance
scale_color_manual(values = c("blue", "grey", "red")) + # recolor points
geom_hline(yintercept = -log10(0.01)) + # add FDR cuttof line
lims(x = c(-9, 9)) + # set axis limits
geom_text_repel(aes(label = label)) + # label points
theme_bw() # set theme
... +
facet_wrap(~ variable) # facet by variable
Media
vs
Mtb
. The plot type is up to you! As a reminder, here is how
we made the combined data.full_data <- as.data.frame(dat$E) %>%
rownames_to_column("gene") %>%
pivot_longer(-gene, names_to = "libID", values_to = "log2CPM") %>%
inner_join(dat$targets)
## Joining, by = "libID"
full_data
## # A tibble: 268,380 × 15
## gene libID log2CPM group lib.s…¹ norm.…² ptID condi…³ age_dys sex ptID_…⁴
## <chr> <chr> <dbl> <fct> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 A1BG pt01… 1.62 1 8.30e6 1.09 pt01 Media 12410 M pt00001
## 2 A1BG pt01… 1.72 1 5.72e6 0.819 pt01 Mtb 12410 M pt00001
## 3 A1BG pt02… 1.09 1 8.09e6 1.12 pt02 Media 12775 M pt00002
## 4 A1BG pt02… 1.10 1 5.28e6 0.912 pt02 Mtb 12775 M pt00002
## 5 A1BG pt03… 0.874 1 5.18e6 1.05 pt03 Media 11315 M pt00003
## 6 A1BG pt03… 1.71 1 5.23e6 0.958 pt03 Mtb 11315 M pt00003
## 7 A1BG pt04… 1.57 1 9.00e6 1.13 pt04 Media 8395 M pt00004
## 8 A1BG pt04… -0.264 1 5.40e6 0.855 pt04 Mtb 8395 M pt00004
## 9 A1BG pt05… 1.63 1 1.26e7 1.07 pt05 Media 7300 M pt00005
## 10 A1BG pt05… -0.0305 1 1.01e7 0.930 pt05 Mtb 7300 M pt00005
## # … with 268,370 more rows, 4 more variables: RNAseq <lgl>, methylation <lgl>,
## # total_seq <dbl>, sample.weights <dbl>, and abbreviated variable names
## # ¹lib.size, ²norm.factors, ³condition, ⁴ptID_old
total_seq
.
Explore the help pages for scale_color_continuous
,
scale_color_gradient
, or scale_color_gradient2
and recolor your PCA to more easily see differences. As a reminder, here
is the base PCA plot.ggplot(dat.pca.meta) +
aes(x = PC1, y = PC2) +
geom_point() +
labs(x = "PC1 (24%)", y = "PC2 (11%)",
color = "Infection") +
theme_classic()
geom_text_repel
, label the points in the PCA by
their libID