This document covers our recommended data cleaning pipeline for RNA-seq gene count tables. This pipeline includes quality assessment, filtering, and normalization of count tables generated using our SEAsnake pipeline. The example data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.
This pipeline should be completed in R and RStudio. You should also install the following packages.
#CRAN packages
install.packages(c("tidyverse", "ggrepel", "scales"))
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma", "biomaRt", "patchwork"))
#GitHub packages
install.packages("devtools")
devtools::install_github("zhangyuqing/sva-devel")
devtools::install_github("BIGslu/BIGverse")
#Or if fails, install packages individually
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")
And load them into your current R session.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(BIGverse)
## ── Attaching packages ──────────────────────────────────────── BIGverse 1.0.0 ──
## ✔ kimma 1.5.0 ✔ BIGpicture 1.1.0
## ✔ RNAetc 1.1.0 ✔ SEARchways 1.1.0
## ── Conflicts ─────────────────────────────────────────── BIGverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(sva)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:readr':
##
## spec
##
## Loading required package: BiocParallel
library(ggrepel)
library(patchwork)
##
## Attaching package: 'patchwork'
##
## The following object is masked from 'package:genefilter':
##
## area
library(limma)
library(edgeR)
#Note we do not load biomaRt because it has conflicts with the tidyverse.
# We instead call its functions with biomaRt::
set.seed(651)
Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this tutorial.
Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub
Specifically, this tutorial uses RNAseq data processed using our SEAsnake
pipeline. These data are available in kimma
and names
example.seasnake
.
Create a directory for the outputs of this tutorial.
dir.create("data_clean", showWarnings = FALSE)
The SEAsnake pipeline outputs library quality metrics from samtools
flagstat
and picard
. For more details, see flagstat and
Picard RnaSeqMetrics.
Let’s load in the metrics files.
flagstat <- kimma::example.seasnake$flagstat
picard <- kimma::example.seasnake$picard
Next, we read in and combine any patient and/or sample metadata. You may have only one of these data types. In general, patient data include things like age, sex, and any variable that is assigned to each person in the study. In contrast, sample data are variables assigned to each library, like experimental treatment, quality, etc. We combine these along with the above metrics tables to make a single data frame with all metadata.
patient <- kimma::example.seasnake$patient %>%
mutate(asthma = fct_relevel(factor(asthma), "healthy", after = 0))
sample <- kimma::example.seasnake$sample %>%
#format any variables that need to be factors
#Or any other additional formatting that you desire on your data
mutate(virus = fct_relevel(factor(virus), "none", after = 0)) %>%
mutate(batch = factor(batch))
meta <- full_join(sample, patient, by = "ptID") %>%
full_join(flagstat, by = "libID") %>%
full_join(picard, by = "libID")
This provides the following metrics and metadata for our libraries.
## [1] "libID" "ptID"
## [3] "virus" "batch"
## [5] "asthma" "age"
## [7] "sex" "QC_pass"
## [9] "primary" "secondary"
## [11] "supplementary" "duplicate"
## [13] "primary_duplicate" "mapped"
## [15] "primary_mapped" "paired"
## [17] "read1" "read2"
## [19] "paired_proper" "paired_mapped"
## [21] "singleton" "paired_diff_chr"
## [23] "paired_diff_chr5" "PF_BASES"
## [25] "PF_ALIGNED_BASES" "RIBOSOMAL_BASES"
## [27] "CODING_BASES" "UTR_BASES"
## [29] "INTRONIC_BASES" "INTERGENIC_BASES"
## [31] "IGNORED_READS" "CORRECT_STRAND_READS"
## [33] "INCORRECT_STRAND_READS" "NUM_R1_TRANSCRIPT_STRAND_READS"
## [35] "NUM_R2_TRANSCRIPT_STRAND_READS" "NUM_UNEXPLAINED_READS"
## [37] "PCT_R1_TRANSCRIPT_STRAND_READS" "PCT_R2_TRANSCRIPT_STRAND_READS"
## [39] "PCT_RIBOSOMAL_BASES" "PCT_CODING_BASES"
## [41] "PCT_UTR_BASES" "PCT_INTRONIC_BASES"
## [43] "PCT_INTERGENIC_BASES" "PCT_MRNA_BASES"
## [45] "PCT_USABLE_BASES" "PCT_CORRECT_STRAND_READS"
## [47] "MEDIAN_CV_COVERAGE" "MEDIAN_5PRIME_BIAS"
## [49] "MEDIAN_3PRIME_BIAS" "MEDIAN_5PRIME_TO_3PRIME_BIAS"
## [51] "SAMPLE" "LIBRARY"
## [53] "READ_GROUP"
Output by Subread featureCounts
in SEAsnake, the counts
table can be read directly into R.
count <- kimma::example.seasnake$fcounts
count[1:3,1:5]
## # A tibble: 3 × 5
## Geneid lib1 lib2 lib3 lib4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000284662 0 0 0 0
## 2 ENSG00000186827 240 126 595 180
## 3 ENSG00000186891 185 50 255 37
We assess sample quality using several metrics from samtools
flagstat
and picard
. Our standard assessment
includes:
QC_pass
: total sequences passing quality control (QC),
adapter and primer removal with AdapterRemoval
fastq_total_reads
in BRI processed datapaired_mapped
: total QC pass sequences where both read
1 and 2 aligned to the genome. If you had non-paired-end data, you would
use mapped
pct_align
: percent alignment
i.e. paired_mapped
/QC_pass
mapped_reads_pct
in BRI processed dataMEDIAN_CV_COVERAGE
: median coefficent of variation of
coverage i.e. how variable sequence coverage is across the genome
median_cv_coverage
in BRI processed dataIdeal libraries have high total sequences, high percent alignment, and low CV coverage. Cutoffs for sample removal will vary by data set but our starting recommendations are to remove libraries with:
Set your cutoffs here.
seq_cutoff <- 1E6
cv_cutoff <- 1
align_cutoff <- 75
For the example data, we see that we do not need to remove any libraries due to total sequences.
ggplot(meta, aes(x = reorder(libID, QC_pass), y = QC_pass)) +
geom_col() +
#Add cutoff line
geom_hline(yintercept = seq_cutoff) +
#Log scale y-axis
scale_y_continuous(trans = 'log10',
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
#Beautify
theme_classic() +
labs(x = "Library", y = "Pass-filter sequences (log scale)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
However, there are two libraries with median CV coverage near our
cutoff. Note that we use geom_text_repel
to label these
libraries.
#Set x-axis max to the larger of 1 or the highest data value
CV_max <- max(1, max(meta$MEDIAN_CV_COVERAGE))
#Set cutoffs to label with libID
cv_cutoff_label <- 0.9
align_cutoff_label <- 75
ggplot(meta, aes(x = MEDIAN_CV_COVERAGE, y = paired_mapped/QC_pass*100)) +
geom_point() +
#Rescale axis limits
lims(x = c(0,CV_max), y=c(0,100)) +
#Add cutoff lines
geom_hline(yintercept = align_cutoff, lty="dashed") +
geom_vline(xintercept = cv_cutoff, lty="dashed") +
#Label points outside cutoffs
geom_text_repel(data=filter(meta, MEDIAN_CV_COVERAGE > cv_cutoff_label |
paired_mapped/QC_pass*100 < align_cutoff_label),
aes(label=libID), show.legend = FALSE, max.overlaps = Inf,
box.padding = 1) +
#Beautify
theme_classic() +
labs(x = "Median CV coverage", y="Percent alignment")
We now filter by our cutoffs. In this case, this leaves the questionable libraries (as we will see why they differ in a later step).
meta.filter <- meta %>%
filter(MEDIAN_CV_COVERAGE < cv_cutoff & QC_pass > seq_cutoff &
paired_mapped/QC_pass*100 > align_cutoff)
count.filter <- count %>%
select(1, all_of(meta.filter$libID))
This removes 0 libraries. However, you may choose to remove low-quality libraries at this stage.
Standard differential expression analyses focus on protein-coding
genes as these RNA products are the most likely to result in a
measurable phenotype. We annotate the counts table genes to their
biotypes as well as additional identifiers with
biomaRt
.
For human:
#Get database
ensembl <- biomaRt::useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl",
mirror = "uswest")
## Ensembl site unresponsive, trying asia mirror
#Format gene key
key <- biomaRt::getBM(attributes=c("ensembl_gene_id", "entrezgene_id", "hgnc_symbol",
"gene_biotype", "chromosome_name",
"start_position", "end_position"), mart=ensembl) %>%
#Filter protein coding genes
filter(gene_biotype == "protein_coding")
key.filter <- key %>%
#Filter protein coding genes in count table
filter(ensembl_gene_id %in% count$Geneid) %>%
#collapse multiannotations.
group_by(ensembl_gene_id, hgnc_symbol, gene_biotype,
chromosome_name, start_position, end_position) %>%
summarise(entrezgene_id = list(unique(entrezgene_id)), .groups = "drop") %>%
group_by(ensembl_gene_id, entrezgene_id, gene_biotype,
chromosome_name, start_position, end_position) %>%
summarise(symbol = list(unique(hgnc_symbol)), .groups = "drop")
Then, we filter the count table to genes in the protein-coding key.
count.filter.pc <- count.filter %>%
filter(Geneid %in% key.filter$ensembl_gene_id)
This removes 91 genes from this data set. These data were pre-filtered so you will see A LOT more genes removed in this step in your data. In the current human genome (GRCh38 release 106), there are 24035 protein-coding genes, of which 19027 were found in this data set.
For mouse (not run here):
#Get database
ensembl <- biomaRt::useEnsembl(biomart="ensembl", dataset="mmusculus_gene_ensembl",
mirror = "uswest")
#Format gene key
key <- biomaRt::getBM(attributes=c("ensembl_gene_id", "entrezgene_id", "mgi_symbol",
"gene_biotype", "chromosome_name",
"start_position", "end_position"), mart=ensembl) %>%
#Filter protein coding genes
filter(gene_biotype == "protein_coding")
key.filter <- key %>%
#Filter protein coding genes in count table
filter(ensembl_gene_id %in% count$Geneid) %>%
#collapse multiannotations.
group_by(ensembl_gene_id, mgi_symbol, gene_biotype,
chromosome_name, start_position, end_position) %>%
summarise(entrezgene_id = list(unique(entrezgene_id)), .groups = "drop") %>%
group_by(ensembl_gene_id, entrezgene_id, gene_biotype,
chromosome_name, start_position, end_position) %>%
summarise(symbol = list(unique(mgi_symbol)), .groups = "drop")
If your libraries were obtained from multiple experiments and/or
sequenced on multiple runs, batch effects may exist in the data. These
effects will mask true biological signals and should be corrected prior
to modeling. There are a number of ways to batch correct. Our preferred
methods are ComBat-Seq implemented
in sva
and limma
removeBatchEffect
.
First, we look for batch effects in PCA. These effects may be apparent in an overall shift of libraries in a batch.
BIGpicture::plot_pca(count.filter.pc, meta=meta.filter,
vars="batch", transform_logCPM=TRUE)
## Joining with `by = join_by(libID)`
## $batch
Or by highlighting libraries sequenced in multiple batches. In your
analysis, you will need to change the unique.ID
variables
to those that uniquely identify a sample. For example, these samples can
be uniquely identified by patient and virus stimulation.
unique.ID <- c("ptID", "virus")
#Identify duplicate libraries
dups <- meta.filter %>%
unite("dupID", all_of(unique.ID), remove=FALSE) %>%
count(dupID) %>%
filter(n>1)
#Create duplicate ID variable
meta.filter.dups <- meta.filter %>%
unite("dupID",unique.ID,remove=FALSE) %>%
mutate(duplicate = ifelse(dupID %in% dups$dupID, dupID, NA))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(unique.ID)
##
## # Now:
## data %>% select(all_of(unique.ID))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
BIGpicture::plot_pca(count.filter.pc, meta=meta.filter.dups,
vars="duplicate", transform_logCPM=TRUE)
## Joining with `by = join_by(libID)`
## $duplicate
Here, we see apparent batch effects where batch A tends to have lower PC2 values than batch B. This is re-enforced by the duplicate samples which group on PC2 more by batch than by duplicate samples.
To correct batch effects, we use negative binomial normalization with
sva
ComBat-Seq including
batch
: the variable denoting batchesgroup
: the main effect, in this case virus stimulation
covar
: co-variates, in this case sex
group
and covar
are variables you
want to retain in our later linear models. This ensures that we don’t
over-correct the data and lose our biological signals of interest.We also reduce the data set with
shrink = TRUE, gene.subset.n = 2
to run models on 2 random
genes. This significantly reduces computational time for this tutorial.
However, a real analysis should be run on a larger subset. We
recommend anywhere from 10,000 to all genes in the data.
count.filter.pc.combat <- count.filter.pc %>%
#Convert count df to matrix
column_to_rownames("Geneid") %>%
as.matrix() %>%
#Batch correction
sva::ComBat_seq(., batch = meta.filter$batch,
group = meta.filter$virus,
covar_mod = model.matrix(~ sex, meta.filter),
shrink = TRUE, gene.subset.n = 2) %>%
as.data.frame() %>%
rownames_to_column("Geneid")
## Found 2 batches
## Using full model in ComBat-seq.
## Adjusting for 2 covariate(s) or covariate level(s)
## Estimating dispersions
## Fitting the GLM model
## Apply shrinkage - computing posterior estimates for parameters
## Using 2 random genes for Monte Carlo integration
## Apply shrinkage to mean only
## Adjusting the data
Note that the total “co-variates” listed in the above message
includes non-reference levels of the main effects (HRV) plus variables
listed as covar
(sex).
This new data set should have reduced batch effects. However, some batch effects likely still exist as this is a difficult correction and we err on the side of less correction to preserve as much of the original data as possible. In this example, we did a rather poor correction based on only 2 genes. So not surprisingly, there is still apparent splitting by batch.
BIGpicture::plot_pca(count.filter.pc.combat, meta=meta.filter.dups,
vars=c("batch","duplicate"), transform_logCPM=TRUE) %>%
wrap_plots(ncol=2)
## Joining with `by = join_by(libID)`
In another data set corrected using all genes, we can more readily see a reduction in the batch effects shown by color below.
limma
’s method for removing batch effects implements
linear regression on the normalized counts. Thus, this cannot be
completed until after data normalization and log transformation. See Section 4 for details.
The main difference between our two preferred batch correction methods is implementation on raw (negative binomial) vs normalized (linear) counts. In practice, ComBat-Seq tends to be more conservative and some batch effects remain in the final, normalized counts. In contrast, limma regresses out batch on the final counts so no batch effects remain; this can over-correct the data and remove real biological variation.
Importantly, both method rely heavily on repeated samples across batches as these technical replicates best represent batch effects. Without repeated samples, neither method performs well.
We recommend comparing both methods. In the end, we generally use ComBat-Seq for small batch effects and limma for large ones.
There are also other batch correction methods you may wish to explore.
scBatch
quantile normalization with
QuantNorm( )
implemented on unnormalized counts similar to
ComBat-SeqEven if you have measurable batch effects, you may not wish to perform batch correction. This correction directly alters count values and can lead to lose of signal for your actual variables of interest. Some things to consider include:
Sometimes one or more libraries will appear as outliers in PCA. We define this as any library more than 3 standard deviations away from the mean on PC1 and/or PC2.
The two PCA plots show scaled and unscaled values (see
?prcomp
for details). Generally, scaled values are
preferred but the unnormalized counts tend to look weird in scaled PCA
as they group but sequencing depth and/or quality. Thus, we usually only
consider outliers in the unscaled plot.
Here, we have no outliers.
pca3a <- BIGpicture::plot_pca(count.filter.pc.combat, meta=meta.filter,
vars="outlier", transform_logCPM=TRUE,
scale = FALSE)
## Joining with `by = join_by(libID)`
pca3b <- BIGpicture::plot_pca(count.filter.pc.combat, meta=meta.filter,
vars="outlier", transform_logCPM=TRUE,
scale = TRUE)
## Joining with `by = join_by(libID)`
pca3a$outlier + labs(title = "Uncaled PCA") +
pca3b$outlier + labs(title = "Scaled PCA")
You may define outliers at a more or less strict level
(outlier_sd
) or wish to determine them separately within
each group of interest (outlier_group
). For example here,
there is one outlier more than 1.5 sd away from the mean within batch
but this cutoff is much stricter than we recommend.
BIGpicture::plot_pca(count.filter.pc.combat, meta=meta.filter,
vars="outlier", transform_logCPM=TRUE,
outlier_sd = 1.5, outlier_group = "batch")
## Joining with `by = join_by(libID)`
## $outlier
We recommend that you initially remove dramatic outliers but leave those that are borderline or questionable. Then, you can re-assess outliers after gene filtering and normalization. You may find that some are no longer outliers after these steps. If they are, you can return to this step and remove them before repeating subsequent steps.
If outlier libraries need to be removed, you can use the following code to do so.
not.outlier <- pca3a$outlier$data %>%
filter(col.group == "no")
meta.filter.out <- meta.filter %>%
filter(libID %in% not.outlier$libID)
count.filter.pc.combat.out <- count.filter.pc.combat %>%
select(1, all_of(meta.filter.out$libID))
If you did not complete ComBat-Seq batch correction, the following
code completes the PCA outlier filtering. (Just remove
combat
from the count object names.)
not.outlier <- pca3a$outlier$data %>%
filter(col.group == "no")
meta.filter.out <- meta.filter %>%
filter(libID %in% not.outlier$libID)
count.filter.pc.out <- count.filter.pc %>%
select(1, all_of(meta.filter.out$libID))
Here, this removes 0 libraries because we do not have any at std dev > 3X.
Your data may have have one or more libraries that represent the same sample. This is common when multiple sequencing batches are completed and a couple of samples are re-sequenced to aid in batch effect detection. Unless the majority of samples have multiple libraries (i.e. technical replicates), the duplicates should be removed or they will skew downstream analysis.
In addition, if you have duplicate libraries and wish to perform
limma
batch correction, skip this step.
We usually keep the library with the highest number of sequences like so.
#Using the name unique ID variables as seen in batch correction
unique.ID
## [1] "ptID" "virus"
#Find libraries with highest seqs
meta.filter.out.dedup <- meta.filter.out %>%
group_by_at(unique.ID) %>%
slice_max(order_by = QC_pass)
#Filter lower seq libraries
count.filter.pc.combat.out.dedup <- count.filter.pc.combat.out %>%
select(1, all_of(meta.filter.out.dedup$libID))
Here, this removes 2 libraries, which matches what we expect from the duplicate PCA. Note that these two libraries were the ones with questionable median CV coverage. So we have resolved this quality issue without removing them in step 2.1.
At this stage, we’ve completed sample filtering and can collapse our count and meta data into a single list object. This allows us to shorten our long object names as well as works efficiently with the remaining cleaning steps.
First, let’s ensure that all the data are in the same order.
#Order metadata by library ID
meta.filter.out.dedup.ord <- meta.filter.out.dedup %>%
arrange(libID)
#Order counts by library ID and gene ID
count.filter.pc.combat.out.dedup.ord <- count.filter.pc.combat.out.dedup %>%
select(1, all_of(meta.filter.out.dedup.ord$libID)) %>%
arrange(Geneid)
#Order gene key by gene ID
key.filter.ord <- key.filter %>%
arrange(ensembl_gene_id)
#check libraries
identical(meta.filter.out.dedup.ord$libID,
colnames(count.filter.pc.combat.out.dedup.ord)[-1])
## [1] TRUE
#Check genes
identical(key.filter.ord$ensembl_gene_id,
count.filter.pc.combat.out.dedup.ord$Geneid)
## [1] TRUE
Then, we merge into the DGEList object, edgeR
format.
# Move gene names from a variable in the df to rownames and format to matrix
count.filter.pc.combat.out.dedup.ord.mat <-
count.filter.pc.combat.out.dedup.ord %>%
column_to_rownames("Geneid") %>%
as.matrix()
dat <- DGEList(
#count table
counts=count.filter.pc.combat.out.dedup.ord.mat,
#metadata
samples=meta.filter.out.dedup.ord,
#gene info
genes=key.filter.ord)
Or similarly for the version without batch correction or duplicate
library removal. We call this other output dat2
.
#Order metadata by library ID
meta.filter.out.ord <- meta.filter.out %>%
arrange(libID)
#Order counts by library ID and gene ID
count.filter.pc.out.ord <- count.filter.pc.out %>%
select(1, all_of(meta.filter.out.ord$libID)) %>%
arrange(Geneid)
#Order gene key by gene ID
key.filter.ord <- key.filter %>%
arrange(ensembl_gene_id)
#check libraries
identical(meta.filter.out.ord$libID,
colnames(count.filter.pc.out.ord)[-1])
## [1] TRUE
#Check genes
identical(key.filter.ord$ensembl_gene_id,
count.filter.pc.out.ord$Geneid)
## [1] TRUE
# Move gene names from a variable in the df to rownames and format to matrix
count.filter.pc.out.ord.mat <-
count.filter.pc.out.ord %>%
column_to_rownames("Geneid") %>%
as.matrix()
dat2 <- DGEList(
#count table
counts=count.filter.pc.out.ord.mat,
#metadata
samples=meta.filter.out.ord,
#gene info
genes=key.filter.ord)
Low abundance (small counts) and rare genes (many 0 counts) are removed from the data because they:
Our goal is to remove genes in the lower left of the mean-variance plot because counts (x) and variance (y) are low e.g. these genes break the red mean variance trend.
BIGpicture::plot_mv(dat, design = "~ virus")
We use our custom function to retain only genes that are at least
min.CPM
counts per million in at least
min.sample
number of samples OR in at least
min.pct
percent of samples. Here, we use 0.1 CPM in at
least 3 samples.
dat.abund <- RNAetc::filter_rare(dat, min_CPM = 0.1, min.sample = 3,
gene.var="ensembl_gene_id")
## 5734 (30.14%) of 19027 genes removed.
dat2.abund <- RNAetc::filter_rare(dat2, min_CPM = 0.1, min.sample = 3,
gene.var="ensembl_gene_id")
## 5558 (29.21%) of 19027 genes removed.
This yields slightly different results for the two versions of the
data. We visualize just dat
here.
Plotting the filtered data, we see the red trend line no longer curves down on the left. There is still a bit of the downward tail of dots but this filtering is sufficient.
plot_mv(dat.abund, design = "~ virus")
There is no exact cutoff for this filtering, and you should try
several cutoffs to observe the effects. In general, we use minimum CPM
from 0.1 - 1, minimum samples around 3 for small data sets, or minimum
samples from 5 - 10% in larger data sets. It is important to keep the
minimum sample cutoff larger than your smallest group of interest,
otherwise you may filter genes specifically associated with one group.
For example, if you have 10 libraries with 5 each of media vs
stimulated, you’re min.sample
value should not be > 5 or
else you will remove genes only expressed in one of media or stimulated
groups.
This filtering generally results in the removal of around 25% of genes. Here, our filtering is on the stricter side (which is common in small data sets), removing 5734 or 30% of genes.
You may also wish to look for specific genes of interest and ensure
they are not being filtered. This following calculates some statistics
on filtered genes and saves a csv
for your perusal.
rare <- as.data.frame(cpm(dat$counts)) %>%
#Filter genes removed
rownames_to_column("ensembl_gene_id") %>%
filter(!(ensembl_gene_id %in% rownames(dat.abund$counts))) %>%
#Add gene symbols
left_join(dat$genes, by = "ensembl_gene_id") %>%
#format
select(-c(chromosome_name:end_position)) %>%
pivot_longer(-c(ensembl_gene_id, gene_biotype, symbol, entrezgene_id)) %>%
group_by(ensembl_gene_id, gene_biotype, symbol) %>%
summarise(mean.CPM = mean(value, na.rm=TRUE),
min.CPM = min(value, na.rm=TRUE),
max.CPM = max(value, na.rm=TRUE),
express.in.libs = length(value[value > 0]),
.groups="drop")
write_csv(rare, file="data_clean/rare_genes.csv")
rare
## # A tibble: 5,734 × 7
## ensembl_gene_id gene_biotype symbol mean.CPM min.CPM max.CPM express.in.libs
## <chr> <chr> <list> <dbl> <dbl> <dbl> <int>
## 1 ENSG00000000005 protein_codi… <chr> 0 0 0 0
## 2 ENSG00000001617 protein_codi… <chr> 0.164 0 0.854 2
## 3 ENSG00000001626 protein_codi… <chr> 0.0786 0 0.786 1
## 4 ENSG00000002745 protein_codi… <chr> 0.0377 0 0.377 1
## 5 ENSG00000002746 protein_codi… <chr> 0 0 0 0
## 6 ENSG00000003137 protein_codi… <chr> 0.0376 0 0.376 1
## 7 ENSG00000003249 protein_codi… <chr> 0 0 0 0
## 8 ENSG00000003989 protein_codi… <chr> 0 0 0 0
## 9 ENSG00000004776 protein_codi… <chr> 0 0 0 0
## 10 ENSG00000004799 protein_codi… <chr> 0.171 0 1.71 1
## # ℹ 5,724 more rows
RNA-seq counts are not independent within a library and not comparable across libraries. A library with 1 million sequences will have higher counts for most genes than one with 1 thousand sequences. We correct for this aspect of the data with the following normalization steps.
TMM defines a reference sample from your data set as the one with the
most representative expression for the overall data set. Specifically,
the reference sample is the one whose upper quartile is closest to the
overall data set upper quartile. The upper quantile is the value
(x
) where 75% of genes have expression < x
.
Thus, the reference sample is the sample whose x
is the
closest to mean(x
) across all samples.
All other samples are considered test samples. For each test sample, a scaling factor is calculated based on the weighted mean of log ratios of representative genes between the test and reference. These representative genes are a subset of the data set, removing the highest and lowest expressed genes as well as genes with the highest and lowest log ratios. The exact genes used as representative genes for scaling are dynamic and specific to each test sample.
The calculated scaling factors are then applied to the counts table
automatically in the voom
step.
dat.abund.norm <- calcNormFactors(dat.abund, method = "TMM")
dat2.abund.norm <- calcNormFactors(dat2.abund, method = "TMM")
We continue normalization by converting counts to CPM within each sample, thus accounting for differential sampling depth. We also perform log2 transformation, because RNA-seq data are heavily right-skewed and thus, violate assumptions of normality.
as.data.frame(dat.abund$counts) %>%
rownames_to_column() %>%
pivot_longer(-rowname) %>%
ggplot() +
geom_histogram(aes(x=value), bins = 100) +
theme_classic() +
labs(x = "count", y = "occurences") +
lims(x=c(0,1000))
voom
performs both of these steps! We use
voomWithQualityWeights
to additionally calculate sample
specific quality weights that can be of use as co-variates in linear
modeling.
dat.abund.norm.voom <- voomWithQualityWeights(
dat.abund.norm,
design=model.matrix(~ virus,
data=dat.abund.norm$samples),
plot=TRUE)
dat2.abund.norm.voom <- voomWithQualityWeights(
dat2.abund.norm,
design=model.matrix(~ virus,
data=dat2.abund.norm$samples),
plot=TRUE)
Now, we return to batch correction. You should only complete this
section on data that were not batch corrected by ComBat-Seq (Section 2.3.1) and did not have
duplicates removed (Section 2.5). In this
tutorial, this is dat2.abund.norm.voom
.
With your uncorrected voom object, regress out batch effects with a
linear model in removeBatchEffect
including
batch
: the variable denoting batchescovar
: additional numeric co-variates to correct for
(not used here)
design
: model matrix with the main effects and
co-variates you want to retain signal for. , in this case sex
group
+ covar
variablesvoom.removeBatch <- removeBatchEffect(
dat2.abund.norm.voom,
batch = dat2.abund.norm.voom$targets$batch,
design = model.matrix(~ virus + sex, data = dat2.abund.norm.voom$targets)
)
Incorporate the batched corrected counts into a new voom object.
dat2.abund.norm.voom.removeBatch <- dat2.abund.norm.voom
dat2.abund.norm.voom.removeBatch$E <- voom.removeBatch
Visualize before and after batch correction. We see the batch effect is completely removed on PC2, though the duplicates still don’t look great. A larger data set generally performs better.
BIGpicture::plot_pca(dat2.abund.norm.voom$E, meta = meta.filter.dups,
vars=c("batch","duplicate")) %>%
wrap_plots(ncol=2)
## Joining with `by = join_by(libID)`
BIGpicture::plot_pca(dat2.abund.norm.voom.removeBatch$E, meta = meta.filter.dups,
vars=c("batch","duplicate")) %>%
wrap_plots(ncol=2)
## Joining with `by = join_by(libID)`
Because this version skipped Section 2.5, we now remove duplicate samples.
#Using the name unique ID variables as seen in batch correction
unique.ID
## [1] "ptID" "virus"
dat2.abund.norm.voom.removeBatch.dedup <- dat2.abund.norm.voom.removeBatch
#Find libraries with highest seqs
dat2.abund.norm.voom.removeBatch.dedup$targets <-
dat2.abund.norm.voom.removeBatch.dedup$targets %>%
group_by_at(unique.ID) %>%
slice_max(order_by = QC_pass)
#Filter lower seq libraries
dat2.abund.norm.voom.removeBatch.dedup$E <-
as.data.frame(dat2.abund.norm.voom.removeBatch.dedup$E) %>%
select(all_of(dat2.abund.norm.voom.removeBatch.dedup$targets$libID)) %>%
as.matrix()
The following summarizes libraries removed from analysis. In this tutorial, all libraries pass-filter and move on to analysis.
## # A tibble: 12 × 2
## libID filter
## <chr> <chr>
## 1 lib1 pass-filter
## 2 lib2 pass-filter
## 3 lib3 pass-filter
## 4 lib4 pass-filter
## 5 lib5 pass-filter
## 6 lib6 pass-filter
## 7 lib7 pass-filter
## 8 lib8 pass-filter
## 9 lib9 pass-filter
## 10 lib10 pass-filter
## 11 lib11 duplicate sample
## 12 lib12 duplicate sample
Re-assess PCA outliers. Some new samples may pop up after normalization but unless they are very clearly outliers, you should leave them in the data set. Also note that the unscaled and scaled PCA look much more similar after normalization. This is expected and we will use the scaled PCA in all future analyses.
You can also use this section to check if outliers identified in Section 2.4 remain outliers when you leave them
in the data at that stage. Simply return to that section and comment out
filter(col.group == "no")
then run all subsequent
steps.
pca5a <- BIGpicture::plot_pca(dat.abund.norm.voom,
vars="outlier",
scale = FALSE)
## Joining with `by = join_by(libID)`
pca5b <- BIGpicture::plot_pca(dat.abund.norm.voom,
vars="outlier",
scale = TRUE)
## Joining with `by = join_by(libID)`
pca5a$outlier + labs(title = "Unscaled PCA") +
pca5b$outlier + labs(title = "Scaled PCA")
Write as RData
save(dat.abund.norm, file = "data_clean/dat_combat.RData")
save(dat.abund.norm.voom, file = "data_clean/voom_combat.RData")
Write counts as csv
as.data.frame(dat.abund.norm$counts) %>%
rownames_to_column("ensembl_gene_id") %>%
write_csv("data_clean/counts_combat.csv")
as.data.frame(dat.abund.norm.voom$E) %>%
rownames_to_column("ensembl_gene_id") %>%
write_csv("data_clean/counts_voom_combat.csv")
And similarly for the limma batch corrected data. We rename the object so either version can be used in downstream tutorials.
dat.abund.norm <- dat2.abund.norm
dat.abund.norm.voom <- dat2.abund.norm.voom.removeBatch.dedup
Write as RData
save(dat.abund.norm, file = "data_clean/dat_limma.RData")
save(dat.abund.norm.voom, file = "data_clean/voom_limma.RData")
Write counts as csv
as.data.frame(dat.abund.norm$counts) %>%
rownames_to_column("ensembl_gene_id") %>%
write_csv("data_clean/counts_limma.csv")
as.data.frame(dat.abund.norm.voom$E) %>%
rownames_to_column("ensembl_gene_id") %>%
write_csv("data_clean/counts_voom_limma.csv")
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] edgeR_3.40.2 limma_3.54.2 patchwork_1.1.2
## [4] ggrepel_0.9.3 sva_3.44.0 BiocParallel_1.32.6
## [7] genefilter_1.78.0 mgcv_1.8-42 nlme_3.1-162
## [10] scales_1.2.1 SEARchways_1.1.0 BIGpicture_1.1.0
## [13] RNAetc_1.1.0 kimma_1.5.0 BIGverse_1.0.0
## [16] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [19] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
## [22] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
## [25] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_0.63.0 bit64_4.0.5
## [4] filelock_1.0.2 progress_1.2.2 httr_1.4.6
## [7] GenomeInfoDb_1.34.9 tools_4.2.1 bslib_0.4.2
## [10] utf8_1.2.3 R6_2.5.1 DBI_1.1.3
## [13] BiocGenerics_0.44.0 colorspace_2.1-0 withr_2.5.0
## [16] prettyunits_1.1.1 tidyselect_1.2.0 curl_5.0.0
## [19] bit_4.0.5 compiler_4.2.1 cli_3.6.1
## [22] Biobase_2.58.0 xml2_1.3.4 labeling_0.4.2
## [25] sass_0.4.6 rappdirs_0.3.3 digest_0.6.31
## [28] rmarkdown_2.22 XVector_0.38.0 pkgconfig_2.0.3
## [31] htmltools_0.5.5 dbplyr_2.3.2 highr_0.10
## [34] fastmap_1.1.1 rlang_1.1.1 rstudioapi_0.14
## [37] RSQLite_2.3.1 farver_2.1.1 jquerylib_0.1.4
## [40] generics_0.1.3 jsonlite_1.8.4 vroom_1.6.3
## [43] RCurl_1.98-1.12 magrittr_2.0.3 GenomeInfoDbData_1.2.9
## [46] Matrix_1.5-4.1 Rcpp_1.0.10 munsell_0.5.0
## [49] S4Vectors_0.36.2 fansi_1.0.4 lifecycle_1.0.3
## [52] stringi_1.7.12 yaml_2.3.7 zlibbioc_1.44.0
## [55] BiocFileCache_2.6.1 grid_4.2.1 blob_1.2.4
## [58] parallel_4.2.1 crayon_1.5.2 lattice_0.21-8
## [61] Biostrings_2.66.0 splines_4.2.1 annotate_1.74.0
## [64] hms_1.1.3 KEGGREST_1.38.0 locfit_1.5-9.7
## [67] knitr_1.43 pillar_1.9.0 biomaRt_2.54.1
## [70] codetools_0.2-19 stats4_4.2.1 XML_3.99-0.14
## [73] glue_1.6.2 evaluate_0.21 png_0.1-8
## [76] vctrs_0.6.2 tzdb_0.4.0 foreach_1.5.2
## [79] gtable_0.3.3 cachem_1.0.8 xfun_0.39
## [82] xtable_1.8-4 survival_3.5-5 iterators_1.0.14
## [85] AnnotationDbi_1.60.2 memoise_2.0.1 IRanges_2.32.0
## [88] timechange_0.2.0