This document covers our recommended analysis pipeline to determine differentially expressed genes (DEG). This pipeline includes simple linear modeling and linear mixed effects modeling with main effects, interaction terms, co-variates, and random effects. There is discussion of how to choose a ‘best fit’ model using fit, significant genes, and biological knowledge. The example data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.
This pipeline should be completed in R and RStudio. You should also install the following packages.
#CRAN packages
install.packages("tidyverse")
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma", "patchwork", "ComplexHeatmap"))
#GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/BIGverse")
#Or if fails, install packages individually
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")
devtools::install_github("dleelab/pvca")
And load them into your current R session.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(BIGverse)
## ── Attaching packages ──────────────────────────────────────── BIGverse 1.0.0 ──
## ✔ kimma 1.5.0 ✔ BIGpicture 1.1.0
## ✔ RNAetc 1.1.0 ✔ SEARchways 1.1.0
## ── Conflicts ─────────────────────────────────────────── BIGverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(limma)
library(edgeR)
library(pvca)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.12.1
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(patchwork)
set.seed(651)
Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this tutorial.
Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub
Specifically, this tutorial uses RNAseq data processed using our SEAsnake
and counts
to voom pipelines, resulting in voom-normalized, log2 counts per
million (CPM) expression and associated sample metadata in a
limma EList
object in the data_clean/
directory. These data are available in the kimma
package
within the example.voom
data object.
Our main research question is how human rhinovirus (HRV) infection
impacts gene expression. As a secondary question, we are also interested
in how individuals with asthma may respond differently to HRV compared
to healthy individuals. Thus, we will explore models comparing media and
HRV-infected samples (variable named virus
) in asthmatic
and healthy individuals (variable named asthma
). We will
then explore the impacts of patient co-variates, paired sample design,
and random effects to improve model fit and the detection of
differentially expressed genes.
All counts, gene, and sample metadata are contained in a single
object from the limma
package. Note that these data contain
only 1000 genes from the full data set create in our counts to voom pipeline. This is to reduce
computational time.
#Extract and rename data object
dat <- kimma::example.voom
We access each data frame within this Elist
using
$
. The normalized log2 CPM expression data are contained in
E
.
dat$E[1:3,1:7]
## lib1 lib2 lib3 lib4 lib5 lib6 lib7
## ENSG00000000460 6.109675 7.998060 7.322775 7.327720 7.915870 7.989109 8.294872
## ENSG00000001460 7.173806 4.828135 6.133741 8.130273 6.662114 7.241296 5.807207
## ENSG00000002587 8.672611 7.460403 8.856207 8.130273 7.852445 7.067965 8.249429
Library and donor metadata are in targets
.
dat$targets[1:3,]
## group lib.size norm.factors libID ptID median_cv_coverage virus asthma
## lib1 1 79645.59 1.001378 lib1 donor1 0.513732 none healthy
## lib2 1 88007.91 0.951334 lib2 donor1 0.435162 HRV healthy
## lib3 1 178019.51 1.091153 lib3 donor2 0.374282 none healthy
## batch
## lib1 2
## lib2 2
## lib3 2
Gene metadata are in genes
.
dat$genes[1:3,1:4]
## hgnc_symbol Previous symbols Alias symbols
## ENSG00000000460 RAB12 <NA> <NA>
## ENSG00000001460 MIA <NA> CD-RAP
## ENSG00000002587 C1orf52 <NA> gm117, FLJ44982
## gene_biotype
## ENSG00000000460 protein-coding gene
## ENSG00000001460 protein-coding gene
## ENSG00000002587 protein-coding gene
Voom gene-level quality weights are in weights
. These
were calculated with voomWithQualityWeights( )
.
example.voom$weights[1:3,1:3]
## [,1] [,2] [,3]
## [1,] 0.7585106 0.8274172 1.5710160
## [2,] 0.5288184 0.5555010 0.9553991
## [3,] 0.7259811 0.7906774 1.5015908
And finally, the null model used in voom normalization is found in
design
.
example.voom$design[1:3,]
## lib1 lib2 lib3
## 1 1 1
First, we consider our research question and what variable(s) we need
to answer that question. In these data, the first variable of interest
is virus
to determine how viral infection impacts gene
expression. In R, this model is written as:
~ virus
On a coarse scale such as PCA, we can see that virus
impacts gene expression with uninfected controls (none) grouping
together away from HRV-infected samples.
pca1 <- plot_pca(dat, vars = "virus", scale = TRUE)
## Joining with `by = join_by(libID)`
pca1$virus
We run this linear model in kimma
using
kmFit
.
virus <- kmFit(dat = dat,
model = "~ virus",
run_lm = TRUE)
lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We see that numerous genes are significant for virus
in
the summary table and FDR <0.05 highlighted in the volcano plot.
However, this is not the best model for these data since we have other
factors that may impact expression.
summarise_kmFit(fdr = virus$lm)
## # A tibble: 2 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 virusHRV 44 66 132 160 240 323
## 2 total (nonredundant) 44 66 132 160 240 323
plot_volcano(model_result = virus,
model = "lm", variables = "virusHRV",
y_cutoff = 0.05)
limma introduced gene-level weights to account for sequence library
quality Law 2014.
These weights are calculated in limma with
voomWithQualityWeights
and then kimma can incorporate them
from the EList
object or separately using the
weights
parameter.
# Check if weights exist in the data
names(example.voom)
## [1] "genes" "targets" "E" "weights" "design"
virus_weights <- kmFit(dat = example.voom,
model = "~ virus",
run_lm = TRUE,
use_weights = TRUE)
lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We see that in this case, weights have little impact. However, in larger datasets with more variability in quality and in other model designs, weights can have a significant effect. In general, we use weights in our models.
plot_venn_genes(model_result = list("no weights" = virus,
"weights" = virus_weights),
fdr.cutoff = 0.2)
## $venn
## $venn$`0.2`
We are actually interested in how individuals with and without asthma
respond to virus. We can add variables to our model with +
such as
~ virus + asthma
However, this model only captures the main effects of each variable in isolation. Specifically, this model tells you how virus impacts genes expression and how asthma impacts gene expression. It does not address how viral impacts differ between those with and without asthma.
One way to assess this is with an interaction term written as:
~ virus + asthma + virus:asthma
or short-handed with *
. Note, these two models are
equivalent in R.
~ virus * asthma
This model now tests both the main effects and their interaction.
virus_interact <- kmFit(dat = dat,
model = "~ virus*asthma",
run_lm = TRUE,
use_weights = TRUE)
lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We now see 3 variables in our results equivalent to the variables in
the long form of our model equation. Notice that we’ve lost significance
for a lot of genes in virus
and not found any genes at FDR
< 0.05 for asthma or the interaction term. Because this data set is
small, an interaction model is likely too complex, and we do not appear
to have the power to detect interaction effects.
summarise_kmFit(fdr = virus_interact$lm)
## # A tibble: 4 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 asthmaasthma NA NA 4 16 30 89
## 2 virusHRV 1 9 31 52 81 116
## 3 virusHRV:asthmaasthma NA NA 1 1 1 1
## 4 total (nonredundant) 1 9 35 64 103 182
Importantly, a gene with a significant interaction term cannot be assessed for the main effects. For example at a higher FDR of 0.3, there is 1 gene here that is significant for the interaction (green) and asthma (yellow). However, we cannot use the asthma results for these genes, because they are comparing all healthy to all asthma samples without taking virus into account. Since we know there is an interaction effect for these genes, the asthma comparison alone is incorrectly averaging across samples we know to be different (none vs HRV). Similarly, we cannot use the virus results for an interaction significant gene.
plot_venn_genes(model_result = list("lm"=virus_interact),
fdr_cutoff = 0.3)
## $venn
## $venn$`0.3`
If this were our final model, our DEG list would be all interaction genes (green) as well as the intersect of virus and asthma main terms (blue-yellow overlap). This second group encompasses genes that change with virus similarly in healthy and asthma donors but are always higher in one asthma group.
Another way to model interactions is with pairwise contrasts.
Contrasts compare 2 or more groups to all other groups in that variable.
For example, we’re interested in the 4 groups within the interaction
term: none_healthy
, none_asthma
,
HRV_healthy
, HRV_asthma
. We run these
comparisons with the same interaction model as above, only now we also
set run_contrast = TRUE
. We will get pairwise comparisons
that we’re not interested in since all combinations are run but will
filter those of interest later on.
virus_contrast <- kmFit(dat = dat, model = "~ virus*asthma",
run_lm = TRUE,
run_contrast = TRUE,
use_weights = TRUE,
contrast_var = "virus:asthma")
lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We see the same main model outcome as before.
summarise_kmFit(fdr = virus_contrast$lm)
## # A tibble: 4 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 asthmaasthma NA NA 4 16 30 89
## 2 virusHRV 1 9 31 52 81 116
## 3 virusHRV:asthmaasthma NA NA 1 1 1 1
## 4 total (nonredundant) 1 9 35 64 103 182
We also get all pairwise contrasts between the 4 groups.
summarise_kmFit(fdr = virus_contrast$lm.contrast)
## # A tibble: 7 × 9
## variable contrast_ref contrast_lvl fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4
## <fct> <chr> <chr> <int> <int> <int> <int> <int>
## 1 virus*asth… HRV healthy none asthma 56 95 199 279 375
## 2 virus*asth… none asthma HRV asthma 14 28 62 111 157
## 3 virus*asth… none healthy HRV healthy 1 9 31 52 81
## 4 virus*asth… HRV healthy HRV asthma NA 4 17 26 47
## 5 virus*asth… none healthy HRV asthma NA NA 50 134 204
## 6 virus*asth… none healthy none asthma NA NA 4 16 30
## 7 total (non… <NA> <NA> 60 99 230 350 484
## # ℹ 1 more variable: fdr_0.5 <int>
Not all of these contrasts are of interest, so we select just the effects of virus (green, red) and asthma (blue, yellow). You may be interested in all of these genes or just in genes that change with virus in one group (green and red not overlapping) or those that change with virus (green and/or ref) AND are different with asthma (blue and/or yellow).
plot_venn_genes(model_result = list("lm"=virus_contrast),
fdr_cutoff = 0.2,
contrasts = c("none asthma - none healthy",
"HRV asthma - HRV healthy",
"HRV healthy - none healthy",
"HRV asthma - none asthma"))
## $venn
## $venn$`0.2`
At their heart, interaction and contrast models are trying to answer the same question. However, statistically, they are very different. Contrasts compare means between groups (see below) and you must select which pairwise comparisons are meaningful.
An interaction term tests if two slopes differ. In these data, this is comparing the change (slope) in response to virus in healthy vs asthmatic donors like below In most cases, it is more difficult to achieve significance when comparing slopes as opposed to means.
In general, we recommend using the interaction term to define differentially expressed genes (DEG). Then, as a post-hoc test, run contrasts only on significant DEGs to further probe the results. This is demonstrated later in this tutorial. It is like running an ANOVA to compare groups A,B,C and then TukeyHSD to determine which groups differ from which (A vs B, B vs C, A vs C).
Contrasts may be employed as your main model instead in cases such as:
We also want to consider the effects of variables that may impact or prevent us from seeing our main effects. In human studies, this often includes age and sex, though we will not explore these here. Here, we want to consider batch since these data were sequenced in two batches.
To determine which co-variates to include, let’s see if they have a large impact on the data in PCA and compare to the effects of virus and asthma.
plot_pca(dat, vars = c("virus","asthma","batch"), scale = TRUE) %>%
wrap_plots(ncol=1)
## Joining with `by = join_by(libID)`
We see grouping by batch; thus, we have evidence to include batch.
We run the models with and without co-variates for comparison. To
include co-variates, we add them to the models with +
. We
also set metrics = TRUE
so that kimma
gives us
model fit metrics for comparison.
virus_interact <- kmFit(dat = dat,
model = "~ virus*asthma",
run_lm = TRUE,
use_weights = TRUE,
metrics = TRUE)
lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
virus_batch <- kmFit(dat = dat,
model = "~ virus*asthma + batch",
run_lm = TRUE,
use_weights = TRUE,
metrics = TRUE)
lm model: expression~virus*asthma+batch
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We compare model fits with AIC, which summarize model fit. For more
information, see https://en.wikipedia.org/wiki/Akaike_information_criterion.
In general, a difference in AIC < 2 is considered no evidence of
better fit, 2 - 8 is moderate evidence, and > 8 is strong evidence.
There are additional metrics also available in kimma
including BIC, sigma, and R-squared.
Smaller AIC indicate a better fitting model so genes above 0 are better fit by the “x” model and genes below 0 line are better fit by the “y” model. Throughout this tutorial, we put the more complex model as y. Our plotting function also outputs a message with how many genes are best fit by each model as well as mean and standard deviation of the difference in AIC between the two models.
plot_fit2(model_result = virus_interact,
x = "lm", x_label = "virus*asthma",
model_result_y = virus_batch,
y = "lm", y_label = "virus*asthma + batch",
metrics = "AIC")
## Summary
## Best fit Metric Total genes Mean delta Stdev delta
## 1 virus*asthma AIC 757 1.479661 0.5478221
## 2 virus*asthma + batch AIC 243 2.892197 3.0911213
We see that each model is the best fit for a subset of genes. Batch improves the model fit for about 25% of genes, and this improvement is moderate to strong for some genes (e.g. genes with change in fit < -2 and mean \(\Delta\) AIC = 2.9). In contrast, not having batch does not change AIC more than 2; thus, there is no evidence to remove batch.
Next, we compare how many genes are significant with and without co-variates.
summarise_kmFit(virus_interact$lm)
## # A tibble: 4 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 asthmaasthma NA NA 4 16 30 89
## 2 virusHRV 1 9 31 52 81 116
## 3 virusHRV:asthmaasthma NA NA 1 1 1 1
## 4 total (nonredundant) 1 9 35 64 103 182
summarise_kmFit(virus_batch$lm)
## # A tibble: 4 × 4
## variable fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int>
## 1 asthmaasthma 8 26 42
## 2 virusHRV 1 1 1
## 3 virusHRV:asthmaasthma NA NA 1
## 4 total (nonredundant) 9 27 43
First, consider how many genes are significant for the co-variates themselves. Batch is not significant for any genes here. Next, compare the number of virus-significant genes impacted by including the co-variate. Adding batch to the model decreases the number of virus significant genes here. The dramatic decrease in virus genes is likely evidence that we do not have enough power to support the co-variate model complexity (e.g. too few samples relative to the number of variables). Given the size of this data set (N = 12), this is not surprising.
To summarize, the batch co-variate:
All evidence except PCA points to not including batch in the model. It is rare that all the evidence points to one conclusion so it must be weighed along with biological and experimental information to determine the final model. In this case, I would choose to NOT include batch since the sample size is very small and model fit differences are also small.
In your data, you may have co-variates that are not as clear as this. In that case, it’s important to prioritize model fit and sample size over significant gene counts. You should not include or exclude a co-variate just because it gets you more significant genes for your main terms! This is especially true if the co-variate negatively impacts model fit.
You may also have non-statistical evidence for including a co-variate. If you have established biological evidence that a co-variate is important in your system or it’s standard in your field, you may wish to include it even if it does not improve fit and is not significant. If the co-variate, however, greatly worsens fit, you should not include it even if it’s standard. Instead, present fit plots like above to support its exclusion.
As shown earlier, PCA is often used to assessed co-variate effects on gene expression. Another way to visualize this (and an easier way when you have a lot of variables) is to make a heatmap of correlations to PCs. This works best for numeric co-variates but can also be roughly applied to categorical variables if they are converted to numeric levels (like 1, 2). Categorical variables with more than two levels are difficult to interpret by this method since the ordering (1,2,3 vs 1,3,2 etc) can give different results. However, than can technically still be run in correlation.
First, we correlate numeric and converted categorical variables to PCA with Pearson’s correlation.
#Extract PCA values from first PCA under the ~virus model (2.1)
pca.dat <- pca1$virus$data %>%
# convert 2 level categoricals
mutate(across(c(virus, asthma, batch, ptID), ~as.numeric(as.factor(.)))) %>%
# select var of interest
select(PC1:PC12, lib.size, median_cv_coverage, virus, asthma, batch, ptID)
pca.corr <- cor(pca.dat, method = "pearson") %>%
as.data.frame() %>%
rownames_to_column() %>%
#Remove PCs from rows
filter(!grepl("^PC[0-9]{1,2}$", rowname)) %>%
#Remove everything else from columns
select(rowname, starts_with("PC")) %>%
column_to_rownames() %>%
as.matrix()
Visualizing this in a heatmap, we see that our main variable of interest (virus) most strongly correlates to PC2. This PC is also correlated with batch which further underscores the importance of testing this variable in linear models. In addition, PC1 is related to technical variables such as median CV coverage and biological variables like asthma. Given that the technical variables are on a highly explanatory PC, this result indicates that they should be explored for model fit similar to batch as above.
ComplexHeatmap::Heatmap(pca.corr,
name = "Pearson (R)",
cluster_columns = FALSE)
Note that ptID does not have a directly interpretable R value, because it is a six level factor and does not logically convert to numeric.
Another method for correlating categorical variable to PCs in through
pvca
, which performs mixed models and determines the
percent of PC variation that is explained by each variable.
Here, we test the previously “converted to numeric” variables of virus, asthma, batch, and ptID. This is a better method for ptID in particular, because it has more than two levels.
dat.pvca <- dat$targets %>%
select(ptID, virus, asthma, batch)
pvca.corr <- pvca::PVCA(counts = dat$E,
meta = dat.pvca,
threshold = 1,
inter = FALSE)
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
The package has its own integrated plotting function. We see that the
majority of variation (38.7%) is not explained by any of the provided
variables (resid
). The next most explanatory variable is
patient ID (ptID
) at 31.8%. As you’ll see in the next
section, this effect is likely due to the paired sample design and will
be accounted for in a linear mixed effects model. Continuing down, we
see virus, asthma, and then batch in explanatory power. This indicates
that maybe batch isn’t as impactful of a co-variate as we originally
thought, though often you will consider all co-variates that explain
> 1% of variation.
PlotPVCA(pvca.corr, title = "")
These methods in addition to model fit assessment will help you in determining co-variates to include in your model. It is both an art and a science as you must take into account prior knowledge and statistics in your choices. In the end, sometimes you have to make a arbitrary choice if the results are not conclusive one way or the other.
These data come from a paired study design with uninfected (none) and infected (HRV) samples from the same donor’s cells. We take this into account in our model by using donor as a random effect. This allows the model to match samples from the same donor. It greatly improves our statistical power as we now have 1 slope per donor instead of 1 mean slope across all donors. So, we can see if all individual donors change similarly or not.
Random effects are added to the model with + (1|block)
where the block is the variable you want to pair samples by. Thus, we
now run a mixed effects model lme
with
~ virus*asthma + (1|ptID)
In kimma
, this is run similarly to a simple model except
we add our random term and ask it to run_lme
.
virus_lme <- kmFit(dat = dat,
model = "~ virus*asthma + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE,
metrics = TRUE)
lme/lmerel model: expression~virus*asthma+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
Note, we could run both models at once in kmFit
if we
set run_lm=TRUE
and run_lme=TRUE
. This is
helpful when you know you want to compare multiple models from the
start. Since we already ran the simple linear model, we’ll skip that
here.
Comparing models as we did with co-variates, we see that genes are
split by best fit with 1/3 best fit with the paired design (below 0,
mean \(\Delta\) AIC = 4.4) and 2/3 best
fit without the paired design (above 0, mean \(\Delta\) AIC = 6.1). This is a case where
fit does not clearly tell you which model to choose. We, however, know
that the experimental design is paired and thus, choose the mixed
effects model with 1|ptID
as we know it better represents
these data.
plot_fit2(model_result = virus_interact,
x="lm", x_label = "virus*asthma",
model_result_y = virus_lme,
y="lme", y_label = "virus*asthma + (1|ptID)",
metrics = "AIC")
## Summary
## Best fit Metric Total genes Mean delta Stdev delta
## 1 virus*asthma AIC 619 6.092427 3.996585
## 2 virus*asthma + (1|ptID) AIC 381 4.425601 3.632122
limma
limma
offers a sudo random effect in linear modeling
with duplicateCorrelation
. This estimates the random effect
of donor across all genes and uses one value for all models. The full
mixed effect model in kimma
estimates the random effect for
each gene, thus fitting a different value for each gene’s model.
In our analyses, we’ve found limma
and
kimma
results to be similar when the sample size or
variable effects are large. In the case of small data sets or small
effects, however, there can be dramatic differences. You can find a
side-by-side comparison in the kimma
vignette.
Given the large changes with a mixed effect model, let’s re-consider the batch and co-variate.
virus_batch_lme <- kmFit(dat = dat,
model = "~ virus*asthma + batch + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE,
metrics = TRUE)
lme/lmerel model: expression~virus*asthma+batch+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We see no clear evidence to keep co-variates based on model fit since genes are split for best fit model and mean \(\Delta\) AIC is < 2.
plot_fit2(model_result = virus_lme,
x="lme", x_label = "virus*asthma + (1|ptID)",
model_result_y = virus_batch_lme,
y="lme", y_label = "virus*asthma + batch + (1|ptID)",
metrics="AIC")
## Summary
## Best fit Metric Total genes Mean delta Stdev delta
## 1 virus*asthma + (1|ptID) AIC 578 1.835798 1.177889
## 2 virus*asthma + batch + (1|ptID) AIC 422 1.944901 1.929572
Inclusion of batch in the model decreases the number of virus-significant genes but batch itself is significant for a number of genes.
summarise_kmFit(virus_lme$lme)
## # A tibble: 5 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 (1 | ptID) 7 9 21 24 33 124
## 2 asthma 69 98 140 204 284 355
## 3 virus 266 314 397 468 566 628
## 4 virus:asthma 17 33 46 77 89 106
## 5 total (nonredundant) 316 384 493 590 710 797
summarise_kmFit(virus_batch_lme$lme)
## # A tibble: 6 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 (1 | ptID) 5 5 6 21 63 89
## 2 asthma 70 90 131 174 268 336
## 3 batch 79 91 125 161 210 273
## 4 virus 158 210 296 335 410 493
## 5 virus:asthma 48 59 83 113 133 189
## 6 total (nonredundant) 259 330 441 523 643 745
Importantly, some of the interaction genes (our main variable of interest) are significant for batch. This is evidence to keep this co-variate since it appears to play a role in the genes we’ll focus on in our results. That being said, batch does not improve model fit and you could argue to remove them. So, this is a case where there is no clear right answer. As long as you understand the potential impacts on your results, you could continue your analyses with or without batch.
plot_venn_genes(model_result = list("with batch"=virus_batch_lme),
fdr_cutoff = 0.2)
## $venn
## $venn$`0.2`
For simplicity, we’ll move forward without any co-variates in the model below.
~ virus*asthma + (1|ptID)
Kinship is a summative measure of genetic relatedness. It can be from 0 to 1 with 1 being 100% identical (monozygotic twins). Some other common values are 0.5 for parent-child, 0.25 grandparent-grandchild, 0.125 first cousins, etc. This measure is a pairwise measure with 1 value per pair of individuals.
kin <- kimma::example.kin
kin
## donor1 donor2 donor3 donor4 donor5 donor6
## donor1 1.00 0.50 0.10 0.20 0.15 0.10
## donor2 0.50 1.00 0.10 0.11 0.23 0.09
## donor3 0.10 0.10 1.00 0.10 0.20 0.15
## donor4 0.20 0.11 0.10 1.00 0.11 0.13
## donor5 0.15 0.23 0.20 0.11 1.00 0.17
## donor6 0.10 0.09 0.15 0.13 0.17 1.00
Because it is not a single value per sample or individual, kinship
cannot be added to a model with + kin
. Instead, it is used
as a random effect where you block by individual so the model can
identify an individual’s kinship values relative to all other
individuals in the data set.
kimma
incorporates the lme4qtl
package’s
function relmatLmer
to use kinship in linear mixed effects
models. This feature is why kimma
exists since pairwise
kinship cannot be used in limma
. We fit this type of model
with run_lmerel = TRUE
providing the kinship matrix.
virus_kin <- kmFit(dat = dat,
kin = kin,
model = "~ virus*asthma + (1|ptID)",
run_lmerel = TRUE,
use_weights = TRUE,
metrics = TRUE)
lme/lmerel model: expression~virus*asthma+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
We see that kinship does not improve model fit with very small \(\Delta\) AIC < 2 and even some genes with identical AIC for both models (Best fit = none). We also see kinship has very little impact on significant gene detection. Of note, this is not surprising as the example kinship data is made-up and does not reflex the actual relatedness of these patients.
plot_fit2(model_result = virus_lme,
x="lme", x_label = "virus*asthma + (1|ptID)",
model_result_y = virus_kin,
y="lmerel", y_label = "virus*asthma + (1|ptID) + kin",
metrics = "AIC")
## Summary
## Best fit Metric Total genes Mean delta Stdev delta
## 1 none AIC 272 0.0000000 0.0000000
## 2 virus*asthma + (1|ptID) AIC 242 0.3798391 0.3998915
## 3 virus*asthma + (1|ptID) + kin AIC 486 0.2843757 0.2528879
plot_venn_genes(model_result = list("lme"=virus_lme),
fdr.cutoff = 0.2)[["venn"]][["0.2"]] +
ggtitle("Without kinship") +
plot_venn_genes(model_result = list("lmerel"=virus_kin),
fdr.cutoff = 0.2)[["venn"]][["0.2"]] +
ggtitle("With kinship")
Thus, we consider our final best fit model to be the previously linear mixed effects model without kinship or co-variates.
~ virus*asthma + (1|ptID)
If you have an interaction term in your model, you may wish to further probe the results to determine how the interaction is significant. Do healthy donors increase expression in response to virus while those with asthma show no change? Or does everyone decrease in expression but those with asthma decrease more? And other potential outcomes.
Similar to the unpaired model in section 2.2, we can do with for the
mixed effects model with run_contrasts = TRUE
. Here,
though, we will only run contrasts on genes that were significant for
the interaction term. First, we select interaction DEG at FDR <
0.05.
subset.genes <- virus_lme$lme %>%
filter(variable == "virus:asthma" & FDR < 0.05) %>%
pull(gene)
And then run the contrast model on these genes.
virus_contrast_signif <- kmFit(dat = dat,
model = "~ virus*asthma + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE,
run_contrast = TRUE,
contrast_var = "virus:asthma",
subset_genes = subset.genes)
lme/lmerel model: expression~virus*asthma+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 17 genes
Failed: 0 genes
We see that a number of genes only change significantly with virus in healthy or asthma individuals (right) and that some genes differ with asthma in the presence and/or absence of virus (left). The other contrasts (not plotted) are not of interest in this analysis since they represent comparisons across both variables.
plot_venn_genes(model_result = list("contrast lme"=virus_contrast_signif),
fdr.cutoff = 0.2,
contrasts = c("none asthma - none healthy",
"HRV asthma - HRV healthy",
"HRV healthy - none healthy",
"HRV asthma - none asthma"))
## $venn
## $venn$`0.2`
As noted in section 2.2, you may have instead chosen to use a contrast model for all genes and thus, would not need this redundant post-hoc test.
This tutorial is meant to walk you through the majority of experimental designs you will encounter in RNAseq analysis. But you do not need to run all these models on an actual data set. For these example data, here is the actual pipeline I would take to select a best fit model for these data.
virus
and asthma
~ virus*asthma
use_weights = TRUE
~ virus*asthma + (1|ptID)
. I run this model for comparison
to those in #4 and #5.~ virus*asthma + batch + (1|ptID)
~ virus*asthma + batch + (1|ptID)
. I determine that
batch does not improve the model fit enough to be included.The code to run the above pipeline (not run).
#3
## Run base model
virus_interact <- kmFit(dat = dat,
model = "~ virus*asthma + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE,
metrics = TRUE)
#6a
## Explore co-variate batch
plot_pca(dat, vars = "batch", scale = TRUE)
virus_batch <- kmFit(dat = dat,
model = "~ virus*asthma + batch + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE,
metrics = TRUE)
## Compare fit
plot_fit2(model_result = virus_interact, x="lme",
model_result_y = virus_batch, y="lme",
metrics = "AIC")
## Compare DEG
plot_venn_genes(model_result = list("without batch"=virus_interact),
fdr.cutoff = c(0.05, 0.2))[["venn"]] %>%
wrap_plots()
plot_venn_genes(model_result = list("without batch"=virus_batch),
fdr.cutoff = c(0.05, 0.2))[["venn"]] %>%
wrap_plots()
#7
## Interaction significant genes
subset.genes <- virus_interact$lme %>%
filter(variable == "virus:asthma" & FDR < 0.05) %>%
pull(gene)
contrast_model <- kmFit(dat = dat,
model = "~ virus*asthma + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE,
run_contrast = TRUE,
contrast_var = "virus:asthma",
subset_genes = subset.genes)
#8
## Volcano plot
plot_volcano(virus_interact, model="lme",
y.cutoff = 0.05)
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.1.2 ComplexHeatmap_2.12.1 lme4_1.1-34
## [4] Matrix_1.5-4.1 pvca_0.1.0 edgeR_3.40.2
## [7] limma_3.54.2 SEARchways_1.1.0 BIGpicture_1.1.0
## [10] RNAetc_1.1.0 kimma_1.5.0 BIGverse_1.0.0
## [13] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [16] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
## [19] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
## [22] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-162 matrixStats_1.0.0 doParallel_1.0.17
## [4] RColorBrewer_1.1-3 httr_1.4.6 tools_4.2.1
## [7] bslib_0.5.0 utf8_1.2.3 R6_2.5.1
## [10] BiocGenerics_0.44.0 mgcv_1.9-0 colorspace_2.1-0
## [13] GetoptLong_1.0.5 withr_2.5.0 tidyselect_1.2.0
## [16] compiler_4.2.1 cli_3.6.1 labeling_0.4.2
## [19] sass_0.4.7 scales_1.2.1 digest_0.6.33
## [22] minqa_1.2.5 rmarkdown_2.23 pkgconfig_2.0.3
## [25] htmltools_0.5.5 fastmap_1.1.1 highr_0.10
## [28] rlang_1.1.1 GlobalOptions_0.1.2 rstudioapi_0.15.0
## [31] shape_1.4.6 jquerylib_0.1.4 generics_0.1.3
## [34] farver_2.1.1 jsonlite_1.8.7 magrittr_2.0.3
## [37] Rcpp_1.0.11 munsell_0.5.0 S4Vectors_0.36.2
## [40] fansi_1.0.4 lifecycle_1.0.3 stringi_1.7.12
## [43] yaml_2.3.7 MASS_7.3-60 parallel_4.2.1
## [46] crayon_1.5.2 lattice_0.21-8 splines_4.2.1
## [49] circlize_0.4.15 hms_1.1.3 magick_2.7.4
## [52] locfit_1.5-9.8 knitr_1.43 pillar_1.9.0
## [55] boot_1.3-28.1 rjson_0.2.21 codetools_0.2-19
## [58] stats4_4.2.1 glue_1.6.2 ggvenn_0.1.10
## [61] evaluate_0.21 png_0.1-8 vctrs_0.6.3
## [64] nloptr_2.0.3 tzdb_0.4.0 foreach_1.5.2
## [67] gtable_0.3.3 clue_0.3-64 cachem_1.0.8
## [70] xfun_0.39 iterators_1.0.14 IRanges_2.32.0
## [73] cluster_2.1.4 timechange_0.2.0