Overview

This document covers our recommended analysis pipeline to determine differentially expressed genes (DEG). This pipeline includes simple linear modeling and linear mixed effects modeling with main effects, interaction terms, co-variates, and random effects. There is discussion of how to choose a ‘best fit’ model using fit, significant genes, and biological knowledge. The example data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.

0. Setup

Software

This pipeline should be completed in R and RStudio. You should also install the following packages.

#CRAN packages
install.packages("tidyverse")

#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma", "patchwork", "ComplexHeatmap"))

#GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")
devtools::install_github("dleelab/pvca")

And load them into your current R session.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(kimma)
library(BIGpicture)
library(RNAetc)

## 
## Attaching package: 'RNAetc'
## 
## The following objects are masked from 'package:kimma':
## 
##     example.dat, example.voom

library(SEARchways)
library(limma)
library(edgeR)
library(pvca)
library(lme4)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

library(ComplexHeatmap)

## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.24.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================

library(patchwork)

set.seed(651)

Example data

Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this tutorial.

Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub

Specifically, this tutorial uses RNAseq data processed using our SEAsnake and counts to voom pipelines, resulting in voom-normalized, log2 counts per million (CPM) expression and associated sample metadata in a limma EList object in the data_clean/ directory. These data are available in the kimma package within the example.voom data object.

Research question

Our main research question is how human rhinovirus (HRV) infection impacts gene expression. As a secondary question, we are also interested in how individuals with asthma may respond differently to HRV compared to healthy individuals. Thus, we will explore models comparing media and HRV-infected samples (variable named virus) in asthmatic and healthy individuals (variable named asthma). We will then explore the impacts of patient co-variates, paired sample design, and random effects to improve model fit and the detection of differentially expressed genes.

1. Load data

All counts, gene, and sample metadata are contained in a single object from the limma package. Note that these data contain only 1000 genes from the full data set create in our counts to voom pipeline. This is to reduce computational time.

#Extract and rename data object
dat <- kimma::example.voom

We access each data frame within this Elist using $. The normalized log2 CPM expression data are contained in E.

dat$E[1:3,1:7]

##                     lib1     lib2     lib3     lib4     lib5     lib6     lib7
## ENSG00000000460 6.109675 7.998060 7.322775 7.327720 7.915870 7.989109 8.294872
## ENSG00000001460 7.173806 4.828135 6.133741 8.130273 6.662114 7.241296 5.807207
## ENSG00000002587 8.672611 7.460403 8.856207 8.130273 7.852445 7.067965 8.249429

Library and donor metadata are in targets.

dat$targets[1:3,]

##      group  lib.size norm.factors libID   ptID median_cv_coverage virus  asthma
## lib1     1  79645.59     1.001378  lib1 donor1           0.513732  none healthy
## lib2     1  88007.91     0.951334  lib2 donor1           0.435162   HRV healthy
## lib3     1 178019.51     1.091153  lib3 donor2           0.374282  none healthy
##      batch
## lib1     2
## lib2     2
## lib3     2

Gene metadata are in genes.

dat$genes[1:3,1:4]

##                 hgnc_symbol Previous symbols   Alias symbols
## ENSG00000000460       RAB12             <NA>            <NA>
## ENSG00000001460         MIA             <NA>          CD-RAP
## ENSG00000002587     C1orf52             <NA> gm117, FLJ44982
##                        gene_biotype
## ENSG00000000460 protein-coding gene
## ENSG00000001460 protein-coding gene
## ENSG00000002587 protein-coding gene

Voom gene-level quality weights are in weights. These were calculated with voomWithQualityWeights( ).

example.voom$weights[1:3,1:3]

##           [,1]      [,2]      [,3]
## [1,] 0.7585106 0.8274172 1.5710160
## [2,] 0.5288184 0.5555010 0.9553991
## [3,] 0.7259811 0.7906774 1.5015908

And finally, the null model used in voom normalization is found in design.

example.voom$design[1:3,]

## lib1 lib2 lib3 
##    1    1    1

2. Simple linear model

2.1: A single main effect (virus)

First, we consider our research question and what variable(s) we need to answer that question. In these data, the first variable of interest is virus to determine how viral infection impacts gene expression. In R, this model is written as:

~ virus

On a coarse scale such as PCA, we can see that virus impacts gene expression with uninfected controls (none) grouping together away from HRV-infected samples.

pca1 <- plot_pca(dat, vars = "virus", scale = TRUE)

## Joining with `by = join_by(libID)`

pca1$virus

We run this linear model in kimma using kmFit.

virus <- kmFit(dat = dat, 
               model = "~ virus", 
               run_lm = TRUE)

lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see that numerous genes are significant for virus in the summary table and FDR <0.05 highlighted in the volcano plot. However, this is not the best model for these data since we have other factors that may impact expression.

summarise_kmFit(fdr = virus$lm)

## # A tibble: 2 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 virusHRV                  44        66       132       160       240       323
## 2 total (nonredund…         44        66       132       160       240       323

plot_volcano(model_result = virus, 
             model = "lm", variables = "virusHRV",
             y_cutoff = 0.05)

A note on variable levels and factor order

The order of your variable levels is important for results interpretation. Here, our virus variable is a factor with order “none”, “HRV”. Thus, the reference is “none” and the comparison level is “HRV. This means positive estimates are genes that are UP with HRV infection, because estimates are LEVEL - REFERENCE.

class(dat$targets$virus)

## [1] "factor"

levels(dat$targets$virus)

## [1] "none" "HRV"

If your variable is not a factor, the order of levels will be automatically determined alphabetically.

class(dat$targets$batch)

## [1] "character"

levels(dat$targets$batch)

## NULL

sort(unique(dat$targets$batch))

## [1] "1" "2"

If you do not like this order, convert variables to factors BEFORE running kimma models. For example, if we wanted to switch batch order.

dat$targets <- dat$targets %>% 
  mutate(batch = factor(batch, levels=c("2","1")))
levels(dat$targets$batch)

## [1] "2" "1"

Gene-level weights

limma introduced gene-level weights to account for sequence library quality Law 2014. These weights are calculated in limma with voomWithQualityWeights and then kimma can incorporate them from the EList object or separately using the weights parameter.

# Check if weights exist in the data
names(example.voom)

## [1] "genes"   "targets" "E"       "weights" "design"

virus_weights <- kmFit(dat = example.voom, 
                       model = "~ virus", 
                       run_lm = TRUE,
                       use_weights = TRUE)

lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see that in this case, weights have little impact. However, in larger datasets with more variability in quality and in other model designs, weights can have a significant effect. In general, we use weights in our models.

plot_venn_genes(model_result = list("no weights" = virus,
                                    "weights" = virus_weights), 
                fdr.cutoff = 0.2)

## $venn
## $venn$`0.2`

2.2: Multiple main effects (virus & asthma)

We are actually interested in how individuals with and without asthma respond to virus. We can add variables to our model with + such as

~ virus + asthma

However, this model only captures the main effects of each variable in isolation. Specifically, this model tells you how virus impacts genes expression and how asthma impacts gene expression. It does not address how viral impacts differ between those with and without asthma.

Interaction terms

One way to assess this is with an interaction term written as:

~ virus + asthma + virus:asthma

or short-handed with *. Note, these two models are equivalent in R.

~ virus * asthma

This model now tests both the main effects and their interaction.

virus_interact <- kmFit(dat = dat, 
                        model = "~ virus*asthma", 
                        run_lm = TRUE,
                        use_weights = TRUE)

lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We now see 3 variables in our results equivalent to the variables in the long form of our model equation. Notice that we’ve lost significance for a lot of genes in virus and not found any genes at FDR < 0.05 for asthma or the interaction term. Because this data set is small, an interaction model is likely too complex, and we do not appear to have the power to detect interaction effects.

summarise_kmFit(fdr = virus_interact$lm)

## # A tibble: 4 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 asthmaasthma               0         0         4        16        30        89
## 2 virusHRV                   1         9        31        52        81       116
## 3 virusHRV:asthmaa…          0         0         1         1         1         1
## 4 total (nonredund…          1         9        35        64       103       182

Importantly, a gene with a significant interaction term cannot be assessed for the main effects. For example at a higher FDR of 0.3, there is 1 gene here that is significant for the interaction (green) and asthma (yellow). However, we cannot use the asthma results for these genes, because they are comparing all healthy to all asthma samples without taking virus into account. Since we know there is an interaction effect for these genes, the asthma comparison alone is incorrectly averaging across samples we know to be different (none vs HRV). Similarly, we cannot use the virus results for an interaction significant gene.

plot_venn_genes(model_result = list("lm"=virus_interact), 
                fdr_cutoff = 0.3)

## $venn
## $venn$`0.3`

If this were our final model, our DEG list would be all interaction genes (green) as well as the intersect of virus and asthma main terms (blue-yellow overlap). This second group encompasses genes that change with virus similarly in healthy and asthma donors but are always higher in one asthma group.

Pairwise contrasts

Another way to model interactions is with pairwise contrasts. Contrasts compare 2 or more groups to all other groups in that variable. For example, we’re interested in the 4 groups within the interaction term: none_healthy, none_asthma, HRV_healthy, HRV_asthma. We run these comparisons with the same interaction model as above, only now we also set run_contrast = TRUE. We will get pairwise comparisons that we’re not interested in since all combinations are run but will filter those of interest later on.

virus_contrast <- kmFit(dat = dat, model = "~ virus*asthma", 
                        run_lm = TRUE, 
                        run_contrast = TRUE,
                        use_weights = TRUE, 
                        contrast_var = "virus:asthma")

lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see the same main model outcome as before.

summarise_kmFit(fdr = virus_contrast$lm)

## # A tibble: 4 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 asthmaasthma               0         0         4        16        30        89
## 2 virusHRV                   1         9        31        52        81       116
## 3 virusHRV:asthmaa…          0         0         1         1         1         1
## 4 total (nonredund…          1         9        35        64       103       182

We also get all pairwise contrasts between the 4 groups.

summarise_kmFit(fdr = virus_contrast$lm.contrast)

## # A tibble: 7 × 9
##   variable    contrast_ref contrast_lvl `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3`
##   <fct>       <chr>        <chr>             <int>     <int>     <int>     <int>
## 1 virus*asth… HRV healthy  HRV asthma            0         4        17        26
## 2 virus*asth… HRV healthy  none asthma          56        95       199       279
## 3 virus*asth… none asthma  HRV asthma           14        28        62       111
## 4 virus*asth… none healthy HRV asthma            0         0        50       134
## 5 virus*asth… none healthy HRV healthy           1         9        31        52
## 6 virus*asth… none healthy none asthma           0         0         4        16
## 7 total (non… <NA>         <NA>                 60        99       230       350
## # ℹ 2 more variables: `fdr<0.4` <int>, `fdr<0.5` <int>

Not all of these contrasts are of interest, so we select just the effects of virus (green, red) and asthma (blue, yellow). You may be interested in all of these genes or just in genes that change with virus in one group (green and red not overlapping) or those that change with virus (green and/or ref) AND are different with asthma (blue and/or yellow).

plot_venn_genes(model_result = list("lm"=virus_contrast), 
                fdr_cutoff = 0.2,
                contrasts = c("none asthma - none healthy",
                              "HRV asthma - HRV healthy",
                              "HRV healthy - none healthy",
                              "HRV asthma - none asthma"))

## $venn
## $venn$`0.2`

When to use interaction vs contrasts

At their heart, interaction and contrast models are trying to answer the same question. However, statistically, they are very different. Contrasts compare means between groups (see below) and you must select which pairwise comparisons are meaningful.

An interaction term tests if two slopes differ. In these data, this is comparing the change (slope) in response to virus in healthy vs asthmatic donors like below In most cases, it is more difficult to achieve significance when comparing slopes as opposed to means.

In general, we recommend using the interaction term to define differentially expressed genes (DEG). Then, as a post-hoc test, run contrasts only on significant DEGs to further probe the results. This is demonstrated later in this tutorial. It is like running an ANOVA to compare groups A,B,C and then TukeyHSD to determine which groups differ from which (A vs B, B vs C, A vs C).

Contrasts may be employed as your main model instead in cases such as:

Small data sets where you are under-powered and would benefit from the reduced complexity of contrasts (1 variable) vs an interaction (3 variables)
When there is no significance for the interaction term
When you are only interested in a subset of the pairwise comparisons encompassed by the interaction term. For example, if you wanted to test a longitudinal study as time point 1 vs 2, 2 vs 3, 3 vs 4, etc

2.3: Co-variates (batch)

We also want to consider the effects of variables that may impact or prevent us from seeing our main effects. In human studies, this often includes age and sex, though we will not explore these here. Here, we want to consider batch since these data were sequenced in two batches.

To determine which co-variates to include, let’s see if they have a large impact on the data in PCA and compare to the effects of virus and asthma.

plot_pca(dat, vars = c("virus","asthma","batch"), scale = TRUE) %>% 
  wrap_plots(ncol=1)

## Joining with `by = join_by(libID)`

We see grouping by batch; thus, we have evidence to include batch.

We run the models with and without co-variates for comparison. To include co-variates, we add them to the models with +. We also set metrics = TRUE so that kimma gives us model fit metrics for comparison.

virus_interact <- kmFit(dat = dat, 
                        model = "~ virus*asthma", 
                        run_lm = TRUE, 
                        use_weights = TRUE,
                        metrics = TRUE)

lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

virus_batch <- kmFit(dat = dat, 
                     model = "~ virus*asthma + batch", 
                     run_lm = TRUE, 
                     use_weights = TRUE,
                     metrics = TRUE)

lm model: expression~virus*asthma+batch
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We compare model fits with AIC, which summarize model fit. For more information, see https://en.wikipedia.org/wiki/Akaike_information_criterion. In general, a difference in AIC < 2 is considered no evidence of better fit, 2 - 8 is moderate evidence, and > 8 is strong evidence. There are additional metrics also available in kimma including BIC, sigma, and R-squared.

Smaller AIC indicate a better fitting model so genes above 0 are better fit by the “x” model and genes below 0 line are better fit by the “y” model. Throughout this tutorial, we put the more complex model as y. Our plotting function also outputs a message with how many genes are best fit by each model as well as mean and standard deviation of the difference in AIC between the two models.

plot_fit2(model_result = virus_interact, 
          x = "lm", x_label = "virus*asthma",
          model_result_y = virus_batch, 
          y = "lm", y_label = "virus*asthma + batch",
          metrics = "AIC")

## Summary: AIC and BIC

##                              Best fit Metric Total genes Mean delta Stdev delta
## 1 Moderate (2-7) virus*asthma + batch    AIC          98   3.813692   1.3416383
## 2                     Nonsignif (< 2)    AIC         879   1.378355   0.6001681
## 3   Signif (> 7) virus*asthma + batch    AIC          23  10.330051   2.9288461

We see that each model is the best fit for a subset of genes. Batch improves the model fit for about 25% of genes, and this improvement is moderate to strong for some genes (e.g. genes with change in fit < -2 and mean $\Delta$ AIC = 2.9). In contrast, not having batch does not change AIC more than 2; thus, there is no evidence to remove batch.

Next, we compare how many genes are significant with and without co-variates.

summarise_kmFit(virus_interact$lm)

## # A tibble: 4 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 asthmaasthma               0         0         4        16        30        89
## 2 virusHRV                   1         9        31        52        81       116
## 3 virusHRV:asthmaa…          0         0         1         1         1         1
## 4 total (nonredund…          1         9        35        64       103       182

summarise_kmFit(virus_batch$lm)

## # A tibble: 5 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 asthmaasthma               0         0         0         8        26        42
## 2 batch2                     0         0         0         0         0         0
## 3 virusHRV                   0         0         0         1         1         1
## 4 virusHRV:asthmaa…          0         0         0         0         0         1
## 5 total (nonredund…          0         0         0         9        27        43

First, consider how many genes are significant for the co-variates themselves. Batch is not significant for any genes here. Next, compare the number of virus-significant genes impacted by including the co-variate. Adding batch to the model decreases the number of virus significant genes here. The dramatic decrease in virus genes is likely evidence that we do not have enough power to support the co-variate model complexity (e.g. too few samples relative to the number of variables). Given the size of this data set (N = 12), this is not surprising.

To summarize, the batch co-variate:

Impacts overall gene expression in PCA
Moderately improves model fit for a minority of genes
Does not significantly impact fit for the majority of genes
Significant for 0 genes in a linear model
Reduces the number of virus-significant genes

All evidence except PCA points to not including batch in the model. It is rare that all the evidence points to one conclusion so it must be weighed along with biological and experimental information to determine the final model. In this case, I would choose to NOT include batch since the sample size is very small and model fit differences are also small.

More to consider with co-variates

In your data, you may have co-variates that are not as clear as this. In that case, it’s important to prioritize model fit and sample size over significant gene counts. You should not include or exclude a co-variate just because it gets you more significant genes for your main terms! This is especially true if the co-variate negatively impacts model fit.

You may also have non-statistical evidence for including a co-variate. If you have established biological evidence that a co-variate is important in your system or it’s standard in your field, you may wish to include it even if it does not improve fit and is not significant. If the co-variate, however, greatly worsens fit, you should not include it even if it’s standard. Instead, present fit plots like above to support its exclusion.

Assessing many co-variates at once

As shown earlier, PCA is often used to assessed co-variate effects on gene expression. Another way to visualize this (and an easier way when you have a lot of variables) is to make a heatmap of correlations to PCs. This works best for numeric co-variates but can also be roughly applied to categorical variables if they are converted to numeric levels (like 1, 2). Categorical variables with more than two levels are difficult to interpret by this method since the ordering (1,2,3 vs 1,3,2 etc) can give different results. However, than can technically still be run in correlation.

First, we correlate numeric and converted categorical variables to PCA with Pearson’s correlation.

#Extract PCA values from first PCA under the ~virus model (2.1)
pca.dat <- pca1$virus$data %>% 
  # convert 2 level categoricals
  mutate(across(c(virus, asthma, batch, ptID), ~as.numeric(as.factor(.)))) %>% 
  # select var of interest
  select(PC1:PC12, lib.size, median_cv_coverage, virus, asthma, batch, ptID)
  
pca.corr <- cor(pca.dat, method = "pearson") %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  #Remove PCs from rows
  filter(!grepl("^PC[0-9]{1,2}$", rowname)) %>% 
  #Remove everything else from columns
  select(rowname, starts_with("PC")) %>% 
  column_to_rownames() %>% 
  as.matrix()

Visualizing this in a heatmap, we see that our main variable of interest (virus) most strongly correlates to PC2. This PC is also correlated with batch which further underscores the importance of testing this variable in linear models. In addition, PC1 is related to technical variables such as median CV coverage and biological variables like asthma. Given that the technical variables are on a highly explanatory PC, this result indicates that they should be explored for model fit similar to batch as above.

ComplexHeatmap::Heatmap(pca.corr, 
                        name = "Pearson (R)", 
                        cluster_columns = FALSE)

Note that ptID does not have a directly interpretable R value, because it is a six level factor and does not logically convert to numeric.

Another method for correlating categorical variable to PCs in through pvca, which performs mixed models and determines the percent of PC variation that is explained by each variable.

Here, we test the previously “converted to numeric” variables of virus, asthma, batch, and ptID. This is a better method for ptID in particular, because it has more than two levels.

dat.pvca <- dat$targets %>% 
  select(ptID, virus, asthma, batch)

pvca.corr <- pvca::PVCA(counts = dat$E,
           meta = dat.pvca,
           threshold = 1,
           inter = FALSE)

## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')

The package has its own integrated plotting function. We see that the majority of variation (38.7%) is not explained by any of the provided variables (resid). The next most explanatory variable is patient ID (ptID) at 31.8%. As you’ll see in the next section, this effect is likely due to the paired sample design and will be accounted for in a linear mixed effects model. Continuing down, we see virus, asthma, and then batch in explanatory power. This indicates that maybe batch isn’t as impactful of a co-variate as we originally thought, though often you will consider all co-variates that explain > 1% of variation.

PlotPVCA(pvca.corr, title = "")

These methods in addition to model fit assessment will help you in determining co-variates to include in your model. It is both an art and a science as you must take into account prior knowledge and statistics in your choices. In the end, sometimes you have to make a arbitrary choice if the results are not conclusive one way or the other.

3. Linear mixed effects model

3.1: Random effects (ptID)

These data come from a paired study design with uninfected (none) and infected (HRV) samples from the same donor’s cells. We take this into account in our model by using donor as a random effect. This allows the model to match samples from the same donor. It greatly improves our statistical power as we now have 1 slope per donor instead of 1 mean slope across all donors. So, we can see if all individual donors change similarly or not.

Random effects are added to the model with + (1|block) where the block is the variable you want to pair samples by. Thus, we now run a mixed effects model lme with

~ virus*asthma + (1|ptID)

In kimma, this is run similarly to a simple model except we add our random term and ask it to run_lme.

virus_lme <- kmFit(dat = dat, 
                   model = "~ virus*asthma + (1|ptID)", 
                   run_lme = TRUE, 
                   use_weights = TRUE,
                   metrics = TRUE)

lme/lmerel model: expression~virus*asthma+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

Note, we could run both models at once in kmFit if we set run_lm=TRUE and run_lme=TRUE. This is helpful when you know you want to compare multiple models from the start. Since we already ran the simple linear model, we’ll skip that here.

Comparing models as we did with co-variates, we see that genes are split by best fit with 1/3 best fit with the paired design (below 0, mean $\Delta$ AIC = 4.4) and 2/3 best fit without the paired design (above 0, mean $\Delta$ AIC = 6.1). This is a case where fit does not clearly tell you which model to choose. We, however, know that the experimental design is paired and thus, choose the mixed effects model with 1|ptID as we know it better represents these data.

plot_fit2(model_result = virus_interact, 
          x="lm", x_label = "virus*asthma",
          model_result_y = virus_lme, 
          y="lme", y_label = "virus*asthma + (1|ptID)",
          metrics = "AIC")

## Summary: AIC and BIC

##                                 Best fit Metric Total genes Mean delta
## 1            Moderate (2-7) virus*asthma    AIC         287   4.585766
## 2 Moderate (2-7) virus*asthma + (1|ptID)    AIC         224   4.203401
## 3                        Nonsignif (< 2)    AIC         199   1.075858
## 4              Signif (> 7) virus*asthma    AIC         227  10.297780
## 5   Signif (> 7) virus*asthma + (1|ptID)    AIC          63  10.285674
##   Stdev delta
## 1   1.4337513
## 2   1.3190495
## 3   0.6137955
## 4   2.9592266
## 5   4.6351583

Comparison to `limma`

limma offers a sudo random effect in linear modeling with duplicateCorrelation. This estimates the random effect of donor across all genes and uses one value for all models. The full mixed effect model in kimma estimates the random effect for each gene, thus fitting a different value for each gene’s model.

In our analyses, we’ve found limma and kimma results to be similar when the sample size or variable effects are large. In the case of small data sets or small effects, however, there can be dramatic differences. You can find a side-by-side comparison in the kimma vignette.

3.2 Co-variates (batch)

Given the large changes with a mixed effect model, let’s re-consider the batch and co-variate.

virus_batch_lme <- kmFit(dat = dat, 
                         model = "~ virus*asthma + batch + (1|ptID)",
                         run_lme = TRUE, 
                         use_weights = TRUE,
                         metrics = TRUE)

lme/lmerel model: expression~virus*asthma+batch+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see no clear evidence to keep co-variates based on model fit since genes are split for best fit model and mean $\Delta$ AIC is < 2.

plot_fit2(model_result = virus_lme, 
          x="lme", x_label = "virus*asthma + (1|ptID)",
         model_result_y = virus_batch_lme, 
         y="lme", y_label = "virus*asthma + batch + (1|ptID)",
         metrics="AIC")

## Summary: AIC and BIC

##                                         Best fit Metric Total genes Mean delta
## 1         Moderate (2-7) virus*asthma + (1|ptID)    AIC         230  2.9523504
## 2 Moderate (2-7) virus*asthma + batch + (1|ptID)    AIC         130  3.6163481
## 3                                Nonsignif (< 2)    AIC         625  0.9726142
## 4           Signif (> 7) virus*asthma + (1|ptID)    AIC           2  7.0887799
## 5   Signif (> 7) virus*asthma + batch + (1|ptID)    AIC          13  8.5086419
##   Stdev delta
## 1   0.7443203
## 2   1.3151626
## 3   0.5750350
## 4   0.1057450
## 5   1.2285494

Inclusion of batch in the model decreases the number of virus-significant genes but batch itself is significant for a number of genes.

summarise_kmFit(virus_lme$lme)

## # A tibble: 5 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 (1 | ptID)                 7         9        21        24        33       124
## 2 asthma                    69        98       140       204       284       355
## 3 virus                    266       314       397       468       566       628
## 4 virus:asthma              17        33        46        77        89       106
## 5 total (nonredund…        316       384       493       590       710       797

summarise_kmFit(virus_batch_lme$lme)

## # A tibble: 6 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 (1 | ptID)                 5         5         6        21        63        89
## 2 asthma                    70        90       131       174       268       336
## 3 batch                     79        91       125       161       210       273
## 4 virus                    158       210       296       335       410       493
## 5 virus:asthma              48        59        83       113       133       189
## 6 total (nonredund…        259       330       441       523       643       745

Importantly, some of the interaction genes (our main variable of interest) are significant for batch. This is evidence to keep this co-variate since it appears to play a role in the genes we’ll focus on in our results. That being said, batch does not improve model fit and you could argue to remove them. So, this is a case where there is no clear right answer. As long as you understand the potential impacts on your results, you could continue your analyses with or without batch.

plot_venn_genes(model_result = list("with batch"=virus_batch_lme),
                fdr_cutoff = 0.2)

## $venn
## $venn$`0.2`

For simplicity, we’ll move forward without any co-variates in the model below.

~ virus*asthma + (1|ptID)

3.3: Kinship co-variate

Kinship is a summative measure of genetic relatedness. It can be from 0 to 1 with 1 being 100% identical (monozygotic twins). Some other common values are 0.5 for parent-child, 0.25 grandparent-grandchild, 0.125 first cousins, etc. This measure is a pairwise measure with 1 value per pair of individuals.

kin <- kimma::example.kin

kin

##        donor1 donor2 donor3 donor4 donor5 donor6
## donor1   1.00   0.50   0.10   0.20   0.15   0.10
## donor2   0.50   1.00   0.10   0.11   0.23   0.09
## donor3   0.10   0.10   1.00   0.10   0.20   0.15
## donor4   0.20   0.11   0.10   1.00   0.11   0.13
## donor5   0.15   0.23   0.20   0.11   1.00   0.17
## donor6   0.10   0.09   0.15   0.13   0.17   1.00

Because it is not a single value per sample or individual, kinship cannot be added to a model with + kin. Instead, it is used as a random effect where you block by individual so the model can identify an individual’s kinship values relative to all other individuals in the data set.

kimma incorporates the lme4qtl package’s function relmatLmer to use kinship in linear mixed effects models. This feature is why kimma exists since pairwise kinship cannot be used in limma. We fit this type of model with run_lmerel = TRUE providing the kinship matrix.

virus_kin <- kmFit(dat = dat,
                   kin = kin, 
                   model = "~ virus*asthma + (1|ptID)",
                   run_lmerel = TRUE, 
                   use_weights = TRUE,
                   metrics = TRUE)

lme/lmerel model: expression~virus*asthma+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see that kinship does not improve model fit with very small $\Delta$ AIC < 2 and even some genes with identical AIC for both models (Best fit = none). We also see kinship has very little impact on significant gene detection. Of note, this is not surprising as the example kinship data is made-up and does not reflex the actual relatedness of these patients.

plot_fit2(model_result = virus_lme, 
          x="lme", x_label = "virus*asthma + (1|ptID)",
          model_result_y = virus_kin, 
          y="lmerel", y_label = "virus*asthma + (1|ptID) + kin",
          metrics = "AIC")

## Summary: AIC and BIC

##          Best fit Metric Total genes Mean delta Stdev delta
## 1 Nonsignif (< 2)    AIC        1000  0.2301277   0.3015061

plot_venn_genes(model_result = list("lme"=virus_lme),
                fdr.cutoff = 0.2)[["venn"]][["0.2"]] +
  ggtitle("Without kinship") +
plot_venn_genes(model_result = list("lmerel"=virus_kin),
                fdr.cutoff = 0.2)[["venn"]][["0.2"]] +
  ggtitle("With kinship")

Thus, we consider our final best fit model to be the previously linear mixed effects model without kinship or co-variates.

~ virus*asthma + (1|ptID)

4. Post-hoc pairwise contrasts

If you have an interaction term in your model, you may wish to further probe the results to determine how the interaction is significant. Do healthy donors increase expression in response to virus while those with asthma show no change? Or does everyone decrease in expression but those with asthma decrease more? And other potential outcomes.

Similar to the unpaired model in section 2.2, we can do with for the mixed effects model with run_contrasts = TRUE. Here, though, we will only run contrasts on genes that were significant for the interaction term. First, we select interaction DEG at FDR < 0.05.

subset.genes <- virus_lme$lme %>% 
  filter(variable == "virus:asthma" & FDR < 0.05) %>% 
  pull(gene)

And then run the contrast model on these genes.

virus_contrast_signif <- kmFit(dat = dat, 
                               model = "~ virus*asthma + (1|ptID)",
                               run_lme = TRUE, 
                               use_weights = TRUE,
                               run_contrast = TRUE,
                               contrast_var = "virus:asthma",
                               subset_genes = subset.genes)

lme/lmerel model: expression~virus*asthma+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 17 genes
Failed: 0 genes

We see that a number of genes only change significantly with virus in healthy or asthma individuals (right) and that some genes differ with asthma in the presence and/or absence of virus (left). The other contrasts (not plotted) are not of interest in this analysis since they represent comparisons across both variables.

plot_venn_genes(model_result = list("contrast lme"=virus_contrast_signif),
                fdr.cutoff = 0.2, 
                contrasts = c("none asthma - none healthy",
                              "HRV asthma - HRV healthy",
                              "HRV healthy - none healthy",
                              "HRV asthma - none asthma"))

## $venn
## $venn$`0.2`

As noted in section 2.2, you may have instead chosen to use a contrast model for all genes and thus, would not need this redundant post-hoc test.

5. An actual analysis

This tutorial is meant to walk you through the majority of experimental designs you will encounter in RNAseq analysis. But you do not need to run all these models on an actual data set. For these example data, here is the actual pipeline I would take to select a best fit model for these data.

What are the main variables of interest? Is an interaction necessary?
- virus and asthma
- Yes, so my starting model is ~ virus*asthma
Are gene-level weights available?
- Yes, so use_weights = TRUE
Are the data from a paired sample design?
- Yes, so I need a mixed effects model ~ virus*asthma + (1|ptID). I run this model for comparison to those in #4 and #5.
Are there potential co-variates?
- Yes, so I run the model with each co-variate of interest such as ~ virus*asthma + batch + (1|ptID)
- If there were additional co-variates, I would run each in a model as well
Is kinship available and of interest?
- No, so we don’t add it to the #4 model.
Compare more complex models to the version in #3 as that is the most simple with all main variables and the correct paired sample design.
- 1. ~ virus*asthma + batch + (1|ptID). I determine that batch does not improve the model fit enough to be included.
- Note that the exclusion of co-variates was also influenced by the small data set size and a desire to use the simplest model possible
- Also if kinship had been important, I would check that fit before adding co-variates.
Run contrasts on interaction significant genes from best model (which ends up being #3)
Make summary plots

The code to run the above pipeline (not run).

#3
## Run base model
virus_interact <- kmFit(dat = dat, 
                        model = "~ virus*asthma + (1|ptID)",
                        run_lme = TRUE, 
                        use_weights = TRUE,
                        metrics = TRUE)

#6a
## Explore co-variate batch
plot_pca(dat, vars = "batch", scale = TRUE)

virus_batch <- kmFit(dat = dat, 
                     model = "~ virus*asthma + batch + (1|ptID)",
                     run_lme = TRUE,
                     use_weights = TRUE,
                     metrics = TRUE)
## Compare fit
plot_fit2(model_result = virus_interact, x="lme", 
         model_result_y = virus_batch, y="lme", 
         metrics = "AIC")
## Compare DEG
plot_venn_genes(model_result = list("without batch"=virus_interact),
                fdr.cutoff = c(0.05, 0.2))[["venn"]] %>% 
  wrap_plots()
plot_venn_genes(model_result = list("without batch"=virus_batch),
                fdr.cutoff = c(0.05, 0.2))[["venn"]] %>% 
  wrap_plots()

#7
## Interaction significant genes
subset.genes <- virus_interact$lme %>% 
  filter(variable == "virus:asthma" & FDR < 0.05) %>% 
  pull(gene)

contrast_model <- kmFit(dat = dat, 
                        model = "~ virus*asthma + (1|ptID)",
                        run_lme = TRUE, 
                        use_weights = TRUE,
                        run_contrast = TRUE,
                        contrast_var = "virus:asthma",
                        subset_genes = subset.genes)

#8
## Volcano plot
plot_volcano(virus_interact, model="lme",
             y.cutoff = 0.05)

R session

sessionInfo()

## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.3.0       ComplexHeatmap_2.24.0 lme4_1.1-37          
##  [4] Matrix_1.7-3          pvca_0.1.0            edgeR_4.6.1          
##  [7] limma_3.64.0          SEARchways_1.4.0      RNAetc_1.2.0         
## [10] BIGpicture_1.1.0      kimma_1.6.3           lubridate_1.9.4      
## [13] forcats_1.0.0         stringr_1.5.1         dplyr_1.1.4          
## [16] purrr_1.0.4           readr_2.1.5           tidyr_1.3.1          
## [19] tibble_3.2.1          ggplot2_3.5.2         tidyverse_2.0.0      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    farver_2.1.2        fastmap_1.2.0      
##  [4] digest_0.6.37       timechange_0.3.0    lifecycle_1.0.4    
##  [7] cluster_2.1.8.1     statmod_1.5.0       magrittr_2.0.3     
## [10] compiler_4.5.0      rlang_1.1.6         sass_0.4.10        
## [13] tools_4.5.0         utf8_1.2.4          yaml_2.3.10        
## [16] data.table_1.17.0   knitr_1.50          labeling_0.4.3     
## [19] RColorBrewer_1.1-3  BiocParallel_1.42.0 withr_3.0.2        
## [22] BiocGenerics_0.54.0 stats4_4.5.0        colorspace_2.1-1   
## [25] scales_1.4.0        iterators_1.0.14    MASS_7.3-65        
## [28] cli_3.6.5           rmarkdown_2.29      crayon_1.5.3       
## [31] reformulas_0.4.0    generics_0.1.3      rstudioapi_0.17.1  
## [34] tzdb_0.5.0          rjson_0.2.23        minqa_1.2.8        
## [37] cachem_1.1.0        splines_4.5.0       parallel_4.5.0     
## [40] matrixStats_1.5.0   vctrs_0.6.5         boot_1.3-31        
## [43] jsonlite_2.0.0      IRanges_2.42.0      hms_1.1.3          
## [46] GetoptLong_1.0.5    S4Vectors_0.46.0    clue_0.3-66        
## [49] locfit_1.5-9.12     foreach_1.5.2       jquerylib_0.1.4    
## [52] glue_1.8.0          nloptr_2.2.1        codetools_0.2-20   
## [55] cowplot_1.1.3       stringi_1.8.7       gtable_0.3.6       
## [58] shape_1.4.6.1       pillar_1.10.2       htmltools_0.5.8.1  
## [61] fgsea_1.34.0        circlize_0.4.16     R6_2.6.1           
## [64] Rdpack_2.6.4        doParallel_1.0.17   evaluate_1.0.3     
## [67] lattice_0.22-7      rbibutils_2.3       png_0.1-8          
## [70] bslib_0.9.0         Rcpp_1.0.14         fastmatch_1.1-6    
## [73] nlme_3.1-168        mgcv_1.9-3          ggvenn_0.1.10      
## [76] xfun_0.52           pkgconfig_2.0.3     GlobalOptions_0.1.2

RNA-seq data analysis

voom to DEG

Kim Dill-McFarland, kadm@uw.edu

version April 30, 2025

Overview

0. Setup

Software

Example data

Research question

1. Load data

2. Simple linear model

2.1: A single main effect (virus)

A note on variable levels and factor order

Gene-level weights

2.2: Multiple main effects (virus & asthma)

Interaction terms

Pairwise contrasts

When to use interaction vs contrasts

2.3: Co-variates (batch)

More to consider with co-variates

Assessing many co-variates at once

3. Linear mixed effects model

3.1: Random effects (ptID)

Comparison to `limma`

3.2 Co-variates (batch)

3.3: Kinship co-variate

4. Post-hoc pairwise contrasts

5. An actual analysis

R session

Navigation

RNA-seq data analysis

voom to DEG

Kim Dill-McFarland, kadm@uw.edu

version April 30, 2025

Overview

0. Setup

Software

Example data

Research question

1. Load data

2. Simple linear model

2.1: A single main effect (virus)

A note on variable levels and factor order

Gene-level weights

2.2: Multiple main effects (virus & asthma)

Interaction terms

Pairwise contrasts

When to use interaction vs contrasts

2.3: Co-variates (batch)

More to consider with co-variates

Assessing many co-variates at once

3. Linear mixed effects model

3.1: Random effects (ptID)

Comparison to limma

3.2 Co-variates (batch)

3.3: Kinship co-variate

4. Post-hoc pairwise contrasts

5. An actual analysis

R session

Navigation

Comparison to `limma`