Overview

The identification of differentially expressed genes (DEGs) from transcriptomic datasets is a major avenue of research across diverse disciplines. Current bioinformatic tools support a number of experimental designs including covariates, random effects, and blocking. However, covariance matrices are not yet among the features available. Here, we introduce kimma for kinship in mixed model analysis, an open-source R package that provides linear and linear mixed effects modeling of RNA-seq data including all previous designs plus covariance random effects. kimma equals or outcompetes other DEG pipelines in terms of sensitivity, computational time, and model complexity.

In particular, kimma provides:

  • A single function kmFit for flexible linear modeling of fixed, random, and complex random effects
  • Multi-processor architecture for reduced computational time
  • Multiple model fit measures such as AIC and BIC
  • Easy incorporation with data visualization and downstream analyses in the BIGverse (see below)

0. Setup

Software and packages

This pipeline should be completed in R version 4.2.1 (2022-06-23) or newer. Please also download the following packages.

# CRAN packages
install.packages("tidyverse")

# Bioconductor packages
## These are dependencies for kimma that are beneficial to download separately
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma"))

# GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/BIGverse")

And load them into your current R session.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(BIGverse)
## ── Attaching packages ──────────────────────────────────────── BIGverse 1.0.0 ──
## ✔ kimma      1.5.0     ✔ BIGpicture 1.1.0
## ✔ RNAetc     1.1.0     ✔ SEARchways 1.1.0
## ── Conflicts ─────────────────────────────────────────── BIGverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

To ensure reproducibility with this document, set the random seed value.

set.seed(651)

Data

Example data were obtained from media controls and human rhinovirus (HRV) infected human plasmacytoid dendritic cells. You can learn more about these data in

Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub

Specifically, this vignette uses bulk human RNA-seq processed from fastq to counts using SEAsnake and counts to voom using edgeR/limma. This results in voom-normalized, log2 counts per million (CPM) gene expression and associated sample and gene metadata. In total, 6 donors and 1000 random genes were selected for this vignette. Actual donor identifiers have been altered for privacy.

Research question

Our main research question is how human rhinovirus (HRV) infection impacts gene expression. As a secondary question, we are also interested in how individuals with asthma may respond differently to HRV compared to healthy individuals. Thus, we will explore models comparing media and HRV-infected samples (variable named virus) in asthmatic and healthy individuals (variable named asthma). We will then explore the impacts of patient co-variates, paired sample design, and random effects to improve model fit and the detection of differentially expressed genes.

1. Load data

All expression, gene, and sample data are contained in a single limma EList object. These data are available in the kimma package.

example.voom <- kimma::example.voom
names(example.voom)
## Loading required package: limma
## Warning: package 'limma' was built under R version 4.2.2
## [1] "genes"   "targets" "E"       "weights" "design"

The normalized log2 CPM expression data are contained in E.

example.voom$E[1:3,1:6]
##                     lib1     lib2     lib3     lib4     lib5     lib6
## ENSG00000000460 6.109675 7.998060 7.322775 7.327720 7.915870 7.989109
## ENSG00000001460 7.173806 4.828135 6.133741 8.130273 6.662114 7.241296
## ENSG00000002587 8.672611 7.460403 8.856207 8.130273 7.852445 7.067965

Sequencing library and donor metadata are in targets.

example.voom$targets[1:3,]
##      group  lib.size norm.factors libID   ptID median_cv_coverage virus  asthma
## lib1     1  79645.59     1.001378  lib1 donor1           0.513732  none healthy
## lib2     1  88007.91     0.951334  lib2 donor1           0.435162   HRV healthy
## lib3     1 178019.51     1.091153  lib3 donor2           0.374282  none healthy
##      batch
## lib1     2
## lib2     2
## lib3     2

Gene metadata are in genes.

example.voom$genes[1:3,]
##                 hgnc_symbol Previous symbols   Alias symbols
## ENSG00000000460       RAB12             <NA>            <NA>
## ENSG00000001460         MIA             <NA>          CD-RAP
## ENSG00000002587     C1orf52             <NA> gm117, FLJ44982
##                        gene_biotype        geneName
## ENSG00000000460 protein-coding gene ENSG00000206418
## ENSG00000001460 protein-coding gene ENSG00000261857
## ENSG00000002587 protein-coding gene ENSG00000162642

Voom gene-level quality weights are in weights. These were calculated with voomWithQualityWeights( ).

example.voom$weights[1:3,1:3]
##           [,1]      [,2]      [,3]
## [1,] 0.7585106 0.8274172 1.5710160
## [2,] 0.5288184 0.5555010 0.9553991
## [3,] 0.7259811 0.7906774 1.5015908

And finally, the null model used in voom normalization is found in design.

example.voom$design[1:3,]
## lib1 lib2 lib3 
##    1    1    1

2. Simple linear model

2.1: A single main effect (virus)

First, we consider our research question and what variable(s) we need to answer that question. In these data, the first variable of interest is virus to determine how viral infection impacts gene expression. In R, this model is written as:

~ virus

On a coarse scale such as PCA, we can see that virus impacts gene expression with uninfected media controls (none) grouping together away from HRV-infected samples. Thus, we expect many differentially expressed genes for virus. Importantly, even if you don’t see good clustering in your data’s PCA, you may still have many significant genes!

# This is a BIGpicture function
plot_pca(example.voom, scale = TRUE, vars = "virus") 
## Joining with `by = join_by(libID)`
## $virus

We run this linear model in kimma using kmFit.

lm_virus <- kmFit(dat = example.voom, 
               model = "~ virus", 
               run_lm = TRUE)
lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see that genes are significant for virus at several FDR cutoffs.

summarise_kmFit(fdr = lm_virus$lm)
## # A tibble: 2 × 7
##   variable             fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
##   <fct>                   <int>   <int>   <int>   <int>   <int>   <int>
## 1 virusHRV                   45      66     133     169     259     324
## 2 total (nonredundant)       45      66     133     169     259     324

Compare to limma

We can compare this to limma (which you’ve already loaded into R as a dependency of kimma).

# Remove gene-level weights from example data
# limma uses weights by default whereas kimma does not
example.voom.noW <- example.voom
example.voom.noW$weights <- NULL
# Create model matrix
mm_virus <- model.matrix(~ virus, 
                         data = example.voom.noW$targets)

# Fit limma model
fit_virus <- lmFit(object = example.voom.noW,
                   design = mm_virus)

# Estimate P-values
efit_virus <- eBayes(fit = fit_virus)

# Summarise significant genes
# This extracts limma results into the same format as kimma results
limma_virus <- extract_lmFit(design = mm_virus,
                             fit = efit_virus)
summarise_lmFit(fdr = limma_virus$lm)
## # A tibble: 2 × 7
##   variable             fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
##   <fct>                   <int>   <int>   <int>   <int>   <int>   <int>
## 1 virusHRV                   43      64     120     165     233     331
## 2 total (nonredundant)       43      64     120     165     233     331

We compare significant genes at FDR < 0.05. The majority of genes are found by both methods. Differences are likely attributed to limma’s empirical Bayes (eBayes) p-value estimation which slightly increases power by reducing variance toward the mean. You can read more about this method in the original limma paper (Ritchie et all 2015).

# Another BIGpicture function!
plot_venn_genes(model_result = list("kimma lm"=lm_virus, 
                                    "limma lm"=limma_virus),
                fdr_cutoff = 0.05)
## $venn
## $venn$`0.05`

In parallel analyses of simulated gene expression, kimma and limma achieved the same sensitivity and specificity in simple linear models Dill-McFarland 2022. Thus, the differences seen here are minimal, and we do not have a strong recommendation for one software over the other. Instead, we’ll move on to highlight kimma’s more complex models and additional features.

Gene-level weights

limma introduced gene-level weights to account for sequence library quality Law 2014. These weights are calculated in limma with voomWithQualityWeights and then kimma can incorporate them from the EList object or separately using the weights parameter.

# Check if weights exist in the data
names(example.voom)
## [1] "genes"   "targets" "E"       "weights" "design"
lm_weights <- kmFit(dat = example.voom, 
               model = "~ virus", 
               run_lm = TRUE,
               use_weights = TRUE)
lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see that in this case, weights have little impact. However, in larger datasets with more variability in quality and in other model designs, weights can have a significant effect. In general, we use weights in our models.

plot_venn_genes(model_result = list("no weights" = lm_virus,
                                    "weights" = lm_weights), 
                fdr_cutoff = 0.05)
## $venn
## $venn$`0.05`

2.2: Multiple main effects (virus & asthma)

We are further interested in how individuals with and without asthma respond to virus. We can add variables to our model with + such as

~ virus + asthma

However, this model only captures the main effects of each variable in isolation. Specifically, this model tells you how virus impacts gene expression and how asthma impacts gene expression. It does not address how viral impacts differ between those with and without asthma.

Interaction terms

One way to assess this is with an interaction term written as:

~ virus + asthma + virus:asthma

or short-handed with *. Note, these two models are equivalent in R.

~ virus * asthma

This model now tests both the main effects and their interaction.

lm_virus_asthma <- kmFit(dat = example.voom, 
                         model = "~ virus*asthma", 
                         run_lm = TRUE,
                         use_weights = TRUE)
lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We now see 3 variables in our results equivalent to the variables in the long form of our model equation. Notice that we’ve lost significance for a lot of genes in virus and not found any genes at FDR < 0.05 for asthma or the interaction term. Because this data set is small, an interaction model is likely too complex, and we do not appear to have the power to detect interaction effects.

summarise_kmFit(fdr = lm_virus_asthma$lm)
## # A tibble: 4 × 7
##   variable               fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
##   <fct>                     <int>   <int>   <int>   <int>   <int>   <int>
## 1 asthmahealthy                NA      NA       4      16      30      89
## 2 virusHRV                     14      28      62     111     157     218
## 3 virusHRV:asthmahealthy       NA      NA       1       1       1       1
## 4 total (nonredundant)         14      28      63     118     174     277

Importantly, a gene with a significant interaction term cannot be assessed for the main effects. For example at a higher FDR of 0.2, there is 1 gene that is significant for the interaction (green) and virus (blue). However, we cannot use the virus result for this gene, because it is comparing all HRV-infected to all media samples without taking asthma into account. Since we know there is an interaction effect between virus and asthma, the virus comparison alone is incorrectly averaging across samples we know to be different (e.g. the asthma groups). Similarly, we could not use an asthma main term result if the interaction term were also significant for that gene (overlap green - yellow).

plot_venn_genes(model_result = list("kimma lm"=lm_virus_asthma),
                fdr_cutoff = 0.2)
## $venn
## $venn$`0.2`

If this were our final model, our differentially expressed gene (DEG) list would be all interaction genes (green) as well as the intersect of virus and asthma main terms (blue + yellow). This second group encompasses genes that change with virus similarly in healthy and asthma donors but are always higher in one asthma group.

Pairwise contrasts

Another way to model interactions is with pairwise contrasts. Contrasts compare 2 or more groups to all other groups in that variable. For example, we’re interested in the 4 groups within the interaction term: none_healthy, none_asthma, HRV_healthy, HRV_asthma. We run these comparisons with the same interaction model as above, only now we also set run_contrast = TRUE. We will get pairwise comparisons that we’re not interested in since all combinations are run but will filter those of interest later on.

lm_contrast <- kmFit(dat = example.voom, 
                     model = "~ virus*asthma", 
                     run_lm = TRUE, 
                     use_weights = TRUE,
                     run_contrast = TRUE, 
                     contrast_var = "virus:asthma")
lm model: expression~virus*asthma
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We see the same main model outcome as before.

summarise_kmFit(fdr = lm_contrast$lm)
## # A tibble: 4 × 7
##   variable               fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
##   <fct>                     <int>   <int>   <int>   <int>   <int>   <int>
## 1 asthmahealthy                NA      NA       4      16      30      89
## 2 virusHRV                     14      28      62     111     157     218
## 3 virusHRV:asthmahealthy       NA      NA       1       1       1       1
## 4 total (nonredundant)         14      28      63     118     174     277

We also get all pairwise contrasts between the 4 groups.

summarise_kmFit(fdr = lm_contrast$lm.contrast)
## # A tibble: 7 × 9
##   variable    contrast_ref contrast_lvl fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4
##   <fct>       <chr>        <chr>           <int>   <int>   <int>   <int>   <int>
## 1 virus*asth… none asthma  HRV asthma         14      28      62     111     157
## 2 virus*asth… none asthma  HRV healthy        56      95     199     279     375
## 3 virus*asth… none healthy HRV healthy         1       9      31      52      81
## 4 virus*asth… HRV asthma   HRV healthy        NA       4      17      26      47
## 5 virus*asth… HRV asthma   none healthy       NA      NA      50     134     204
## 6 virus*asth… none asthma  none healthy       NA      NA       4      16      30
## 7 total (non… <NA>         <NA>               60      99     230     350     484
## # ℹ 1 more variable: fdr_0.5 <int>

Not all of these contrasts are of interest, so we select just the effects of virus (blue, yellow) and asthma (green, red). We see that many genes change with virus in the asthma (yellow) and/or healthy (blue) groups. No genes differ between healthy and asthma either with (red) or without virus (green). The lack of significant genes in the right ovals is an artifact of this small data set with only 2 - 3 donors per group. In a real analysis, you may be interested in genes that only change with virus in one group (blue and yellow not overlapping) or those that change with virus (blue and/or yellow) AND are different with asthma (green and/or red).

plot_venn_genes(model_result = list("kimma contrast" = lm_contrast), 
                fdr_cutoff = 0.3,
                contrasts = c(
                  # Impacts of virus within asthma groups
                  "HRV healthy - none healthy",
                  "HRV asthma - none asthma",
                  # Impacts of asthma within virus groups
                  "none asthma - none healthy",
                  "HRV asthma - HRV healthy"))
## $venn
## $venn$`0.3`

When to use interaction vs contrasts

At their heart, interaction and contrast models are trying to answer the same question. However, statistically, they are very different. Contrasts compare means between groups (see below) and you must select which pairwise comparisons are meaningful.

An interaction term tests if two slopes differ. In these data, this is comparing the change (slope) in response to virus in healthy vs asthmatic donors like below. In most cases, it is more difficult to achieve significance when comparing slopes as opposed to means.

In general, we recommend using the interaction term to define differentially expressed genes (DEG). Then, as a post-hoc test, run contrasts only on significant DEGs to further probe the results. This is demonstrated later in this tutorial. It is like running an ANOVA to compare groups A,B,C and then TukeyHSD to determine which groups differ from which (A vs B, B vs C, A vs C).

Contrasts may be employed as your main model instead in cases such as:

  • Small data sets where you are under-powered and would benefit from the reduced complexity of contrasts (1 variable) vs an interaction (3 variables)
  • When there is no significance for the interaction term
  • When you are only interested in a subset of the pairwise comparisons encompassed by the interaction term. For example, if you wanted to test a longitudinal study as time point 1 vs 2, 2 vs 3, 3 vs 4, etc

Why do main model and contrast p-values not match?

If you run a two-level variable (like virus) through kimma contrasts, you may notice that the exact p-values are not the same as those from the main model. This is because contrasts are estimated using the emmeans package which has slightly different significance estimation than the main linear model. You do not need to run two-level comparisons through contrasts so you should use the main model results in the lm, lme, lmrel data frames.

2.3: Co-variates (batch)

We also want to consider the effects of variables that may impact or prevent us from seeing our main effects. These co-variates can include patient variables like age and sex as well as technical variables like batch or data quality. As an example here, we will just use batch.

To determine if batch should be included in our model, let’s first see if it has a large impact on the data in PCA. We see clustering by batch; thus, we have evidence to include batch in the model.

plot_pca(example.voom, scale = TRUE, vars = "batch")
## Joining with `by = join_by(libID)`
## $batch

Next, we run models with and without the batch co-variate. To include co-variates, we add them to the models with +. We also set metrics = TRUE so that kimma gives us model fit metrics for comparison. For simplicity, we use only the virus variable as our main term.

lm_batch <- kmFit(dat = example.voom, 
                  model = "~ virus + batch", 
                  run_lm = TRUE,
                  use_weights = TRUE,
                  metrics = TRUE)
lm model: expression~virus+batch
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes
lm_virus <- kmFit(dat = example.voom, 
                  model = "~ virus", 
                  run_lm = TRUE,
                  use_weights = TRUE,
                  metrics = TRUE)
lm model: expression~virus
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

We then compare model fits with AIC, which summarize model fit. For more information, see https://en.wikipedia.org/wiki/Akaike_information_criterion. In general, a difference in AIC < 2 is considered no evidence of better fit, 2 - 8 is moderate evidence, and > 8 is strong evidence. There are additional metrics also available in kimma including BIC, sigma, and R-squared.

Smaller AIC indicate a better fitting model so genes above 0 are better fit by the “x” model and genes below 0 line are better fit by the “y” model. Throughout this tutorial, we put the more complex model as y. Our plotting function also outputs a message with how many genes are best fit by each model and mean and standard deviation of the difference in AIC between the two models.

plot_fit2(model_result = lm_virus, x = "lm",
         model_result_y = lm_batch, y = "lm", 
         metrics = "AIC")
## Summary
##             Best fit Metric Total genes Mean delta Stdev delta
## 1        lm_virusHRV    AIC         822   1.511602   0.5271711
## 2 lm_virusHRV_batch2    AIC         178   2.291703   2.4258229

We see that each model is the best fit for a subset of genes. Batch improves the model fit for only about 18% of genes, and this improvement is moderate (change in AIC from 0 to -11). Also of importance, the model without batch does not have evidence of improved fit for the rest of the genes with a mean \(\Delta\) AIC around 2.

Next, we compare how many genes are significant with and without the co-variate.

plot_venn_genes(model_result = list("lm without batch"=lm_virus, 
                                    "lm with batch"=lm_batch), 
                fdr_cutoff = 0.05) 
## $venn
## $venn$`0.05`

First, consider how many genes are significant for the co-variate itself such as here where batch is not significant for any genes. Next, compare the number of virus-significant genes. Adding batch to the model results in the loss of all virus significant genes (green 45 vs blue 0). In this case, this is likely a case of over-fitting with a too complex model for the data set size.

To summarize, the batch co-variate:

  • Impacts overall gene expression in PCA
  • Moderately improves model fit for a minority of genes
  • Does not significantly impact fit for the other half of genes
  • Significant for 0 genes in a linear model
  • Reduces the number of virus-significant genes

All evidence except PCA points to not including batch in the model. It is rare that all the evidence points to one conclusion so it must be weighed along with biological and experimental information to determine the final model. In this case, I would choose to NOT include batch since the sample size is very small and model fit differences are also small.

In your data, you may have co-variates with even more conflicting outcomes. In that case, it’s important to prioritize model fit and sample size over significant gene counts. You should not include or exclude a co-variate just because it gets you more significant genes for your main terms. This is especially true if the co-variate negatively impacts model fit.

You may also have non-statistical evidence for including a co-variate. If you have established biological evidence that a co-variate is important in your system or it’s standard in your field, you may wish to include it even if it does not improve fit and is not significant. If the co-variate, however, greatly worsens fit, you should not include it even if it’s standard. Instead, present fit plots like above to support its exclusion.

3. Linear mixed effects model

3.1: Random effects (ptID)

These data come from a paired study design with uninfected and virus-infected samples from the same donor’s cells. We take this into account in our model by using donor as a random effect. This allows the model to match samples from the same donor. It greatly improves our statistical power as we now have 1 slope per donor instead of 1 mean slope across all donors. So, we can see if all individual donors change similarly or not.

Random effects are added to the model with + (1|block) where the block is the variable you want to pair samples by. Thus, we now run a mixed effects model lme in kimma with

~ virus + (1|ptID)

In kimma, this is run similarly to a simple model except we add our random term and ask it to run_lme.

lme_virus <- kmFit(dat = example.voom, 
                   model = "~ virus + (1|ptID)", 
                   run_lme = TRUE, 
                   use_weights = TRUE,
                   metrics = TRUE)
lme/lmerel model: expression~virus+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

Comparing models as we did with co-variates, we see that genes are split by best fit with more genes better fit in the simple “x” model (mean \(\Delta\) AIC = 3.5). However, we see many more significant genes in the model with the random effect. This illustrates the much improved power of a mixed effects model when you have paired samples. Despite a slight increase in AIC, we would choose the mixed effects model here because 1) differences in AIC are small for the majority of genes and 2) we know the experimental design is paired.

plot_fit2(model_result = lm_virus, x="lm", 
         model_result_y = lme_virus, y="lme", 
         metrics = "AIC")
## Summary
##               Best fit Metric Total genes Mean delta Stdev delta
## 1          lm_virusHRV    AIC         724   3.617818    2.356881
## 2 lme_virus_(1 | ptID)    AIC         276   2.519171    3.637943

plot_venn_genes(model_result = list("lm" = lm_virus,
                                    "lme" = lme_virus),
                fdr_cutoff = 0.05)
## $venn
## $venn$`0.05`

Compare to limma

limma offers a sudo random effect in linear modeling with duplicateCorrelation. This estimates the random effect of donor across all genes and uses one value for all models. The full mixed effect model in kimma estimates the random effect for each gene, thus fitting a different value for each gene’s model.

In simulated analyses, we found that kimma’s true mixed effect outperforms limma’s pseudo random effect Dill-McFarland 2022 in paired designed.

For example, we run a limma model with pseduo paired design for these data.

# Create model matrix
mm_lme <- model.matrix(~ virus,
                       data=example.voom$targets)

# Block by donor
consensus.corr <- duplicateCorrelation(
  object = example.voom$E, 
  design = mm_lme,
  block = example.voom$targets$ptID)
# View average correlation
consensus.corr$consensus.correlation
## [1] 0.357227
# Fit limma model
fit_lme <- lmFit(object = example.voom$E, 
                 design = mm_lme,
                 block = example.voom$targets$ptID,
                 correlation = consensus.corr$consensus.correlation)

# Estimate P-values
efit_lme <- eBayes(fit = fit_lme)

# Summarise significant genes
# This extracts limma results into the same format as kimma results
limma_lme <- extract_lmFit(design = mm_lme,
                             fit = efit_lme)

To compare model fit, limma only provides sigma, or the standard deviation of residuals. Thus, we will use this to compare to kimma here. This metric shows an almost even split in best fit model by gene, all of which are small differences. It is difficult to make a conclusion by sigma alone as it does not take into account as many model fit parameters as metrics like AIC.

plot_fit2(model_result = lme_virus, x = "lme",
         model_result_y = limma_lme, y = "lm",
         metrics = "sigma")
## Summary
##               Best fit Metric Total genes Mean delta Stdev delta
## 1                 lm_2  sigma         491  0.2507702   0.1891429
## 2 lme_virus_(1 | ptID)  sigma         509  0.5486835   0.4096612

Continuing on, we see that using duplicate correlation in limma (blue) results in fewer significant genes that kimma’s full mixed effects model (yellow). Moreover, kimma detects all of the limma significant genes, capturing all of that signal plus additional genes.

plot_venn_genes(model_result = list("kimma" = lme_virus,
                                    "limma" = limma_lme),
                fdr_cutoff = 0.05)
## $venn
## $venn$`0.05`

Compare to dream

The dream function in the VariancePartition package Hoffman 2021 provides another way to run mixed effects models on RNA-seq data. In simulated analyses, dream had similar DEG detection as kimma but required re-calculation of gene-level weights, took more computational time, and did not provide model fit measures Dill-McFarland 2022. Thus, we do not directly compare this method here. However, you can learn more about dream at https://www.bioconductor.org/packages/devel/bioc/vignettes/variancePartition/inst/doc/dream.html

3.2 Co-variates

Similar to a simple linear model, co-variates may be added and assessed in a mixed effect model by including them in the model formula such as below. We will not demonstrate this here as it is the same process of in Section 2.3

~ virus + batch + (1|ptID)

4. Linear mixed effects model with covariance

Unlike limma and dream, kimma can incorporate complex random effects such as a covariance matrix. These variables contain more than 1 measure per sequencing library such as pairwise comparisons between libraries. Thus, they cannot be used as co-variates or simple random effect. kimma leverages relmatLmer in the lme4qtl package Ziyatdinov 2018 to fit these models.

4.1: Covariance random effect (kinship)

Kinship is a summative measure of genetic relatedness. It can be from 0 to 1 with 1 being 100% identical (monozygotic twins). Some other common values are 0.5 for parent-child, 0.25 grandparent-grandchild, 0.125 first cousins, etc. This measure is a pairwise measure with 1 value per pair of individuals. Here are the example data kinship data.

example.kin
##        donor1 donor2 donor3 donor4 donor5 donor6
## donor1   1.00   0.50   0.10   0.20   0.15   0.10
## donor2   0.50   1.00   0.10   0.11   0.23   0.09
## donor3   0.10   0.10   1.00   0.10   0.20   0.15
## donor4   0.20   0.11   0.10   1.00   0.11   0.13
## donor5   0.15   0.23   0.20   0.11   1.00   0.17
## donor6   0.10   0.09   0.15   0.13   0.17   1.00

Because it is not a single value per sample or individual, kinship cannot be added to a model with + kin. Instead, it is used as a random effect where you block by individual so the model can identify an individual’s kinship values relative to all other individuals in the data set. We fit this type of model with run_lmerel = TRUE and providing the kinship matrix.

lmerel_virus <- kmFit(dat = example.voom, 
                   kin = example.kin, 
                   model = "~ virus + (1|ptID)",
                   run_lmerel = TRUE, 
                   use_weights = TRUE,
                   metrics = TRUE)
lme/lmerel model: expression~virus+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 1000 genes
Failed: 0 genes

Comparing to a mixed effects model without kinship, we see that kinship does not impact model fit with very small \(\Delta\) < 2 and even some genes with identical AIC for both models (Best fit = none). We also see kinship has very little impact on DEG detection. Of note, this is not surprising as the example kinship data is artificial and does not reflex the actual relatedness of these patients.

plot_fit2(model_result = lme_virus, x="lme", 
         model_result_y = lmerel_virus, y="lmerel", 
         metrics = "AIC")
## Summary
##                  Best fit Metric Total genes Mean delta Stdev delta
## 1    lme_virus_(1 | ptID)    AIC         260  0.3240494   0.3738785
## 2 lmerel_virus_(1 | ptID)    AIC         527  0.4022421   0.3637793
## 3                    none    AIC         213  0.0000000   0.0000000

plot_venn_genes(model_result = list("lme" = lme_virus,
                                    "lmerel" = lmerel_virus),
                fdr_cutoff = 0.05)
## $venn
## $venn$`0.05`

5. Additional kimma features

Multiple models

kimma can run more than one model at the same time by setting multiple run parameters to TRUE. This allows easy comparison of models when you’re considering whether to use mixed effects or not. For example, the following runs

~ virus
~ virus + (1|ptID)
~ virus + (kinship|ptID)
kmFit(dat = example.voom, 
      kin = example.kin, 
      model = "~ virus + (1|ptID)",
      run_lm = TRUE,
      run_lme = TRUE,
      run_lmerel = TRUE, 
      use_weights = TRUE)

Other data inputs

kimma can run models on simple data frames. This allows inputs from other pipelines as well as use on non-RNA-seq data. For example, the following runs the example RNA-seq data from each data frame instead of the single EList object.

kmFit(counts = example.voom$E,
      meta = example.voom$targets,
      genes = example.voom$genes,
      weights = example.voom$weights,
      model = "~ virus",
      run_lm = TRUE,
      use_weights = TRUE)

kimma also allows different patient and library identifiers. By default, patientID = "ptID" and libraryID = "libID" but these parameters can be changed to whatever names your data use. Note that you will also need to change your model for patient ID’s other than ptID.

Data subsets

Instead of creating multiple EList objects for subsets of interest, kimma directly subsets data while modeling. Simply use subset_var and subset_lvl to identify a subset of samples such as testing asthma effects in only virus-infected samples.

kmFit(dat = example.voom, 
      model = "~ virus + (1|ptID)",
      run_lme = TRUE,
      use_weights = TRUE,
      subset_var = list("asthma"),
      subset_lvl = list("healthy"))

You can specify multiple subset variables and levels using lists where the first subset_var corresponds to the first subset_lvl vector. For example, this selects all of these data.

kmFit(dat = example.voom, 
      model = "~ virus + (1|ptID)",
      run_lme = TRUE,
      use_weights = TRUE,
      subset_var = list("virus", "asthma"),
      subset_lvl = list(c("none", "HRV"),
                        c("healthy", "asthma")))

Or run on only a subset of genes with subset_genes.

kmFit(dat = example.voom, 
      model = "~ asthma + (1|ptID)",
      run_lme = TRUE,
      use_weights = TRUE,
      subset_genes = c("ENSG00000206418","ENSG00000261857"))

Expression quantitative loci (eQTL)

kimma’s function kmFit_eQTL builds on the kmFit modeling structure to test gene expression against single nucleotide polymorphisms (SNP). This is not meant as a replacement for GWAS tools like the GENESIS package but instead, support targeted eQTL analyses potentially of interest in data sets with kinship.

# Create dummy SNP data
## Genotype data
dat.snp <- data.frame(
  ptID = unique(example.voom$targets$ptID),
  snp1 = sample(0:2, size = 6, replace = TRUE, ),
  snp2 = sample(0:2, size = 6, replace = TRUE))

dat.snp
##     ptID snp1 snp2
## 1 donor1    2    1
## 2 donor2    2    0
## 3 donor3    0    2
## 4 donor4    1    2
## 5 donor5    1    0
## 6 donor6    2    1
## Mapping data (SNP to genes)
dat.map <- data.frame(
  gene = example.voom$genes$geneName[c(1,1,10,12)],
  snpID = c("snp1", "snp2", "snp2", "snp1")
)

dat.map
##              gene snpID
## 1 ENSG00000206418  snp1
## 2 ENSG00000206418  snp2
## 3 ENSG00000013583  snp2
## 4 ENSG00000124701  snp1
eqtl <- kmFit_eQTL(dat_snp = dat.snp,
                   dat_map = dat.map,
                   dat = example.voom, 
                   model = "~ genotype + (1|ptID)",
                   run_lme = TRUE,
                   use_weights = TRUE)
Running snp1
lme/lmerel model: expression~snp1+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 2 genes
Failed: 0 genes

Running snp2
lme/lmerel model: expression~snp2+(1|ptID)
Input: 12 libraries from 6 unique patients
Model: 12 libraries
Complete: 2 genes
Failed: 0 genes

We see that no SNPs are associated gene expression based on the gene-SNP mappings we provided. In order to see all variables in the summary, we look at extremely high FDR because this is random SNP data not expected to correlate with gene expression.

summarise_kmFit(eqtl$lme,
                fdr_cutoff = c(0.05, 0.8, 0.9))
## # A tibble: 4 × 4
##   variable             fdr_0.05 fdr_0.8 fdr_0.9
##   <fct>                   <int>   <int>   <int>
## 1 (1 | ptID)                  2       2       2
## 2 snp1                       NA       2       2
## 3 snp2                       NA       2       2
## 4 total (nonredundant)        1       3       3

6. Visualize gene expression

BIGpicture provides a function to plot expression for genes of interest across many variables. For example, we plot a virus:asthma interaction significant gene from our earlier simple linear model. Each circle is a RNA-seq library colored by donor. The black squares are overall group means and boxplots are interquartile ranges.

plot_genes(dat = example.voom, 
           fdr = lm_virus_asthma$lm, 
           subset_genes = "ENSG00000090097",
           variables = c("virus","asthma","virus:asthma"), 
           geneID = "geneName",
           colorID="ptID")
## $ENSG00000090097

If you want custom ordering within variables, you need to create factors of the correct order, ideally before running the model.

example.voom$targets <- example.voom$targets %>% 
  mutate(interaction = paste(virus, asthma, sep="\n"),
         interaction = factor(interaction,
                              levels=c("none\nhealthy","none\nasthma",
                                       "HRV\nhealthy","HRV\nasthma")))

plot_genes(dat = example.voom, 
           fdr = lm_virus_asthma$lm, 
           subset_genes = "ENSG00000090097",
           variables = "interaction", 
           geneID = "geneName",
           colorID="ptID")
## $ENSG00000090097

R session

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] limma_3.54.2     SEARchways_1.1.0 BIGpicture_1.1.0 RNAetc_1.1.0    
##  [5] kimma_1.5.0      BIGverse_1.0.0   lubridate_1.9.2  forcats_1.0.0   
##  [9] stringr_1.5.0    dplyr_1.1.2      purrr_1.0.1      readr_2.1.4     
## [13] tidyr_1.3.0      tibble_3.2.1     ggplot2_3.4.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.21-8    digest_0.6.31     foreach_1.5.2     utf8_1.2.3       
##  [5] R6_2.5.1          backports_1.4.1   evaluate_0.21     httr_1.4.6       
##  [9] highr_0.10        pillar_1.9.0      rlang_1.1.1       rstudioapi_0.14  
## [13] car_3.1-2         jquerylib_0.1.4   Matrix_1.5-4.1    rmarkdown_2.22   
## [17] labeling_0.4.2    splines_4.2.1     statmod_1.5.0     munsell_0.5.0    
## [21] broom_1.0.4       compiler_4.2.1    xfun_0.39         pkgconfig_2.0.3  
## [25] mgcv_1.8-42       htmltools_0.5.5   tidyselect_1.2.0  ggvenn_0.1.10    
## [29] codetools_0.2-19  fansi_1.0.4       ggpubr_0.6.0      crayon_1.5.2     
## [33] tzdb_0.4.0        withr_2.5.0       grid_4.2.1        nlme_3.1-162     
## [37] jsonlite_1.8.4    gtable_0.3.3      lifecycle_1.0.3   magrittr_2.0.3   
## [41] scales_1.2.1      carData_3.0-5     cli_3.6.1         stringi_1.7.12   
## [45] cachem_1.0.8      ggsignif_0.6.4    farver_2.1.1      doParallel_1.0.17
## [49] bslib_0.4.2       generics_0.1.3    vctrs_0.6.2       cowplot_1.1.1    
## [53] iterators_1.0.14  tools_4.2.1       glue_1.6.2        hms_1.1.3        
## [57] abind_1.4-5       parallel_4.2.1    fastmap_1.1.1     yaml_2.3.7       
## [61] timechange_0.2.0  colorspace_2.1-0  rstatix_0.7.2     knitr_1.43       
## [65] patchwork_1.1.2   sass_0.4.6