4.2 RNA-seq linear modeling

Overview

Thus far, we’ve modeled one gene at a time. However, RNA-seq data sets have 1000s of genes! Here, we introduce some R packages designed for running linear models across all genes in an efficient manor.

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#Linear modeling
library(kimma)
library(limma)
## Warning: package 'limma' was built under R version 4.1.3
set.seed(651)

Load data

All counts, gene, and sample metadata are contained in a single object.

#Attach data
attach("0_data/dat_voom.RData")

Note how the data object does not appear in the R environment. This is because attach gives you access to the data in R but does not load it into the current environment. We can access the data and do so here to rename it with a shorter name.

#Rename data object to be shorter
dat <- dat.abund.norm.voom

Now the dat object appears in our current R environment! We can still access dat.abund.norm.voom if we want but don’t need it cluttering up the environment pane.

Modeling in limma

Simple linear regression in limma

limma take model inputs as a model matrix instead of the ~ variable formula format you’ve seen with lm and lmer. This matrix encodes all the variables from the formula as 0 for N and 1 for Y. For example, here is the model matrix for the formula ~ condition + sex

mm_limma <- model.matrix(~ condition + sex, data=dat$targets)

head(mm_limma)
##            (Intercept) conditionMtb sexM
## pt01_Media           1            0    1
## pt01_Mtb             1            1    1
## pt02_Media           1            0    1
## pt02_Mtb             1            1    1
## pt03_Media           1            0    1
## pt03_Mtb             1            1    1

Note that we only see one level for each variable: Mtb for condition and female for sex. This shows that the Mtb samples are being compared to the reference level (which is media). The reference is determined alphabetically if the variable is character and by level if the variable is a factor. So, if you want a different order than alphabetical, you need to format your variables as factors and set the appropriate order. For example, here we set the reference of sex to female instead of male.

# Set order of sex variable as factors
dat$targets <- dat$targets %>% 
  mutate(sex = factor(sex, levels = c("M","F")))

#Make model matrix
mm_limma <- model.matrix(~ condition + sex, data=dat$targets)

head(mm_limma)
##            (Intercept) conditionMtb sexF
## pt01_Media           1            0    0
## pt01_Mtb             1            1    0
## pt02_Media           1            0    0
## pt02_Mtb             1            1    0
## pt03_Media           1            0    0
## pt03_Mtb             1            1    0

Once we have a model matrix, we fit the model and estimate P-values.

#Fit model
fit_limma <- lmFit(object = dat$E, design = mm_limma,
                   weights = dat$weights)
#Estimate significance
efit_limma <- eBayes(fit = fit_limma)

These outputs contain a lot of information. We can pull out the most commonly used pieces with topTable. By default, this gives your the 10 most significant genes across the entire model.

#Extract results
fdr_limma <- topTable(fit = efit_limma)
## Removing intercept from test coefficients

With some additional parameters, we can get all gene results for each variable.

fdr_limma_mtb <- topTable(fit = efit_limma, 
                      coef = "conditionMtb", number = Inf)
fdr_limma_sex <- topTable(fit = efit_limma, 
                      coef = "sexF", number = Inf)

head(fdr_limma_mtb)
##             logFC  AveExpr        t      P.Value    adj.P.Val        B
## PLEKHM3  2.690201 5.973384 15.37621 6.888863e-13 9.244165e-09 19.59613
## MAFF     3.276295 6.164289 14.07285 3.769007e-12 2.528815e-08 17.95641
## PPP1R15B 1.856514 8.167870 13.37282 9.911130e-12 3.472082e-08 17.02685
## CCL3     5.390670 8.467644 13.34214 1.034975e-11 3.472082e-08 16.94657
## RABGEF1  1.732223 6.129960 12.94266 1.832223e-11 3.582863e-08 16.44090
## IRAK2    4.532959 5.898344 12.93913 1.841603e-11 3.582863e-08 16.36725
head(fdr_limma_sex)
##             logFC  AveExpr         t      P.Value    adj.P.Val        B
## DDX3Y  -10.860939 4.023600 -67.09595 5.979891e-26 4.012208e-22 39.77376
## RPS4Y1  -9.854802 3.295219 -67.41397 5.418008e-26 4.012208e-22 37.39048
## KDM5D   -9.253437 2.870455 -58.22743 1.149610e-24 5.142204e-21 34.82432
## UTY     -8.891150 2.884104 -40.20886 2.522091e-21 6.768787e-18 32.06835
## EIF1AY  -7.837518 1.921431 -49.99166 2.749097e-23 9.222533e-20 29.63353
## ZFY     -8.025645 2.634853 -32.79599 1.693467e-19 3.787438e-16 29.10000

The variables included are:

  • logFC: log fold change. The sign is relative to your reference. For example, negative logFC for conditionMtb means Mtb minus media is negative and thus, expression is lower in Mtb-infected samples.
  • AveExpr: average expression across all samples
  • t: test statistic for significance
  • P.Value
  • adj.P.Val: FDR adjusted P-value
  • B: beta coefficient

With some tidyverse manipulation, we can get results for all genes and variables in one table. Or you can use the kimma function extract_lmFit and skip the coding! This also renames the columns to match kimma’s output.

fdr_limma <- extract_lmFit(design = mm_limma, fit = efit_limma)

head(fdr_limma)
##   model    gene    variable estimate         pval          FDR        t
## 1 limma ANKRD17 (Intercept) 8.000626 8.945156e-35 4.396458e-31 177.2443
## 2 limma   PRPF6 (Intercept) 7.134773 1.965776e-34 4.396458e-31 170.7053
## 3 limma PRPF40A (Intercept) 8.336440 1.663704e-34 4.396458e-31 172.0705
## 4 limma   SF3B2 (Intercept) 8.525784 1.416921e-34 4.396458e-31 173.3946
## 5 limma   MATR3 (Intercept) 8.149936 2.524933e-34 4.840297e-31 168.6773
## 6 limma   CNOT1 (Intercept) 8.725961 1.665219e-34 4.396458e-31 172.0630
##          B  AveExpr
## 1 65.65834 8.002042
## 2 65.55667 7.037995
## 3 65.11819 8.291966
## 4 65.09105 8.497101
## 5 64.94127 8.263252
## 6 64.89197 8.632038

Paired sample design in limma

limma uses a shortcut to model paired sample designs. Unlike what you saw with lmer (which is a true mixed effect model), limma estimates the mean correlation of gene expression between pairs. This is an okay approximation of a mixed effects model and runs very fast.

Using the same model as before, we can calculate the mean correlation.

consensus.corr <- duplicateCorrelation(object = dat$E, 
                                       design = mm_limma,
                                       block = dat$targets$ptID)
consensus.corr$consensus.correlation
## [1] 0.3925634

Note: If you get an error about the package statmod, please install this package with install.packages("statmod")

Then incorporate this estimate into our model fitting.

# Fit model
fit_limma2 <- lmFit(object = dat$E, design = mm_limma,
                    weights = dat$weights,
                    block = dat$targets$ptID,
                    correlation = consensus.corr$consensus.correlation)
#Estimate p-values
efit_limma2 <- eBayes(fit_limma2)

#Get results
fdr_limma2 <- extract_lmFit(design = mm_limma, fit = efit_limma2)

Comparing these results for the condition variable, we see that the pseudo-paired model with duplicateCorrelation tends to give lower FDR values. This doesn’t necessarily mean the model is a better fit!

#Format results for merging
temp <- fdr_limma %>% 
  select(gene, variable, FDR) %>% 
  filter(variable == "conditionMtb") %>% 
  rename(`limma` = FDR)

temp2 <- fdr_limma2 %>% 
  select(gene, variable, FDR) %>% 
  filter(variable == "conditionMtb") %>% 
  rename(limma_dupCor = FDR)

#Merge results and plot FDR
full_join(temp, temp2) %>% 
  ggplot(aes(x = `limma`, y = limma_dupCor)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "FDR for Mtb vs media")
## Joining, by = c("gene", "variable")

limma provides one fit quality metric, sigma, which is the estimated standard deviation of the errors. Looking at this, we see that the best fit model (lowest sigma) varies from gene to gene. To be honest, however, sigma is not the best way to assess model fit. We’ll next move into kimma where we can compare models with more fit metrics as well as run a true paired mixed effects model.

data.frame(`limma` = efit_limma$sigma,
           limma_dupCor = efit_limma2$sigma) %>% 
  
  ggplot(aes(x = `limma`, y = limma_dupCor)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "Sigma")

Modeling in kimma

kimma supports more flexible modeling of RNA-seq data including simple linear and linear mixed effects models with co-variates, weights, random effects, and matrix random effects. Let’s run the same models as we did with limma.

Note that kimma is much slower than limma. It can be run on multiple processors to increase speed but if you don’t have a very powerful laptop, DO NOT RUN the next section of code.

fit_kimma <- kmFit(dat = dat, 
                   model = "~ condition + sex + (1|ptID)", 
                   use.weights = TRUE,
                   run.lm = TRUE, run.lme = TRUE,
                   metrics = TRUE)

save(fit_kimma, file="0_data/kimma_results.RData")
lm model: expression~condition+sex
lme/lmerel model: expression~condition+sex+(1|ptID)
Input: 20 libraries from 10 unique patients
Model: 20 libraries
Complete: 13419 genes
Failed: 0 genes

In the interest of time, we’ve provided the kimma results for you.

load("0_data/kimma_results.RData")

The kimma output contains 4 data frames: one that matches the limma output and one with fit metrics for each model.

names(fit_kimma)
## [1] "lm"      "lme"     "lm.fit"  "lme.fit"
head(fit_kimma$lm)
## # A tibble: 6 × 7
##   model gene  variable     estimate     pval      FDR gene_biotype  
##   <chr> <chr> <chr>           <dbl>    <dbl>    <dbl> <chr>         
## 1 lm    A1BG  (Intercept)    1.38   9.83e- 7 1.35e- 6 protein_coding
## 2 lm    A1BG  conditionMtb  -0.313  2.79e- 1 4.09e- 1 protein_coding
## 3 lm    A1BG  sexF          -0.0796 8.16e- 1 9.39e- 1 protein_coding
## 4 lm    A2M   (Intercept)    6.79   7.69e-15 1.36e-14 protein_coding
## 5 lm    A2M   conditionMtb  -0.127  7.36e- 1 8.15e- 1 protein_coding
## 6 lm    A2M   sexF          -1.44   3.96e- 3 8.24e- 2 protein_coding
head(fit_kimma$lm.fit)
## # A tibble: 6 × 8
##   model  gene   sigma   AIC   BIC    Rsq adj_Rsq gene_biotype  
##   <chr>  <chr>  <dbl> <dbl> <dbl>  <dbl>   <dbl> <chr>         
## 1 lm.fit A1BG   0.766  43.4  47.4 0.0730 -0.0360 protein_coding
## 2 lm.fit A2M    2.82   54.9  58.9 0.400   0.329  protein_coding
## 3 lm.fit A2ML1  0.582  58.4  62.4 0.638   0.595  protein_coding
## 4 lm.fit A4GALT 1.89   80.0  84.0 0.440   0.374  protein_coding
## 5 lm.fit AAAS   0.971  25.1  29.1 0.335   0.257  protein_coding
## 6 lm.fit AACS   0.706  15.6  19.6 0.336   0.258  protein_coding

We can now look at metrics like AIC where we see that best fit still varies by gene (which is very common) and the overall AIC mean is lower for the simple linear model without paired design.

fit_kimma_all <- full_join(fit_kimma$lm.fit, fit_kimma$lme.fit, 
          by = c("gene"), suffix = c("_lm","_lme"))

fit_kimma_all %>%
  ggplot(aes(x = AIC_lm, y = AIC_lme)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "AIC")

mean(fit_kimma$lm.fit$AIC)
## [1] 33.39391
mean(fit_kimma$lme.fit$AIC)
## [1] 37.71397

So in this case, which model do we go with? It would be nice if one model showed consistently better fit across our genes, but this rarely happens. For this experiment, we know we have a paired design so either limma with duplicateCorrelation or kimma with run.lme is appropriate. In our experience, a true mixed effects model in kimma yields fewer false positive genes when you have a paired design, even if metrics like AIC do not support it as the best fit model.

Other models

We don’t have time to go over all the potential models of these data. So here’s a quick reference for you to explore later on.

  • Interaction terms
    • limma or kimma formula with * such as ~ condition * sex which is shorthand for ~ condition + sex + condition:sex
  • Pairwise comparisons between 3+ levels
    • limma makeContrasts
    • kimma run.contrasts = TRUE
  • Matrix random effects
    • limma not supported
    • kimma provide the matrix kin and set run.lmerel = TRUE
  • Removing gene-level weights
  • dream in the package variancePartition

For more help with limma, see Chapter 15 http://bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf

For kimma, see https://github.com/BIGslu/tutorials/blob/main/RNAseq/3.Hawn_RNAseq_voom.to.DEG.pdf

Useful kimma plots

If you use kimma, our other package BIGpicture has several plotting functions you may find useful. Here are some examples.

devtools::install_github("BIGslu/BIGpicture")
library(BIGpicture)

Principle component analysis (PCA)

plot_pca(dat, vars = c("condition", "outlier"))
## Joining, by = "libID"
## $condition

## 
## $outlier

Model fit metrics

plot_fit(model_result = fit_kimma, 
         x="lm", y="lme", metrics = c("AIC","adj_Rsq"))
## Summary
##   Best fit  Metric Total genes Mean delta Stdev delta
## 1       lm adj_Rsq        3583  3.8238324  118.478571
## 2      lme adj_Rsq        9836  0.9833688   28.855033
## 3       lm     AIC       10647  6.5535246    3.859962
## 4      lme     AIC        2772  4.2584460    4.588052

Significant gene Venn

plot_venn_genes(fit_kimma, model = "lme", fdr.cutoff = 0.05)

Gene expression boxplots

plot_genes(dat, fdr = fit_kimma$lme, geneID = "hgnc_symbol",
           subset.genes = "IFNG", variables = c("condition","sex"))
## [[1]]

STRING protein-protein interaction network

#list top 10 most significant genes
genes.signif <- fit_kimma$lme %>% 
  filter(variable == "condition") %>% 
  slice_min(order_by = FDR, n = 10) %>% 
  pull(gene)

#map genes to STRING database
map <- map_string(genes.signif)
#make plot
plot_string(map)