Thus far, we’ve modeled one gene at a time. However, RNA-seq data sets have 1000s of genes! Here, we introduce some R packages designed for running linear models across all genes in an efficient manor.
Load the following packages. For more information on installation, see the setup instructions.
#Data manipulation and plotting
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#Linear modeling
library(kimma)
library(limma)
## Warning: package 'limma' was built under R version 4.1.3
set.seed(651)
All counts, gene, and sample metadata are contained in a single object.
#Attach data
attach("0_data/dat_voom.RData")
Note how the data object does not appear in the R environment. This is because attach
gives you access to the data in R but does not load
it into the current environment. We can access the data and do so here to rename it with a shorter name.
#Rename data object to be shorter
dat <- dat.abund.norm.voom
Now the dat
object appears in our current R environment! We can still access dat.abund.norm.voom
if we want but don’t need it cluttering up the environment pane.
limma
limma
limma
take model inputs as a model matrix instead of the ~ variable
formula format you’ve seen with lm
and lmer
. This matrix encodes all the variables from the formula as 0 for N and 1 for Y. For example, here is the model matrix for the formula ~ condition + sex
mm_limma <- model.matrix(~ condition + sex, data=dat$targets)
head(mm_limma)
## (Intercept) conditionMtb sexM
## pt01_Media 1 0 1
## pt01_Mtb 1 1 1
## pt02_Media 1 0 1
## pt02_Mtb 1 1 1
## pt03_Media 1 0 1
## pt03_Mtb 1 1 1
Note that we only see one level for each variable: Mtb for condition and female for sex. This shows that the Mtb samples are being compared to the reference level (which is media). The reference is determined alphabetically if the variable is character and by level if the variable is a factor. So, if you want a different order than alphabetical, you need to format your variables as factors and set the appropriate order. For example, here we set the reference of sex to female instead of male.
# Set order of sex variable as factors
dat$targets <- dat$targets %>%
mutate(sex = factor(sex, levels = c("M","F")))
#Make model matrix
mm_limma <- model.matrix(~ condition + sex, data=dat$targets)
head(mm_limma)
## (Intercept) conditionMtb sexF
## pt01_Media 1 0 0
## pt01_Mtb 1 1 0
## pt02_Media 1 0 0
## pt02_Mtb 1 1 0
## pt03_Media 1 0 0
## pt03_Mtb 1 1 0
Once we have a model matrix, we fit the model and estimate P-values.
#Fit model
fit_limma <- lmFit(object = dat$E, design = mm_limma,
weights = dat$weights)
#Estimate significance
efit_limma <- eBayes(fit = fit_limma)
These outputs contain a lot of information. We can pull out the most commonly used pieces with topTable
. By default, this gives your the 10 most significant genes across the entire model.
#Extract results
fdr_limma <- topTable(fit = efit_limma)
## Removing intercept from test coefficients
With some additional parameters, we can get all gene results for each variable.
fdr_limma_mtb <- topTable(fit = efit_limma,
coef = "conditionMtb", number = Inf)
fdr_limma_sex <- topTable(fit = efit_limma,
coef = "sexF", number = Inf)
head(fdr_limma_mtb)
## logFC AveExpr t P.Value adj.P.Val B
## PLEKHM3 2.690201 5.973384 15.37621 6.888863e-13 9.244165e-09 19.59613
## MAFF 3.276295 6.164289 14.07285 3.769007e-12 2.528815e-08 17.95641
## PPP1R15B 1.856514 8.167870 13.37282 9.911130e-12 3.472082e-08 17.02685
## CCL3 5.390670 8.467644 13.34214 1.034975e-11 3.472082e-08 16.94657
## RABGEF1 1.732223 6.129960 12.94266 1.832223e-11 3.582863e-08 16.44090
## IRAK2 4.532959 5.898344 12.93913 1.841603e-11 3.582863e-08 16.36725
head(fdr_limma_sex)
## logFC AveExpr t P.Value adj.P.Val B
## DDX3Y -10.860939 4.023600 -67.09595 5.979891e-26 4.012208e-22 39.77376
## RPS4Y1 -9.854802 3.295219 -67.41397 5.418008e-26 4.012208e-22 37.39048
## KDM5D -9.253437 2.870455 -58.22743 1.149610e-24 5.142204e-21 34.82432
## UTY -8.891150 2.884104 -40.20886 2.522091e-21 6.768787e-18 32.06835
## EIF1AY -7.837518 1.921431 -49.99166 2.749097e-23 9.222533e-20 29.63353
## ZFY -8.025645 2.634853 -32.79599 1.693467e-19 3.787438e-16 29.10000
The variables included are:
With some tidyverse manipulation, we can get results for all genes and variables in one table. Or you can use the kimma
function extract_lmFit
and skip the coding! This also renames the columns to match kimma
’s output.
fdr_limma <- extract_lmFit(design = mm_limma, fit = efit_limma)
head(fdr_limma)
## model gene variable estimate pval FDR t
## 1 limma ANKRD17 (Intercept) 8.000626 8.945156e-35 4.396458e-31 177.2443
## 2 limma PRPF6 (Intercept) 7.134773 1.965776e-34 4.396458e-31 170.7053
## 3 limma PRPF40A (Intercept) 8.336440 1.663704e-34 4.396458e-31 172.0705
## 4 limma SF3B2 (Intercept) 8.525784 1.416921e-34 4.396458e-31 173.3946
## 5 limma MATR3 (Intercept) 8.149936 2.524933e-34 4.840297e-31 168.6773
## 6 limma CNOT1 (Intercept) 8.725961 1.665219e-34 4.396458e-31 172.0630
## B AveExpr
## 1 65.65834 8.002042
## 2 65.55667 7.037995
## 3 65.11819 8.291966
## 4 65.09105 8.497101
## 5 64.94127 8.263252
## 6 64.89197 8.632038
limma
limma
uses a shortcut to model paired sample designs. Unlike what you saw with lmer
(which is a true mixed effect model), limma
estimates the mean correlation of gene expression between pairs. This is an okay approximation of a mixed effects model and runs very fast.
Using the same model as before, we can calculate the mean correlation.
consensus.corr <- duplicateCorrelation(object = dat$E,
design = mm_limma,
block = dat$targets$ptID)
consensus.corr$consensus.correlation
## [1] 0.3925634
Note: If you get an error about the package statmod
, please install this package with install.packages("statmod")
Then incorporate this estimate into our model fitting.
# Fit model
fit_limma2 <- lmFit(object = dat$E, design = mm_limma,
weights = dat$weights,
block = dat$targets$ptID,
correlation = consensus.corr$consensus.correlation)
#Estimate p-values
efit_limma2 <- eBayes(fit_limma2)
#Get results
fdr_limma2 <- extract_lmFit(design = mm_limma, fit = efit_limma2)
Comparing these results for the condition variable, we see that the pseudo-paired model with duplicateCorrelation tends to give lower FDR values. This doesn’t necessarily mean the model is a better fit!
#Format results for merging
temp <- fdr_limma %>%
select(gene, variable, FDR) %>%
filter(variable == "conditionMtb") %>%
rename(`limma` = FDR)
temp2 <- fdr_limma2 %>%
select(gene, variable, FDR) %>%
filter(variable == "conditionMtb") %>%
rename(limma_dupCor = FDR)
#Merge results and plot FDR
full_join(temp, temp2) %>%
ggplot(aes(x = `limma`, y = limma_dupCor)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic() +
coord_fixed() +
labs(title = "FDR for Mtb vs media")
## Joining, by = c("gene", "variable")
limma
provides one fit quality metric, sigma, which is the estimated standard deviation of the errors. Looking at this, we see that the best fit model (lowest sigma) varies from gene to gene. To be honest, however, sigma is not the best way to assess model fit. We’ll next move into kimma
where we can compare models with more fit metrics as well as run a true paired mixed effects model.
data.frame(`limma` = efit_limma$sigma,
limma_dupCor = efit_limma2$sigma) %>%
ggplot(aes(x = `limma`, y = limma_dupCor)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic() +
coord_fixed() +
labs(title = "Sigma")
kimma
kimma
supports more flexible modeling of RNA-seq data including simple linear and linear mixed effects models with co-variates, weights, random effects, and matrix random effects. Let’s run the same models as we did with limma
.
Note that kimma
is much slower than limma
. It can be run on multiple processors to increase speed but if you don’t have a very powerful laptop, DO NOT RUN the next section of code.
fit_kimma <- kmFit(dat = dat,
model = "~ condition + sex + (1|ptID)",
use.weights = TRUE,
run.lm = TRUE, run.lme = TRUE,
metrics = TRUE)
save(fit_kimma, file="0_data/kimma_results.RData")
lm model: expression~condition+sex
lme/lmerel model: expression~condition+sex+(1|ptID)
Input: 20 libraries from 10 unique patients
Model: 20 libraries
Complete: 13419 genes
Failed: 0 genes
In the interest of time, we’ve provided the kimma
results for you.
load("0_data/kimma_results.RData")
The kimma
output contains 4 data frames: one that matches the limma
output and one with fit metrics for each model.
names(fit_kimma)
## [1] "lm" "lme" "lm.fit" "lme.fit"
head(fit_kimma$lm)
## # A tibble: 6 × 7
## model gene variable estimate pval FDR gene_biotype
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 lm A1BG (Intercept) 1.38 9.83e- 7 1.35e- 6 protein_coding
## 2 lm A1BG conditionMtb -0.313 2.79e- 1 4.09e- 1 protein_coding
## 3 lm A1BG sexF -0.0796 8.16e- 1 9.39e- 1 protein_coding
## 4 lm A2M (Intercept) 6.79 7.69e-15 1.36e-14 protein_coding
## 5 lm A2M conditionMtb -0.127 7.36e- 1 8.15e- 1 protein_coding
## 6 lm A2M sexF -1.44 3.96e- 3 8.24e- 2 protein_coding
head(fit_kimma$lm.fit)
## # A tibble: 6 × 8
## model gene sigma AIC BIC Rsq adj_Rsq gene_biotype
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 lm.fit A1BG 0.766 43.4 47.4 0.0730 -0.0360 protein_coding
## 2 lm.fit A2M 2.82 54.9 58.9 0.400 0.329 protein_coding
## 3 lm.fit A2ML1 0.582 58.4 62.4 0.638 0.595 protein_coding
## 4 lm.fit A4GALT 1.89 80.0 84.0 0.440 0.374 protein_coding
## 5 lm.fit AAAS 0.971 25.1 29.1 0.335 0.257 protein_coding
## 6 lm.fit AACS 0.706 15.6 19.6 0.336 0.258 protein_coding
We can now look at metrics like AIC where we see that best fit still varies by gene (which is very common) and the overall AIC mean is lower for the simple linear model without paired design.
fit_kimma_all <- full_join(fit_kimma$lm.fit, fit_kimma$lme.fit,
by = c("gene"), suffix = c("_lm","_lme"))
fit_kimma_all %>%
ggplot(aes(x = AIC_lm, y = AIC_lme)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic() +
coord_fixed() +
labs(title = "AIC")
mean(fit_kimma$lm.fit$AIC)
## [1] 33.39391
mean(fit_kimma$lme.fit$AIC)
## [1] 37.71397
So in this case, which model do we go with? It would be nice if one model showed consistently better fit across our genes, but this rarely happens. For this experiment, we know we have a paired design so either limma
with duplicateCorrelation
or kimma
with run.lme
is appropriate. In our experience, a true mixed effects model in kimma
yields fewer false positive genes when you have a paired design, even if metrics like AIC do not support it as the best fit model.
We don’t have time to go over all the potential models of these data. So here’s a quick reference for you to explore later on.
limma
or kimma
formula with *
such as ~ condition * sex
which is shorthand for ~ condition + sex + condition:sex
limma
makeContrasts
kimma
run.contrasts = TRUE
limma
not supportedkimma
provide the matrix kin
and set run.lmerel = TRUE
limma
don’t provide weights
kimma
use.weights = FALSE
dream
in the package variancePartition
For more help with limma
, see Chapter 15 http://bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf
For kimma
, see https://github.com/BIGslu/tutorials/blob/main/RNAseq/3.Hawn_RNAseq_voom.to.DEG.pdf
If you use kimma
, our other package BIGpicture
has several plotting functions you may find useful. Here are some examples.
devtools::install_github("BIGslu/BIGpicture")
library(BIGpicture)
Principle component analysis (PCA)
plot_pca(dat, vars = c("condition", "outlier"))
## Joining, by = "libID"
## $condition
##
## $outlier
Model fit metrics
plot_fit(model_result = fit_kimma,
x="lm", y="lme", metrics = c("AIC","adj_Rsq"))
## Summary
## Best fit Metric Total genes Mean delta Stdev delta
## 1 lm adj_Rsq 3583 3.8238324 118.478571
## 2 lme adj_Rsq 9836 0.9833688 28.855033
## 3 lm AIC 10647 6.5535246 3.859962
## 4 lme AIC 2772 4.2584460 4.588052
Significant gene Venn
plot_venn_genes(fit_kimma, model = "lme", fdr.cutoff = 0.05)
Gene expression boxplots
plot_genes(dat, fdr = fit_kimma$lme, geneID = "hgnc_symbol",
subset.genes = "IFNG", variables = c("condition","sex"))
## [[1]]
STRING protein-protein interaction network
#list top 10 most significant genes
genes.signif <- fit_kimma$lme %>%
filter(variable == "condition") %>%
slice_min(order_by = FDR, n = 10) %>%
pull(gene)
#map genes to STRING database
map <- map_string(genes.signif)
#make plot
plot_string(map)