RNA-seq data sets have 1000s of genes, but we don’t want to individually run 1000s of statistical models! In this section, we introduce some R packages designed for running linear models across all genes in an efficient manor.
Recording available at https://www.youtube.com/watch?v=UwtT6GQvJHc
Live notes at https://github.com/BIGslu/workshops/blob/main/2023.01.30_RNAseq.i4TB/2023.01.30_i4TB.R
Please follow the setup instructions at https://bigslu.github.io/workshops/setup/setup.html and install the packages under both of:
Load the following packages. For more information on installation, see the setup instructions.
#Data manipulation and plotting
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#Linear modeling
library(kimma)
library(limma)
set.seed(651)
Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this workshop.
Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub
Specifically, this workshop uses RNAseq data processed using our SEAsnake
and counts
to voom pipelines, resulting in voom-normalized, log2 counts per
million (CPM) expression and associated sample metadata in a
limma EList
object. A random subset of 1000 genes is used
here to reduce computational time. These data are available in the
kimma
package within the example.voom
data
object.
Our main research question is how human rhinovirus (HRV) infection
impacts gene expression. As a secondary question, we are also interested
in how individuals with asthma may respond differently to HRV compared
to healthy individuals. Thus, we will explore models comparing media and
HRV-infected samples (variable named virus
) in asthmatic
and healthy individuals (variable named asthma
). We will
then explore the impacts of patient co-variates, paired sample design,
and random effects to improve model fit and the detection of
differentially expressed genes.
All counts, gene, and sample metadata are contained in a single
object from the kimma
package.
dat<-example.voom
We access each data frame within this Elist
using
$
. The normalized log2 CPM expression data are contained in
E
.
dat$E[1:3,1:7]
## lib1 lib2 lib3 lib4 lib5 lib6 lib7
## ENSG00000000460 6.109675 7.998060 7.322775 7.327720 7.915870 7.989109 8.294872
## ENSG00000001460 7.173806 4.828135 6.133741 8.130273 6.662114 7.241296 5.807207
## ENSG00000002587 8.672611 7.460403 8.856207 8.130273 7.852445 7.067965 8.249429
Library and donor metadata are in targets
.
dat$targets[1:3,]
## group lib.size norm.factors libID ptID median_cv_coverage virus asthma
## lib1 1 79645.59 1.001378 lib1 donor1 0.513732 none healthy
## lib2 1 88007.91 0.951334 lib2 donor1 0.435162 HRV healthy
## lib3 1 178019.51 1.091153 lib3 donor2 0.374282 none healthy
## batch
## lib1 2
## lib2 2
## lib3 2
Gene metadata are in genes
.
dat$genes[1:3,1:4]
## hgnc_symbol Previous symbols Alias symbols
## ENSG00000000460 RAB12 <NA> <NA>
## ENSG00000001460 MIA <NA> CD-RAP
## ENSG00000002587 C1orf52 <NA> gm117, FLJ44982
## gene_biotype
## ENSG00000000460 protein-coding gene
## ENSG00000001460 protein-coding gene
## ENSG00000002587 protein-coding gene
Voom gene-level quality weights are in weights
. These
were calculated with voomWithQualityWeights( )
.
example.voom$weights[1:3,1:3]
## [,1] [,2] [,3]
## [1,] 0.7585106 0.8274172 1.5710160
## [2,] 0.5288184 0.5555010 0.9553991
## [3,] 0.7259811 0.7906774 1.5015908
And finally, the null model used in voom normalization is found in
design
.
example.voom$design[1:3,]
## lib1 lib2 lib3
## 1 1 1
This workshop assumes some familiarity with statistical modeling. You can find a quick intro relevant to RNA-seq data in our Intro to linear modeling.
limma
limma
limma
take model inputs as a model matrix. This matrix
encodes all the variables from the formula as 0 for N and 1 for Y. For
example, here is the model matrix for the formula
~ virus + asthma
mm_limma <- model.matrix(~ virus + asthma, data=dat$targets)
head(mm_limma)
## (Intercept) virusHRV asthmaasthma
## lib1 1 0 0
## lib2 1 1 0
## lib3 1 0 0
## lib4 1 1 0
## lib5 1 0 0
## lib6 1 1 0
Be careful with variable order! Note that we only see one level for each variable: HRV for virus and asthma for asthma. This shows that the HRV samples are being compared to the reference level (which is
virus == "none"
). The reference is determined alphabetically if the variable is a character and by level if the variable is a factor. So, if you want a different order than alphabetical, you need to format your variables as factors and set the appropriate order. This was done for these data since “HRV” is before “none” alphabetically and similar for the asthma/healthy variable.
Once we have a model matrix, we fit the model and estimate P-values.
#Fit model
fit_limma <- lmFit(object = dat$E, design = mm_limma,
weights = dat$weights)
#Estimate significance
efit_limma <- eBayes(fit = fit_limma)
These outputs contain a lot of information. We can pull out the most
commonly used pieces with topTable
. By default, this gives
your the 10 most significant genes across the entire model.
#Extract results
fdr_limma <- topTable(fit = efit_limma)
## Removing intercept from test coefficients
head(fdr_limma)
## virusHRV asthmaasthma AveExpr F P.Value
## ENSG00000119917 2.8622101 -1.3309603 12.746965 48.84428 1.618092e-07
## ENSG00000175183 3.6668058 -1.2147603 8.448405 46.06223 2.412669e-07
## ENSG00000169248 5.9883442 -3.7794475 7.846965 39.87514 6.349601e-07
## ENSG00000064666 -0.8951822 0.4651098 11.919618 33.19140 2.102518e-06
## ENSG00000184979 2.8065694 -0.6999710 10.761801 32.18228 2.561666e-06
## ENSG00000135899 1.1958674 -0.9441346 12.231043 31.50300 2.934197e-06
## adj.P.Val
## ENSG00000119917 0.0001206335
## ENSG00000175183 0.0001206335
## ENSG00000169248 0.0002116534
## ENSG00000064666 0.0004476088
## ENSG00000184979 0.0004476088
## ENSG00000135899 0.0004476088
With some additional parameters, we can get all gene results for each variable.
fdr_limma_hrv <- topTable(fit = efit_limma,
coef = "virusHRV", number = Inf)
fdr_limma_asthma <- topTable(fit = efit_limma,
coef = "asthmaasthma", number = Inf)
head(fdr_limma_hrv)
## logFC AveExpr t P.Value adj.P.Val
## ENSG00000175183 3.666806 8.448405 9.112491 1.032395e-07 6.447752e-05
## ENSG00000119917 2.862210 12.746965 8.963128 1.289550e-07 6.447752e-05
## ENSG00000184979 2.806569 10.761801 7.784801 8.185701e-07 2.397699e-04
## ENSG00000169248 5.988344 7.846965 7.554502 1.198850e-06 2.397699e-04
## ENSG00000135070 1.848877 9.640886 7.372586 1.628530e-06 2.714216e-04
## ENSG00000164761 -5.250311 5.156882 -7.618383 1.077703e-06 2.397699e-04
## B
## ENSG00000175183 8.101733
## ENSG00000119917 7.518732
## ENSG00000184979 5.884777
## ENSG00000169248 5.720121
## ENSG00000135070 5.343999
## ENSG00000164761 5.094245
head(fdr_limma_asthma)
## logFC AveExpr t P.Value adj.P.Val B
## ENSG00000120616 -1.574174 9.828525 -6.784263 4.520576e-06 0.004520576 4.4065770
## ENSG00000138755 -5.450077 6.697474 -5.789928 2.826039e-05 0.014130193 2.4776086
## ENSG00000066084 -1.052537 9.943029 -5.276803 7.663976e-05 0.023952278 1.5684643
## ENSG00000169248 -3.779447 7.846965 -4.762329 2.152436e-04 0.023952278 0.8149666
## ENSG00000100385 1.880785 8.187354 4.729110 2.303163e-04 0.023952278 0.7298956
## ENSG00000188647 -1.215303 8.911152 -4.729181 2.302830e-04 0.023952278 0.6457409
The variables included are:
logFC
: log fold change. The sign is relative to your
reference. For example, negative logFC for virusHRV means HRV minus
media is negative and thus, expression is lower in HRV-infected
samples.AveExpr
: average expression across all samplest
: test statistic for significanceP.Value
adj.P.Val
: FDR adjusted P-valueB
: beta coefficientWith some tidyverse
manipulation, we can get results for
all genes and variables in one table. Or you can use the
kimma
function extract_lmFit
and skip the
coding! This also renames the columns to match kimma
’s
output.
fdr_limma <- extract_lmFit(design = mm_limma, fit = efit_limma)
names(fdr_limma)
## [1] "lm" "lm.fit"
head(fdr_limma$lm)
## model gene variable estimate pval FDR
## 1 limma ENSG00000120616 asthmaasthma -1.574174 4.520576e-06 0.004520576
## 2 limma ENSG00000138755 asthmaasthma -5.450077 2.826039e-05 0.014130193
## 3 limma ENSG00000066084 asthmaasthma -1.052537 7.663976e-05 0.023952278
## 4 limma ENSG00000169248 asthmaasthma -3.779447 2.152436e-04 0.023952278
## 5 limma ENSG00000100385 asthmaasthma 1.880785 2.303163e-04 0.023952278
## 6 limma ENSG00000188647 asthmaasthma -1.215303 2.302830e-04 0.023952278
## t B AveExpr
## 1 -6.784263 4.4065770 9.828525
## 2 -5.789928 2.4776086 6.697474
## 3 -5.276803 1.5684643 9.943029
## 4 -4.762329 0.8149666 7.846965
## 5 4.729110 0.7298956 8.187354
## 6 -4.729181 0.6457409 8.911152
limma
limma
uses a shortcut to model paired sample designs.
Unlike a true mixed effect model, limma
estimates the mean
correlation of gene expression between pairs. This is an okay
approximation of a mixed effects model and runs very fast.
Using the same model as before, we can calculate the mean correlation.
consensus.corr <- duplicateCorrelation(object = dat$E,
design = mm_limma,
block = dat$targets$ptID)
consensus.corr$consensus.correlation
## [1] 0.2689307
Note: If you get an error about the package statmod
,
please install this package with
install.packages("statmod")
Then incorporate this estimate into our model fitting.
# Fit model
fit_limma2 <- lmFit(object = dat$E, design = mm_limma,
weights = dat$weights,
block = dat$targets$ptID,
correlation = consensus.corr$consensus.correlation)
#Estimate p-values
efit_limma2 <- eBayes(fit_limma2)
#Get results
fdr_limma2 <- extract_lmFit(design = mm_limma,
fit = efit_limma2)
Comparing these results for the virus variable, we see that the
pseudo-paired model with duplicateCorrelation
tends to give
lower FDR values. This doesn’t necessarily mean the model is a better
fit!
#Format results for merging
temp <- fdr_limma$lm %>%
select(gene, variable, FDR) %>%
filter(variable == "virusHRV") %>%
rename(`limma` = FDR)
temp2 <- fdr_limma2$lm %>%
select(gene, variable, FDR) %>%
filter(variable == "virusHRV") %>%
rename(limma_dupCor = FDR)
#Merge results and plot FDR
full_join(temp, temp2) %>%
ggplot(aes(x = `limma`, y = limma_dupCor)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic() +
coord_fixed() +
labs(title = "FDR for HRV vs media")
## Joining, by = c("gene", "variable")
limma
provides one fit quality metric, sigma, which is
the estimated standard deviation of the errors. Looking at this, we see
that the best fit model (lowest sigma) varies from gene to gene. To be
honest, however, sigma is not the best way to assess model fit. We’ll
next move into kimma
where we can compare models with more
fit metrics as well as run a true mixed effects model.
data.frame(`limma` = fdr_limma$lm.fit$sigma,
limma_dupCor = fdr_limma2$lm.fit$sigma) %>%
ggplot(aes(x = `limma`, y = limma_dupCor)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic() +
coord_fixed() +
labs(title = "Sigma")
kimma
kimma
supports more flexible modeling of RNA-seq data
including simple linear and linear mixed effects models with
co-variates, weights, random effects, and matrix random effects. Let’s
run the same models as we did with limma
.
Note that kimma
is slower than limma
. It
can be run on multiple processors to increase speed.
fit_kimma <- kmFit(dat = dat,
model = "~ virus + asthma + (1|ptID)",
use.weights = TRUE,
run.lm = TRUE, run.lme = TRUE,
metrics = TRUE)
## lm model: expression~virus+asthma
## lme/lmerel model: expression~virus+asthma+(1|ptID)
## Input: 12 libraries from 6 unique patients
## Model: 12 libraries
## Complete: 1000 genes
## Failed: 0 genes
The kimma
output contains 4 data frames: one for each
model’s results (like limma
’s topTable
) and
one for each model’s fit metrics, which unlike limma
,
contain more than just sigma.
names(fit_kimma)
## [1] "lm" "lme" "lm.fit" "lme.fit"
head(fit_kimma$lm)
## # A tibble: 6 × 8
## model gene variable df statistic estimate pval FDR
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 lm ENSG00000000460 (Intercept) 9 21.9 7.32 4.06e-9 7.12e-9
## 2 lm ENSG00000000460 virusHRV 9 1.21 0.463 2.56e-1 5.67e-1
## 3 lm ENSG00000000460 asthmaasthma 9 2.06 0.787 6.93e-2 3.19e-1
## 4 lm ENSG00000001460 (Intercept) 9 12.7 6.29 4.61e-7 6.78e-7
## 5 lm ENSG00000001460 virusHRV 9 1.70 0.959 1.24e-1 4.07e-1
## 6 lm ENSG00000001460 asthmaasthma 9 1.30 0.735 2.26e-1 5.24e-1
head(fit_kimma$lm.fit)
## # A tibble: 6 × 7
## model gene sigma AIC BIC Rsq adj_Rsq
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm.fit ENSG00000000460 0.776 29.0 30.9 0.385 0.249
## 2 lm.fit ENSG00000001460 0.905 38.3 40.3 0.333 0.185
## 3 lm.fit ENSG00000002587 0.640 24.9 26.8 0.600 0.511
## 4 lm.fit ENSG00000005189 1.31 53.0 55.0 0.206 0.0299
## 5 lm.fit ENSG00000005801 0.606 8.47 10.4 0.311 0.158
## 6 lm.fit ENSG00000005884 1.00 49.6 51.5 0.587 0.495
We can now look at metrics like AIC where we see that best fit varies by gene (which is very common)…
fit_kimma_all <- full_join(fit_kimma$lm.fit,
fit_kimma$lme.fit,
by = c("gene"),
suffix = c("_lm","_lme"))
fit_kimma_all %>%
ggplot(aes(x = AIC_lm, y = AIC_lme)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic() +
coord_fixed() +
labs(title = "AIC")
and the overall AIC mean and total are somewhat lower for the simple linear model without paired design.
#Mean
mean(fit_kimma$lm.fit$AIC)
## [1] 28.2214
mean(fit_kimma$lme.fit$AIC)
## [1] 31.09005
#Sum
sum(fit_kimma$lm.fit$AIC)
## [1] 28221.4
sum(fit_kimma$lme.fit$AIC)
## [1] 31090.05
In general, differences in mean AIC < 2 show that either model is appropriate, differences from 2 to 7 are moderate evidence for the lower value model, and differences greater than 7 are strong evidence for the lower value model.
So in this case, which model do we go with? AIC supports the simple model but our study design is paired… Always use your scientific reasoning first! If you know that there is a study design feature or confounding covariate, keep them in the model even if the AIC says otherwise. Be smarter than your data!
For this experiment, we know we have a paired design so either
limma
with duplicateCorrelation
or
kimma
with run.lme
is appropriate. In our
experience, a true mixed effects model in kimma
yields
fewer false positive genes when you have a paired design, even if
metrics like AIC do not support it as the best fit model.
We don’t have time to go over all the potential models of these data. So here’s a quick reference for you to explore later on.
limma
or kimma
formula with *
such as ~ virus * asthma
which is shorthand for
~ virus + asthma + virus:asthma
limma
makeContrasts
kimma
run.contrasts = TRUE
limma
not supportedkimma
provide the matrix example.kin
from
the kimma
package and set
run.lmerel = TRUE
limma
don’t provide weights
kimma
use.weights = FALSE
dream
in the package variancePartition
This section covers gene set related analyses of RNAseq data in the R
package SEARchways
. The goal is to determine pathways
associated with genes impacted by viral infection from the previous
linear models.
We will now load additional packages for pathway analyses.
library(SEARchways)
library(BIGpicture)
There are hundreds of genes significant for virus in this data set (FDR < 0.1). This is too many to assess individuals so we use enrichment to determine pathways associated with these genes.
summarise_kmFit(fit_kimma$lme)
## # A tibble: 4 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 (1 | ptID) 7 7 26 34 108 162
## 2 asthma 63 93 137 200 286 357
## 3 virus 244 296 389 474 543 626
## 4 total (nonredundant) 295 359 478 580 702 794
SEARchways
has the function BIGprofiler
,
which employs clusterProfiler
to run Fisher’s exact tests
of your genes of interest against any gene set data base available in
the Broad Molecular Signatures Database (MSigDB)
or a custom one your provide.
Here, let’s run the virus significant genes against the Hallmark data base.
#Virus significant genes
genes.virus <- fit_kimma$lme %>%
filter(variable == "virus" & FDR < 0.1) %>%
pull(gene)
enrich.virus <- BIGprofiler(gene_list = list("virus" = genes.virus),
ID = "ENSEMBL",
category = "H")
## [1] "virus"
##
The output includes:
group
: Name identifier for genes of interestsize_group
: Total genes in each groupgs_cat
: Gene set data base categorygs_subcat
: Gene set data base subcategorysize_cat.subcat
: Total genes in data base
category/subcategorygroup_in_cat.subcat
: Total significant genes in data
base category/subcategory e.g. how many of your genes of interest are
present in the data basepathway
: Gene set namesize_pathway
: Total genes in gene set (K)group_in_pathway
: Total significant genes in gene set
(k)k/K
: group_in_pathway / size_pathway, a commonly used
metric of enrichmentpval
: P-valueFDR
: FDR corrected P-valueqvalue
: q-valuegenes
: List of significant genes in gene set (not shown
below for length)The results show that there is 1 pathway enriched in virus significant genes at FDR < 0.05.
enrich.virus %>%
filter(FDR < 0.05) %>%
select(-genes)
## group size_group gs_cat gs_subcat size_cat.subcat group_in_cat.subcat
## 1 virus 296 H 4917 98
## pathway size_pathway group_in_pathway k/K
## 1 HALLMARK_INTERFERON_GAMMA_RESPONSE 286 17 0.05944056
## pval FDR qvalue
## 1 3.876975e-05 0.001822178 0.001822178
We can easily plot this result in BIGpicture
. We’ll use
FDR < 0.5 so you can see multiple pathways. You can see that roughly
6% of the IFN gamma response pathway is present in our list of virus
significant genes.
plot_enrich(enrich.virus,
fdr.cutoff = 0.5, fdr.colors = c(0.05, 0.5))
This type of enrichment does not address whether genes are up or
down-regulated, just that they are significant in some way. Thus, if you
would like to know direction, you need to create subsets like
virus.up
and virus.down
, then run enrichment
on the lists separately. This should be done with caution as not all
gene sets have concordant directionality and shorter gene lists are less
likely to be significant in any pathway.
SEARchways
resultsMany SEARchways
results contain list data. For example,
the genes
column from all of the BIG
functions
is a list column. This means these data cannot be saved outside of R as
they are.
class(enrich.virus$genes)
## [1] "list"
To save this column outside R, simply convert it to a
character
.
enrich.virus %>%
mutate(genes = as.character(genes)) %>%
write_csv(file = "results/BIGprofiler.results.csv")
Note that if you read this csv back into R, the genes
column will not be formatting correctly for downstream
BIGverse
functions. Thus, we recommend saving data in
.RData
or .Rds
in order to maintain
formatting.
Alternatively, you can unnest the data to get 1 row per gene per
pathway. Then when you read the csv
into R, you can re-nest
it like so.
enrich.virus %>%
unnest(genes) %>%
write_csv(file = "results/BIGprofiler.results.csv")
enrich.virus <- read_csv("results/BIGprofiler.results.csv") %>%
group_by(across(c(-genes))) %>%
summarise(genes = list(genes), .groups = "drop")
Another gene set related analysis is GSEA. This method uses fold change values of all genes in the data set instead of having you pre-define a significant list of genes at an arbitrary FDR cutoff. GSEA, thus, can find significant pathways with no individually significant genes in them. Instead, smaller up (or down) regulations of many genes can drive overall significance of a pathway.
BIGsea
runs GSEA with parameters similar to
BIGprofiler
only now the input is log fold change (named
estimate
in the kimma
output).
FC.virus <- fit_kimma$lme %>%
filter(variable == "virus") %>%
select(variable, gene, estimate)
head(FC.virus)
## # A tibble: 6 × 3
## variable gene estimate
## <chr> <chr> <dbl>
## 1 virus ENSG00000000460 0.474
## 2 virus ENSG00000001460 0.959
## 3 virus ENSG00000002587 -1.11
## 4 virus ENSG00000005189 -0.347
## 5 virus ENSG00000005801 0.0792
## 6 virus ENSG00000005884 1.00
gsea.virus <- BIGsea(gene_df = FC.virus,
ID = "ENSEMBL",
category = "H")
## virus
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.7% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
Note the warning about ties in the data. This means that 2 or more genes have an identical log fold change value and thus, GSEA must randomly determine their rank order relative to each other. This is not an issue because we’ve already set a random seed for reproducibility. However, if there are a lot of ties in your data, like > 10%, this may indicate a need to re-assess your linear model to better represent differences across the data.
Similarly, we can plot the GSEA results with BIGpicture
.
Unlike simple enrichment, GSEA shows directionality such as here where
IFN gamma response is up-regulated in response to virus. Remember to
define your variable order with factor
in the original data
if you want something other than alphabetical as this determines whether
the fold change values are A minus B or B minus A.
plot_gsea(gsea.virus,
fdr.cutoff = 0.5, fdr.colors = c(0.05, 0.5))
GSEA can only support pairwise comparison. Thus, if you have a
variable of interest with 3 or more levels, you need to run each
pairwise comparison of interest separately. You can access this data in
kimma
with run.contrasts = TRUE
. This is also
true for interaction terms, where the model estimate is not a simple
fold change.