RNA-seq linear modeling

Overview

RNA-seq data sets have 1000s of genes, but we don’t want to individually run 1000s of statistical models! In this section, we introduce some R packages designed for running linear models across all genes in an efficient manor.

Recording available at https://www.youtube.com/watch?v=UwtT6GQvJHc

Live notes at https://github.com/BIGslu/workshops/blob/main/2023.01.30_RNAseq.i4TB/2023.01.30_i4TB.R

Prior to the workshop

Please follow the setup instructions at https://bigslu.github.io/workshops/setup/setup.html and install the packages under both of:

RNA-seq analysis in R
RNA-seq pathway analysis in R

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

#Linear modeling
library(kimma)
library(limma)

set.seed(651)

Data

Data description

Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this workshop.

Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub

Specifically, this workshop uses RNAseq data processed using our SEAsnake and counts to voom pipelines, resulting in voom-normalized, log2 counts per million (CPM) expression and associated sample metadata in a limma EList object. A random subset of 1000 genes is used here to reduce computational time. These data are available in the kimma package within the example.voom data object.

Our main research question is how human rhinovirus (HRV) infection impacts gene expression. As a secondary question, we are also interested in how individuals with asthma may respond differently to HRV compared to healthy individuals. Thus, we will explore models comparing media and HRV-infected samples (variable named virus) in asthmatic and healthy individuals (variable named asthma). We will then explore the impacts of patient co-variates, paired sample design, and random effects to improve model fit and the detection of differentially expressed genes.

Load data

All counts, gene, and sample metadata are contained in a single object from the kimma package.

dat<-example.voom

We access each data frame within this Elist using $. The normalized log2 CPM expression data are contained in E.

dat$E[1:3,1:7]

##                     lib1     lib2     lib3     lib4     lib5     lib6     lib7
## ENSG00000000460 6.109675 7.998060 7.322775 7.327720 7.915870 7.989109 8.294872
## ENSG00000001460 7.173806 4.828135 6.133741 8.130273 6.662114 7.241296 5.807207
## ENSG00000002587 8.672611 7.460403 8.856207 8.130273 7.852445 7.067965 8.249429

Library and donor metadata are in targets.

dat$targets[1:3,]

##      group  lib.size norm.factors libID   ptID median_cv_coverage virus  asthma
## lib1     1  79645.59     1.001378  lib1 donor1           0.513732  none healthy
## lib2     1  88007.91     0.951334  lib2 donor1           0.435162   HRV healthy
## lib3     1 178019.51     1.091153  lib3 donor2           0.374282  none healthy
##      batch
## lib1     2
## lib2     2
## lib3     2

Gene metadata are in genes.

dat$genes[1:3,1:4]

##                 hgnc_symbol Previous symbols   Alias symbols
## ENSG00000000460       RAB12             <NA>            <NA>
## ENSG00000001460         MIA             <NA>          CD-RAP
## ENSG00000002587     C1orf52             <NA> gm117, FLJ44982
##                        gene_biotype
## ENSG00000000460 protein-coding gene
## ENSG00000001460 protein-coding gene
## ENSG00000002587 protein-coding gene

Voom gene-level quality weights are in weights. These were calculated with voomWithQualityWeights( ).

example.voom$weights[1:3,1:3]

##           [,1]      [,2]      [,3]
## [1,] 0.7585106 0.8274172 1.5710160
## [2,] 0.5288184 0.5555010 0.9553991
## [3,] 0.7259811 0.7906774 1.5015908

And finally, the null model used in voom normalization is found in design.

example.voom$design[1:3,]

## lib1 lib2 lib3 
##    1    1    1

Introduction to linear modeling

This workshop assumes some familiarity with statistical modeling. You can find a quick intro relevant to RNA-seq data in our Intro to linear modeling.

Modeling in `limma`

Simple linear regression in `limma`

limma take model inputs as a model matrix. This matrix encodes all the variables from the formula as 0 for N and 1 for Y. For example, here is the model matrix for the formula ~ virus + asthma

mm_limma <- model.matrix(~ virus + asthma, data=dat$targets)

head(mm_limma)

##      (Intercept) virusHRV asthmaasthma
## lib1           1        0            0
## lib2           1        1            0
## lib3           1        0            0
## lib4           1        1            0
## lib5           1        0            0
## lib6           1        1            0

Be careful with variable order! Note that we only see one level for each variable: HRV for virus and asthma for asthma. This shows that the HRV samples are being compared to the reference level (which is virus == "none"). The reference is determined alphabetically if the variable is a character and by level if the variable is a factor. So, if you want a different order than alphabetical, you need to format your variables as factors and set the appropriate order. This was done for these data since “HRV” is before “none” alphabetically and similar for the asthma/healthy variable.

Once we have a model matrix, we fit the model and estimate P-values.

#Fit model
fit_limma <- lmFit(object = dat$E, design = mm_limma,
                   weights = dat$weights)
#Estimate significance
efit_limma <- eBayes(fit = fit_limma)

These outputs contain a lot of information. We can pull out the most commonly used pieces with topTable. By default, this gives your the 10 most significant genes across the entire model.

#Extract results
fdr_limma <- topTable(fit = efit_limma)

## Removing intercept from test coefficients

head(fdr_limma)

##                   virusHRV asthmaasthma   AveExpr        F      P.Value
## ENSG00000119917  2.8622101   -1.3309603 12.746965 48.84428 1.618092e-07
## ENSG00000175183  3.6668058   -1.2147603  8.448405 46.06223 2.412669e-07
## ENSG00000169248  5.9883442   -3.7794475  7.846965 39.87514 6.349601e-07
## ENSG00000064666 -0.8951822    0.4651098 11.919618 33.19140 2.102518e-06
## ENSG00000184979  2.8065694   -0.6999710 10.761801 32.18228 2.561666e-06
## ENSG00000135899  1.1958674   -0.9441346 12.231043 31.50300 2.934197e-06
##                    adj.P.Val
## ENSG00000119917 0.0001206335
## ENSG00000175183 0.0001206335
## ENSG00000169248 0.0002116534
## ENSG00000064666 0.0004476088
## ENSG00000184979 0.0004476088
## ENSG00000135899 0.0004476088

With some additional parameters, we can get all gene results for each variable.

fdr_limma_hrv <- topTable(fit = efit_limma, 
                      coef = "virusHRV", number = Inf)
fdr_limma_asthma <- topTable(fit = efit_limma, 
                      coef = "asthmaasthma", number = Inf)

head(fdr_limma_hrv)

##                     logFC   AveExpr         t      P.Value    adj.P.Val
## ENSG00000175183  3.666806  8.448405  9.112491 1.032395e-07 6.447752e-05
## ENSG00000119917  2.862210 12.746965  8.963128 1.289550e-07 6.447752e-05
## ENSG00000184979  2.806569 10.761801  7.784801 8.185701e-07 2.397699e-04
## ENSG00000169248  5.988344  7.846965  7.554502 1.198850e-06 2.397699e-04
## ENSG00000135070  1.848877  9.640886  7.372586 1.628530e-06 2.714216e-04
## ENSG00000164761 -5.250311  5.156882 -7.618383 1.077703e-06 2.397699e-04
##                        B
## ENSG00000175183 8.101733
## ENSG00000119917 7.518732
## ENSG00000184979 5.884777
## ENSG00000169248 5.720121
## ENSG00000135070 5.343999
## ENSG00000164761 5.094245

head(fdr_limma_asthma)

##                     logFC  AveExpr         t      P.Value   adj.P.Val         B
## ENSG00000120616 -1.574174 9.828525 -6.784263 4.520576e-06 0.004520576 4.4065770
## ENSG00000138755 -5.450077 6.697474 -5.789928 2.826039e-05 0.014130193 2.4776086
## ENSG00000066084 -1.052537 9.943029 -5.276803 7.663976e-05 0.023952278 1.5684643
## ENSG00000169248 -3.779447 7.846965 -4.762329 2.152436e-04 0.023952278 0.8149666
## ENSG00000100385  1.880785 8.187354  4.729110 2.303163e-04 0.023952278 0.7298956
## ENSG00000188647 -1.215303 8.911152 -4.729181 2.302830e-04 0.023952278 0.6457409

The variables included are:

logFC: log fold change. The sign is relative to your reference. For example, negative logFC for virusHRV means HRV minus media is negative and thus, expression is lower in HRV-infected samples.
AveExpr: average expression across all samples
t: test statistic for significance
P.Value
adj.P.Val: FDR adjusted P-value
B: beta coefficient

With some tidyverse manipulation, we can get results for all genes and variables in one table. Or you can use the kimma function extract_lmFit and skip the coding! This also renames the columns to match kimma’s output.

fdr_limma <- extract_lmFit(design = mm_limma, fit = efit_limma)

names(fdr_limma)

## [1] "lm"     "lm.fit"

head(fdr_limma$lm)

##   model            gene     variable  estimate         pval         FDR
## 1 limma ENSG00000120616 asthmaasthma -1.574174 4.520576e-06 0.004520576
## 2 limma ENSG00000138755 asthmaasthma -5.450077 2.826039e-05 0.014130193
## 3 limma ENSG00000066084 asthmaasthma -1.052537 7.663976e-05 0.023952278
## 4 limma ENSG00000169248 asthmaasthma -3.779447 2.152436e-04 0.023952278
## 5 limma ENSG00000100385 asthmaasthma  1.880785 2.303163e-04 0.023952278
## 6 limma ENSG00000188647 asthmaasthma -1.215303 2.302830e-04 0.023952278
##           t         B  AveExpr
## 1 -6.784263 4.4065770 9.828525
## 2 -5.789928 2.4776086 6.697474
## 3 -5.276803 1.5684643 9.943029
## 4 -4.762329 0.8149666 7.846965
## 5  4.729110 0.7298956 8.187354
## 6 -4.729181 0.6457409 8.911152

Paired sample design in `limma`

limma uses a shortcut to model paired sample designs. Unlike a true mixed effect model, limma estimates the mean correlation of gene expression between pairs. This is an okay approximation of a mixed effects model and runs very fast.

Using the same model as before, we can calculate the mean correlation.

consensus.corr <- duplicateCorrelation(object = dat$E, 
                                       design = mm_limma,
                                       block = dat$targets$ptID)
consensus.corr$consensus.correlation

## [1] 0.2689307

Note: If you get an error about the package statmod, please install this package with install.packages("statmod")

Then incorporate this estimate into our model fitting.

# Fit model
fit_limma2 <- lmFit(object = dat$E, design = mm_limma,
                    weights = dat$weights,
                    block = dat$targets$ptID,
                    correlation = consensus.corr$consensus.correlation)
#Estimate p-values
efit_limma2 <- eBayes(fit_limma2)

#Get results
fdr_limma2 <- extract_lmFit(design = mm_limma, 
                            fit = efit_limma2)

Comparing these results for the virus variable, we see that the pseudo-paired model with duplicateCorrelation tends to give lower FDR values. This doesn’t necessarily mean the model is a better fit!

#Format results for merging
temp <- fdr_limma$lm %>% 
  select(gene, variable, FDR) %>% 
  filter(variable == "virusHRV") %>% 
  rename(`limma` = FDR)

temp2 <- fdr_limma2$lm %>% 
  select(gene, variable, FDR) %>% 
  filter(variable == "virusHRV") %>% 
  rename(limma_dupCor = FDR)

#Merge results and plot FDR
full_join(temp, temp2) %>% 
  ggplot(aes(x = `limma`, y = limma_dupCor)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "FDR for HRV vs media")

## Joining, by = c("gene", "variable")

limma provides one fit quality metric, sigma, which is the estimated standard deviation of the errors. Looking at this, we see that the best fit model (lowest sigma) varies from gene to gene. To be honest, however, sigma is not the best way to assess model fit. We’ll next move into kimma where we can compare models with more fit metrics as well as run a true mixed effects model.

data.frame(`limma` = fdr_limma$lm.fit$sigma,
           limma_dupCor = fdr_limma2$lm.fit$sigma) %>% 
  
  ggplot(aes(x = `limma`, y = limma_dupCor)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "Sigma")

Modeling in `kimma`

kimma supports more flexible modeling of RNA-seq data including simple linear and linear mixed effects models with co-variates, weights, random effects, and matrix random effects. Let’s run the same models as we did with limma.

Note that kimma is slower than limma. It can be run on multiple processors to increase speed.

fit_kimma <- kmFit(dat = dat, 
                   model = "~ virus + asthma + (1|ptID)", 
                   use.weights = TRUE,
                   run.lm = TRUE, run.lme = TRUE,
                   metrics = TRUE)

## lm model: expression~virus+asthma

## lme/lmerel model: expression~virus+asthma+(1|ptID)

## Input: 12 libraries from 6 unique patients

## Model: 12 libraries

## Complete: 1000 genes

## Failed: 0 genes

The kimma output contains 4 data frames: one for each model’s results (like limma’s topTable) and one for each model’s fit metrics, which unlike limma, contain more than just sigma.

names(fit_kimma)

## [1] "lm"      "lme"     "lm.fit"  "lme.fit"

head(fit_kimma$lm)

## # A tibble: 6 × 8
##   model gene            variable        df statistic estimate       pval     FDR
##   <chr> <chr>           <chr>        <int>     <dbl>    <dbl>      <dbl>   <dbl>
## 1 lm    ENSG00000000460 (Intercept)      9     21.9     7.32     4.06e-9 7.12e-9
## 2 lm    ENSG00000000460 virusHRV         9      1.21    0.463    2.56e-1 5.67e-1
## 3 lm    ENSG00000000460 asthmaasthma     9      2.06    0.787    6.93e-2 3.19e-1
## 4 lm    ENSG00000001460 (Intercept)      9     12.7     6.29     4.61e-7 6.78e-7
## 5 lm    ENSG00000001460 virusHRV         9      1.70    0.959    1.24e-1 4.07e-1
## 6 lm    ENSG00000001460 asthmaasthma     9      1.30    0.735    2.26e-1 5.24e-1

head(fit_kimma$lm.fit)

## # A tibble: 6 × 7
##   model  gene            sigma   AIC   BIC   Rsq adj_Rsq
##   <chr>  <chr>           <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 lm.fit ENSG00000000460 0.776 29.0   30.9 0.385  0.249 
## 2 lm.fit ENSG00000001460 0.905 38.3   40.3 0.333  0.185 
## 3 lm.fit ENSG00000002587 0.640 24.9   26.8 0.600  0.511 
## 4 lm.fit ENSG00000005189 1.31  53.0   55.0 0.206  0.0299
## 5 lm.fit ENSG00000005801 0.606  8.47  10.4 0.311  0.158 
## 6 lm.fit ENSG00000005884 1.00  49.6   51.5 0.587  0.495

Picking a best fit model

We can now look at metrics like AIC where we see that best fit varies by gene (which is very common)…

fit_kimma_all <- full_join(fit_kimma$lm.fit, 
                           fit_kimma$lme.fit, 
                           by = c("gene"), 
                           suffix = c("_lm","_lme"))

fit_kimma_all %>%
  ggplot(aes(x = AIC_lm, y = AIC_lme)) +
  geom_point(alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic() +
  coord_fixed() +
  labs(title = "AIC")

and the overall AIC mean and total are somewhat lower for the simple linear model without paired design.

#Mean
mean(fit_kimma$lm.fit$AIC)

## [1] 28.2214

mean(fit_kimma$lme.fit$AIC)

## [1] 31.09005

#Sum
sum(fit_kimma$lm.fit$AIC)

## [1] 28221.4

sum(fit_kimma$lme.fit$AIC)

## [1] 31090.05

In general, differences in mean AIC < 2 show that either model is appropriate, differences from 2 to 7 are moderate evidence for the lower value model, and differences greater than 7 are strong evidence for the lower value model.

So in this case, which model do we go with? AIC supports the simple model but our study design is paired… Always use your scientific reasoning first! If you know that there is a study design feature or confounding covariate, keep them in the model even if the AIC says otherwise. Be smarter than your data!

For this experiment, we know we have a paired design so either limma with duplicateCorrelation or kimma with run.lme is appropriate. In our experience, a true mixed effects model in kimma yields fewer false positive genes when you have a paired design, even if metrics like AIC do not support it as the best fit model.

Other models

We don’t have time to go over all the potential models of these data. So here’s a quick reference for you to explore later on.

Interaction terms
- limma or kimma formula with * such as ~ virus * asthma which is shorthand for ~ virus + asthma + virus:asthma
Pairwise comparisons between 3+ levels
- limma makeContrasts
- kimma run.contrasts = TRUE
Matrix random effects
- limma not supported
- kimma provide the matrix example.kin from the kimma package and set run.lmerel = TRUE
Removing gene-level weights
- limma don’t provide weights
- kimma use.weights = FALSE
- More about weights doi: 10.1186/gb-2014-15-2-r29
dream in the package variancePartition
- Another way to run paired sample designs
- See https://www.bioconductor.org/packages/devel/bioc/vignettes/variancePartition/inst/doc/dream.html

RNA-seq pathway analysis

Overview

This section covers gene set related analyses of RNAseq data in the R package SEARchways. The goal is to determine pathways associated with genes impacted by viral infection from the previous linear models.

Load packages

We will now load additional packages for pathway analyses.

library(SEARchways)
library(BIGpicture)

DEG enrichment

There are hundreds of genes significant for virus in this data set (FDR < 0.1). This is too many to assess individuals so we use enrichment to determine pathways associated with these genes.

summarise_kmFit(fit_kimma$lme)

## # A tibble: 4 × 7
##   variable             fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
##   <fct>                   <int>   <int>   <int>   <int>   <int>   <int>
## 1 (1 | ptID)                  7       7      26      34     108     162
## 2 asthma                     63      93     137     200     286     357
## 3 virus                     244     296     389     474     543     626
## 4 total (nonredundant)      295     359     478     580     702     794

SEARchways has the function BIGprofiler, which employs clusterProfiler to run Fisher’s exact tests of your genes of interest against any gene set data base available in the Broad Molecular Signatures Database (MSigDB) or a custom one your provide.

Here, let’s run the virus significant genes against the Hallmark data base.

#Virus significant genes
genes.virus <- fit_kimma$lme %>% 
  filter(variable == "virus" & FDR < 0.1) %>% 
  pull(gene)

enrich.virus <- BIGprofiler(gene_list = list("virus" = genes.virus),
                            ID = "ENSEMBL",
                            category = "H")

## [1] "virus"

##

The output includes:

group: Name identifier for genes of interest
size_group: Total genes in each group
gs_cat: Gene set data base category
gs_subcat: Gene set data base subcategory
size_cat.subcat: Total genes in data base category/subcategory
group_in_cat.subcat: Total significant genes in data base category/subcategory e.g. how many of your genes of interest are present in the data base
pathway: Gene set name
size_pathway: Total genes in gene set (K)
group_in_pathway: Total significant genes in gene set (k)
k/K: group_in_pathway / size_pathway, a commonly used metric of enrichment
pval: P-value
FDR: FDR corrected P-value
qvalue: q-value
genes: List of significant genes in gene set (not shown below for length)

The results show that there is 1 pathway enriched in virus significant genes at FDR < 0.05.

enrich.virus %>% 
  filter(FDR < 0.05) %>% 
  select(-genes)

##   group size_group gs_cat gs_subcat size_cat.subcat group_in_cat.subcat
## 1 virus        296      H                      4917                  98
##                              pathway size_pathway group_in_pathway        k/K
## 1 HALLMARK_INTERFERON_GAMMA_RESPONSE          286               17 0.05944056
##           pval         FDR      qvalue
## 1 3.876975e-05 0.001822178 0.001822178

We can easily plot this result in BIGpicture. We’ll use FDR < 0.5 so you can see multiple pathways. You can see that roughly 6% of the IFN gamma response pathway is present in our list of virus significant genes.

plot_enrich(enrich.virus, 
            fdr.cutoff = 0.5, fdr.colors = c(0.05, 0.5))

What about directionality?

This type of enrichment does not address whether genes are up or down-regulated, just that they are significant in some way. Thus, if you would like to know direction, you need to create subsets like virus.up and virus.down, then run enrichment on the lists separately. This should be done with caution as not all gene sets have concordant directionality and shorter gene lists are less likely to be significant in any pathway.

A note on saving `SEARchways` results

Many SEARchways results contain list data. For example, the genes column from all of the BIG functions is a list column. This means these data cannot be saved outside of R as they are.

class(enrich.virus$genes)

## [1] "list"

To save this column outside R, simply convert it to a character.

enrich.virus %>% 
  mutate(genes = as.character(genes)) %>% 
  write_csv(file = "results/BIGprofiler.results.csv")

Note that if you read this csv back into R, the genes column will not be formatting correctly for downstream BIGverse functions. Thus, we recommend saving data in .RData or .Rds in order to maintain formatting.

Alternatively, you can unnest the data to get 1 row per gene per pathway. Then when you read the csv into R, you can re-nest it like so.

enrich.virus %>% 
  unnest(genes) %>% 
  write_csv(file = "results/BIGprofiler.results.csv")

enrich.virus <- read_csv("results/BIGprofiler.results.csv") %>% 
  group_by(across(c(-genes))) %>% 
  summarise(genes = list(genes), .groups = "drop")

Gene set enrichment analysis (GSEA)

Another gene set related analysis is GSEA. This method uses fold change values of all genes in the data set instead of having you pre-define a significant list of genes at an arbitrary FDR cutoff. GSEA, thus, can find significant pathways with no individually significant genes in them. Instead, smaller up (or down) regulations of many genes can drive overall significance of a pathway.

BIGsea runs GSEA with parameters similar to BIGprofiler only now the input is log fold change (named estimate in the kimma output).

FC.virus <- fit_kimma$lme %>% 
  filter(variable == "virus") %>% 
  select(variable, gene, estimate) 
head(FC.virus)

## # A tibble: 6 × 3
##   variable gene            estimate
##   <chr>    <chr>              <dbl>
## 1 virus    ENSG00000000460   0.474 
## 2 virus    ENSG00000001460   0.959 
## 3 virus    ENSG00000002587  -1.11  
## 4 virus    ENSG00000005189  -0.347 
## 5 virus    ENSG00000005801   0.0792
## 6 virus    ENSG00000005884   1.00

gsea.virus <- BIGsea(gene_df = FC.virus,
                   ID = "ENSEMBL",
                   category = "H")

## virus

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.7% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

Note the warning about ties in the data. This means that 2 or more genes have an identical log fold change value and thus, GSEA must randomly determine their rank order relative to each other. This is not an issue because we’ve already set a random seed for reproducibility. However, if there are a lot of ties in your data, like > 10%, this may indicate a need to re-assess your linear model to better represent differences across the data.

Similarly, we can plot the GSEA results with BIGpicture. Unlike simple enrichment, GSEA shows directionality such as here where IFN gamma response is up-regulated in response to virus. Remember to define your variable order with factor in the original data if you want something other than alphabetical as this determines whether the fold change values are A minus B or B minus A.

plot_gsea(gsea.virus,
          fdr.cutoff = 0.5, fdr.colors = c(0.05, 0.5))

What if I have more than 2 groups to compare?

GSEA can only support pairwise comparison. Thus, if you have a variable of interest with 3 or more levels, you need to run each pairwise comparison of interest separately. You can access this data in kimma with run.contrasts = TRUE. This is also true for interaction terms, where the model estimate is not a simple fold change.

RNA-seq linear modeling

`limma` and `kimma`

Kim Dill-McFarland, kadm@uw.edu

version January 30, 2023

RNA-seq linear modeling

Overview

Prior to the workshop

Load packages

Data

Data description

Load data

Introduction to linear modeling

Modeling in `limma`

Simple linear regression in `limma`

Paired sample design in `limma`

Modeling in `kimma`

Picking a best fit model

Other models

RNA-seq pathway analysis

Overview

Load packages

DEG enrichment

What about directionality?

A note on saving `SEARchways` results

Gene set enrichment analysis (GSEA)

What if I have more than 2 groups to compare?

Navigation

Additional resources

RNA-seq linear modeling

limma and kimma

Kim Dill-McFarland, kadm@uw.edu

version January 30, 2023

RNA-seq linear modeling

Overview

Prior to the workshop

Load packages

Data

Data description

Load data

Introduction to linear modeling

Modeling in limma

Simple linear regression in limma

Paired sample design in limma

Modeling in kimma

Picking a best fit model

Other models

RNA-seq pathway analysis

Overview

Load packages

DEG enrichment

What about directionality?

A note on saving SEARchways results

Gene set enrichment analysis (GSEA)

What if I have more than 2 groups to compare?

Navigation

Additional resources

`limma` and `kimma`

Modeling in `limma`

Simple linear regression in `limma`

Paired sample design in `limma`

Modeling in `kimma`

A note on saving `SEARchways` results