This document covers gene set related analyses of RNAseq data in the
R package SEARchways
. This pipeline explores differentially
expressed genes (see voom
to DEG) and gene co-expression modules (see RNAseq
modules). The goal is to determine pathways associated with genes
and modules significantly impacted by viral infection or asthma status.
The example data are human, bulk, paired-end RNA-seq, but this pipeline
can be applied to other organisms or single-read libraries.
This pipeline should be completed in R and RStudio. You should also install the following packages.
#CRAN packages
install.packages(c("tidyverse", "WGCNA"))
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma", "patchwork"))
#GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/BIGverse")
#Or if fails, install packages individually
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")
And load them into your current R session.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(BIGverse)
## ── Attaching packages ──────────────────────────────────────── BIGverse 1.0.0 ──
## ✔ kimma 1.5.0 ✔ BIGpicture 1.1.0
## ✔ RNAetc 1.1.0 ✔ SEARchways 1.1.0
## ── Conflicts ─────────────────────────────────────────── BIGverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(limma)
library(WGCNA)
## Loading required package: dynamicTreeCut
## Loading required package: fastcluster
##
## Attaching package: 'fastcluster'
##
## The following object is masked from 'package:stats':
##
## hclust
##
##
##
## Attaching package: 'WGCNA'
##
## The following object is masked from 'package:stats':
##
## cor
library(patchwork)
set.seed(651)
Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this tutorial.
Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub
Specifically, this tutorial uses RNAseq data processed using our SEAsnake
and counts
to voom pipelines, resulting in voom-normalized, log2 counts per
million (CPM) expression and associated sample metadata in a
limma EList
object in the data_clean/
directory. These data are available in the kimma
package
within the example.voom
data object.
All counts, gene, and sample metadata are contained in a single
object in the kimma
package. Note that these data contain
only 1000 genes from the full data set create in our counts to voom pipeline. This is to reduce
computational time.
#Extract and rename data object
dat <- kimma::example.voom
First, we determine genes associated with viral infection and/or asthma in this data set. You can see more about model fitting in our voom to DEG tutorial.
virus_lme <- kmFit(dat = dat,
model = "~ virus*asthma + (1|ptID)",
run_lme = TRUE,
use_weights = TRUE)
## lme/lmerel model: expression~virus*asthma+(1|ptID)
## Input: 12 libraries from 6 unique patients
## Model: 12 libraries
## Complete: 1000 genes
## Failed: 0 genes
There are hundreds of genes significant for virus and/or asthma in this data set (FDR < 0.05). This is too many to assess individuals so we use enrichment to determine pathways associated with these genes.
summarise_kmFit(virus_lme$lme)
## # A tibble: 5 × 7
## variable fdr_0.05 fdr_0.1 fdr_0.2 fdr_0.3 fdr_0.4 fdr_0.5
## <fct> <int> <int> <int> <int> <int> <int>
## 1 (1 | ptID) 7 9 21 24 33 124
## 2 asthma 69 98 140 204 284 355
## 3 virus 266 314 397 468 566 628
## 4 virus:asthma 17 33 46 77 89 106
## 5 total (nonredundant) 316 384 493 590 710 797
BIGprofiler
SEARchway
contains two functions to run enrichment. The
first BIGprofiler
employs clusterProfiler
to
run Fisher’s exact tests of your genes of interest against any gene set
data base available in the Broad Molecular Signatures Database (MSigDB).
Here, let’s run the virus significant genes against the Hallmark data
base.
#Virus significant genes
genes.virus <- virus_lme$lme %>%
filter(variable == "virus" & FDR < 0.05) %>%
pull(gene)
enrich.virus <- BIGprofiler(gene_list = list("virus" = genes.virus),
ID = "ENSEMBL",
category = "H")
## [1] "virus"
The output includes:
group
: Name identifier for genes of interestsize_group
: Total genes in each groupgs_cat
: Gene set data base categorygs_subcat
: Gene set data base subcategorysize_cat.subcat
: Total genes in data base
category/subcategorygroup_in_cat.subcat
: Total significant genes in data
base category/subcategory e.g. how many of your genes of interest are
present in the data basepathway
: Gene set namesize_pathway
: Total genes in gene set (K)group_in_pathway
: Total significant genes in gene set
(k)k/K
: group_in_pathway / size_pathway, a commonly used
metric of enrichmentpval
: P-valueFDR
: FDR corrected P-valueqvalue
: q-valuegenes
: List of significant genes in gene set (not shown
below for length)The results show that there is 1 pathway enriched in virus significant genes at FDR < 0.05.
enrich.virus %>%
filter(FDR < 0.05) %>%
select(-genes)
## group size_group gs_cat gs_subcat size_cat.subcat group_in_cat.subcat
## 1 virus 266 H 4917 92
## pathway size_pathway group_in_pathway k/K
## 1 HALLMARK_INTERFERON_GAMMA_RESPONSE 286 18 0.06293706
## pval FDR qvalue
## 1 3.960295e-06 0.0001900941 0.0001875929
We can easily plot this result in BIGpicture
. We’ll use
FDR < 0.5 so you can see multiple pathways. You can see that just
over 6% of the IFN gamma response pathway is present in our list of
virus significant genes.
plot_enrich(enrich.virus,
fdr_cutoff = 0.5, fdr_colors = c(0.05, 0.5))
You can also look at multiple variables by inputing a data frame
(gene_df
) with the variable in 1 column and the significant
genes in another. Note that the plot shows non-significant results for
one variable or the other. This can be turned off with
show.overlap = FALSE
.
gene_df <- virus_lme$lme %>%
filter(FDR < 0.05 & variable %in% c("virus","asthma")) %>%
select(variable, gene)
head(gene_df)
## # A tibble: 6 × 2
## variable gene
## <chr> <chr>
## 1 virus ENSG00000002587
## 2 asthma ENSG00000005884
## 3 asthma ENSG00000006712
## 4 virus ENSG00000010818
## 5 virus ENSG00000013810
## 6 asthma ENSG00000013810
enrich.all <- BIGprofiler(gene_df = gene_df,
ID = "ENSEMBL",
category = "H")
## [1] "virus"
## [1] "asthma"
plot_enrich(enrich.all,
fdr_cutoff = 0.5, fdr_colors = c(0.05, 0.5))
This type of enrichment does not address whether genes are up or
down-regulated, just that they are significant in some way. Thus, if you
would like to know direction, you need to create subsets like
virus.up
and virus.down
, then run enrichment
on the lists separately. This should be done with caution as not all
gene sets have concordant directionality and shorter gene lists are less
likely to be significant in any pathway.
SEARchways
resultsMany SEARchways
results contain list data. For example,
the genes
column from all of the BIG
functions
in this tutorial is a list column. This means these data cannot be saved
outside of R as they are.
class(enrich.all$genes)
## [1] "list"
To save this column outside R, simply convert it to a
character
.
enrich.all %>%
mutate(genes = as.character(genes)) %>%
write_csv(file = "results/BIGprofiler.results.csv")
Note that if you read this csv back into R, the genes
column will not be formatting correctly for downstream
BIGverse
functions. Thus, we recommend saving data in
.RData
or .Rds
in order to maintain
formatting.
Alternatively, you can unnest the data to get 1 row per gene per
pathway. Then when you read the csv
into R, you can re-nest
it like so.
enrich.all %>%
unnest(genes) %>%
write_csv(file = "results/BIGprofiler.results.csv")
enrich.all <- read_csv("results/BIGprofiler.results.csv") %>%
group_by(across(c(-genes))) %>%
summarise(genes = list(genes), .groups = "drop")
Another gene set related analysis is GSEA. This method uses fold change values of all genes in the data set instead of having you pre-define a significant list of genes. GSEA, thus, can find significant pathways with no individually significant genes in them. Instead, smaller up (or down) regulations of many genes can drive overall significance of a pathway.
BIGsea
runs GSEA with parameters similar to
BIGprofiler
only now the input is log fold change (named
estimate
in the kimma
output).
gene_df2 <- virus_lme$lme %>%
filter(variable %in% c("virus", "asthma")) %>%
select(variable, gene, estimate)
head(gene_df2)
## # A tibble: 6 × 3
## variable gene estimate
## <chr> <chr> <dbl>
## 1 virus ENSG00000000460 0.425
## 2 asthma ENSG00000000460 0.783
## 3 virus ENSG00000001460 0.369
## 4 asthma ENSG00000001460 0.225
## 5 virus ENSG00000002587 -0.858
## 6 asthma ENSG00000002587 -0.130
gsea.all <- BIGsea(gene_df = gene_df2,
ID = "ENSEMBL",
category = "H")
## virus
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.7% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## asthma
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.7% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
Note the warning about ties in the data. This means that 2 or more genes have an identical log fold change value and thus, GSEA must randomly determine their rank order relative to each other. This is not an issue because we’ve already set a random seed for reproducibility. However, if there are a lot of ties in your data, like > 10%, this may indicate a need to re-assess your linear model to better represent differences across the data.
Similarly, we can plot the GSEA results with BIGpicture
.
Until simple enrichment, GSEA shows directionality such as here where
IFN gamma response is up-regulated in response to virus. Remember to
define your variable order with factor
in the original data
if you want something other than alphabetical as this determines whether
the fold change values are A-B or B-A.
plot_gsea(gsea.all,
fdr_cutoff = 0.2)
GSEA can only support pairwise comparison. Thus, if you have a
variable of interest with 3 or more levels, you need to run each
pairwise comparison of interest separately. You can access this data in
kimma
with run.contrasts = TRUE
. This is also
true for interaction terms, where the model estimate is not a simple
fold change.
Gene set analyses can also be used to assign function to gene co-expression modules. For more information on module building, see our RNAseq module tutorial.
#Define significant genes
genes.interact <- virus_lme$lme %>%
filter(variable == "virus:asthma" & FDR < 0.2) %>%
pull(gene)
genes.main <- virus_lme$lme %>%
filter(variable %in% c("virus", "asthma") & FDR < 0.2) %>%
pull(gene)
genes.all <- unique(c(genes.interact, genes.main))
fit <- fit_modules(dat = kimma::example.voom,
genes = genes.all,
powerVector=c(1:30),
networkType = "signed",
nThread = 4)
mod.p14 <- make_modules(
fit,
sft.value = 14,
minModuleSize = 20,
maxBlockSize = 500,
deepSplit = 2,
networkType = "signed",
mods_mean = TRUE, mods_eigen = TRUE,
david = TRUE,
nThread = 4)
The SEARchways
function can be applied to module results
from RNAetc
like so.
We see several pathways enrichment in some of the modules.
Note that this also showcases that different gene identifiers can be used in enrichment.
mod_df <- mod.p14$mods %>%
select(module.char, geneName)
head(mod_df)
## module.char geneName
## 1 05 ENSG00000002587
## 2 04 ENSG00000005884
## 3 01 ENSG00000006453
## 4 04 ENSG00000006634
## 5 07 ENSG00000006712
## 6 02 ENSG00000010278
enrich.mod <- BIGprofiler(gene_df = mod_df,
ID = "ENSEMBL",
category = "H")
## [1] "05"
## [1] "04"
## [1] "01"
## [1] "07"
## [1] "02"
## [1] "06"
## [1] "08"
## [1] "03"
## [1] "00"
plot_enrich(enrich.mod, fdr_cutoff = 0.2,
show_overlap = FALSE)
The STRING protein-protein
interaction network data base is a useful tool for visualizing genes of
interest. We have incorporated this visualization with
SEARchways
in order to facilitate annotation of the a
STRING network with significant pathways.
For example, we plot the virus significant genes and color by
significant hypergeometric enrichment from BIGprofiler
.
#Map gene to STRING
map <- map_string(genes = genes.virus)
## Warning: we couldn't map to STRING 0% of your identifiers
#plot
plot_string(map = map,
enrichment = enrich.virus, fdr_cutoff = 0.05)
## Registered S3 method overwritten by 'ggnetwork':
## method from
## fortify.igraph ggtree
We used ENSEMBL gene identifiers in our enrichment, so these are used
to label the networks nodes. However, you may wish to use more
human-readable names like HGNC symbols. We can pull these data from the
original dat$genes
metadata or if you do not already have
this information, checkout biomaRt
.
#View gene metadata
head(dat$genes)
## hgnc_symbol Previous symbols Alias symbols
## ENSG00000000460 RAB12 <NA> <NA>
## ENSG00000001460 MIA <NA> CD-RAP
## ENSG00000002587 C1orf52 <NA> gm117, FLJ44982
## ENSG00000005189 C12orf73 <NA> FLJ13975, DKFZp547P055, BRAWNIN
## ENSG00000005801 EMC2 KIAA0103, TTC35 <NA>
## ENSG00000005884 CNPPD1 C2orf24 CGI-57
## gene_biotype geneName
## ENSG00000000460 protein-coding gene ENSG00000206418
## ENSG00000001460 protein-coding gene ENSG00000261857
## ENSG00000002587 protein-coding gene ENSG00000162642
## ENSG00000005189 protein-coding gene ENSG00000204954
## ENSG00000005801 protein-coding gene ENSG00000104412
## ENSG00000005884 protein-coding gene ENSG00000115649
#Convert ENSEMBL gene list to HGNC symbol
genes.virus.hgnc <- dat$genes %>%
filter(geneName %in% genes.virus) %>%
pull(hgnc_symbol)
genes.virus.hgnc[1:3]
## [1] "MIA" "C1orf52" "EMC2"
We also need to convert the enrichment results
genes.key <- dat$genes %>%
select(geneName, hgnc_symbol)
enrich.virus.hgnc <- enrich.virus %>%
#Match ENSEMBL to HGNC
unnest(genes) %>%
left_join(genes.key, by = c("genes"="geneName")) %>%
select(-genes) %>%
#Re-nest gene IDs
group_by(across(c(-hgnc_symbol))) %>%
summarise(genes = list(hgnc_symbol), .groups = "drop")
Or rerun the enrichment with the HGNC symbols.
enrich.virus.hgnc <- BIGprofiler(gene_list = list("virus" = genes.virus.hgnc),
ID = "SYMBOL",
category = "H")
Then we can re-make the network.
#Map gene to STRING
map <- map_string(genes = genes.virus.hgnc)
## Warning: we couldn't map to STRING 0% of your identifiers
#plot
plot_string(map = map,
enrichment = enrich.virus.hgnc, fdr_cutoff = 0.05)
This network is pretty big so let’s cut if down to just genes with at least 1 connection.
plot_string(map = map,
enrichment = enrich.virus.hgnc, fdr_cutoff = 0.05,
edge_min = 1)
The pathway names aren’t nicely formatted so we could change them in the enrichment results.
enrich.virus.hgnc.format <- enrich.virus.hgnc %>%
mutate(pathway = gsub("HALLMARK_","",pathway),
pathway = gsub("_"," ",pathway))
plot_string(map = map,
enrichment = enrich.virus.hgnc.format, fdr_cutoff = 0.05,
edge_min = 1)
Or we could look at just nodes with no connections (i.e. orphans). We also use a non-default layout for this visualization.
plot_string(map = map,
enrichment = enrich.virus.hgnc.format, fdr_cutoff = 0.05,
edge_max = 0,
layout = "grid")
We can also color by GSEA leading edge genes from
BIGsea
. Again, we need to convert or rerun GSEA with HGNC
symbols to match the current network.
gsea.virus.hgnc <- gsea.all %>%
filter(group == "virus") %>%
#Match ENSEMBL to HGNC
unnest(leadingEdge) %>%
left_join(genes.key, by = c("leadingEdge"="geneName")) %>%
select(-leadingEdge) %>%
#Re-nest gene IDs
group_by(across(c(-hgnc_symbol))) %>%
summarise(leadingEdge = list(hgnc_symbol), .groups = "drop")
plot_string(map = map,
enrichment = gsea.virus.hgnc, fdr_cutoff = 0.2,
edge_min = 1)
Note that fewer genes are colored in the GSEA network. This is because GSEA identifies the leading edge only. Thus, you can have genes in your network that are annotated to a significant GSEA pathway but not colored because they were not leading edge in your comparison of interest.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.1.2 WGCNA_1.72-1 fastcluster_1.2.3
## [4] dynamicTreeCut_1.63-1 limma_3.54.2 SEARchways_1.1.0
## [7] BIGpicture_1.1.0 RNAetc_1.1.0 kimma_1.5.0
## [10] BIGverse_1.0.0 lubridate_1.9.2 forcats_1.0.0
## [13] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
## [16] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [19] ggplot2_3.4.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] shadowtext_0.1.2 backports_1.4.1 Hmisc_5.1-0
## [4] fastmatch_1.1-3 plyr_1.8.8 igraph_1.4.3
## [7] lazyeval_0.2.2 splines_4.2.1 BiocParallel_1.32.6
## [10] ggnetwork_0.5.12 GenomeInfoDb_1.34.9 digest_0.6.31
## [13] foreach_1.5.2 yulab.utils_0.0.6 htmltools_0.5.5
## [16] GOSemSim_2.24.0 viridis_0.6.3 GO.db_3.16.0
## [19] fansi_1.0.4 magrittr_2.0.3 checkmate_2.2.0
## [22] memoise_2.0.1 cluster_2.1.4 doParallel_1.0.17
## [25] tzdb_0.4.0 Biostrings_2.66.0 graphlayouts_1.0.0
## [28] matrixStats_0.63.0 timechange_0.2.0 enrichplot_1.18.4
## [31] colorspace_2.1-0 blob_1.2.4 ggrepel_0.9.3
## [34] xfun_0.39 crayon_1.5.2 RCurl_1.98-1.12
## [37] jsonlite_1.8.4 scatterpie_0.2.0 impute_1.72.3
## [40] hash_2.2.6.2 survival_3.5-5 iterators_1.0.14
## [43] ape_5.7-1 glue_1.6.2 polyclip_1.10-4
## [46] gtable_0.3.3 zlibbioc_1.44.0 XVector_0.38.0
## [49] BiocGenerics_0.44.0 scales_1.2.1 DOSE_3.24.2
## [52] msigdbr_7.5.1 DBI_1.1.3 Rcpp_1.0.10
## [55] plotrix_3.8-2 viridisLite_0.4.2 htmlTable_2.4.1
## [58] tidytree_0.4.2 gridGraphics_0.5-1 foreign_0.8-84
## [61] bit_4.0.5 preprocessCore_1.60.2 sqldf_0.4-11
## [64] Formula_1.2-5 stats4_4.2.1 htmlwidgets_1.6.2
## [67] httr_1.4.6 fgsea_1.24.0 gplots_3.1.3
## [70] RColorBrewer_1.1-3 pkgconfig_2.0.3 farver_2.1.1
## [73] nnet_7.3-19 sass_0.4.6 STRINGdb_2.10.1
## [76] utf8_1.2.3 labeling_0.4.2 ggplotify_0.1.0
## [79] tidyselect_1.2.0 rlang_1.1.1 reshape2_1.4.4
## [82] AnnotationDbi_1.60.2 munsell_0.5.0 tools_4.2.1
## [85] cachem_1.0.8 downloader_0.4 cli_3.6.1
## [88] gsubfn_0.7 generics_0.1.3 RSQLite_2.3.1
## [91] gson_0.1.0 evaluate_0.21 fastmap_1.1.1
## [94] yaml_2.3.7 ggtree_3.6.2 babelgene_22.9
## [97] knitr_1.43 bit64_4.0.5 tidygraph_1.2.3
## [100] caTools_1.18.2 KEGGREST_1.38.0 ggraph_2.1.0
## [103] nlme_3.1-162 aplot_0.1.10 compiler_4.2.1
## [106] rstudioapi_0.14 png_0.1-8 treeio_1.22.0
## [109] tweenr_2.0.2 bslib_0.4.2 stringi_1.7.12
## [112] highr_0.10 lattice_0.21-8 Matrix_1.5-4.1
## [115] vctrs_0.6.2 pillar_1.9.0 lifecycle_1.0.3
## [118] jquerylib_0.1.4 data.table_1.14.8 cowplot_1.1.1
## [121] bitops_1.0-7 qvalue_2.30.0 R6_2.5.1
## [124] KernSmooth_2.23-21 gridExtra_2.3 IRanges_2.32.0
## [127] codetools_0.2-19 gtools_3.9.4 MASS_7.3-60
## [130] chron_2.3-61 proto_1.0.0 withr_2.5.0
## [133] S4Vectors_0.36.2 GenomeInfoDbData_1.2.9 parallel_4.2.1
## [136] hms_1.1.3 clusterProfiler_4.6.2 grid_4.2.1
## [139] rpart_4.1.19 ggfun_0.0.9 HDO.db_0.99.1
## [142] rmarkdown_2.22 ggforce_0.4.1 Biobase_2.58.0
## [145] base64enc_0.1-3