This package is a collection of functions useful in RNAseq data analysis that don’t fit well into our other packages (see info-graphic). They are involved in data cleaning, data manipulation, WGCNA module analyses (mostly).
This pipeline should be completed in R version 4.5.0 (2025-04-11) or newer. Please also download the following packages.
# CRAN packages
install.packages(c("tidyverse","patchwork"))
# Bioconductor packages
## These are dependencies for kimma that are beneficial to download separately
install.packages("BiocManager")
BiocManager::install(c("limma", "edgeR"))
# GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")
And load them into your current R session.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.2 âś” tibble 3.2.1
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(edgeR)
## Loading required package: limma
library(limma)
library(RNAetc)
library(kimma)
##
## Attaching package: 'kimma'
##
## The following objects are masked from 'package:RNAetc':
##
## example.dat, example.voom
library(BIGpicture)
library(SEARchways)
library(patchwork)
To ensure reproducibility with this document, set the random seed value.
set.seed(651)
Example data were obtained from media controls and human rhinovirus (HRV) infected human plasmacytoid dendritic cells. You can learn more about these data in
Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub
Specifically, this vignette uses bulk human RNA-seq processed from fastq to counts using SEAsnake and counts to voom using edgeR/limma. This results in voom-normalized, log2 counts per million (CPM) gene expression and associated sample and gene metadata. In total, 6 donors and 1000 random genes were selected for this vignette. Actual donor identifiers have been altered for privacy.
Unnormalized RNAseq counts and metadata.
example.dat <- RNAetc::example.dat
names(example.dat)
## [1] "counts" "samples" "genes"
Voom normalized RNAseq counts and metadata.
example.voom <- RNAetc::example.voom
names(example.voom)
## [1] "genes" "targets" "E" "weights" "design"
Example WGCNS modules.
example.mods <- RNAetc::example.mods
head(example.mods)
## geneName module module.char mod.color hgnc_symbol Previous symbols
## 1 ENSG00000005189 0 00 grey REXO5 <NA>
## 2 ENSG00000006025 0 00 grey OSBPL7 <NA>
## 3 ENSG00000006118 0 00 grey TMEM132A HSPA5BP1
## 4 ENSG00000007038 0 00 grey PRSS21 <NA>
## 5 ENSG00000007080 0 00 grey CCDC124 <NA>
## 6 ENSG00000008323 0 00 grey PLEKHG6 <NA>
## Alias symbols gene_biotype
## 1 NEF-sp protein-coding gene
## 2 ORP7, MGC71150 protein-coding gene
## 3 GBP, FLJ20539 protein-coding gene
## 4 ESP-1, TEST1 protein-coding gene
## 5 <NA> protein-coding gene
## 6 FLJ10665, MYOGEF protein-coding gene
filter_rare
: Filter rare and low abundance genesRemove genes that are not at least X counts per million (CPM) in at
least Y samples or Z% of samples. Generally applied to a
DGEList
object prior to voom normalization.
dim(example.dat)
## [1] 1000 12
example.dat.filter <- filter_rare(dat = example.dat,
min_CPM = 1,
min_sample = 3,
gene_var = "geneName")
## 46 (4.6%) of 1000 genes removed.
dim(example.dat.filter)
## [1] 954 12
example.dat.filter <- filter_rare(dat = example.dat,
min_CPM = 1,
min_pct = 10,
gene_var = "geneName")
## 2 (0.2%) of 1000 genes removed.
dim(example.dat.filter)
## [1] 998 12
Automatically plot the mean variance trends before and after filtering.
example.dat.filter <- filter_rare(dat = example.dat,
min_CPM = 1,
min_pct = 10,
gene_var = "geneName",
plot = TRUE)
## 2 (0.2%) of 1000 genes removed.
Other parameters
gene_var
to identify the column in
dat$genes
to match to rownames in
dat$counts
collapse_voom
: Combine voom object into a single
dataframeCollapse a voom object by matching the libraryID
and
geneID
columns.
names(example.voom)
## [1] "genes" "targets" "E" "weights" "design"
dat2 <- collapse_voom(dat = example.voom)
colnames(dat2)
## [1] "geneName" "libID" "value"
## [4] "group" "lib.size" "norm.factors"
## [7] "ptID" "median_cv_coverage" "virus"
## [10] "asthma" "batch" "hgnc_symbol"
## [13] "Previous symbols" "Alias symbols" "gene_biotype"
head(dat2)
## # A tibble: 6 Ă— 15
## geneName libID value group lib.size norm.factors ptID median_cv_coverage
## <chr> <chr> <dbl> <fct> <dbl> <dbl> <chr> <dbl>
## 1 ENSG00000000… lib1 6.11 1 79646. 1.00 dono… 0.514
## 2 ENSG00000000… lib2 8.00 1 88008. 0.951 dono… 0.435
## 3 ENSG00000000… lib3 7.32 1 178020. 1.09 dono… 0.374
## 4 ENSG00000000… lib4 7.33 1 133836. 0.943 dono… 0.388
## 5 ENSG00000000… lib5 7.92 1 192547. 1.00 dono… 0.353
## 6 ENSG00000000… lib6 7.99 1 175144. 0.974 dono… 0.349
## # ℹ 7 more variables: virus <fct>, asthma <chr>, batch <dbl>,
## # hgnc_symbol <chr>, `Previous symbols` <chr>, `Alias symbols` <chr>,
## # gene_biotype <chr>
Add the gene-level weights in dat$weights
dat2 <- collapse_voom(dat = example.voom,
include_weights = TRUE)
colnames(dat2)
## [1] "geneName" "libID" "value"
## [4] "group" "lib.size" "norm.factors"
## [7] "ptID" "median_cv_coverage" "virus"
## [10] "asthma" "batch" "hgnc_symbol"
## [13] "Previous symbols" "Alias symbols" "gene_biotype"
## [16] "weights"
subset_voom
: Subset voom objectSubset a voom object to specific libraries.
dim(example.voom)
## [1] 1000 12
dat.sub1 <- subset_voom(dat = example.voom,
lib_keep = c("lib1","lib3"))
## Subsetting to 2 of 12 libraries
## Subsetting to 1000 of 1000 genes
## Done!
dim(dat.sub1)
## [1] 1000 2
Subset a voom object to remove specific libraries.
dim(example.voom)
## [1] 1000 12
dat.sub2 <- subset_voom(dat = example.voom,
lib_remove = c("lib1","lib3"))
## Subsetting to 10 of 12 libraries
## Subsetting to 1000 of 1000 genes
## Done!
dim(dat.sub2)
## [1] 1000 10
Use the voom metadata to subset the object based on a logical
statement. For example, keep all of the HRV infected samples. Be careful
to use single quotations ’’ within the lib_filter
parameter
which is use double quotations “” overall.
dim(example.voom)
## [1] 1000 12
dat.sub3 <- subset_voom(dat = example.voom,
lib_filter = "virus == 'HRV'")
## Subsetting to 6 of 12 libraries
## Subsetting to 1000 of 1000 genes
## Done!
dim(dat.sub3)
## [1] 1000 6
Instead of subsetting libraries, you can also subset to a vector of specific genes.
dim(example.voom)
## [1] 1000 12
dat.sub4 <- subset_voom(dat = example.voom,
gene_keep =c("ENSG00000000460","ENSG00000001460",
"ENSG00000002587"))
## Subsetting to 12 of 12 libraries
## Subsetting to 3 of 1000 genes
## Done!
dim(dat.sub4)
## [1] 3 12
combine_voom
: Combine 2 voom objectsCombine two voom objects.
dim(dat.sub1)
## [1] 1000 2
dim(dat.sub2)
## [1] 1000 10
dat.recombine <- combine_voom(dat1 = dat.sub1, dat2 = dat.sub2)
## Combining genes: 1000 from dat1 and 1000 from dat2.
## Combining libraries: 2 from dat1 and 10 from dat2.
##
## Returning voom object with 12 libraries and 1000 genes.
dim(dat.recombine)
## [1] 1000 12
This group of functions works around gene co-expression modules
created by WGCNA
.
fit_modules
, make_modules
, and
compare_modules
allow a subset of parameters from
WGCNA
that we commonly utilize. If you require further
customization, we recommend running WCGNA
functions
directly.
fit_modules
: Perform soft thresholdingUsed to determine the power to use for WGCNA module building.
mod.fit <- fit_modules(dat = example.voom)
##
head(mod.fit$sft)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.26313660 9.943599 0.8813112 502.00648 503.15088 531.43769
## 2 2 0.03041606 -1.797055 0.8636882 283.23033 280.65901 330.74298
## 3 3 0.02812478 -1.095957 0.8733717 173.48229 171.12225 226.52349
## 4 4 0.05936983 -1.048659 0.8945757 113.01295 111.05051 164.06308
## 5 5 0.11880939 -1.048333 0.9066056 77.24630 75.01133 123.94006
## 6 6 0.17295138 -1.048439 0.9210418 54.87881 52.62128 96.89712
## signed.R.sq
## 1 -0.26313660
## 2 0.03041606
## 3 0.02812478
## 4 0.05936983
## 5 0.11880939
## 6 0.17295138
mod.fit$top.plot + mod.fit$connect.plot
Change the power levels tested. For example, decrease to improve computational time or increase if you have not reached stability fo the default 30. In all honestly, though, 1:30 is generally sufficient and weirdness can happen at higher powers.
mod.fit2 <- fit_modules(dat = example.voom,
powerVector = c(1:50))
head(mod.fit2$sft)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.26313660 9.943599 0.8813112 502.00648 503.15088 531.43769
## 2 2 0.03041606 -1.797055 0.8636882 283.23033 280.65901 330.74298
## 3 3 0.02812478 -1.095957 0.8733717 173.48229 171.12225 226.52349
## 4 4 0.05936983 -1.048659 0.8945757 113.01295 111.05051 164.06308
## 5 5 0.11880939 -1.048333 0.9066056 77.24630 75.01133 123.94006
## 6 6 0.17295138 -1.048439 0.9210418 54.87881 52.62128 96.89712
## signed.R.sq
## 1 -0.26313660
## 2 0.03041606
## 3 0.02812478
## 4 0.05936983
## 5 0.11880939
## 6 0.17295138
mod.fit2$top.plot + mod.fit2$connect.plot
Use an “unsigned” or “signed hybrid” network instead of the default signed.
mod.fit2 <- fit_modules(dat = example.voom,
networkType = "unsigned")
mod.fit2$top.plot + mod.fit2$connect.plot
Use a subset of genes to create modules.
mod.fit2 <- fit_modules(dat = example.voom,
genes = example.voom$genes$geneName[1:100])
mod.fit2$top.plot + mod.fit2$connect.plot
make_modules
: Construct WGCNA modules and associated
dataBuild modules using power based on fit_modules
such as
power that reaches an R-squared of 0.9.
mods1 <- make_modules(fit = mod.fit,
Rsq_min = 0.9)
Or set it by-hand based on the thresholding plot local (which is what I’d do with these data) or absolute maximum.
mods1 <- make_modules(fit = mod.fit,
sft_value = 14)
Provide additional results including mods_mean
(mean
gene expression by module), mods_eigen
(eigenvalue
expression by module), or david
(DAVID formatted genes in
modules data frame).
mods1 <- make_modules(fit = mod.fit,
sft_value = 14,
mods_mean = TRUE,
mods_eigen = TRUE,
david = TRUE)
Other parameters
WGCNA::blockwiseConsensusModules( )
for more details.
minModuleSize
: Smallest possible modulemaxBlockSize
: Maximum number of genes to include in 1
dendrogram (many dendrograms go into making modules)deepSplit
: Build sensitivity, 0 least and 4 most
sensitivenetworkType
, TOMType
: Signed, unsigned,
signed hybrid network creationnames(mods1)
## [1] "genes" "mods" "sft" "top.plot" "connect.plot"
## [6] "mods.mean" "mods.eigen" "david"
mods1$genes
: Vector of genes used in module buildmods1$mods
: Module membershipmods1$sft
: Results from fit_modules
mods1$top.plot
and mods1$connect.plot
:
Plots from fit_modules
mods1$mods.mean
: Mean gene expression by modulemods1$mods.eigen
: Eigenvalue expression by modulemods1$david
: DAVID formatted genes in modulescompare_modules
: Compare WGCNA module buildsUsed to help find the “best” module build. You can create many builds with varied R-squared, soft threshold, minimum module size, maximum block size, deep split, and/or network sign.
Recommended: Increase nThread
to run on
multiple processors and speed up results.
Compare 2 soft thresholds and 2 minimum sizes.
comp1 <- compare_modules(fit = mod.fit,
sft_value = c(10,14),
minModuleSize = c(20,50),
nThread = 6)
## Running 4 module builds.
names(comp1)
## [1] "summary" "modules"
#View parameters
comp1$summary
## sft_value SFT.R.sq minModuleSize maxBlockSize deepSplit networkType TOMType
## 1 10 0.5419406 20 500 3 signed signed
## 2 14 0.8185863 20 500 3 signed signed
## 3 10 0.5419406 50 500 3 signed signed
## 4 14 0.8185863 50 500 3 signed signed
## param genes mod0 modMin modMax modN
## 1 param1 1000 88 21 114 17
## 2 param2 1000 81 20 99 18
## 3 param3 1000 212 77 186 6
## 4 param4 1000 173 50 171 7
#Summarize total genes in modules
plyr::ldply(comp1$modules) %>%
pivot_wider(names_from = `.id`, values_from = n)
## # A tibble: 19 Ă— 5
## module param1 param2 param3 param4
## <dbl> <int> <int> <int> <int>
## 1 0 88 81 212 173
## 2 1 114 99 186 171
## 3 2 104 91 174 171
## 4 3 77 81 140 140
## 5 4 71 75 107 105
## 6 5 64 75 104 99
## 7 6 61 68 77 91
## 8 7 61 68 NA 50
## 9 8 53 50 NA NA
## 10 9 48 45 NA NA
## 11 10 46 38 NA NA
## 12 11 39 37 NA NA
## 13 12 36 36 NA NA
## 14 13 33 35 NA NA
## 15 14 32 32 NA NA
## 16 15 28 24 NA NA
## 17 16 24 23 NA NA
## 18 17 21 22 NA NA
## 19 18 NA 20 NA NA
Compare 2 R-squared and 2 deep splits. Or many other combinations…
comp2 <- compare_modules(fit = mod.fit,
Rsq_min = c(0.8,0.9),
deepSplit = c(1,4))
Running 4 module builds.
Other parameters
WGCNA::blockwiseConsensusModules( )
for more details.
minModuleSize
: Smallest possible modulemaxBlockSize
: Maximum number of genes to include in 1
dendrogram (many dendrograms go into making modules)deepSplit
: Build sensitivity, 0 least and 4 most
sensitivenetworkType
, TOMType
: Signed, unsigned,
signed hybrid network creationcalculate_module_coherence
: Calculate WGCNA module
coherenceCalculate expression correlation between genes within modules. Compare coherence for modules built within the same data set (example below) or from another data set.
coher <- calculate_module_coherence(
mods = mods1$mods, mods_title = "Study1",
dat = example.voom, dat_title = "Study1")
#correlation data
head(coher$subgene_correlation_df)
## Set gene1 gene2 Cor P
## 1 01 ENSG00000000460 ENSG00000006712 0.7910182 0.002183823
## 2 01 ENSG00000000460 ENSG00000022556 0.4780634 0.115948989
## 3 01 ENSG00000000460 ENSG00000025156 0.4154863 0.179201540
## 4 01 ENSG00000000460 ENSG00000033627 0.7003896 0.011192596
## 5 01 ENSG00000000460 ENSG00000037749 0.5549353 0.061095997
## 6 01 ENSG00000000460 ENSG00000064703 0.6346082 0.026642565
#Correlations and significance
coher$coherence_boxplot_combined
Change the red cutoff lines in the plots.
coher <- calculate_module_coherence(
mods = mods1$mods, mods_title = "Study1",
dat = example.voom, dat_title = "Study21",
r_cutoff = 0.5,
p_cutoff = 0.05)
coher$coherence_boxplot_combined
Other parameters
mods_var
and gene_var
to change the
columns used to define modules and genesremove_mods
allows you to ignore select modules. By
default, it removes the 0 or grey uncorrelated module.return_plot = FALSE
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.3.0 SEARchways_1.5.0 BIGpicture_1.2.0 kimma_1.6.3
## [5] RNAetc_1.2.0 edgeR_4.6.1 limma_3.64.0 lubridate_1.9.4
## [9] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
## [13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.2
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gridExtra_2.3 rlang_1.1.6
## [4] magrittr_2.0.3 matrixStats_1.5.0 compiler_4.5.0
## [7] RSQLite_2.3.9 png_0.1-8 vctrs_0.6.5
## [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [13] backports_1.5.0 XVector_0.48.0 labeling_0.4.3
## [16] utf8_1.2.4 rmarkdown_2.29 tzdb_0.5.0
## [19] preprocessCore_1.70.0 UCSC.utils_1.4.0 bit_4.6.0
## [22] xfun_0.52 cachem_1.1.0 GenomeInfoDb_1.44.0
## [25] jsonlite_2.0.0 blob_1.2.4 BiocParallel_1.42.0
## [28] parallel_4.5.0 cluster_2.1.8.1 R6_2.6.1
## [31] bslib_0.9.0 stringi_1.8.7 RColorBrewer_1.1-3
## [34] rpart_4.1.24 jquerylib_0.1.4 Rcpp_1.0.14
## [37] iterators_1.0.14 knitr_1.50 WGCNA_1.73
## [40] base64enc_0.1-3 IRanges_2.42.0 nnet_7.3-20
## [43] Matrix_1.7-3 splines_4.5.0 timechange_0.3.0
## [46] tidyselect_1.2.1 rstudioapi_0.17.1 yaml_2.3.10
## [49] doParallel_1.0.17 codetools_0.2-20 plyr_1.8.9
## [52] lattice_0.22-7 Biobase_2.68.0 withr_3.0.2
## [55] KEGGREST_1.48.0 evaluate_1.0.3 foreign_0.8-90
## [58] survival_3.8-3 Biostrings_2.76.0 pillar_1.10.2
## [61] checkmate_2.3.2 foreach_1.5.2 stats4_4.5.0
## [64] generics_0.1.3 S4Vectors_0.46.0 hms_1.1.3
## [67] scales_1.4.0 gtools_3.9.5 glue_1.8.0
## [70] Hmisc_5.2-3 tools_4.5.0 data.table_1.17.0
## [73] fgsea_1.34.0 locfit_1.5-9.12 fastcluster_1.2.6
## [76] fastmatch_1.1-6 cowplot_1.1.3 grid_4.5.0
## [79] impute_1.82.0 colorspace_2.1-1 AnnotationDbi_1.70.0
## [82] GenomeInfoDbData_1.2.14 htmlTable_2.4.3 Formula_1.2-5
## [85] cli_3.6.5 gtable_0.3.6 dynamicTreeCut_1.63-1
## [88] sass_0.4.10 digest_0.6.37 BiocGenerics_0.54.0
## [91] htmlwidgets_1.6.4 farver_2.1.2 memoise_2.0.1
## [94] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [97] GO.db_3.21.0 statmod_1.5.0 bit64_4.6.0-1