Overview

This package is a collection of functions useful in RNAseq data analysis that don’t fit well into our other packages (see info-graphic). They are involved in data cleaning, data manipulation, WGCNA module analyses (mostly).

Setup

Software and packages

This pipeline should be completed in R version 4.5.0 (2025-04-11) or newer. Please also download the following packages.

# CRAN packages
install.packages(c("tidyverse","patchwork"))

# Bioconductor packages
## These are dependencies for kimma that are beneficial to download separately
install.packages("BiocManager")
BiocManager::install(c("limma", "edgeR"))

# GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")

And load them into your current R session.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.5.2     âś” tibble    3.2.1
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(edgeR)
## Loading required package: limma
library(limma)
library(RNAetc)
library(kimma)
## 
## Attaching package: 'kimma'
## 
## The following objects are masked from 'package:RNAetc':
## 
##     example.dat, example.voom
library(BIGpicture)
library(SEARchways)
library(patchwork)

To ensure reproducibility with this document, set the random seed value.

set.seed(651)

Data background

Example data were obtained from media controls and human rhinovirus (HRV) infected human plasmacytoid dendritic cells. You can learn more about these data in

Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub

Specifically, this vignette uses bulk human RNA-seq processed from fastq to counts using SEAsnake and counts to voom using edgeR/limma. This results in voom-normalized, log2 counts per million (CPM) gene expression and associated sample and gene metadata. In total, 6 donors and 1000 random genes were selected for this vignette. Actual donor identifiers have been altered for privacy.

Load data

Unnormalized RNAseq counts and metadata.

example.dat <- RNAetc::example.dat
names(example.dat)
## [1] "counts"  "samples" "genes"

Voom normalized RNAseq counts and metadata.

example.voom <- RNAetc::example.voom
names(example.voom)
## [1] "genes"   "targets" "E"       "weights" "design"

Example WGCNS modules.

example.mods <- RNAetc::example.mods
head(example.mods)
##          geneName module module.char mod.color hgnc_symbol Previous symbols
## 1 ENSG00000005189      0          00      grey       REXO5             <NA>
## 2 ENSG00000006025      0          00      grey      OSBPL7             <NA>
## 3 ENSG00000006118      0          00      grey    TMEM132A         HSPA5BP1
## 4 ENSG00000007038      0          00      grey      PRSS21             <NA>
## 5 ENSG00000007080      0          00      grey     CCDC124             <NA>
## 6 ENSG00000008323      0          00      grey     PLEKHG6             <NA>
##      Alias symbols        gene_biotype
## 1           NEF-sp protein-coding gene
## 2   ORP7, MGC71150 protein-coding gene
## 3    GBP, FLJ20539 protein-coding gene
## 4     ESP-1, TEST1 protein-coding gene
## 5             <NA> protein-coding gene
## 6 FLJ10665, MYOGEF protein-coding gene

RNAseq data cleaning

filter_rare: Filter rare and low abundance genes

Remove genes that are not at least X counts per million (CPM) in at least Y samples or Z% of samples. Generally applied to a DGEList object prior to voom normalization.

dim(example.dat)
## [1] 1000   12
example.dat.filter <- filter_rare(dat = example.dat,
                                  min_CPM = 1,
                                  min_sample = 3,
                                  gene_var = "geneName")
## 46 (4.6%) of 1000 genes removed.
dim(example.dat.filter)
## [1] 954  12
example.dat.filter <- filter_rare(dat = example.dat,
                                  min_CPM = 1,
                                  min_pct = 10,
                                  gene_var = "geneName")
## 2 (0.2%) of 1000 genes removed.
dim(example.dat.filter)
## [1] 998  12

Automatically plot the mean variance trends before and after filtering.

example.dat.filter <- filter_rare(dat = example.dat,
                                  min_CPM = 1,
                                  min_pct = 10,
                                  gene_var = "geneName",
                                  plot = TRUE)

## 2 (0.2%) of 1000 genes removed.

Other parameters

  • Use gene_var to identify the column in dat$genes to match to rownames in dat$counts

RNAseq data manipulation

Shared parameters

All of these functions can be tuned to different voom objects by setting the libraryID and geneID parameters to match column names in your data. The defaults of “libID” and “geneName”, respectively.

collapse_voom: Combine voom object into a single dataframe

Collapse a voom object by matching the libraryID and geneID columns.

names(example.voom)
## [1] "genes"   "targets" "E"       "weights" "design"
dat2 <- collapse_voom(dat = example.voom)
colnames(dat2)
##  [1] "geneName"           "libID"              "value"             
##  [4] "group"              "lib.size"           "norm.factors"      
##  [7] "ptID"               "median_cv_coverage" "virus"             
## [10] "asthma"             "batch"              "hgnc_symbol"       
## [13] "Previous symbols"   "Alias symbols"      "gene_biotype"
head(dat2)
## # A tibble: 6 Ă— 15
##   geneName      libID value group lib.size norm.factors ptID  median_cv_coverage
##   <chr>         <chr> <dbl> <fct>    <dbl>        <dbl> <chr>              <dbl>
## 1 ENSG00000000… lib1   6.11 1       79646.        1.00  dono…              0.514
## 2 ENSG00000000… lib2   8.00 1       88008.        0.951 dono…              0.435
## 3 ENSG00000000… lib3   7.32 1      178020.        1.09  dono…              0.374
## 4 ENSG00000000… lib4   7.33 1      133836.        0.943 dono…              0.388
## 5 ENSG00000000… lib5   7.92 1      192547.        1.00  dono…              0.353
## 6 ENSG00000000… lib6   7.99 1      175144.        0.974 dono…              0.349
## # ℹ 7 more variables: virus <fct>, asthma <chr>, batch <dbl>,
## #   hgnc_symbol <chr>, `Previous symbols` <chr>, `Alias symbols` <chr>,
## #   gene_biotype <chr>

Add the gene-level weights in dat$weights

dat2 <- collapse_voom(dat = example.voom,
                      include_weights = TRUE)
colnames(dat2)
##  [1] "geneName"           "libID"              "value"             
##  [4] "group"              "lib.size"           "norm.factors"      
##  [7] "ptID"               "median_cv_coverage" "virus"             
## [10] "asthma"             "batch"              "hgnc_symbol"       
## [13] "Previous symbols"   "Alias symbols"      "gene_biotype"      
## [16] "weights"

subset_voom: Subset voom object

Subset a voom object to specific libraries.

dim(example.voom)
## [1] 1000   12
dat.sub1 <- subset_voom(dat = example.voom,
                        lib_keep = c("lib1","lib3"))
## Subsetting to 2 of 12 libraries
## Subsetting to 1000 of 1000 genes
## Done!
dim(dat.sub1)
## [1] 1000    2

Subset a voom object to remove specific libraries.

dim(example.voom)
## [1] 1000   12
dat.sub2 <- subset_voom(dat = example.voom,
                        lib_remove = c("lib1","lib3"))
## Subsetting to 10 of 12 libraries
## Subsetting to 1000 of 1000 genes
## Done!
dim(dat.sub2)
## [1] 1000   10

Use the voom metadata to subset the object based on a logical statement. For example, keep all of the HRV infected samples. Be careful to use single quotations ’’ within the lib_filter parameter which is use double quotations “” overall.

dim(example.voom)
## [1] 1000   12
dat.sub3 <- subset_voom(dat = example.voom,
                        lib_filter = "virus == 'HRV'")
## Subsetting to 6 of 12 libraries
## Subsetting to 1000 of 1000 genes
## Done!
dim(dat.sub3)
## [1] 1000    6

Instead of subsetting libraries, you can also subset to a vector of specific genes.

dim(example.voom)
## [1] 1000   12
dat.sub4 <- subset_voom(dat = example.voom,
                        gene_keep =c("ENSG00000000460","ENSG00000001460",
                                     "ENSG00000002587"))
## Subsetting to 12 of 12 libraries
## Subsetting to 3 of 1000 genes
## Done!
dim(dat.sub4)
## [1]  3 12

combine_voom: Combine 2 voom objects

Combine two voom objects.

dim(dat.sub1)
## [1] 1000    2
dim(dat.sub2)
## [1] 1000   10
dat.recombine <- combine_voom(dat1 = dat.sub1, dat2 = dat.sub2)
## Combining genes: 1000 from dat1 and 1000 from dat2.
## Combining libraries: 2 from dat1 and 10 from dat2.
## 
## Returning voom object with 12 libraries and 1000 genes.
dim(dat.recombine)
## [1] 1000   12

WGCNA modules

This group of functions works around gene co-expression modules created by WGCNA. fit_modules, make_modules, and compare_modules allow a subset of parameters from WGCNA that we commonly utilize. If you require further customization, we recommend running WCGNA functions directly.

fit_modules: Perform soft thresholding

Used to determine the power to use for WGCNA module building.

mod.fit <- fit_modules(dat = example.voom)
## 
head(mod.fit$sft)
##   Power   SFT.R.sq     slope truncated.R.sq   mean.k. median.k.    max.k.
## 1     1 0.26313660  9.943599      0.8813112 502.00648 503.15088 531.43769
## 2     2 0.03041606 -1.797055      0.8636882 283.23033 280.65901 330.74298
## 3     3 0.02812478 -1.095957      0.8733717 173.48229 171.12225 226.52349
## 4     4 0.05936983 -1.048659      0.8945757 113.01295 111.05051 164.06308
## 5     5 0.11880939 -1.048333      0.9066056  77.24630  75.01133 123.94006
## 6     6 0.17295138 -1.048439      0.9210418  54.87881  52.62128  96.89712
##   signed.R.sq
## 1 -0.26313660
## 2  0.03041606
## 3  0.02812478
## 4  0.05936983
## 5  0.11880939
## 6  0.17295138
mod.fit$top.plot + mod.fit$connect.plot

Change the power levels tested. For example, decrease to improve computational time or increase if you have not reached stability fo the default 30. In all honestly, though, 1:30 is generally sufficient and weirdness can happen at higher powers.

mod.fit2 <- fit_modules(dat = example.voom,
                        powerVector = c(1:50))
head(mod.fit2$sft)
##   Power   SFT.R.sq     slope truncated.R.sq   mean.k. median.k.    max.k.
## 1     1 0.26313660  9.943599      0.8813112 502.00648 503.15088 531.43769
## 2     2 0.03041606 -1.797055      0.8636882 283.23033 280.65901 330.74298
## 3     3 0.02812478 -1.095957      0.8733717 173.48229 171.12225 226.52349
## 4     4 0.05936983 -1.048659      0.8945757 113.01295 111.05051 164.06308
## 5     5 0.11880939 -1.048333      0.9066056  77.24630  75.01133 123.94006
## 6     6 0.17295138 -1.048439      0.9210418  54.87881  52.62128  96.89712
##   signed.R.sq
## 1 -0.26313660
## 2  0.03041606
## 3  0.02812478
## 4  0.05936983
## 5  0.11880939
## 6  0.17295138
mod.fit2$top.plot + mod.fit2$connect.plot

Use an “unsigned” or “signed hybrid” network instead of the default signed.

mod.fit2 <- fit_modules(dat = example.voom,
                        networkType = "unsigned")
mod.fit2$top.plot + mod.fit2$connect.plot

Use a subset of genes to create modules.

mod.fit2 <- fit_modules(dat = example.voom,
                        genes = example.voom$genes$geneName[1:100])
mod.fit2$top.plot + mod.fit2$connect.plot

make_modules: Construct WGCNA modules and associated data

Build modules using power based on fit_modules such as power that reaches an R-squared of 0.9.

mods1 <- make_modules(fit = mod.fit,
                      Rsq_min = 0.9)

Or set it by-hand based on the thresholding plot local (which is what I’d do with these data) or absolute maximum.

mods1 <- make_modules(fit = mod.fit,
                      sft_value = 14)

Provide additional results including mods_mean (mean gene expression by module), mods_eigen (eigenvalue expression by module), or david (DAVID formatted genes in modules data frame).

mods1 <- make_modules(fit = mod.fit,
                      sft_value = 14,
                      mods_mean = TRUE,
                      mods_eigen = TRUE,
                      david = TRUE)

Other parameters

  • Tune your module build with the following. See WGCNA::blockwiseConsensusModules( ) for more details.
    • minModuleSize: Smallest possible module
    • maxBlockSize: Maximum number of genes to include in 1 dendrogram (many dendrograms go into making modules)
    • deepSplit: Build sensitivity, 0 least and 4 most sensitive
    • networkType, TOMType: Signed, unsigned, signed hybrid network creation

A quick tour of module results

names(mods1)
## [1] "genes"        "mods"         "sft"          "top.plot"     "connect.plot"
## [6] "mods.mean"    "mods.eigen"   "david"
  • mods1$genes: Vector of genes used in module build
  • mods1$mods: Module membership
  • mods1$sft: Results from fit_modules
  • mods1$top.plot and mods1$connect.plot: Plots from fit_modules
  • mods1$mods.mean: Mean gene expression by module
  • mods1$mods.eigen: Eigenvalue expression by module
  • mods1$david: DAVID formatted genes in modules

compare_modules: Compare WGCNA module builds

Used to help find the “best” module build. You can create many builds with varied R-squared, soft threshold, minimum module size, maximum block size, deep split, and/or network sign.

Recommended: Increase nThread to run on multiple processors and speed up results.

Compare 2 soft thresholds and 2 minimum sizes.

comp1 <- compare_modules(fit = mod.fit,
                         sft_value = c(10,14),
                         minModuleSize = c(20,50),
                         nThread = 6)
## Running 4 module builds.
names(comp1)
## [1] "summary" "modules"
#View parameters
comp1$summary
##   sft_value  SFT.R.sq minModuleSize maxBlockSize deepSplit networkType TOMType
## 1        10 0.5419406            20          500         3      signed  signed
## 2        14 0.8185863            20          500         3      signed  signed
## 3        10 0.5419406            50          500         3      signed  signed
## 4        14 0.8185863            50          500         3      signed  signed
##    param genes mod0 modMin modMax modN
## 1 param1  1000   88     21    114   17
## 2 param2  1000   81     20     99   18
## 3 param3  1000  212     77    186    6
## 4 param4  1000  173     50    171    7
#Summarize total genes in modules
plyr::ldply(comp1$modules) %>% 
  pivot_wider(names_from = `.id`, values_from = n)
## # A tibble: 19 Ă— 5
##    module param1 param2 param3 param4
##     <dbl>  <int>  <int>  <int>  <int>
##  1      0     88     81    212    173
##  2      1    114     99    186    171
##  3      2    104     91    174    171
##  4      3     77     81    140    140
##  5      4     71     75    107    105
##  6      5     64     75    104     99
##  7      6     61     68     77     91
##  8      7     61     68     NA     50
##  9      8     53     50     NA     NA
## 10      9     48     45     NA     NA
## 11     10     46     38     NA     NA
## 12     11     39     37     NA     NA
## 13     12     36     36     NA     NA
## 14     13     33     35     NA     NA
## 15     14     32     32     NA     NA
## 16     15     28     24     NA     NA
## 17     16     24     23     NA     NA
## 18     17     21     22     NA     NA
## 19     18     NA     20     NA     NA

Compare 2 R-squared and 2 deep splits. Or many other combinations…

comp2 <- compare_modules(fit = mod.fit,
                         Rsq_min = c(0.8,0.9),
                         deepSplit = c(1,4))
Running 4 module builds.

Other parameters

  • Tune your module builds with the following. In this version, each can be a vector of multiple values to test and compare results. See WGCNA::blockwiseConsensusModules( ) for more details.
    • minModuleSize: Smallest possible module
    • maxBlockSize: Maximum number of genes to include in 1 dendrogram (many dendrograms go into making modules)
    • deepSplit: Build sensitivity, 0 least and 4 most sensitive
    • networkType, TOMType: Signed, unsigned, signed hybrid network creation

calculate_module_coherence: Calculate WGCNA module coherence

Calculate expression correlation between genes within modules. Compare coherence for modules built within the same data set (example below) or from another data set.

coher <- calculate_module_coherence(
  mods = mods1$mods, mods_title = "Study1",
  dat = example.voom, dat_title = "Study1")

#correlation data
head(coher$subgene_correlation_df)
##   Set           gene1           gene2       Cor           P
## 1  01 ENSG00000000460 ENSG00000006712 0.7910182 0.002183823
## 2  01 ENSG00000000460 ENSG00000022556 0.4780634 0.115948989
## 3  01 ENSG00000000460 ENSG00000025156 0.4154863 0.179201540
## 4  01 ENSG00000000460 ENSG00000033627 0.7003896 0.011192596
## 5  01 ENSG00000000460 ENSG00000037749 0.5549353 0.061095997
## 6  01 ENSG00000000460 ENSG00000064703 0.6346082 0.026642565
#Correlations and significance
coher$coherence_boxplot_combined

Change the red cutoff lines in the plots.

coher <- calculate_module_coherence(
  mods = mods1$mods, mods_title = "Study1",
  dat = example.voom, dat_title = "Study21",
  r_cutoff = 0.5,
  p_cutoff = 0.05)

coher$coherence_boxplot_combined

Other parameters

  • Use mods_var and gene_var to change the columns used to define modules and genes
  • remove_mods allows you to ignore select modules. By default, it removes the 0 or grey uncorrelated module.
  • Don’t make the plots with return_plot = FALSE

R session

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.0  SEARchways_1.5.0 BIGpicture_1.2.0 kimma_1.6.3     
##  [5] RNAetc_1.2.0     edgeR_4.6.1      limma_3.64.0     lubridate_1.9.4 
##  [9] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4     
## [13] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_3.5.2   
## [17] tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               gridExtra_2.3           rlang_1.1.6            
##  [4] magrittr_2.0.3          matrixStats_1.5.0       compiler_4.5.0         
##  [7] RSQLite_2.3.9           png_0.1-8               vctrs_0.6.5            
## [10] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
## [13] backports_1.5.0         XVector_0.48.0          labeling_0.4.3         
## [16] utf8_1.2.4              rmarkdown_2.29          tzdb_0.5.0             
## [19] preprocessCore_1.70.0   UCSC.utils_1.4.0        bit_4.6.0              
## [22] xfun_0.52               cachem_1.1.0            GenomeInfoDb_1.44.0    
## [25] jsonlite_2.0.0          blob_1.2.4              BiocParallel_1.42.0    
## [28] parallel_4.5.0          cluster_2.1.8.1         R6_2.6.1               
## [31] bslib_0.9.0             stringi_1.8.7           RColorBrewer_1.1-3     
## [34] rpart_4.1.24            jquerylib_0.1.4         Rcpp_1.0.14            
## [37] iterators_1.0.14        knitr_1.50              WGCNA_1.73             
## [40] base64enc_0.1-3         IRanges_2.42.0          nnet_7.3-20            
## [43] Matrix_1.7-3            splines_4.5.0           timechange_0.3.0       
## [46] tidyselect_1.2.1        rstudioapi_0.17.1       yaml_2.3.10            
## [49] doParallel_1.0.17       codetools_0.2-20        plyr_1.8.9             
## [52] lattice_0.22-7          Biobase_2.68.0          withr_3.0.2            
## [55] KEGGREST_1.48.0         evaluate_1.0.3          foreign_0.8-90         
## [58] survival_3.8-3          Biostrings_2.76.0       pillar_1.10.2          
## [61] checkmate_2.3.2         foreach_1.5.2           stats4_4.5.0           
## [64] generics_0.1.3          S4Vectors_0.46.0        hms_1.1.3              
## [67] scales_1.4.0            gtools_3.9.5            glue_1.8.0             
## [70] Hmisc_5.2-3             tools_4.5.0             data.table_1.17.0      
## [73] fgsea_1.34.0            locfit_1.5-9.12         fastcluster_1.2.6      
## [76] fastmatch_1.1-6         cowplot_1.1.3           grid_4.5.0             
## [79] impute_1.82.0           colorspace_2.1-1        AnnotationDbi_1.70.0   
## [82] GenomeInfoDbData_1.2.14 htmlTable_2.4.3         Formula_1.2-5          
## [85] cli_3.6.5               gtable_0.3.6            dynamicTreeCut_1.63-1  
## [88] sass_0.4.10             digest_0.6.37           BiocGenerics_0.54.0    
## [91] htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
## [94] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
## [97] GO.db_3.21.0            statmod_1.5.0           bit64_4.6.0-1