Overview

This document covers our recommended analysis pipeline to determine differentially expressed co-expression modules. This pipeline includes linear regression using kimma and module building using WGCNA. The example data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.

0. Setup

Software

This pipeline should be completed in R and RStudio. You should also install the following packages.

#CRAN packages
install.packages(c("tidyverse", "WGCNA"))

#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma", "patchwork"))

#GitHub packages
install.packages("devtools")
devtools::install_github("BIGslu/kimma")
devtools::install_github("BIGslu/BIGpicture")
devtools::install_github("BIGslu/RNAetc")
devtools::install_github("BIGslu/SEARchways")

And load them into your current R session.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(kimma)
library(BIGpicture)
library(RNAetc)

## 
## Attaching package: 'RNAetc'
## 
## The following objects are masked from 'package:kimma':
## 
##     example.dat, example.voom

library(SEARchways)
library(limma)
library(patchwork)
library(WGCNA)

## Loading required package: dynamicTreeCut
## Loading required package: fastcluster
## 
## Attaching package: 'fastcluster'
## 
## The following object is masked from 'package:stats':
## 
##     hclust
## 
## 
## 
## Attaching package: 'WGCNA'
## 
## The following object is masked from 'package:stats':
## 
##     cor

set.seed(651)

Example data

Example data were obtained from virus-stimulated human plasmacytoid dendritic cells. Actual patient identifiers and metadata have been altered for this tutorial.

Dill-McFarland KA, Schwartz JT, Zhao H, Shao B, Fulkerson PC, Altman MC, Gill MA. 2022. Eosinophil-mediated suppression and Anti-IL-5 enhancement of plasmacytoid dendritic cell interferon responses in asthma. J Allergy Clin Immunol. Epub ahead of print. doi: 10.1016/j.jaci.2022.03.025. — GitHub

Specifically, this tutorial uses RNAseq data processed using our SEAsnake and counts to voom pipelines, resulting in voom-normalized, log2 counts per million (CPM) expression and associated sample metadata in a limma EList object in the data_clean/ directory. These data are available in the kimma package within the example.voom data object.

1. Load data

All counts, gene, and sample metadata are contained in a single object in the kimma package. Note that these data contain only 1000 genes from the full data set create in our counts to voom pipeline. This is to reduce computational time.

#Extract and rename data object
dat <- kimma::example.voom

2. Differential gene expression

We will take a supervised approach to co-expression module creation. Thus, we first determine genes that are associated with our variable(s) of interest. Then, these genes are grouped into modules using weighted gene correlation network analysis (WGCNA). You can also take an unsupervised approach using all genes in the data set. Please see the WGCNA tutorial for an unsupervised example.

We take a supervised approach as it reduces computational time and results in more interpretable modules.

First, we determine genes associated with viral infection and/or asthma in this data set. You can see how we determined this as our best fit model in our voom to DEG tutorial.

virus_lme <- kmFit(dat = dat, 
                   model = "~ virus*asthma + (1|ptID)", 
                   run_lme = TRUE, 
                   use_weights = TRUE)

## lme/lmerel model: expression~virus*asthma+(1|ptID)

## Input: 12 libraries from 6 unique patients

## Model: 12 libraries

## Complete: 1000 genes

## Failed: 0 genes

We extract genes significant for the interaction term as well as those significant for virus or asthma alone.

genes.interact <- virus_lme$lme %>% 
  filter(variable == "virus:asthma" & FDR < 0.2) %>% 
  pull(gene)

genes.virus <- virus_lme$lme %>% 
  filter(variable == "virus" & FDR < 0.2) %>% 
  pull(gene)

genes.asthma <- virus_lme$lme %>% 
  filter(variable == "asthma" & FDR < 0.2) %>% 
  pull(gene)


genes.all <- unique(c(genes.interact, genes.virus, genes.asthma))

This results in 485 genes for use in module building. Note that this example data set contains only 1000 genes out of the roughly 14,000 protein coding genes usually seen in an RNAseq data set. Thus, the genes used in module analysis here are many fewer than would be seen in a full data set.

We filter the dat object to contain only these genes.

dat.subset <- dat[genes.all, ]

3. Module building

The goal of weighted gene correlation network analysis (WGCNA) is to reduce data complexity by grouping genes into co-expression modules. This can take 1000s of genes down to only 10s of modules for analysis, thus dramatically reducing multiple comparison penalties (e.g. FDR). WGCNA modules contain correlated genes with similar expression patterns across samples. However, genes in a module do not necessary have the same average expression and a module can, thus, contain both lowly and highly expressed genes.

3.1 Determine soft threshold

There are a number of parameters that impact module builds. The first we will explore is soft thresholding power, which determines how correlated genes must be in order to be in the same module. We check values from 1 to 30 in a signed network because positive/negative expression values have meaning.

# You may wish to run on multiple processors
# WGCNA::allowWGCNAThreads(nThreads = 4)

fit <- pickSoftThreshold(data = t(dat.subset$E),
                         powerVector = c(1:30),
                         networkType = "signed")

##    Power   SFT.R.sq     slope truncated.R.sq    mean.k.   median.k.    max.k.
## 1      1 0.26313660  9.943599      0.8813112 502.006481 503.1508775 531.43769
## 2      2 0.03041606 -1.797055      0.8636882 283.230328 280.6590060 330.74298
## 3      3 0.02812478 -1.095957      0.8733717 173.482287 171.1222482 226.52349
## 4      4 0.05936983 -1.048659      0.8945757 113.012955 111.0505139 164.06308
## 5      5 0.11880939 -1.048333      0.9066056  77.246302  75.0113257 123.94006
## 6      6 0.17295138 -1.048439      0.9210418  54.878815  52.6212801  96.89712
## 7      7 0.25743039 -1.072228      0.9179633  40.246468  38.1184746  77.60957
## 8      8 0.36256555 -1.115785      0.9260355  30.311291  28.2546075  63.38037
## 9      9 0.47816903 -1.141153      0.9427763  23.350851  21.4726710  52.59127
## 10    10 0.54194055 -1.114224      0.9113726  18.342379  16.4120450  44.25168
## 11    11 0.62117033 -1.181646      0.9257618  14.654301  12.8113442  38.13064
## 12    12 0.72867667 -1.292638      0.9679070  11.883257  10.1154016  33.55203
## 13    13 0.77459583 -1.348804      0.9650979   9.763989   8.0714519  29.79725
## 14    14 0.81858626 -1.397540      0.9821075   8.117515   6.4733097  26.67198
## 15    15 0.78215560 -1.484216      0.9263327   6.820289   5.2857455  24.03698
## 16    16 0.79957493 -1.544640      0.9232817   5.785278   4.3256524  21.79033
## 17    17 0.88055940 -1.490901      0.9795207   4.950042   3.6054073  19.85593
## 18    18 0.89425851 -1.489373      0.9631134   4.269043   3.0279837  18.17602
## 19    19 0.91864982 -1.495883      0.9768264   3.708566   2.5429408  16.70603
## 20    20 0.93217689 -1.491355      0.9844292   3.243310   2.1659235  15.41106
## 21    21 0.92049496 -1.549493      0.9683322   2.854045   1.8600610  14.58156
## 22    22 0.91613250 -1.596012      0.9574356   2.525993   1.5838550  13.98098
## 23    23 0.94146629 -1.589420      0.9633023   2.247671   1.3623646  13.45295
## 24    24 0.91362068 -1.635744      0.9212029   2.010072   1.1693855  12.98548
## 25    25 0.93581070 -1.651049      0.9444718   1.806065   1.0251724  12.56893
## 26    26 0.91744934 -1.708221      0.9262327   1.629958   0.9068533  12.19550
## 27    27 0.90604648 -1.736650      0.9038590   1.477171   0.7978643  11.85880
## 28    28 0.89618355 -1.745676      0.8812566   1.343993   0.6930648  11.55362
## 29    29 0.85142506 -1.815295      0.8252909   1.227395   0.6114491  11.27562
## 30    30 0.85695984 -1.829880      0.8343792   1.124888   0.5439173  11.02120

You can roughly see how gene correlation associates with power in the automatically generated table but we also create plots for easier interpretation. First looking at the signed R², we see a general increase in gene correlation with power until correlation is maximized around power 17. In supervised module building, there are also often 1 or 2 local maxima such as powers 2 and 14 here.

as.data.frame(fit$fitIndices) %>%
  #Calculate scale free topology
  dplyr::mutate(signed.R.sq = -sign(slope)*SFT.R.sq) %>% 
  ggplot(aes(x = Power, y = signed.R.sq, label = Power)) +
  geom_text(size = 3) +
  theme_classic() +
  labs(y = "Scale free topology model fit,signed R^2",
       x = "Soft threshold (power)") +
  scale_y_continuous(breaks = round(seq(0, 1, by = 0.1), digits = 1))

Next, we look at connectivity which is a measure of how many genes in a module meet the correlation cutoff. This roughly correlates with how many genes are in each module with lower connectivity resulting in smaller modules. Here, we see an expected decrease in connectivity with increase in power.

as.data.frame(fit$fitIndices) %>%
  ggplot(aes(x = Power, y = mean.k., label = Power)) +
  geom_text(size = 3) +
  theme_classic() +
  labs(y = "Mean connectivity",
       x = "Soft threshold (power)")

Taken together, we choose a soft thresholding power of 14 because it is a local maximum and achieves correlation > 0.8. We will also compare to power 23 as the true maximum correlation to show how this effects modules in this tutorial.

as.data.frame(fit$fitIndices) %>% 
  filter(Power %in% c(14, 23))

##   Power  SFT.R.sq    slope truncated.R.sq  mean.k. median.k.   max.k.
## 1    14 0.8185863 -1.39754      0.9821075 8.117515  6.473310 26.67198
## 2    23 0.9414663 -1.58942      0.9633023 2.247671  1.362365 13.45295

Why is my R² so low?

If you’ve selected few genes and/or your data are not strongly correlated, you many not reach an R² of 0.8 or your local maximum may be much lower. This is because there are not many genes that meet the correlation cutoff. In this case, we either choose the local maximum closest to 0.8 or go back and select more genes for module building.

3.2 Build modules

We build modules with our two chosen soft thresholding powers and WGCNA defaults, except we continue to use a signed network and ask it for numeric labels.

mod.p14 <- blockwiseModules(datExpr = t(dat.subset$E),
                             power = 14,
                             networkType="signed",
                             numericLabels = TRUE,
                             nthreads = 4)

mod.p23 <- blockwiseModules(datExpr = t(dat.subset$E),
                             power = 23,
                             networkType="signed",
                             numericLabels = TRUE,
                             nthreads = 4)

The first thing to note is module 0. This is the trash module; genes in this module are not correlated with one another. In an unsupervised analysis with all genes, module 0 can’t be analyzed as it is a random assortment of genes. In our supervised analysis, however, we already know that these genes are associated with virus and/or asthma, so we may be able to glean some insight from them (more on this in Section 4).

Comparing the two module builds, we see the power 14 has fewer genes in module 0 than power 23. This makes sense as power 23 demands that genes be more highly correlated in order to be in a module together. Since both builds have the same number of modules, the additional genes in module 0, power 23 come from those that “fell out” of all the other modules. In a power comparison, you may also see fewer modules at higher powers when enough genes “fall out” of the smaller modules to no longer meet the minimum module size, thus that entire module is redistributed across the other modules, mostly to module 0.

as.data.frame(mod.p14$colors) %>% 
  rownames_to_column("gene") %>% 
  count(`mod.p14$colors`)

##   mod.p14$colors  n
## 1              0  9
## 2              1 76
## 3              2 67
## 4              3 66
## 5              4 64
## 6              5 60
## 7              6 56
## 8              7 52
## 9              8 35

as.data.frame(mod.p23$colors) %>% 
  rownames_to_column("gene") %>% 
  count(`mod.p23$colors`)

##   mod.p23$colors  n
## 1              0 79
## 2              1 68
## 3              2 58
## 4              3 53
## 5              4 51
## 6              5 50
## 7              6 48
## 8              7 44
## 9              8 34

3.3 Other module parameters

There are a lot of module building parameters as you can see in the ?blockwiseModules help page. We like to explore minModuleSize, deepSplit, and maxBlockSize as these generally achieve useful modules. Here, we will use the power 14 build to assess these additional parameters.

Minimum module size

As the name suggests, minimum module size is the minimum number of correlated genes needed to make a module. If you have a group of correlated genes that do not make the minimum, those genes get redistributed, thus changing the overall network.

The default minModSize is the minimum of 20 or 1/2 the number of genes. In these data, that is 242.5 so our default run had minModSize = 20 and we saw that our smallest module 8 was 35 genes.

If we change the minimum to 40 genes, we lose 3 modules, not just the former module 8 that was below this cutoff. This is because the removal of one module results in the redistribution of these genes, thus changing the overall network and shifting more than just the one small module.

mod.p14.m40 <- blockwiseModules(datExpr = t(dat.subset$E),
                                power = 14,
                                networkType="signed",
                                numericLabels = TRUE,
                                nthreads = 4,
                                minModuleSize = 40)

as.data.frame(mod.p14.m40$colors) %>% 
  rownames_to_column("gene") %>% 
  count(`mod.p14.m40$colors`)

##   mod.p14.m40$colors   n
## 1                  0  14
## 2                  1 127
## 3                  2 120
## 4                  3  96
## 5                  4  76
## 6                  5  52

Deep network split

The deep network split determines where the network can be split to form modules. It is a controlled parameter from 0 to 4 with being 0 least and 4 most sensitive. Thus, lower values result in less splitting and fewer modules. The default deepSplit is 2 so comparing to 0, we see fewer modules in our data.

mod.p14.s0 <- blockwiseModules(datExpr = t(dat.subset$E),
                                power = 14,
                                networkType="signed",
                                numericLabels = TRUE,
                                nthreads = 4,
                                deepSplit = 0) 

as.data.frame(mod.p14.s0$colors) %>% 
  rownames_to_column("gene") %>% 
  count(`mod.p14.s0$colors`)

##   mod.p14.s0$colors   n
## 1                 0   9
## 2                 1 127
## 3                 2  76
## 4                 3  67
## 5                 4  64
## 6                 5  55
## 7                 6  52
## 8                 7  35

Maximum block size

The maximum block size determines the largest group of genes analyzed. If you have more genes than the maxBlockSize, the data are pre-clusterd into blocks less than this size and modules are built from within these clusters. In general, lower maxBlockSize results in more modules and more genes in module 0 as a full network connecting all genes is not constructed and you may miss some connections.

Because we only have 485 genes in this analysis, the default maxBlockSize of 5000 includes all genes. In most supervised module builds, this is appropriate and we rarely change this parameter. However, if you have highly variable gene expression, reducing the maxBlockSize can result in smaller, more interpretable modules.

For example, if we reduce the maxBlockSize to 200, we get more genes in module 0 and generally smaller modules. This is not recommended for these data and just shown as an example here.

mod.p14.b200 <- blockwiseModules(datExpr = t(dat.subset$E),
                                power = 14,
                                networkType="signed",
                                numericLabels = TRUE,
                                nthreads = 4,
                                maxBlockSize = 200)

as.data.frame(mod.p14.b200$colors) %>% 
  rownames_to_column("gene") %>% 
  count(`mod.p14.b200$colors`)

##   mod.p14.b200$colors  n
## 1                   0 25
## 2                   1 84
## 3                   2 78
## 4                   3 67
## 5                   4 67
## 6                   5 57
## 7                   6 41
## 8                   7 35
## 9                   8 31

An Art, not a science

There are no absolute cutoffs in module building. This includes supervised gene selection and module build parameters.

Because we want to capture as much signal as possible, we use a relatively high FDR cutoff for gene selection in module building. You may go even higher than FDR < 0.2 in your analysis. The exact cutoff is arbitrary and we generally aim for 1000 to 5000 genes for module building. This is, of course, not possible in the reduced example data set here. This range is rather large given the roughly 14,000 genes in the full data set, and it will likely take some trial and error in your data. In general, if you find that you have a lot of modules not associated with your variable(s) of interest, you may need to reduce the FDR, and thus the number of genes, selected for module building. Or you may leave those genes and simply ignore modules that are not significant.

Similarly, it is helpful to compare several module builds across different parameters as summarized below. You want to minimize genes in module 0 (because these are the uncorrelated genes) without creating too many or too small modules for analysis. This process has many “right” paths. Hence, the idea that it is an art, not a science.

##   power minModSize deepSplit maxBlockSize totalModules sizeModule0
## 1    14         20         2         5000            8           9
## 2    23         20         2         5000            8          79
## 3    14         40         2         5000            5          14
## 4    14         20         0         5000            7           9
## 5    14         20         2          200            8          25

4. Differential module expression

Once you’ve finalized your modules, you can assess their expression similar to individual genes. Here, we use mean expression of genes in modules. You may use the module eigenvalues as well (contained in mod.p14$MEs). We generally see similar results with the two methods and chose mean module expression because the values are more directly interpretable in relation to actual expression.

4.1 Mean module expression

First, we calculate the mean expression of genes in each module.

# List genes in modules and format module names
mod.genes <- as.data.frame(mod.p14$colors) %>%
  tibble::rownames_to_column("geneName") %>%
  dplyr::rename(module = "mod.p14$colors") %>%
  #add leading 0 to module names for correct sorting of factor
  dplyr::mutate(module = ifelse(module <= 9,
                                paste("0", module, sep=""),
                                module)) %>%
  #Add color var
  dplyr::mutate(mod.color = WGCNA::labels2colors(mod.p14$colors))

# Mean gene expression
mod.E <- as.data.frame(dat.subset$E) %>% 
  rownames_to_column("geneName") %>% 
  inner_join(mod.genes, by = "geneName") %>% 
  #Mean
  group_by(module) %>% 
  summarise(across(where(is.numeric), ~mean(., na.rm=TRUE)),
            .groups = "drop")

mod.E

## # A tibble: 9 × 13
##   module  lib1  lib2  lib3  lib4  lib5  lib6  lib7  lib8  lib9 lib10 lib11 lib12
##   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 00      6.51  7.13  7.35  6.99  7.40  7.31  6.82  6.32  6.55  6.40  7.47  7.22
## 2 01      9.30  8.49  9.26  8.19  9.49  8.89  9.78  8.85  9.96  9.43  9.87  9.27
## 3 02      8.36  7.19  8.26  6.60  8.02  7.00  8.30  6.91  8.22  7.00  7.75  5.78
## 4 03      8.39  9.26  7.80  9.28  8.47  9.24  7.56  9.43  7.39  8.73  7.64  8.85
## 5 04      8.57  8.26  7.98  8.30  7.98  8.23  6.88  7.36  6.59  6.75  6.59  6.90
## 6 05      9.32  8.52  9.20  8.32  8.41  8.04  8.57  7.89  8.51  7.21  8.84  7.81
## 7 06      8.32  9.51  8.50  9.85  9.34  9.97  8.78  9.80  9.09  9.88  8.62  9.68
## 8 07      9.25  9.80  9.79  9.98 10.3  10.4  10.6  10.5  10.4  10.6  10.5  10.5 
## 9 08      8.32  8.60  7.37  7.98  6.84  7.19  6.81  7.35  6.92  7.25  6.73  7.63

4.2 Linear modeling

We run a linear mixed effects model just like in Section 2, only now we give the mean gene expression for counts. We can use the original sample metadata as our library names are the same in the module data. Note that the gene-level weights cannot be applied to module data and thus, use.weights = FALSE.

virus_mod <- kmFit(counts = mod.E,
                   meta = dat.subset$targets,
                   model = "~ virus*asthma + (1|ptID)", 
                   run_lme = TRUE, 
                   use_weights = FALSE)

## lme/lmerel model: expression~virus*asthma+(1|ptID)

## Input: 12 libraries from 6 unique patients

## Model: 12 libraries

## Complete: 9 genes

## Failed: 0 genes

We see that all modules are significant for virus and/or asthma, though none are signifcant for the interaction.

summarise_kmFit(virus_mod$lme)

## # A tibble: 5 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 (1 | ptID)                 0         1         2         7         7         8
## 2 asthma                     4         5         5         6         6         8
## 3 virus                      7         8         8         8         8         9
## 4 virus:asthma               0         0         0         0         0         8
## 5 total (nonredund…          8         8         8         9         9         9

Thus, in a true analysis, we would remove the interaction in our final model.

virus_mod2 <- kmFit(counts = mod.E,
                   meta = dat.subset$targets,
                   model = "~ virus + asthma + (1|ptID)", 
                   run_lme = TRUE, 
                   use_weights = FALSE)

## lme/lmerel model: expression~virus+asthma+(1|ptID)

## Input: 12 libraries from 6 unique patients

## Model: 12 libraries

## Complete: 9 genes

## Failed: 0 genes

summarise_kmFit(virus_mod2$lme)

## # A tibble: 4 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 (1 | ptID)                 1         1         2         7         7         8
## 2 asthma                     4         5         5         6         6         8
## 3 virus                      7         8         8         8         8         9
## 4 total (nonredund…          8         8         8         9         9         9

5. `RNAetc` custom module functions

Our package RNAetc contains functions to make module building easier. You can filter to genes of interest and test soft thresholds with fit_modules.

fit <- fit_modules(dat = dat, 
                   genes = genes.all,
                   powerVector=c(1:30), 
                   networkType = "signed",
                   nThread = 4)

## Allowing multi-threading with up to 4 threads.
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1  0.31300  5.670          0.820  244.00   245.000 262.00
## 2      2  0.08810  2.210          0.921  144.00   144.000 169.00
## 3      3  0.01860  0.601          0.931   93.60    93.000 123.00
## 4      4  0.00160  0.130          0.976   65.00    63.900  95.80
## 5      5  0.00144  0.094          0.973   47.30    46.500  77.10
## 6      6  0.00756 -0.175          0.964   35.50    34.500  63.40
## 7      7  0.04530 -0.370          0.958   27.40    26.500  53.00
## 8      8  0.16500 -0.619          0.953   21.60    20.900  44.90
## 9      9  0.23200 -0.689          0.968   17.30    16.600  38.40
## 10    10  0.29000 -0.727          0.976   14.10    13.200  33.10
## 11    11  0.33900 -0.769          0.948   11.60    10.700  28.80
## 12    12  0.40000 -0.806          0.926    9.66     8.780  25.20
## 13    13  0.47000 -0.874          0.905    8.13     7.290  22.20
## 14    14  0.53300 -0.909          0.919    6.90     5.980  19.70
## 15    15  0.59000 -0.963          0.920    5.90     4.960  17.50
## 16    16  0.64200 -1.010          0.894    5.09     4.230  15.60
## 17    17  0.68200 -1.040          0.888    4.41     3.590  14.00
## 18    18  0.74200 -1.100          0.890    3.85     3.080  12.80
## 19    19  0.76900 -1.130          0.895    3.38     2.670  11.80
## 20    20  0.80000 -1.160          0.918    2.98     2.300  10.80
## 21    21  0.81100 -1.210          0.940    2.64     1.960  10.00
## 22    22  0.84500 -1.220          0.974    2.35     1.700   9.31
## 23    23  0.85500 -1.230          0.949    2.10     1.490   8.67
## 24    24  0.86100 -1.260          0.950    1.89     1.320   8.10
## 25    25  0.88300 -1.270          0.965    1.70     1.160   7.59
## 26    26  0.88300 -1.300          0.954    1.54     1.010   7.13
## 27    27  0.86900 -1.350          0.920    1.39     0.890   6.72
## 28    28  0.86100 -1.360          0.902    1.27     0.784   6.34
## 29    29  0.88000 -1.380          0.927    1.16     0.701   5.99
## 30    30  0.89700 -1.370          0.935    1.06     0.622   5.68

This outputs:

dat: Filtered EList containing only genes of interest
genes: Vector of genes of interest
sft: Soft thresholding results from pickSoftThreshold
top.plot: Scaled topology plot (shown below)
connect.plot: Connectivity plot (shown below)

names(fit)

## [1] "dat"          "genes"        "sft"          "top.plot"     "connect.plot"

fit$top.plot + fit$connect.plot

Then build modules with make_modules where you can give a specific soft thresholding value (sft.value) or ask it to determine that value based on the minimum acceptable R² (Rsq.min). You can also set our commonly tested parameters minModuleSize, deepSplit, and maxBlockSize as well as determine is you want mean module expression (mods.mean), eigenvalues (mods.eigen), and/or DAVID formatted data frame of genes in modules (david).

mod.p14 <- make_modules(fit,
                        sft.value = 14, # Or Rsq.min = 0.8
                        minModuleSize = 20, 
                        maxBlockSize = 500, 
                        deepSplit = 2,
                        networkType = "signed", 
                        mods.mean = TRUE, mods.eigen = TRUE, david = TRUE,
                        nThread = 4)

This results in:

genes, sft, top.plot, and connect.plot from fit plus
mods: data frame of genes in modules with cleaned names
mods.mean: Mean gene expression
mods.eigen: Eigenvalues of module expression
david: DAVID-formatted data frame of genes in modules

names(mod.p14)

## [1] "genes"        "mods"         "sft"          "top.plot"     "connect.plot"
## [6] "mods.mean"    "mods.eigen"   "david"

mod.p14$mods[1:3,]

##          geneName module module.char mod.color hgnc_symbol Previous symbols
## 1 ENSG00000002587      5          05     green      HS3ST1             <NA>
## 2 ENSG00000005884      4          04    yellow       ITGA3            MSK18
## 3 ENSG00000006453      1          01 turquoise    BAIAP2L1             <NA>
##                 Alias symbols        gene_biotype
## 1                       3OST1 protein-coding gene
## 2 CD49c, VLA3a, VCA-2, GAP-B3 protein-coding gene
## 3                       IRTKS protein-coding gene

mod.p14$mods.mean[1:3,1:3]

##           module.char     lib1     lib2
## module_00          00 6.514825 7.125083
## module_01          01 9.299977 8.485937
## module_02          02 8.356833 7.191985

mod.p14$david[1:3, 1:3]

##   rowname       module_00       module_01
## 1       1 ENSG00000133302 ENSG00000006453
## 2       2 ENSG00000135446 ENSG00000013810
## 3       3 ENSG00000158560 ENSG00000063761

6. Annotating module function

See our gene set analyses tutorial for methods to assign function to modules.

R session

sessionInfo()

## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] WGCNA_1.73            fastcluster_1.2.6     dynamicTreeCut_1.63-1
##  [4] patchwork_1.3.0       limma_3.64.0          SEARchways_1.4.0     
##  [7] RNAetc_1.2.0          BIGpicture_1.1.0      kimma_1.6.3          
## [10] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1        
## [13] dplyr_1.1.4           purrr_1.0.4           readr_2.1.5          
## [16] tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.2        
## [19] tidyverse_2.0.0      
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               gridExtra_2.3           rlang_1.1.6            
##  [4] magrittr_2.0.3          matrixStats_1.5.0       compiler_4.5.0         
##  [7] RSQLite_2.3.9           png_0.1-8               vctrs_0.6.5            
## [10] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
## [13] backports_1.5.0         XVector_0.48.0          labeling_0.4.3         
## [16] utf8_1.2.4              rmarkdown_2.29          tzdb_0.5.0             
## [19] preprocessCore_1.70.0   UCSC.utils_1.4.0        bit_4.6.0              
## [22] xfun_0.52               cachem_1.1.0            GenomeInfoDb_1.44.0    
## [25] jsonlite_2.0.0          blob_1.2.4              BiocParallel_1.42.0    
## [28] parallel_4.5.0          cluster_2.1.8.1         R6_2.6.1               
## [31] bslib_0.9.0             stringi_1.8.7           RColorBrewer_1.1-3     
## [34] rpart_4.1.24            jquerylib_0.1.4         Rcpp_1.0.14            
## [37] iterators_1.0.14        knitr_1.50              base64enc_0.1-3        
## [40] IRanges_2.42.0          Matrix_1.7-3            splines_4.5.0          
## [43] nnet_7.3-20             timechange_0.3.0        tidyselect_1.2.1       
## [46] rstudioapi_0.17.1       yaml_2.3.10             doParallel_1.0.17      
## [49] codetools_0.2-20        lattice_0.22-7          Biobase_2.68.0         
## [52] withr_3.0.2             KEGGREST_1.48.0         evaluate_1.0.3         
## [55] foreign_0.8-90          survival_3.8-3          Biostrings_2.76.0      
## [58] pillar_1.10.2           checkmate_2.3.2         foreach_1.5.2          
## [61] stats4_4.5.0            generics_0.1.3          S4Vectors_0.46.0       
## [64] hms_1.1.3               scales_1.4.0            glue_1.8.0             
## [67] Hmisc_5.2-3             tools_4.5.0             data.table_1.17.0      
## [70] fgsea_1.34.0            fastmatch_1.1-6         cowplot_1.1.3          
## [73] grid_4.5.0              impute_1.82.0           colorspace_2.1-1       
## [76] AnnotationDbi_1.70.0    GenomeInfoDbData_1.2.14 htmlTable_2.4.3        
## [79] Formula_1.2-5           cli_3.6.5               gtable_0.3.6           
## [82] sass_0.4.10             digest_0.6.37           BiocGenerics_0.54.0    
## [85] htmlwidgets_1.6.4       farver_2.1.2            memoise_2.0.1          
## [88] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
## [91] GO.db_3.21.0            statmod_1.5.0           bit64_4.6.0-1

RNA-seq data analysis

Co-expression modules (WGCNA)

Kim Dill-McFarland, kadm@uw.edu

version April 30, 2025

Overview

0. Setup

Software

Example data

1. Load data

2. Differential gene expression

3. Module building

3.1 Determine soft threshold

Why is my R² so low?

3.2 Build modules

3.3 Other module parameters

Minimum module size

Deep network split

Maximum block size

An Art, not a science

4. Differential module expression

4.1 Mean module expression

4.2 Linear modeling

5. `RNAetc` custom module functions

6. Annotating module function

R session

Navigation

RNA-seq data analysis

Co-expression modules (WGCNA)

Kim Dill-McFarland, kadm@uw.edu

version April 30, 2025

Overview

0. Setup

Software

Example data

1. Load data

2. Differential gene expression

3. Module building

3.1 Determine soft threshold

Why is my R2 so low?

3.2 Build modules

3.3 Other module parameters

Minimum module size

Deep network split

Maximum block size

An Art, not a science

4. Differential module expression

4.1 Mean module expression

4.2 Linear modeling

5. RNAetc custom module functions

6. Annotating module function

R session

Navigation

Why is my R² so low?

5. `RNAetc` custom module functions