3. RNA-seq data cleaning

Overview

Here, we take a raw RNA-seq counts table through quality assessment, filtering, and normalization. For more information on how counts were generated, see our [SEAsnake pipeline][SEAsnake]. These data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

#RNAseq data
library(edgeR)

## Loading required package: limma

## Warning: package 'limma' was built under R version 4.1.3

library(limma)
library(RNAetc)

#Note we do not load biomaRt because it has conflicts with the tidyverse.
# We instead call its functions with biomaRt::

# Set random seed for reproduciblity
set.seed(651)

Read in data

Using the tidyverse read_ functions you saw earlier today, read in the counts and sample metadata files.

count <- read_csv("0_data/raw_counts.csv")

## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

meta <- read_csv("0_data/metadata.csv")

## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Quality-filter data

Filter poor-quality libraries

We assess library quality using metrics from [samtools flagstat][flagstat] and [picard][picard]. Here, we will only look at total sequences for brevity but you can find more quality measures of interest such as percent alignment and median CV coverage in our full [tutorial][count_to_voom].

Using ggplot, let’s plot the total sequences per library.

ggplot(meta, aes(x = libID, y = total_seq)) +
  geom_col()

The above plot is somewhat difficult to read so we modify it with some ggplot features you’ve seen before and some that are new.

ggplot(meta, 
       #Reorder x axis by y-axis value
       aes(x = reorder(libID, total_seq), 
           y = total_seq)) +
  geom_col() +
  #Add cutoff line
  geom_hline(yintercept = 1E6, color = "red") +
  #Set theme
  theme_classic() +
  #Relabel axes
  labs(x = "Library", y = "Total sequences") +
  #Rotate x-axis labels
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Overall we see that all libraries have more the 1 million sequences. Based on this and other quality metrics not shown here, we determine that all libraries pass quality check.

Filter non-protein-coding genes

Standard differential expression analyses focus on protein-coding genes as these RNA products are the most likely to result in a measurable phenotype. We annotate the counts table genes to their biotypes as well as additional identifiers with biomaRt.

First, load the gene data base from biomaRt. Note that here we use biomaRt::function( ) because we did not load this package with library( ) earlier. The :: operator tells R which package contains the function and prevents downstream issues in this pipeline due to dplyr::select vs biomaRt::select.

#Get database
ensembl <- biomaRt::useEnsembl(biomart="ensembl",
                               dataset="hsapiens_gene_ensembl",
                                mirror = "www")

## Ensembl site unresponsive, trying uswest mirror

#Get gene key
key <- biomaRt::getBM(attributes=c("hgnc_symbol", "gene_biotype"),
                      mart=ensembl)

Also, you could extract more than just the biotypes from biomaRt. See all the possible data with listAttributes.

attrib <- biomaRt::listAttributes(ensembl)
head(attrib)

##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

Next, filter the key to protein-coding genes present in the RNA-seq count data.

key.filter <- key %>% 
  #Filter genes in count table
  filter(hgnc_symbol %in% count$hgnc_symbol) %>% 
  #Filter protein-coding
  filter(gene_biotype == "protein_coding")

And filter the count table to genes in the protein-coding key.

count.pc <- count %>% 
  filter(hgnc_symbol %in% key.filter$hgnc_symbol)

This removes 26518 genes from this data set and leaves 18268 genes for analysis.

Filter PCA outliers

Sometimes one or more libraries will appear as outliers in PCA. We define this as any library more than 3 standard deviations away from the mean PC1 and/or PC2.

First, we must reformat the counts data to a log counts per million (cpm) matrix in order for PCA to work. We also transpose it so that genes are columns and libraries are rows. Without transposing, we’d get 1 dot per gene instead of 1 per library.

count.for.pca <- count.pc %>% 
  #move HGNC gene symbol to rownames
  column_to_rownames("hgnc_symbol") %>% 
  #convert to log cpm
  cpm(log=TRUE) %>% 
  #transpose
  t()

count.for.pca[1:3,1:3]

##                A1BG      A1CF      A2M
## pt01_Media 1.835063 -1.865271 7.389832
## pt01_Mtb   1.535061 -1.865271 6.397031
## pt02_Media 1.369070 -1.865271 6.106875

Then, we calculate PCA on these transformed data.

PCA <- prcomp(count.for.pca, scale.=TRUE, center=TRUE)

A basic ggplot of these PCA data looks like so, but it does not tell us anything about outliers.

#Determine percent explained for each PC
summary(PCA)$importance

##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     67.41996 46.09409 38.32512 36.86756 34.23585 31.11334
## Proportion of Variance  0.24882  0.11631  0.08040  0.07440  0.06416  0.05299
## Cumulative Proportion   0.24882  0.36513  0.44553  0.51993  0.58409  0.63709
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     28.79221 28.08592 26.04112 24.22482 22.73842 22.68737
## Proportion of Variance  0.04538  0.04318  0.03712  0.03212  0.02830  0.02818
## Cumulative Proportion   0.68247  0.72565  0.76277  0.79489  0.82319  0.85137
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     21.77168 20.72515 20.25228 19.57118 19.30086 18.76942
## Proportion of Variance  0.02595  0.02351  0.02245  0.02097  0.02039  0.01928
## Cumulative Proportion   0.87732  0.90083  0.92328  0.94425  0.96464  0.98393
##                            PC19     PC20
## Standard deviation     17.12298 0.664296
## Proportion of Variance  0.01605 0.000020
## Cumulative Proportion   0.99998 1.000000

#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")

ggplot(as.data.frame(PCA$x),
       aes(x=PC1, y=PC2)) +
  geom_point(size=3) +
  labs(x=PC1.label, y=PC2.label)

We need to calculate the means and standard deviations of our PCs and create a new variable to color the points in PCA. First, let’s get the summary metrics from the PCA data contained in PCA$x.

PCA.summ <- as.data.frame(PCA$x) %>% 
  #move rownames into column of data frame
  rownames_to_column("libID") %>% 
  #Pivot to get all PCs in 1 column
  pivot_longer(-libID) %>% 
  #calculate mean and sd
  group_by(name) %>% 
  summarize(meanPC = mean(value),
            sdPC = sd(value))
PCA.summ

## # A tibble: 20 × 3
##    name    meanPC    sdPC
##    <chr>    <dbl>   <dbl>
##  1 PC1   -0.250   67.4   
##  2 PC10   0.0951  24.2   
##  3 PC11  -0.118   22.7   
##  4 PC12   0.00668 22.7   
##  5 PC13   0.0262  21.8   
##  6 PC14   0.0299  20.7   
##  7 PC15  -0.0288  20.3   
##  8 PC16  -0.100   19.6   
##  9 PC17   0.0166  19.3   
## 10 PC18  -0.0150  18.8   
## 11 PC19   0.158   17.1   
## 12 PC2    0.362   46.1   
## 13 PC20  -0.647    0.0137
## 14 PC3    0.209   38.3   
## 15 PC4   -0.123   36.9   
## 16 PC5   -0.0577  34.2   
## 17 PC6    0.233   31.1   
## 18 PC7   -0.169   28.8   
## 19 PC8    0.141   28.1   
## 20 PC9    0.0745  26.0

#Extract a row by the PC name
PC1.summ <- PCA.summ[PCA.summ$name == "PC1",]
PC2.summ <- PCA.summ[PCA.summ$name == "PC1",]

#create variable for outliers outside 3 sd from mean on either PC1 of PC2
PCA2 <- as.data.frame(PCA$x) %>% 
  mutate(outlier = ifelse(
    #Higher than +3 sd
    PC1 > PC1.summ$meanPC + 3*PC1.summ$sdPC | 
    PC2 > PC2.summ$meanPC + 3*PC2.summ$sdPC | 
    #Lower than -3 sd
    PC1 < PC1.summ$meanPC - 3*PC1.summ$sdPC |
    PC2 < PC2.summ$meanPC - 3*PC2.summ$sdPC, 
    #If any of the above are TRUE, make the variable equal to this
    "outlier",
    #Otherwise (else), use this
    "okay"))

Then our PCA plot reveals that there are no outliers. Note we added some additional ggplot customization to beautify this plot.

ggplot(PCA2, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=outlier), size=3) +
  labs(x=PC1.label, y=PC2.label) +
  #Set theme
  theme_classic() +
  #set xy ratio
  coord_fixed(ratio=1) +
  #change colors
  scale_color_manual(values = c("okay"="grey70",
                                "outlier"="red"))

We recommend that you initially remove dramatic outliers but leave those that are borderline or questionable. Then, you can re-assess outliers after gene filtering and normalization. You may find that some are no longer outliers after these steps. If they are, you can return to this step and remove them before repeating subsequent steps.

Additional steps

This workshop covers the minimum quality control and filtering needed for RNA-seq libraries. Checkout our full tutorial for additional QC metrics, batch effects, custom plotting functions, and more!

Create DGEList

At this stage, we’ve completed sample filtering and can collapse our count and meta data into a single list object. This allows us to shorten our long object names as well as works efficiently with the remaining cleaning steps.

We merge into a DGEList object, which is edgeR format. We must convert the counts data frame to a matrix for this process.

count.pc.mat <- count.pc %>% 
  column_to_rownames("hgnc_symbol") %>% 
  as.matrix()

dat <- DGEList(counts=count.pc.mat, samples=meta, genes=key.filter)

This object is a list of data frames. It is anS3 object in R, because it has 3 dimensions: list[dataframe[column]]. You can access each data frame in the list with $ similar to how you’ve used it to access columns within a data frame.

typeof(dat)

## [1] "list"

names(dat)

## [1] "counts"  "samples" "genes"

#One data frame in dat
dat$samples

##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  7590941            1 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  6975318            1   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  7225255            1 pt02_Media pt02     Media   12775   M
## pt02_Mtb       1  5794398            1   pt02_Mtb pt02       Mtb   12775   M
## pt03_Media     1  4924407            1 pt03_Media pt03     Media   11315   M
## pt03_Mtb       1  5458446            1   pt03_Mtb pt03       Mtb   11315   M
## pt04_Media     1  7985262            1 pt04_Media pt04     Media    8395   M
## pt04_Mtb       1  6323624            1   pt04_Mtb pt04       Mtb    8395   M
## pt05_Media     1 11696793            1 pt05_Media pt05     Media    7300   M
## pt05_Mtb       1 10846286            1   pt05_Mtb pt05       Mtb    7300   M
## pt06_Media     1  5576621            1 pt06_Media pt06     Media    6570   F
## pt06_Mtb       1  5047579            1   pt06_Mtb pt06       Mtb    6570   F
## pt07_Media     1  8554618            1 pt07_Media pt07     Media    7665   F
## pt07_Mtb       1  6203112            1   pt07_Mtb pt07       Mtb    7665   F
## pt08_Media     1  7696280            1 pt08_Media pt08     Media    8760   M
## pt08_Mtb       1  5863972            1   pt08_Mtb pt08       Mtb    8760   M
## pt09_Media     1  9864991            1 pt09_Media pt09     Media    6935   M
## pt09_Mtb       1  9692964            1   pt09_Mtb pt09       Mtb    6935   M
## pt10_Media     1  6639203            1 pt10_Media pt10     Media    8030   F
## pt10_Mtb       1  5774444            1   pt10_Mtb pt10       Mtb    8030   F
##            ptID_old RNAseq methylation total_seq
## pt01_Media  pt00001   TRUE       FALSE   9114402
## pt01_Mtb    pt00001   TRUE       FALSE   8918699
## pt02_Media  pt00002   TRUE       FALSE   9221555
## pt02_Mtb    pt00002   TRUE       FALSE   7733260
## pt03_Media  pt00003   TRUE       FALSE   6231728
## pt03_Mtb    pt00003   TRUE       FALSE   7105193
## pt04_Media  pt00004   TRUE        TRUE  10205557
## pt04_Mtb    pt00004   TRUE        TRUE   8413543
## pt05_Media  pt00005   TRUE       FALSE  15536685
## pt05_Mtb    pt00005   TRUE       FALSE  15509446
## pt06_Media  pt00006   TRUE       FALSE   7085995
## pt06_Mtb    pt00006   TRUE       FALSE   6588160
## pt07_Media  pt00007   TRUE       FALSE  10706098
## pt07_Mtb    pt00007   TRUE       FALSE   8576245
## pt08_Media  pt00008   TRUE       FALSE   9957906
## pt08_Mtb    pt00008   TRUE       FALSE   8220348
## pt09_Media  pt00009   TRUE       FALSE  13055276
## pt09_Mtb    pt00009   TRUE       FALSE  13800442
## pt10_Media  pt00010   TRUE       FALSE   8216706
## pt10_Mtb    pt00010   TRUE       FALSE   7599609

#One column in one data frame in dat
dat$samples$libID

##  [1] "pt01_Media" "pt01_Mtb"   "pt02_Media" "pt02_Mtb"   "pt03_Media"
##  [6] "pt03_Mtb"   "pt04_Media" "pt04_Mtb"   "pt05_Media" "pt05_Mtb"  
## [11] "pt06_Media" "pt06_Mtb"   "pt07_Media" "pt07_Mtb"   "pt08_Media"
## [16] "pt08_Mtb"   "pt09_Media" "pt09_Mtb"   "pt10_Media" "pt10_Mtb"

Filter low abundance genes

Low abundance (small counts) and rare genes (many 0 counts) are removed from the data because they:

are unlikely to be significantly differentially expressed
greatly inflate multiple comparison correction
often do not meet linear modeling assumptions regarding mean variance trends (e.g. because of small N, they show lower variance than what is expected for their mean expression)

Our goal is to remove genes in the lower left of the mean-variance plot because counts (x) and variance (y) are low e.g. these genes break the red mean variance trend line.

Here we co-opt limma’s voom function to make out mean-variance plot for us.

mv1 <- voom(dat, plot=TRUE)

We use our custom function in the package RNAetc to retain only genes that are at least min.CPM counts per million in at least min.sample number of samples OR in at least min.pct percent of samples. Here, we use 0.5 CPM in at least 3 samples.

dat.abund <- filter_rare(dat, min.CPM = 0.5, min.sample = 3,
                         gene.var="hgnc_symbol")

## 4849 (26.54%) of 18268 genes removed.

Plotting the filtered data, we see the red trend line no longer curves down on the left. There is still a bit of the downward tail of dots but this filtering is sufficient.

mv2 <- voom(dat.abund, plot=TRUE)

This filtering generally results in the removal of around 25% of genes. There is no exact cutoff for this filtering, and you should try several cutoffs to observe the effects. In general, we use minimum CPM from 0.1 - 1, minimum samples around 3 for small data sets, or minimum samples from 5 - 10% in larger data sets. It is important to keep the minimum sample cutoff larger than your smallest group of interest, otherwise you may filter genes specifically associated with one group. For example, if you have 10 libraries with 5 each of media and stimulated, your min.sample value should not be > 5 or else you will remove genes only expressed in one condition.

You may also wish to look for specific genes of interest and ensure they are not being filtered. For example, we will analyze interferon gammea (IFNG) later in this workshop so we check that this gene is present after filtering.

"IFNG" %in% rownames(dat.abund$counts)

## [1] TRUE

Normalize data

Trimmed mean of M (TMM)

RNA-seq counts are not independent within a library and not comparable across libraries. A library with 1 million sequences will have higher counts for most genes than one with 1 thousand sequences. We correct for this aspect of the data with the following normalization steps.

TMM defines a reference sample from your data set as the one with the most representative expression for the overall data set. Specifically, the reference sample is the one whose upper quartile is closest to the overall data set upper quartile. The upper quantile is the value (x) where 75% of genes have expression < x. Thus, the reference sample is the sample whose x is the closest to mean(x) across all samples.

All other samples are considered test samples. For each test sample, a scaling factor is calculated based on the weighted mean of log ratios of representative genes between the test and reference. These representative genes are a subset of the data set, removing the highest and lowest expressed genes as well as genes with the highest and lowest log ratios. The exact genes used as representative genes for scaling are dynamic and specific to each test sample.

The calculated scaling factors are then applied to the counts table automatically in the voom step.

dat.abund.norm <- calcNormFactors(dat.abund, method = "TMM")

voom aka log2 counts per million (CPM)

We continue normalization by converting counts to CPM within each sample, thus accounting for differential sampling depth. We also perform log2 transformation, because RNA-seq data are heavily right-skewed and thus, violate assumptions of normality.

as.data.frame(dat.abund$counts) %>% 
  rownames_to_column() %>% 
  pivot_longer(-rowname) %>% 
ggplot() +
  geom_histogram(aes(x=value), bins = 100) +
  theme_classic() +
  labs(x = "count", y = "occurences") +
  lims(x=c(0,1000))

Note: You will see a warning message with the plot above. This is telling you that some data are not represented in the plot. This is expected here as we’ve forced the x-axis limits to be 0 to 1000, therefore removing any data with counts higher than 1000. If you see this warning when you’re not expecting data to be removed, go back to the data frame and look for NA or formatting errors (like a letter in a numeric variable).

voom performs both of these steps! We use voomWithQualityWeights to additionally calculate sample specific quality weights that can be of use as co-variates in linear modeling.

#define model matrix
mm <- model.matrix(~ condition, data=dat.abund.norm$samples)
dat.abund.norm.voom <- voomWithQualityWeights(
                           dat.abund.norm,
                           design=mm,
                           plot=TRUE)

Final data structure

Let’s review the final data structure. We access each data frame within the Elist object using $. The normalized log2 CPM expression data are contained in E.

dat.abund.norm.voom$E[1:3,1:7]

##       pt01_Media  pt01_Mtb pt02_Media  pt02_Mtb pt03_Media  pt03_Mtb pt04_Media
## A1BG    1.620022  1.717603   1.086946  1.095467  0.8744826  1.710618   1.573344
## A2M     7.259912  6.680758   5.939909  5.316170  6.0252987  5.541628   7.330036
## A2ML1  -4.052404 -3.515057  -4.015712 -1.816544 -3.3734450 -3.386150  -4.170278

Library and donor metadata are in targets.

dat.abund.norm.voom$targets[1:3,1:10]

##            group lib.size norm.factors      libID ptID condition age_dys sex
## pt01_Media     1  8295928     1.092872 pt01_Media pt01     Media   12410   M
## pt01_Mtb       1  5716203     0.819490   pt01_Mtb pt01       Mtb   12410   M
## pt02_Media     1  8087601     1.119352 pt02_Media pt02     Media   12775   M
##            ptID_old RNAseq
## pt01_Media  pt00001   TRUE
## pt01_Mtb    pt00001   TRUE
## pt02_Media  pt00002   TRUE

Gene metadata are in genes.

dat.abund.norm.voom$genes[1:3,]

##      hgnc_symbol   gene_biotype
## A1BG      MT-ND1 protein_coding
## A1CF      MT-ND2 protein_coding
## A2M       MT-CO1 protein_coding

Gene-level quality weights are in weights.

dat.abund.norm.voom$weights[1:3,1:3]

##            [,1]       [,2]       [,3]
## [1,]  1.5140025  1.1586136  2.0890796
## [2,] 10.0514208 10.9377042 14.1566201
## [3,]  0.3364504  0.3989864  0.4689222

Explorative analysis

To get an initial sense of the cleaned data, we plot some variables of interest. Here, we see a clear Mtb infection signal and no relationship to sex.

#transpose normalized expression data
voom.for.pca <- t(dat.abund.norm.voom$E)
#Calculate PCA
PCA3 <- prcomp(voom.for.pca, scale.=TRUE, center=TRUE)

#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")

#Merge PCA results with metadata
PCA.meta <- as.data.frame(PCA3$x) %>% 
  rownames_to_column("libID") %>% 
  full_join(meta)

## Joining, by = "libID"

#color by condition
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=condition), size=3) +
  labs(x=PC1.label, y=PC2.label) +
  #Set theme
  theme_classic() +
  #set xy ratio
  coord_fixed(ratio=1)

#color by sex
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=sex), size=3) +
  labs(x=PC1.label, y=PC2.label) +
  #Set theme
  theme_classic() +
  #set xy ratio
  coord_fixed(ratio=1)

Save

Write final voom object as RData

save(dat.abund.norm.voom, file = "0_data/dat_voom.RData")

We also save information on a single gene, IFNG, for use in our next section on linear modeling. Note that this combines several of the tidyverse functions you saw this morning as well as introduces some new ones.

# Get 
IFNG <- as.data.frame(dat.abund.norm.voom$E) %>% 
  #Move rownames into the data frame as a column
  rownames_to_column("hgnc_symbol") %>% 
  #Filter gene of interest
  filter(hgnc_symbol == "IFNG") %>% 
  #Pivot to long format so libID are in a single column
  pivot_longer(-hgnc_symbol, names_to = "libID", values_to = "E") %>% 
  #Join with sample metadat
  full_join(dat.abund.norm.voom$targets, by = "libID")

#View resulting data frame
IFNG

## # A tibble: 20 × 15
##    hgnc_symbol libID       E group lib.size norm.factors ptID  condition age_dys
##    <chr>       <chr>   <dbl> <fct>    <dbl>        <dbl> <chr> <chr>       <dbl>
##  1 IFNG        pt01_… -4.05  1       8.30e6        1.09  pt01  Media       12410
##  2 IFNG        pt01_…  0.572 1       5.72e6        0.819 pt01  Mtb         12410
##  3 IFNG        pt02_… -0.846 1       8.09e6        1.12  pt02  Media       12775
##  4 IFNG        pt02_…  3.39  1       5.28e6        0.912 pt02  Mtb         12775
##  5 IFNG        pt03_… -3.37  1       5.18e6        1.05  pt03  Media       11315
##  6 IFNG        pt03_…  0.521 1       5.23e6        0.958 pt03  Mtb         11315
##  7 IFNG        pt04_… -1.00  1       9.00e6        1.13  pt04  Media        8395
##  8 IFNG        pt04_…  4.94  1       5.40e6        0.855 pt04  Mtb          8395
##  9 IFNG        pt05_… -4.65  1       1.26e7        1.07  pt05  Media        7300
## 10 IFNG        pt05_… -2.01  1       1.01e7        0.930 pt05  Mtb          7300
## 11 IFNG        pt06_… -3.63  1       6.17e6        1.11  pt06  Media        6570
## 12 IFNG        pt06_… -1.12  1       5.42e6        1.07  pt06  Mtb          6570
## 13 IFNG        pt07_… -1.32  1       8.76e6        1.02  pt07  Media        7665
## 14 IFNG        pt07_…  1.48  1       5.19e6        0.837 pt07  Mtb          7665
## 15 IFNG        pt08_… -2.56  1       8.82e6        1.15  pt08  Media        8760
## 16 IFNG        pt08_… -0.629 1       5.41e6        0.923 pt08  Mtb          8760
## 17 IFNG        pt09_…  0.290 1       1.19e7        1.20  pt09  Media        6935
## 18 IFNG        pt09_…  3.74  1       1.10e7        1.14  pt09  Mtb          6935
## 19 IFNG        pt10_… -0.524 1       6.47e6        0.974 pt10  Media        8030
## 20 IFNG        pt10_…  4.30  1       4.54e6        0.786 pt10  Mtb          8030
## # … with 6 more variables: sex <chr>, ptID_old <chr>, RNAseq <lgl>,
## #   methylation <lgl>, total_seq <dbl>, sample.weights <dbl>

#Save as csv
write_csv(IFNG, file = "0_data/IFNG.csv")

RNA-seq data cleaning

Counts to voom

Kim Dill-McFarland, kadm@uw.edu

version June 01, 2022