Here, we take a raw RNA-seq counts table through quality assessment, filtering, and normalization. For more information on how counts were generated, see our [SEAsnake pipeline][SEAsnake]. These data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.
Load the following packages. For more information on installation, see the setup instructions.
#Data manipulation and plotting
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#RNAseq data
library(edgeR)
## Loading required package: limma
## Warning: package 'limma' was built under R version 4.1.3
library(limma)
library(RNAetc)
#Note we do not load biomaRt because it has conflicts with the tidyverse.
# We instead call its functions with biomaRt::
# Set random seed for reproduciblity
set.seed(651)
Using the tidyverse read_
functions you saw earlier today, read in the counts and sample metadata files.
count <- read_csv("0_data/raw_counts.csv")
## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We assess library quality using metrics from [samtools flagstat
][flagstat] and [picard
][picard]. Here, we will only look at total sequences for brevity but you can find more quality measures of interest such as percent alignment and median CV coverage in our full [tutorial][count_to_voom].
Using ggplot
, let’s plot the total sequences per library.
ggplot(meta, aes(x = libID, y = total_seq)) +
geom_col()
The above plot is somewhat difficult to read so we modify it with some ggplot
features you’ve seen before and some that are new.
ggplot(meta,
#Reorder x axis by y-axis value
aes(x = reorder(libID, total_seq),
y = total_seq)) +
geom_col() +
#Add cutoff line
geom_hline(yintercept = 1E6, color = "red") +
#Set theme
theme_classic() +
#Relabel axes
labs(x = "Library", y = "Total sequences") +
#Rotate x-axis labels
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Overall we see that all libraries have more the 1 million sequences. Based on this and other quality metrics not shown here, we determine that all libraries pass quality check.
Standard differential expression analyses focus on protein-coding genes as these RNA products are the most likely to result in a measurable phenotype. We annotate the counts table genes to their biotypes as well as additional identifiers with biomaRt
.
First, load the gene data base from biomaRt
. Note that here we use biomaRt::function( )
because we did not load this package with library( )
earlier. The ::
operator tells R which package contains the function and prevents downstream issues in this pipeline due to dplyr::select
vs biomaRt::select
.
#Get database
ensembl <- biomaRt::useEnsembl(biomart="ensembl",
dataset="hsapiens_gene_ensembl",
mirror = "www")
## Ensembl site unresponsive, trying uswest mirror
#Get gene key
key <- biomaRt::getBM(attributes=c("hgnc_symbol", "gene_biotype"),
mart=ensembl)
Also, you could extract more than just the biotypes from biomaRt
. See all the possible data with listAttributes
.
attrib <- biomaRt::listAttributes(ensembl)
head(attrib)
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_gene_id_version Gene stable ID version feature_page
## 3 ensembl_transcript_id Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5 ensembl_peptide_id Protein stable ID feature_page
## 6 ensembl_peptide_id_version Protein stable ID version feature_page
Next, filter the key to protein-coding genes present in the RNA-seq count data.
key.filter <- key %>%
#Filter genes in count table
filter(hgnc_symbol %in% count$hgnc_symbol) %>%
#Filter protein-coding
filter(gene_biotype == "protein_coding")
And filter the count table to genes in the protein-coding key.
count.pc <- count %>%
filter(hgnc_symbol %in% key.filter$hgnc_symbol)
This removes 26518 genes from this data set and leaves 18268 genes for analysis.
Sometimes one or more libraries will appear as outliers in PCA. We define this as any library more than 3 standard deviations away from the mean PC1 and/or PC2.
First, we must reformat the counts data to a log counts per million (cpm) matrix in order for PCA to work. We also transpose it so that genes are columns and libraries are rows. Without transposing, we’d get 1 dot per gene instead of 1 per library.
count.for.pca <- count.pc %>%
#move HGNC gene symbol to rownames
column_to_rownames("hgnc_symbol") %>%
#convert to log cpm
cpm(log=TRUE) %>%
#transpose
t()
count.for.pca[1:3,1:3]
## A1BG A1CF A2M
## pt01_Media 1.835063 -1.865271 7.389832
## pt01_Mtb 1.535061 -1.865271 6.397031
## pt02_Media 1.369070 -1.865271 6.106875
Then, we calculate PCA on these transformed data.
PCA <- prcomp(count.for.pca, scale.=TRUE, center=TRUE)
A basic ggplot
of these PCA data looks like so, but it does not tell us anything about outliers.
#Determine percent explained for each PC
summary(PCA)$importance
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 67.41996 46.09409 38.32512 36.86756 34.23585 31.11334
## Proportion of Variance 0.24882 0.11631 0.08040 0.07440 0.06416 0.05299
## Cumulative Proportion 0.24882 0.36513 0.44553 0.51993 0.58409 0.63709
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 28.79221 28.08592 26.04112 24.22482 22.73842 22.68737
## Proportion of Variance 0.04538 0.04318 0.03712 0.03212 0.02830 0.02818
## Cumulative Proportion 0.68247 0.72565 0.76277 0.79489 0.82319 0.85137
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 21.77168 20.72515 20.25228 19.57118 19.30086 18.76942
## Proportion of Variance 0.02595 0.02351 0.02245 0.02097 0.02039 0.01928
## Cumulative Proportion 0.87732 0.90083 0.92328 0.94425 0.96464 0.98393
## PC19 PC20
## Standard deviation 17.12298 0.664296
## Proportion of Variance 0.01605 0.000020
## Cumulative Proportion 0.99998 1.000000
#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")
ggplot(as.data.frame(PCA$x),
aes(x=PC1, y=PC2)) +
geom_point(size=3) +
labs(x=PC1.label, y=PC2.label)
We need to calculate the means and standard deviations of our PCs and create a new variable to color the points in PCA. First, let’s get the summary metrics from the PCA data contained in PCA$x
.
PCA.summ <- as.data.frame(PCA$x) %>%
#move rownames into column of data frame
rownames_to_column("libID") %>%
#Pivot to get all PCs in 1 column
pivot_longer(-libID) %>%
#calculate mean and sd
group_by(name) %>%
summarize(meanPC = mean(value),
sdPC = sd(value))
PCA.summ
## # A tibble: 20 × 3
## name meanPC sdPC
## <chr> <dbl> <dbl>
## 1 PC1 -0.250 67.4
## 2 PC10 0.0951 24.2
## 3 PC11 -0.118 22.7
## 4 PC12 0.00668 22.7
## 5 PC13 0.0262 21.8
## 6 PC14 0.0299 20.7
## 7 PC15 -0.0288 20.3
## 8 PC16 -0.100 19.6
## 9 PC17 0.0166 19.3
## 10 PC18 -0.0150 18.8
## 11 PC19 0.158 17.1
## 12 PC2 0.362 46.1
## 13 PC20 -0.647 0.0137
## 14 PC3 0.209 38.3
## 15 PC4 -0.123 36.9
## 16 PC5 -0.0577 34.2
## 17 PC6 0.233 31.1
## 18 PC7 -0.169 28.8
## 19 PC8 0.141 28.1
## 20 PC9 0.0745 26.0
#Extract a row by the PC name
PC1.summ <- PCA.summ[PCA.summ$name == "PC1",]
PC2.summ <- PCA.summ[PCA.summ$name == "PC1",]
#create variable for outliers outside 3 sd from mean on either PC1 of PC2
PCA2 <- as.data.frame(PCA$x) %>%
mutate(outlier = ifelse(
#Higher than +3 sd
PC1 > PC1.summ$meanPC + 3*PC1.summ$sdPC |
PC2 > PC2.summ$meanPC + 3*PC2.summ$sdPC |
#Lower than -3 sd
PC1 < PC1.summ$meanPC - 3*PC1.summ$sdPC |
PC2 < PC2.summ$meanPC - 3*PC2.summ$sdPC,
#If any of the above are TRUE, make the variable equal to this
"outlier",
#Otherwise (else), use this
"okay"))
Then our PCA plot reveals that there are no outliers. Note we added some additional ggplot
customization to beautify this plot.
ggplot(PCA2, aes(x=PC1, y=PC2)) +
geom_point(aes(color=outlier), size=3) +
labs(x=PC1.label, y=PC2.label) +
#Set theme
theme_classic() +
#set xy ratio
coord_fixed(ratio=1) +
#change colors
scale_color_manual(values = c("okay"="grey70",
"outlier"="red"))
We recommend that you initially remove dramatic outliers but leave those that are borderline or questionable. Then, you can re-assess outliers after gene filtering and normalization. You may find that some are no longer outliers after these steps. If they are, you can return to this step and remove them before repeating subsequent steps.
This workshop covers the minimum quality control and filtering needed for RNA-seq libraries. Checkout our full tutorial for additional QC metrics, batch effects, custom plotting functions, and more!
At this stage, we’ve completed sample filtering and can collapse our count and meta data into a single list object. This allows us to shorten our long object names as well as works efficiently with the remaining cleaning steps.
We merge into a DGEList object, which is edgeR
format. We must convert the counts data frame to a matrix for this process.
count.pc.mat <- count.pc %>%
column_to_rownames("hgnc_symbol") %>%
as.matrix()
dat <- DGEList(counts=count.pc.mat, samples=meta, genes=key.filter)
This object is a list of data frames. It is anS3
object in R, because it has 3 dimensions: list[dataframe[column]]
. You can access each data frame in the list with $
similar to how you’ve used it to access columns within a data frame.
typeof(dat)
## [1] "list"
names(dat)
## [1] "counts" "samples" "genes"
#One data frame in dat
dat$samples
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 7590941 1 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 6975318 1 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 7225255 1 pt02_Media pt02 Media 12775 M
## pt02_Mtb 1 5794398 1 pt02_Mtb pt02 Mtb 12775 M
## pt03_Media 1 4924407 1 pt03_Media pt03 Media 11315 M
## pt03_Mtb 1 5458446 1 pt03_Mtb pt03 Mtb 11315 M
## pt04_Media 1 7985262 1 pt04_Media pt04 Media 8395 M
## pt04_Mtb 1 6323624 1 pt04_Mtb pt04 Mtb 8395 M
## pt05_Media 1 11696793 1 pt05_Media pt05 Media 7300 M
## pt05_Mtb 1 10846286 1 pt05_Mtb pt05 Mtb 7300 M
## pt06_Media 1 5576621 1 pt06_Media pt06 Media 6570 F
## pt06_Mtb 1 5047579 1 pt06_Mtb pt06 Mtb 6570 F
## pt07_Media 1 8554618 1 pt07_Media pt07 Media 7665 F
## pt07_Mtb 1 6203112 1 pt07_Mtb pt07 Mtb 7665 F
## pt08_Media 1 7696280 1 pt08_Media pt08 Media 8760 M
## pt08_Mtb 1 5863972 1 pt08_Mtb pt08 Mtb 8760 M
## pt09_Media 1 9864991 1 pt09_Media pt09 Media 6935 M
## pt09_Mtb 1 9692964 1 pt09_Mtb pt09 Mtb 6935 M
## pt10_Media 1 6639203 1 pt10_Media pt10 Media 8030 F
## pt10_Mtb 1 5774444 1 pt10_Mtb pt10 Mtb 8030 F
## ptID_old RNAseq methylation total_seq
## pt01_Media pt00001 TRUE FALSE 9114402
## pt01_Mtb pt00001 TRUE FALSE 8918699
## pt02_Media pt00002 TRUE FALSE 9221555
## pt02_Mtb pt00002 TRUE FALSE 7733260
## pt03_Media pt00003 TRUE FALSE 6231728
## pt03_Mtb pt00003 TRUE FALSE 7105193
## pt04_Media pt00004 TRUE TRUE 10205557
## pt04_Mtb pt00004 TRUE TRUE 8413543
## pt05_Media pt00005 TRUE FALSE 15536685
## pt05_Mtb pt00005 TRUE FALSE 15509446
## pt06_Media pt00006 TRUE FALSE 7085995
## pt06_Mtb pt00006 TRUE FALSE 6588160
## pt07_Media pt00007 TRUE FALSE 10706098
## pt07_Mtb pt00007 TRUE FALSE 8576245
## pt08_Media pt00008 TRUE FALSE 9957906
## pt08_Mtb pt00008 TRUE FALSE 8220348
## pt09_Media pt00009 TRUE FALSE 13055276
## pt09_Mtb pt00009 TRUE FALSE 13800442
## pt10_Media pt00010 TRUE FALSE 8216706
## pt10_Mtb pt00010 TRUE FALSE 7599609
#One column in one data frame in dat
dat$samples$libID
## [1] "pt01_Media" "pt01_Mtb" "pt02_Media" "pt02_Mtb" "pt03_Media"
## [6] "pt03_Mtb" "pt04_Media" "pt04_Mtb" "pt05_Media" "pt05_Mtb"
## [11] "pt06_Media" "pt06_Mtb" "pt07_Media" "pt07_Mtb" "pt08_Media"
## [16] "pt08_Mtb" "pt09_Media" "pt09_Mtb" "pt10_Media" "pt10_Mtb"
Low abundance (small counts) and rare genes (many 0 counts) are removed from the data because they:
Our goal is to remove genes in the lower left of the mean-variance plot because counts (x) and variance (y) are low e.g. these genes break the red mean variance trend line.
Here we co-opt limma
’s voom
function to make out mean-variance plot for us.
mv1 <- voom(dat, plot=TRUE)
We use our custom function in the package RNAetc
to retain only genes that are at least min.CPM
counts per million in at least min.sample
number of samples OR in at least min.pct
percent of samples. Here, we use 0.5 CPM in at least 3 samples.
dat.abund <- filter_rare(dat, min.CPM = 0.5, min.sample = 3,
gene.var="hgnc_symbol")
## 4849 (26.54%) of 18268 genes removed.
Plotting the filtered data, we see the red trend line no longer curves down on the left. There is still a bit of the downward tail of dots but this filtering is sufficient.
mv2 <- voom(dat.abund, plot=TRUE)
This filtering generally results in the removal of around 25% of genes. There is no exact cutoff for this filtering, and you should try several cutoffs to observe the effects. In general, we use minimum CPM from 0.1 - 1, minimum samples around 3 for small data sets, or minimum samples from 5 - 10% in larger data sets. It is important to keep the minimum sample cutoff larger than your smallest group of interest, otherwise you may filter genes specifically associated with one group. For example, if you have 10 libraries with 5 each of media and stimulated, your min.sample
value should not be > 5 or else you will remove genes only expressed in one condition.
You may also wish to look for specific genes of interest and ensure they are not being filtered. For example, we will analyze interferon gammea (IFNG) later in this workshop so we check that this gene is present after filtering.
"IFNG" %in% rownames(dat.abund$counts)
## [1] TRUE
RNA-seq counts are not independent within a library and not comparable across libraries. A library with 1 million sequences will have higher counts for most genes than one with 1 thousand sequences. We correct for this aspect of the data with the following normalization steps.
TMM defines a reference sample from your data set as the one with the most representative expression for the overall data set. Specifically, the reference sample is the one whose upper quartile is closest to the overall data set upper quartile. The upper quantile is the value (x
) where 75% of genes have expression < x
. Thus, the reference sample is the sample whose x
is the closest to mean(x
) across all samples.
All other samples are considered test samples. For each test sample, a scaling factor is calculated based on the weighted mean of log ratios of representative genes between the test and reference. These representative genes are a subset of the data set, removing the highest and lowest expressed genes as well as genes with the highest and lowest log ratios. The exact genes used as representative genes for scaling are dynamic and specific to each test sample.
The calculated scaling factors are then applied to the counts table automatically in the voom
step.
dat.abund.norm <- calcNormFactors(dat.abund, method = "TMM")
We continue normalization by converting counts to CPM within each sample, thus accounting for differential sampling depth. We also perform log2 transformation, because RNA-seq data are heavily right-skewed and thus, violate assumptions of normality.
as.data.frame(dat.abund$counts) %>%
rownames_to_column() %>%
pivot_longer(-rowname) %>%
ggplot() +
geom_histogram(aes(x=value), bins = 100) +
theme_classic() +
labs(x = "count", y = "occurences") +
lims(x=c(0,1000))
Note: You will see a warning message with the plot above. This is telling you that some data are not represented in the plot. This is expected here as we’ve forced the x-axis limits to be 0 to 1000, therefore removing any data with counts higher than 1000. If you see this warning when you’re not expecting data to be removed, go back to the data frame and look for NA
or formatting errors (like a letter in a numeric variable).
voom
performs both of these steps! We use voomWithQualityWeights
to additionally calculate sample specific quality weights that can be of use as co-variates in linear modeling.
#define model matrix
mm <- model.matrix(~ condition, data=dat.abund.norm$samples)
dat.abund.norm.voom <- voomWithQualityWeights(
dat.abund.norm,
design=mm,
plot=TRUE)
Let’s review the final data structure. We access each data frame within the Elist
object using $
. The normalized log2 CPM expression data are contained in E
.
dat.abund.norm.voom$E[1:3,1:7]
## pt01_Media pt01_Mtb pt02_Media pt02_Mtb pt03_Media pt03_Mtb pt04_Media
## A1BG 1.620022 1.717603 1.086946 1.095467 0.8744826 1.710618 1.573344
## A2M 7.259912 6.680758 5.939909 5.316170 6.0252987 5.541628 7.330036
## A2ML1 -4.052404 -3.515057 -4.015712 -1.816544 -3.3734450 -3.386150 -4.170278
Library and donor metadata are in targets
.
dat.abund.norm.voom$targets[1:3,1:10]
## group lib.size norm.factors libID ptID condition age_dys sex
## pt01_Media 1 8295928 1.092872 pt01_Media pt01 Media 12410 M
## pt01_Mtb 1 5716203 0.819490 pt01_Mtb pt01 Mtb 12410 M
## pt02_Media 1 8087601 1.119352 pt02_Media pt02 Media 12775 M
## ptID_old RNAseq
## pt01_Media pt00001 TRUE
## pt01_Mtb pt00001 TRUE
## pt02_Media pt00002 TRUE
Gene metadata are in genes
.
dat.abund.norm.voom$genes[1:3,]
## hgnc_symbol gene_biotype
## A1BG MT-ND1 protein_coding
## A1CF MT-ND2 protein_coding
## A2M MT-CO1 protein_coding
Gene-level quality weights are in weights
.
dat.abund.norm.voom$weights[1:3,1:3]
## [,1] [,2] [,3]
## [1,] 1.5140025 1.1586136 2.0890796
## [2,] 10.0514208 10.9377042 14.1566201
## [3,] 0.3364504 0.3989864 0.4689222
To get an initial sense of the cleaned data, we plot some variables of interest. Here, we see a clear Mtb infection signal and no relationship to sex.
#transpose normalized expression data
voom.for.pca <- t(dat.abund.norm.voom$E)
#Calculate PCA
PCA3 <- prcomp(voom.for.pca, scale.=TRUE, center=TRUE)
#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")
#Merge PCA results with metadata
PCA.meta <- as.data.frame(PCA3$x) %>%
rownames_to_column("libID") %>%
full_join(meta)
## Joining, by = "libID"
#color by condition
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
geom_point(aes(color=condition), size=3) +
labs(x=PC1.label, y=PC2.label) +
#Set theme
theme_classic() +
#set xy ratio
coord_fixed(ratio=1)
#color by sex
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
geom_point(aes(color=sex), size=3) +
labs(x=PC1.label, y=PC2.label) +
#Set theme
theme_classic() +
#set xy ratio
coord_fixed(ratio=1)
Write final voom object as RData
save(dat.abund.norm.voom, file = "0_data/dat_voom.RData")
We also save information on a single gene, IFNG, for use in our next section on linear modeling. Note that this combines several of the tidyverse
functions you saw this morning as well as introduces some new ones.
# Get
IFNG <- as.data.frame(dat.abund.norm.voom$E) %>%
#Move rownames into the data frame as a column
rownames_to_column("hgnc_symbol") %>%
#Filter gene of interest
filter(hgnc_symbol == "IFNG") %>%
#Pivot to long format so libID are in a single column
pivot_longer(-hgnc_symbol, names_to = "libID", values_to = "E") %>%
#Join with sample metadat
full_join(dat.abund.norm.voom$targets, by = "libID")
#View resulting data frame
IFNG
## # A tibble: 20 × 15
## hgnc_symbol libID E group lib.size norm.factors ptID condition age_dys
## <chr> <chr> <dbl> <fct> <dbl> <dbl> <chr> <chr> <dbl>
## 1 IFNG pt01_… -4.05 1 8.30e6 1.09 pt01 Media 12410
## 2 IFNG pt01_… 0.572 1 5.72e6 0.819 pt01 Mtb 12410
## 3 IFNG pt02_… -0.846 1 8.09e6 1.12 pt02 Media 12775
## 4 IFNG pt02_… 3.39 1 5.28e6 0.912 pt02 Mtb 12775
## 5 IFNG pt03_… -3.37 1 5.18e6 1.05 pt03 Media 11315
## 6 IFNG pt03_… 0.521 1 5.23e6 0.958 pt03 Mtb 11315
## 7 IFNG pt04_… -1.00 1 9.00e6 1.13 pt04 Media 8395
## 8 IFNG pt04_… 4.94 1 5.40e6 0.855 pt04 Mtb 8395
## 9 IFNG pt05_… -4.65 1 1.26e7 1.07 pt05 Media 7300
## 10 IFNG pt05_… -2.01 1 1.01e7 0.930 pt05 Mtb 7300
## 11 IFNG pt06_… -3.63 1 6.17e6 1.11 pt06 Media 6570
## 12 IFNG pt06_… -1.12 1 5.42e6 1.07 pt06 Mtb 6570
## 13 IFNG pt07_… -1.32 1 8.76e6 1.02 pt07 Media 7665
## 14 IFNG pt07_… 1.48 1 5.19e6 0.837 pt07 Mtb 7665
## 15 IFNG pt08_… -2.56 1 8.82e6 1.15 pt08 Media 8760
## 16 IFNG pt08_… -0.629 1 5.41e6 0.923 pt08 Mtb 8760
## 17 IFNG pt09_… 0.290 1 1.19e7 1.20 pt09 Media 6935
## 18 IFNG pt09_… 3.74 1 1.10e7 1.14 pt09 Mtb 6935
## 19 IFNG pt10_… -0.524 1 6.47e6 0.974 pt10 Media 8030
## 20 IFNG pt10_… 4.30 1 4.54e6 0.786 pt10 Mtb 8030
## # … with 6 more variables: sex <chr>, ptID_old <chr>, RNAseq <lgl>,
## # methylation <lgl>, total_seq <dbl>, sample.weights <dbl>
#Save as csv
write_csv(IFNG, file = "0_data/IFNG.csv")