4.1 Linear modeling

Overview

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

#Linear modeling
library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

#Get p-values with random effects model
library(lmerTest)

## 
## Attaching package: 'lmerTest'

## The following object is masked from 'package:lme4':
## 
##     lmer

## The following object is masked from 'package:stats':
## 
##     step

Load data

For this section, we’re just going to be focusing on one gene, IFNG, which you collapsed into one data frame in the previous section.

modelDat<-read_csv('0_data/IFNG.csv')

## Rows: 20 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): hgnc_symbol, libID, ptID, condition, sex, ptID_old
## dbl (7): E, group, lib.size, norm.factors, age_dys, total_seq, sample.weights
## lgl (2): RNAseq, methylation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Awesome, we have our data! Let’s get to the fun part - statistics!

Introduction to (statistical) sampling

In research, we want to try to learn about a target population. However, we usually can’t get data about the entire population, so we take a sample and get data about the sample. Then, using the data about the sample, we want to make inferences about the entire population.

For example, let’s say we wanted to know if people who play video games more than 10 hours a week spend more or less money on electronics than those that play video games less than 10 hours a week. Since we can’t ask everyone about their gaming and spending habits, we would ask a select few (a sample) about their gaming and spending habits. Then we would use statistics to determine the probability or likelihood that our population behaves a certain way based on that sample.

T-tests

T-tests examine the means of two groups. To do a t-test you need to have a binary variable and a continuous variable. Then the test examines if the average of that continuous variable is the same for each binary variable. Using the example above, our binary variable would be if the person plays video games more or less than 10 hours a week and the continuous variable would be the amount spent on electronics.

In our data here, we will be looking at gene expression levels (continuous) and compare Media vs Mtb infection (binary).

Let’s first look at the distribution of the data.

p <- ggplot(modelDat, aes(x=condition, y=E)) + 
  geom_boxplot() + 
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  xlab("Condtion") +
  ylab(expression(paste('Log'[2], " Expression")))
p

geom_boxplot outputs the actual box plots and geom_jitter adds the actual data points but slightly scatters them so we can see them better.

We can see that the expression levels are lower in the media condition than in the Mtb condition. Now let’s do a t-test to see if that difference is statistically significant.

ttestRes<-t.test(E ~ condition, data = modelDat)
ttestRes

## 
##  Welch Two Sample t-test
## 
## data:  E by condition
## t = -3.9134, df = 16.056, p-value = 0.001231
## alternative hypothesis: true difference in means between group Media and group Mtb is not equal to 0
## 95 percent confidence interval:
##  -5.681177 -1.689521
## sample estimates:
## mean in group Media   mean in group Mtb 
##           -2.166250            1.519099

The notation here is that the continuous variable goes before the ~, then we have the binary variable, and then we specify where the data is found.

From the results we get the t-test statistic (t = -3.9134), the degrees of freedom (df = 16.056), the p-value (p-value = 0.001231), as well as estimates of the mean for each group (Media = -2.17, Mtb = 1.52).

These results suggest that there is a statistically significant difference in gene expression for the gene IFNG between the Media and the Mtb infection conditions. (In statistics language: We reject the null hypothesis that the mean gene expression is equal in the Media and Mtb conditions)

T-test assumptions

1. Data is continuous

Easy! Our data is a transformed measure of gene expression. Definitely continuous

2. Data is collected from a simple random sample

This would be a question for the researchers that collected this sample. I think this would be fair to assume however.

3. The data is normally distributed

This is what we would check with our boxplot. We could also look at a violin plot.

p <- ggplot(modelDat, aes(x=condition, y=E)) + 
  geom_violin() + 
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  xlab("Condtion") +
  ylab(expression(paste('Log'[2], " Expression")))
p

Or a histogram

modelDat %>% ggplot(aes(x=E, fill = condition)) +
  geom_bar() +
  scale_x_binned() +
  xlab(expression(paste('Log'[2], " Expression")))

We don’t have a ton of data here so these plots look a little odd, but this could be useful if you have more data.

There are also statistical tests you can do to test normality. One is the Kolmogorov-Smirnov test. This test is very sensitive to departures from normality. You can also test other distributions (F-distribution, etc). The other is the Shapiro-Wilks test for normality. This test only tests for normality.

ks.test(modelDat$E, "pnorm")

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  modelDat$E
## D = 0.30116, p-value = 0.04181
## alternative hypothesis: two-sided

shapiro.test(modelDat$E)

## 
##  Shapiro-Wilk normality test
## 
## data:  modelDat$E
## W = 0.95163, p-value = 0.3925

4. The sample is “reasonably large”

Well… Generally more data is better, but we only have what we have here.

5. Homogeneity of variance

This means that the variance is the same in the Media condition and in the Mtb condition. We can test this with an F-test.

res <- var.test(E ~ condition, data = modelDat)
res

## 
##  F test to compare two variances
## 
## data:  E by condition
## F = 0.48371, num df = 9, denom df = 9, p-value = 0.2943
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1201468 1.9474147
## sample estimates:
## ratio of variances 
##          0.4837103

Or a Bartlett’s test.

bartlett.test(E ~ condition, data = modelDat)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  E by condition
## Bartlett's K-squared = 1.1005, df = 1, p-value = 0.2942

Paired t-test

A paired t-test is very similar to a regular t-test, except here we are going to be incorporating the fact that each patient has a Media sample and a Mtb infection sample. That is, one person generated both pieces of data.

Another example of paired data might be pre/post tests where a person is given a test prior to some exposure and after some exposure to determine if the exposure caused a difference in test results. For example, we might give people a memory test, then have them do an obstacle course and get physically worn out, then have them do the memory test again to see if physical tiredness causes a change in memory abilities.

Let’s look at a spaghetti plot of our data. These are plots that connect lines between data that is generated by the same person. These plots can be useful for looking at paired data, longitudinal data, or other repeated measures.

p <- modelDat %>%
  ggplot(aes(x = condition, y = E, group = ptID)) +
  geom_line() +
  xlab("Condtion") +
  ylab(expression(paste('Log'[2], " Expression")))
p

Nice! We can see that there’s some correlation between the data. People with higher gene expression in the media tend to have higher gene expression in the Mtb condition.

To do the paired t-test, we will first need to modify our data.

modelDatPair<-modelDat %>% 
  select(c(E, condition, ptID)) %>%
  pivot_wider(names_from = condition,
              values_from = E)

Let’s see what the resulting dataset looks like after that data manipulation. Is it what you expected?

modelDatPair

## # A tibble: 10 × 3
##    ptID   Media    Mtb
##    <chr>  <dbl>  <dbl>
##  1 pt01  -4.05   0.572
##  2 pt02  -0.846  3.39 
##  3 pt03  -3.37   0.521
##  4 pt04  -1.00   4.94 
##  5 pt05  -4.65  -2.01 
##  6 pt06  -3.63  -1.12 
##  7 pt07  -1.32   1.48 
##  8 pt08  -2.56  -0.629
##  9 pt09   0.290  3.74 
## 10 pt10  -0.524  4.30

Let’s find out what exactly the correlation is between the media and the Mtb conditions.

cor.test(modelDatPair$Media, modelDatPair$Mtb)

## 
##  Pearson's product-moment correlation
## 
## data:  modelDatPair$Media and modelDatPair$Mtb
## t = 5.2315, df = 8, p-value = 0.0007914
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5604480 0.9713172
## sample estimates:
##       cor 
## 0.8796646

Wow! Those are really correlated. Statistics is just taking data (numbers) and asking about relationships between those numbers. Since we know that this correlation is just due to the data being generated by the same person, and not a difference in gene expression caused by the condition, we want to do statistics that can take this correlation into account.

So let’s run the paired t-test with this data:

ptTestRes<-t.test(modelDatPair$Mtb,
                  modelDatPair$Media, 
                  paired = T)
ptTestRes

## 
##  Paired t-test
## 
## data:  modelDatPair$Mtb and modelDatPair$Media
## t = 9.3464, df = 9, p-value = 6.262e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.793367 4.577330
## sample estimates:
## mean of the differences 
##                3.685349

Here, we see we get different results. We still get the t-score (9.3464), degrees of freedom (9), and p-value (<0.0001), but the values have slightly changed. This is because we’re now testing if the difference between the matched pairs is 0. The estimate of the mean of the differences is 3.69, which is actually the same as the difference between the two means from before (Media = -2.17, Mtb = 1.52, difference = 3.69). This answer is more accurate because know that the data came from the same person and not two independent samples.

Linear models

One major limitations of t-tests is that we can’t adjust for any other information we know about the patients. We have a lot of other information in our targets dataset including sex, age, total sequences. What if these variables also have an impact on gene expression in each of the conditions? The t-test wouldn’t be able to tell us that.

This is when we would want to use a linear model! Let’s start with a linear model not adjusting for other factors and see what we get.

lmMod<-lm(E ~ condition, data = modelDat)
summary(lmMod)

## 
## Call:
## lm(formula = E ~ condition, data = modelDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5322 -1.5658 -0.2135  1.7005  3.4173 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -2.1662     0.6659  -3.253  0.00442 **
## conditionMtb   3.6853     0.9417   3.913  0.00102 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.106 on 18 degrees of freedom
## Multiple R-squared:  0.4597, Adjusted R-squared:  0.4297 
## F-statistic: 15.31 on 1 and 18 DF,  p-value: 0.001019

Some of these numbers look awfully familiar. -2.16 is the mean gene expression for the Media condition, 3.69 is the difference between the means for the two conditions, and \(t = 3.91\) is the same as for our unpaired t-test. The p-value of 0.00102 slightly different because of how the degrees of freedom are calculated.

Now let’s add some other variables.

lmModBig<-lm(E ~ condition + age_dys + sex, data = modelDat)
summary(lmModBig)

## 
## Call:
## lm(formula = E ~ condition + age_dys + sex, data = modelDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.357 -1.689 -0.254  1.531  3.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -2.268e+00  2.194e+00  -1.033  0.31675   
## conditionMtb  3.685e+00  9.962e-01   3.699  0.00195 **
## age_dys       3.918e-05  2.605e-04   0.150  0.88234   
## sexM         -3.600e-01  1.238e+00  -0.291  0.77503   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.228 on 16 degrees of freedom
## Multiple R-squared:  0.4625, Adjusted R-squared:  0.3618 
## F-statistic:  4.59 on 3 and 16 DF,  p-value: 0.01678

We see now that our numbers have changed and we have additional results! The main result of interest here is the p-value for the condition which is 0.00209. That means, adjusting for age and sex, the gene expression is statistically significantly different by condtion (Media vs Mtb infection).

Exercise: On your own, build a model that also adjusts for the total number of sequences.

Linear Modeling Assumptions

1. The true relationship is linear

This makes more sense when we’re modeling two continuous variables, but essentially we want to make sure that a linear model is the correct model to be using. To do that, we have to assume the relationship between the predictors and the outcome is linear.

2. Errors are normally distributed

3. Homoscedasticity of errors (or, equal variance around the line).

When we model, we can get a predicted value. We also have our observed values. The difference between the predicted values and the observed values are called the errors or residuals. We want those errors to be normally distributed. If there is some pattern in the error terms, this suggests that we might be missing some pattern and our model might be incorrect. We can make a few plots to look at our errors.

First we create a new data frame and then plot a scatter plot of our results.

#Create a new data frame using our results
residDat<- data.frame(pred = lmModBig$fitted.values,
                      resid = lmModBig$residuals)

#Plot a scatterplot of our residuals vs our predicted values
p<-residDat %>% ggplot(aes(x = pred, y = resid)) +
  geom_point() +
  xlab("Predicted value")
  ylab("Residual")

## $y
## [1] "Residual"
## 
## attr(,"class")
## [1] "labels"

We see a gap in our predicted values. This is just because the condition makes a big difference in our predicted values. Otherwise it looks good.

Next, we’ll plot a histogram of the residuals. We are going to use base R for this plot.

#Plot a histogram of our residuals.
hist(residDat$resid, main="Histogram of Residuals",
 ylab="Residuals")

Looks pretty good! We want our histogram to look like a bell curve.

Lastly we’ll plot a Q-Q plot. We want this to look like a straight line with no major trends or deviations.

#Q-Q Plot
qqnorm(residDat$resid)
qqline(residDat$resid)

These plots check both the normality and heteroskedacity of the error terms. If you just see randomness in the first plot, that’s great! If you see a funnel pattern, that’s a problem. In the Q-Q plot if you see a generally straight line with points randomly above or below the line, that’s great! If you see curve (or a trend where most points are below the line, then above the line, then below the lineagain) that’s a problem.

4. Independence of the observations

This means that each data point is independent of all others. The results of one observation do not influence another observation. Here, this would mean that none of the people in our sample are related. If there were relatives in our data, then we would expect them to have more similar results than two randomly selected people.

However, we actually break this assumption since we have paired data and linear models don’t actually take that pairing into account. We’ll talk more about that later.

Exercise: Test these assumptions using your model that includes total sequences.

Goodness-of-Fit

How do we know if we really need to adjust for age and sex? One way is to use the AIC (Akaike’s An Information Criterion). This is a measure of “goodness of fit” or how well the model fits the data. AIC takes into account the number of variables included in the model and the maximum likelihood, which is a measure of how well the model predicts the data. Lower AIC indicates better goodness-of-fit.

AIC(lmMod)

## [1] 90.43791

AIC(lmModBig)

## [1] 94.33237

We see that the AIC is lower for our smaller model. This means we might not actually be benefiting from adding those extra variables to our model. Always use your scientific reasoning though. If you know that there is a difference by age or sex, keep them in the model even if the AIC says otherwise. Be smarter than your data!

Another method to look at goodness-of-fit is BIC (Bayesian Information Criterion). This is similar to the AIC in that it takes into account the number of variables in the model and the maximum likelihood. Lower is also better for BIC. It is calculated slightly differently, however. BIC should not be compared to AIC.

BIC(lmMod)

## [1] 93.42511

BIC(lmModBig)

## [1] 99.31103

This agrees with AIC that the simple model is a better fit.

Exercise: calculate the AIC and BIC for your model including total sequences. Is it a better fit than either of these models?

Linear mixed effects models

The issue with the linear models is the same issue we had with a regular t-test. We’re not using the fact that the one person generated two pieces of data!

To do that, we will use a mixed effects model. Mixed effects models are similar to linear models, except we have a “random effect” that accounts for individual differences between people.

lmeMod<-lmer(E ~ condition + (1 | ptID), data = modelDat)
summary(lmeMod)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: E ~ condition + (1 | ptID)
##    Data: modelDat
## 
## REML criterion at convergence: 72.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.13492 -0.51308  0.06058  0.45472  1.52648 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ptID     (Intercept) 3.6570   1.9123  
##  Residual             0.7774   0.8817  
## Number of obs: 20, groups:  ptID, 10
## 
## Fixed effects:
##              Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   -2.1662     0.6659 10.7136  -3.253  0.00795 ** 
## conditionMtb   3.6853     0.3943  9.0000   9.346 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## conditinMtb -0.296

The notation here is similar to the model notation we’ve seen before, but we have a new term here. The (1 | ptID) is the random effect. This is telling the model what is the variable that connects the repeated measures. Here, we use the patient ID. Every row with the same patient ID value is data generated by the same person. This is how we can do a linear model while taking into account the fact that one person generated two pieces of data.

Next, let’s add some other variables to our model!

lmeModBig<-lmer(E ~ condition + sex + age_dys + (1 | ptID), data = modelDat)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

## Warning: Some predictor variables are on very different scales: consider
## rescaling

summary(lmeModBig)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: E ~ condition + sex + age_dys + (1 | ptID)
##    Data: modelDat
## 
## REML criterion at convergence: 83.6
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.09476 -0.52511  0.05475  0.45468  1.48340 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ptID     (Intercept) 4.7828   2.1870  
##  Residual             0.7774   0.8817  
## Number of obs: 20, groups:  ptID, 10
## 
## Fixed effects:
##                Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  -2.268e+00  3.091e+00  7.057e+00  -0.734    0.487    
## conditionMtb  3.685e+00  3.943e-01  9.000e+00   9.346 6.26e-06 ***
## sexM         -3.600e-01  1.788e+00  7.000e+00  -0.201    0.846    
## age_dys       3.918e-05  3.761e-04  7.000e+00   0.104    0.920    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) cndtnM sexM  
## conditinMtb -0.064              
## sexM         0.121  0.000       
## age_dys     -0.903  0.000 -0.479
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling

It looks like our effect estimate for condition didn’t change all that much by adding age and sex. We might not need to incorporate those variables, but we’ll look at that in the next section.

Exercise: Build your own model with a random effect and include total sequences.

Goodness-of-Fit

We can use the same tools to look at goodness-of-fit with our mixed effects model as we did for our linear model.

First, we’ll look at AIC.

AIC(lmeMod)

## [1] 80.23768

AIC(lmeModBig)

## [1] 95.6416

Once again, the smaller model has a lower AIC and thus fits our data better.

Next, let’s look at BIC.

BIC(lmeMod)

## [1] 84.22061

BIC(lmeModBig)

## [1] 101.616

We see the same trend where the smaller model is a better fit.

Now, let’s do something a little funky. Let’s see which fits our model better: our random effects model or our linear model.

AIC(lmMod)

## [1] 90.43791

AIC(lmeMod)

## [1] 80.23768

Looks like our mixed effects model is a better fit based on the smaller AIC! However, we also know that our data is paired so even if the AIC was higher for the mixed effects model, we should still incorporate the random effect. This is another case of using our scientific reasoning over the statistical tests.

Exercise: Compare your mixed effects model with total sequences to the other models we created. Which model is best?

Wrap up

In this section we covered t-tests, paired t-tests, linear models, and mixed effects models. That’s a lot of content in a short amount of time!

If you hope to be doing more of these analyses, we highly recommend taking a formal statistics class. In such a class you will learn more about how to interpret model results (which we didn’t even cover) and how to model different types of data.

Perhaps an even better option is collaborating with a statistician or biostatistician. We are friendly folk and are passionate about helping scientists. We prefer to get involved in projects early on to help design your study to be the most efficient and useful. This helps us from being the bearers of bad news later on when we find out that your data won’t answer your question.

Linear modeling

t-tests, linear models, linear mixed effect models

Holly Hartman, heh39@case.edu

version June 01, 2022

4.1 Linear modeling

Overview

Load packages

Load data

Introduction to (statistical) sampling

T-tests

T-test assumptions

1. Data is continuous

2. Data is collected from a simple random sample

3. The data is normally distributed

4. The sample is “reasonably large”

5. Homogeneity of variance

Paired t-test

Linear models

Exercise: On your own, build a model that also adjusts for the total number of sequences.

Linear Modeling Assumptions

1. The true relationship is linear

2. Errors are normally distributed

3. Homoscedasticity of errors (or, equal variance around the line).

4. Independence of the observations

Exercise: Test these assumptions using your model that includes total sequences.

Goodness-of-Fit

Exercise: calculate the AIC and BIC for your model including total sequences. Is it a better fit than either of these models?

Linear mixed effects models

Exercise: Build your own model with a random effect and include total sequences.

Goodness-of-Fit

Exercise: Compare your mixed effects model with total sequences to the other models we created. Which model is best?

Wrap up

Navigation