1 Intro to linear modeling

Overview

Load packages

Load the following packages. For more information on installation, see the setup instructions.

#Data manipulation and plotting
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#Linear modeling
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#Get p-values with random effects model
library(lmerTest)
## 
## Attaching package: 'lmerTest'
## 
## The following object is masked from 'package:lme4':
## 
##     lmer
## 
## The following object is masked from 'package:stats':
## 
##     step
#Clean up model results
library(broom)

Load data

For this section, we’re just going to be focusing on one gene, IFNG. Create the data with the following.

modelDat<-data.frame(
  ptID = rep(paste0("pt", 1:10), 2),
  E = c(-4.05,-0.84,-3.37,-1.00,-4.65,-3.62,-1.32,-2.55,0.29,-0.52,
         0.57, 3.39, 0.52, 4.93,-2.01,-1.11, 1.48,-0.62,3.74, 4.30),
  condition = c(rep("Media",10), rep("Mtb",10)),
  sex = c("M","F")
)

Awesome, we have our data! Let’s get to the fun part - statistics!

Introduction to (statistical) sampling

In research, we want to try to learn about a target population. However, we usually can’t get data about the entire population, so we take a sample and get data about the sample. Then, using the data about the sample, we want to make inferences about the entire population.

For example, let’s say we wanted to know if people who play video games more than 10 hours a week spend more or less money on electronics than those that play video games less than 10 hours a week. Since we can’t ask everyone about their gaming and spending habits, we would ask a select few (a sample) about their gaming and spending habits. Then we would use statistics to determine the probability or likelihood that our population behaves a certain way based on that sample.

T-tests

T-tests examine the means of two groups. To do a t-test you need to have a binary variable and a continuous variable. Then the test examines if the average of that continuous variable is the same for each binary variable. Using the example above, our binary variable would be if the person plays video games more or less than 10 hours a week and the continuous variable would be the amount spent on electronics.

In our data here, we will be looking at gene expression levels (continuous) and compare Media vs Mtb infection (binary).

In our data here, we will be looking at gene expression levels (continuous) and compare Media vs Mtb infection (binary).

Let’s first look at the distribution of the data.

p <- ggplot(modelDat, aes(x=condition, y=E)) + 
  geom_boxplot() + 
  geom_jitter(position=position_jitter(0.2)) +
  ylab(expression(paste('Log'[2], " Expression")))
p

geom_boxplot outputs the actual box plots and geom_jitter adds the actual data points but slightly scatters them so we can see them better.

We can see that the expression levels are lower in the media condition than in the Mtb condition. Now let’s do a t-test to see if that difference is statistically significant.

ttestRes<-t.test(E ~ condition, data = modelDat)
ttestRes
## 
##  Welch Two Sample t-test
## 
## data:  E by condition
## t = -3.9139, df = 16.066, p-value = 0.001228
## alternative hypothesis: true difference in means between group Media and group Mtb is not equal to 0
## 95 percent confidence interval:
##  -5.675613 -1.688387
## sample estimates:
## mean in group Media   mean in group Mtb 
##              -2.163               1.519

The notation here is that the continuous variable goes before the ~, then we have the binary variable, and then we specify where the data is found.

From the results we get the t-test statistic (t = -3.9139489), the degrees of freedom (df = 16.0656228), the p-value (p-value = 0.0012284), as well as estimates of the mean for each group (Media = -2.163, Mtb = 1.519).

These results suggest that there is a statistically significant difference in gene expression for the gene IFNG between the Media and the Mtb infection conditions. (In statistics language: We reject the null hypothesis that the mean gene expression is equal in the Media and Mtb conditions)

T-test assumptions

1. Data is continuous

Easy! Our data is a transformed measure of gene expression. Definitely continuous

2. Data is collected from a simple random sample

This would be a question for the researchers that collected this sample. I think this would be fair to assume however.

3. The data is normally distributed

This is what we would check with our boxplot. We could also look at a violin plot.

p <- ggplot(modelDat, aes(x=condition, y=E)) + 
  geom_violin() + 
  geom_jitter(position=position_jitter(0.2)) +
  ylab(expression(paste('Log'[2], " Expression")))
p

We don’t have a ton of data here so these plots look a little odd, but this could be useful if you have more data.

There are also statistical tests you can do to test normality. One is the Kolmogorov-Smirnov test. This test is very sensitive to departures from normality. You can also test other distributions (F-distribution, etc). The other is the Shapiro-Wilks test for normality. This test only tests for normality.

ks.test(modelDat$E, "pnorm")
## 
##  Exact one-sample Kolmogorov-Smirnov test
## 
## data:  modelDat$E
## D = 0.29955, p-value = 0.04357
## alternative hypothesis: two-sided
shapiro.test(modelDat$E)
## 
##  Shapiro-Wilk normality test
## 
## data:  modelDat$E
## W = 0.9517, p-value = 0.3936

4. The sample is “reasonably large”

Well… Generally more data is better, but we only have what we have here.

5. Homogeneity of variance

This means that the variance is the same in the Media condition and in the Mtb condition. We can test this with an F-test.

res <- var.test(E ~ condition, data = modelDat)
res
## 
##  F test to compare two variances
## 
## data:  E by condition
## F = 0.48479, num df = 9, denom df = 9, p-value = 0.2958
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1204144 1.9517527
## sample estimates:
## ratio of variances 
##          0.4847878

Paired t-test

A paired t-test is very similar to a regular t-test, except here we are going to be incorporating the fact that each patient has a Media sample and a Mtb infection sample. That is, one person generated both pieces of data.

Another example of paired data might be pre/post tests where a person is given a test prior to some exposure and after some exposure to determine if the exposure caused a difference in test results. For example, we might give people a memory test, then have them do an obstacle course and get physically worn out, then have them do the memory test again to see if physical tiredness causes a change in memory abilities.

Let’s look at a spaghetti plot of our data. These are plots that connect lines between data that is generated by the same person. These plots can be useful for looking at paired data, longitudinal data, or other repeated measures.

p <- modelDat %>%
  ggplot(aes(x = condition, y = E, group = ptID)) +
  geom_line() +
  ylab(expression(paste('Log'[2], " Expression")))
p

Nice! We can see that there’s some correlation between the data. People with higher gene expression in the media tend to have higher gene expression in the Mtb condition.

To do the paired t-test, we will first need to modify our data.

modelDatPair<-modelDat %>% 
  pivot_wider(names_from = condition,
              values_from = E)

Let’s see what the resulting dataset looks like after that data manipulation. Is it what you expected?

modelDatPair
## # A tibble: 10 × 4
##    ptID  sex   Media   Mtb
##    <chr> <chr> <dbl> <dbl>
##  1 pt1   M     -4.05  0.57
##  2 pt2   F     -0.84  3.39
##  3 pt3   M     -3.37  0.52
##  4 pt4   F     -1     4.93
##  5 pt5   M     -4.65 -2.01
##  6 pt6   F     -3.62 -1.11
##  7 pt7   M     -1.32  1.48
##  8 pt8   F     -2.55 -0.62
##  9 pt9   M      0.29  3.74
## 10 pt10  F     -0.52  4.3

So let’s run the paired t-test with this data:

ptTestRes<-t.test(modelDatPair$Mtb,
                  modelDatPair$Media, 
                  paired = TRUE)
ptTestRes
## 
##  Paired t-test
## 
## data:  modelDatPair$Mtb and modelDatPair$Media
## t = 9.3581, df = 9, p-value = 6.198e-06
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  2.791945 4.572055
## sample estimates:
## mean difference 
##           3.682

Here, we see we get different results. We still get the t-score (9.3581409), degrees of freedom (9), and p-value (6.1978829^{-6}), but the values have slightly changed. This is because we’re now testing if the difference between the matched pairs is 0. The estimate of the mean of the differences is 3.682, which is actually the same as the difference between the two means from before (Media = -2.163, Mtb = 1.519, difference = 3.682). This answer is more accurate because know that the data came from the same person and not two independent samples.

Linear models

One major limitations of t-tests is that we can’t adjust for any other information we know about the patients. You may a lot of other information in your dataset including sex, age, total sequences, etc. What if these variables also have an impact on gene expression in each of the conditions? The t-test wouldn’t be able to tell us that.

This is when we would want to use a linear model! Let’s start with a linear model not adjusting for other factors and see what we get.

lmMod<-lm(E ~ condition, data = modelDat)
summary(lmMod)
## 
## Call:
## lm(formula = E ~ condition, data = modelDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.529 -1.565 -0.213  1.700  3.411 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -2.1630     0.6652  -3.252  0.00443 **
## conditionMtb   3.6820     0.9407   3.914  0.00102 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.104 on 18 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4298 
## F-statistic: 15.32 on 1 and 18 DF,  p-value: 0.001017

Some of these numbers look awfully familiar. -2.163 is the mean gene expression for the Media condition, 3.682 is the difference between the means for the two conditions, and t = 3.9139489 is the same as for our unpaired t-test. The p-value of 0.0010173 slightly different because of how the degrees of freedom are calculated.

Now let’s add another variable that may effect gene expression.

lmModBig<-lm(E ~ condition + sex, data = modelDat)
summary(lmModBig)
## 
## Call:
## lm(formula = E ~ condition + sex, data = modelDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.187 -1.479  0.064  1.335  3.011 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.6050     0.8049  -1.994  0.06244 . 
## conditionMtb   3.6820     0.9294   3.962  0.00101 **
## sexM          -1.1160     0.9294  -1.201  0.24630   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.078 on 17 degrees of freedom
## Multiple R-squared:  0.502,  Adjusted R-squared:  0.4434 
## F-statistic: 8.568 on 2 and 17 DF,  p-value: 0.002669

We see now that our numbers have changed and we have additional results! The main result of interest here is the p-value for the condition which is 0.0010173. That means, adjusting for sex, the gene expression is statistically significantly different by condition (Media vs Mtb infection).

Linear Modeling Assumptions

1. The true relationship is linear

This makes more sense when we’re modeling two continuous variables, but essentially we want to make sure that a linear model is the correct model to be using. To do that, we have to assume the relationship between the predictors and the outcome is linear.

2. Errors are normally distributed

3. Homoscedasticity of errors (or, equal variance around the line).

When we model, we can get a predicted value. We also have our observed values. The difference between the predicted values and the observed values are called the errors or residuals. We want those errors to be normally distributed. If there is some pattern in the error terms, this suggests that we might be missing some pattern and our model might be incorrect. We can make a few plots to look at our errors.

First we create a new data frame and then plot a scatter plot of our results.

#Create a new data frame using our results
residDat<- data.frame(pred = lmModBig$fitted.values,
                      resid = lmModBig$residuals)

#Plot a scatterplot of our residuals vs our predicted values
p<-residDat %>% ggplot(aes(x = pred, y = resid)) +
  geom_point() +
  xlab("Predicted value") +
  ylab("Residual") 
p

We see a gap in our predicted values. This is just because the condition makes a big difference in our predicted values. Otherwise it looks good.

Next, we’ll plot a histogram of the residuals. We are going to use base R for this plot.

#Plot a histogram of our residuals.
hist(residDat$resid, main="Histogram of Residuals",
 ylab="Residuals")

Looks pretty good! We want our histogram to look like a bell curve.

Lastly we’ll plot a Q-Q plot. We want this to look like a straight line with no major trends or deviations.

#Q-Q Plot
qqnorm(residDat$resid)
qqline(residDat$resid)

These plots check both the normality and heteroskedacity of the error terms. If you just see randomness in the first plot, that’s great! If you see a funnel pattern, that’s a problem. In the Q-Q plot if you see a generally straight line with points randomly above or below the line, that’s great! If you see curve (or a trend where most points are below the line, then above the line, then below the lineagain) that’s a problem.

4. Independence of the observations

This means that each data point is independent of all others. The results of one observation do not influence another observation. Here, this would mean that none of the people in our sample are related. If there were relatives in our data, then we would expect them to have more similar results than two randomly selected people.

However, we actually break this assumption since we have paired data and linear models don’t actually take that pairing into account. We’ll talk more about that later.

Goodness-of-Fit

How do we know if we really need to adjust for sex? One way is to use the AIC (Akaike’s An Information Criterion). This is a measure of “goodness of fit” or how well the model fits the data. AIC takes into account the number of variables included in the model and the maximum likelihood, which is a measure of how well the model predicts the data. Lower AIC indicates better goodness-of-fit.

AIC(lmMod)
## [1] 90.39546
AIC(lmModBig)
## [1] 90.76726

We see that the AIC is lower for our smaller model. This means we might not actually be benefiting from adding those extra variables to our model. Always use your scientific reasoning though. If you know that there is a difference by other variables, keep them in the model even if the AIC says otherwise. Be smarter than your data!

Another method to look at goodness-of-fit is BIC (Bayesian Information Criterion). This is similar to the AIC in that it takes into account the number of variables in the model and the maximum likelihood. Lower is also better for BIC. It is calculated slightly differently, however. BIC should not be compared to AIC.

BIC(lmMod)
## [1] 93.38266
BIC(lmModBig)
## [1] 94.75019

This agrees with AIC that the simple model is a better fit.

Linear mixed effects models

The issue with the linear models is the same issue we had with a regular t-test. We’re not using the fact that the one person generated two pieces of data!

To do that, we will use a mixed effects model. Mixed effects models are similar to linear models, except we have a “random effect” that accounts for individual differences between people.

lmeMod<-lmer(E ~ condition + (1 | ptID), data = modelDat)
summary(lmeMod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: E ~ condition + (1 | ptID)
##    Data: modelDat
## 
## REML criterion at convergence: 72.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1333 -0.5150  0.0604  0.4540  1.5267 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ptID     (Intercept) 3.651    1.9107  
##  Residual             0.774    0.8798  
## Number of obs: 20, groups:  ptID, 10
## 
## Fixed effects:
##              Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   -2.1630     0.6652 10.7095  -3.252  0.00797 ** 
## conditionMtb   3.6820     0.3935  9.0000   9.358  6.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## conditinMtb -0.296

The notation here is similar to the model notation we’ve seen before, but we have a new term here. The (1 | ptID) is the random effect. This is telling the model what is the variable that connects the repeated measures. Here, we use the patient ID. Every row with the same patient ID value is data generated by the same person. This is how we can do a linear model while taking into account the fact that one person generated two pieces of data.

Next, let’s add some other variables to our model!

lmeModBig<-lmer(E ~ condition + sex + (1 | ptID), data = modelDat)
summary(lmeModBig)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: E ~ condition + sex + (1 | ptID)
##    Data: modelDat
## 
## REML criterion at convergence: 69.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1885 -0.5097  0.1176  0.4394  1.4607 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ptID     (Intercept) 3.766    1.9407  
##  Residual             0.774    0.8798  
## Number of obs: 20, groups:  ptID, 10
## 
## Fixed effects:
##              Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   -1.6050     0.9324  8.7459  -1.721    0.120    
## conditionMtb   3.6820     0.3935  9.0000   9.358  6.2e-06 ***
## sexM          -1.1160     1.2889  8.0000  -0.866    0.412    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) cndtnM
## conditinMtb -0.211       
## sexM        -0.691  0.000

It looks like our effect estimate for condition didn’t change all that much by adding sex. We might not need to incorporate those variables, but we’ll look at that in the next section.

Goodness-of-Fit

We can use the same tools to look at goodness-of-fit with our mixed effects model as we did for our linear model.

First, we’ll look at AIC.

AIC(lmeMod)
## [1] 80.18157
AIC(lmeModBig)
## [1] 79.08994

In contrast to previous models, the larger model has a lower AIC and thus fits our data better. This supports including sex in our model, though the AIC difference is small enough (< 2) that either model is reasonable.

Next, let’s look at BIC.

BIC(lmeMod)
## [1] 84.1645
BIC(lmeModBig)
## [1] 84.0686

We see the same trend where the large model is a slightly better fit.

Now, let’s do something a little funky. Let’s see which fits our model better: our random effects model or our linear model.

AIC(lmMod)
## [1] 90.39546
AIC(lmeMod)
## [1] 80.18157

Looks like our mixed effects model is a better fit based on the smaller AIC! However, we also know that our data is paired so even if the AIC was higher for the mixed effects model, we should still incorporate the random effect. This is another case of using our scientific reasoning over the statistical tests.

Wrap up

In this section we covered t-tests, paired t-tests, linear models, and mixed effects models. That’s a lot of content in a short amount of time!

If you hope to be doing more of these analyses, we highly recommend taking a formal statistics class. In such a class you will learn more about how to interpret model results (which we didn’t even cover) and how to model different types of data.

Perhaps an even better option is collaborating with a statistician or biostatistician. We are friendly folk and are passionate about helping scientists. We prefer to get involved in projects early on to help design your study to be the most efficient and useful. This helps us from being the bearers of bad news later on when we find out that your data won’t answer your question.