Review of multiple linear regression

Prof. Maria Tackett

Jan 17, 2024

Announcements

  • HW 01 due Wed, Jan 24 at 11:59pm

    • Released Thursday morning
  • Labs start Thursday

  • Check website for office hours schedule

  • Added a Slack channel #internships-research-opportunities (optional)

Computing set up

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)

ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

colors <- tibble::tibble(green = "#B5BA72")

Topics

  • Define statistical models

  • Motivate generalized linear models and multilevel models

  • Review multiple linear regression

Notes based on Chapter 1 of Roback and Legler (2021) unless noted otherwise.

Statistical models

Models and statistical models

Suppose we have observations y1,…,yn

  • Model: Mathematical description of the process we think generates the observations

  • Statistical model: Model that includes an equation describing the impact of explanatory variables (deterministic part) and probability distributions for parts of the process that are assumed to be random variation (random part)

Definitions adapted from Stroup (2012)

Statistical models

A statistical model must include

  1. The observations
  2. The deterministic (systematic) part of the process
  3. The random part of the process with a statement of the presumed probability distribution

Definitions adapted from Stroup (2012)

Example

Suppose we have the model for comparing two means:

yij=μi+ϵij

where

  • i=1,2: group
  • j=1,…,n: observation number
  • ni: number of observations in group i
  • μi: mean of group i
  • yij: jth observation in the ith group
  • ϵij : random error (variation) associated with ijth observation

Adapted from Stroup (2012)

Example

yij=μi+ϵij

  • yij: the observations

  • μi: deterministic part of the model, no random variability

  • eij : random part of the model, indicating observations vary about their mean

  • Typically assume ϵij are independent and identically distributed (i.i.d.) N(0,σ2)

Adapted from Stroup (2012)

Practice

Suppose yij’s are observed outcome data and xi’s are values of the explanatory variable. Assume a linear regression model can used to describe the process of generating yij based on the xi.

  1. Write the specification of the statistical model.
  2. Label the 3 components of the model equation (observation, deterministic part, random part)
  3. What is E(yij), the expected value of the observations?

Motivating generalized linear models (GLMs) and multilevel models

Assumptions for linear regression

  • Linearity: Linear relationship between mean response and predictor variable(s)

  • Independence: Residuals are independent. There is no connection between how far any two points lie above or below regression line.

  • Normality: Response follows a normal distribution at each level of the predictor (or combination of predictors)

  • Equal variance: Variability (variance or standard deviation) of the response is equal for all levels of the predictor (or combination of predictors)

Assumptions for linear regression

Modified from Figure 1.1. in BMLR
  • Linearity: Linear relationship between mean of the response Y and the predictor X

  • Independence: No connection between how far any two points lie from regression line

  • Normality: Response Y follows a normal distribution at each level of the predictor X (red curves)

  • Equal variance: Variance of the response Y is equal for all levels of the predictor X

Violations in assumptions

Do wealthy households tend to have fewer children compared to households with lower income? Annual income and family size are recorded for a random sample of households.

  • The response variable is number of children in the household.
  • The predictor variable is annual income in US dollars.

Which assumption(s) are obviously violated, if any?

Violations in assumptions

Medical researchers investigated the outcome of a particular surgery for patients with comparable stages of disease but different ages. The 10 hospitals in the study had at least two surgeons performing the surgery of interest. Patients were randomly selected for each surgeon at each hospital. The surgery outcome was recorded on a scale of 1 - 10.

  • The response variable is surgery outcome, 1 - 10.
  • The predictor variable is patient age in years.

Which assumption(s) are obviously violated, if any?

Beyond linear regression

  • When drawing conclusions from linear regression models, we do so assuming LINE are all met

  • Generalized linear models require different assumptions and can accommodate violations in LINE

    • Relationship between response and predictor(s) can be nonlinear
    • Response variable can be non-normal
    • Variance in response can differ at each level of predictor(s)
    • The independence assumption still must hold!
  • Multilevel models are used to model data that violate the independence assumption, i.e. correlated observations

Multiple linear regression

Data: Kentucky Derby Winners

Today’s data is from the Kentucky Derby, an annual 1.25-mile horse race held at the Churchill Downs race track in Louisville, KY. The data is in the file derbyplus.csv and contains information for races 1896 - 2017.

Response variable

  • speed: Average speed of the winner in feet per second (ft/s)

Additional variable

  • winner: Winning horse

Predictor variables

  • year: Year of the race
  • condition: Condition of the track (good, fast, slow)
  • starters: Number of horses who raced

Goal: Understand variability in average winner speed based on characteristics of the race.

Data

derby <- read_csv("data/derbyplus.csv")
derby |>
  head(5) |> kable()
year winner condition speed starters
1896 Ben Brush good 51.66 8
1897 Typhoon II slow 49.81 6
1898 Plaudit good 51.16 4
1899 Manuel fast 50.00 5
1900 Lieut. Gibson fast 52.28 7

Data science workflow

Image source: Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Exploratory data analysis (EDA)

  • Once you’re ready for the statistical analysis, the first step should always be exploratory data analysis.

  • The EDA will help you

    • begin to understand the variables and observations
    • identify outliers or potential data entry errors
    • begin to see relationships between variables
    • identify the appropriate model and identify a strategy
  • The EDA is exploratory; formal modeling and statistical inference are used to draw conclusions.

Univariate EDA

Univariate EDA code

p1 <- ggplot(data = derby, aes(x = speed)) + 
  geom_histogram(fill = colors$green, color = "black") + 
  labs(x = "Winning speed (ft/s)", y = "Count")

p2 <- ggplot(data = derby, aes(x = starters)) + 
  geom_histogram(fill = colors$green, color = "black",
                 binwidth = 2) + 
  labs(x = "Starters", y = "Count")

p3 <- ggplot(data = derby, aes(x = condition)) +
   geom_bar(fill = colors$green, color = "black", aes(x = ))

p1 + (p2 / p3) + 
  plot_annotation(title = "Univariate data analysis")

Bivariate EDA

Bivariate EDA code

p4 <- ggplot(data = derby, aes(x = starters, y = speed)) + 
  geom_point() + 
  labs(x = "Starters", y = "Speed (ft / s)")

p5 <- ggplot(data = derby, aes(x = year, y = speed)) + 
  geom_point() + 
  labs(x = "Year", y = "Speed (ft / s)")

p6 <- ggplot(data = derby, aes(x = condition, y = speed)) + 
  geom_boxplot(fill = colors$green, color = "black") + 
  labs(x = "Conditions", y = "Speed (ft / s)")

(p4 + p5) + p6 +
  plot_annotation(title = "Bivariate data analysis")

Scatterplot matrix

A scatterplot matrix helps quickly visualize relationships between many variable pairs. They are particularly useful to identify potentially correlated predictors.

Scatterplot matrix code

Create using the ggpairs() function in the GGally package.

library(GGally)
ggpairs(data = derby, 
        columns = c("condition", "year", "starters", "speed"))

Multivariate EDA

Plot the relationship between the response and a predictor based on levels of another predictor to assess potential interactions.

Multivariate EDA code

library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition, 
                         shape = condition, linetype = condition)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) + 
  labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
       title = "Speed vs. year", 
       subtitle = "by track condition") +
  guides(lty = FALSE, shape = FALSE) +
  scale_color_viridis_d(end = 0.9)

Candidate models

Model 1: Main effects model (year, condition, starters)

model1 <- lm(speed ~ starters + year + condition, data = derby)


Model 2: Main effects + year2, the quadratic effect of year

model2 <- lm(speed ~ starters + year + I(year^2) + condition,
             data = derby)


Model 3: Main effects + interaction between year and condition

model3 <- lm(speed ~ starters + year + condition + year * condition, 
             data = derby)

Application exercise

📋 sta310-sp24.netlify.app/ae/ae-02-mlr-review

Inference for regression

Use statistical inference to

  • Evaluate if predictors are statistically significant (not necessarily practically significant!)

  • Quantify uncertainty in coefficient estimates

  • Quantify uncertainty in model predictions

If LINE assumptions are met, we can use inferential methods based on mathematical models. If at least linearity and independence are met, we can use simulation-based inference methods.

Inference for regression

When LINE assumptions are met… . . .

  • Use least squares regression to obtain the estimates for the model coefficients β0,β1,…,βj and for σ2

  • σ^ is the regression standard error

    σ^=∑i=1n(yi−y^i)2n−p−1=∑i=1nei2n−p−1

    where p is the number of non-intercept terms in the model (e.g., p=1 in simple linear regression)

  • Goal is to use estimated values to draw conclusions about βj

    • Use σ^ to calculate SEβ^j . Click here for more detail.

Hypothesis testing for βj

  1. State the hypotheses. H0:βj=0 vs. Ha:βj≠0, given the other variables in the model.
  1. Calculate the test statistic.

t=β^j−0SEβ^j

  1. Calculate the p-value. The p-value is calculated from a t distribution with n−p−1 degrees of freedom.

    p-value=2P(T>|t|)T∼tn−p−1

  1. State the conclusion in context of the data.
    • Reject H0 if p-value is sufficiently small.

Confidence interval for βj

The C% confidence confidence interval for βj is

β^j±t∗×SEβ^j

where the critical value t∗∼tn−p−1


General interpretation for the confidence interval [LB, UB]:

We are C% confident that for every one unit increase in xj, the response is expected to change by LB to UB units, holding all else constant.

Application exercise

📋 sta310-sp24.netlify.app/ae/ae-02-mlr-review

Measures of model performance

  • R2: Proportion of variability in the response explained by the model

    • Will always increase as predictors are added, so it shouldn’t be used to compare models
  • Adj.R2: Similar to R2 with a penalty for extra terms

  • AIC: Likelihood-based approach balancing model performance and complexity

  • BIC: Similar to AIC with stronger penalty for extra terms

Model summary statistics

Use the glance() function to get model summary statistics


model r.squared adj.r.squared AIC BIC
Model1 0.730 0.721 259.478 276.302
Model2 0.827 0.819 207.429 227.057
Model3 0.751 0.738 253.584 276.016

Which model do you choose based on these statistics?

Characteristics of a “good” final model

  • Model can be used to answer primary research questions

  • Predictor variables control for important covariates

  • Potential interactions have been investigated

  • Variables are centered, as needed, for more meaningful interpretations

  • Unnecessary terms are removed

  • Assumptions are met and influential points have been addressed

  • Model tells a “persuasive story parsimoniously”

List from Section 1.6.7 of Roback and Legler (2021)

Next class

  • Using likelihoods

  • See Prepare for Jan 22 lecture

References

Roback, Paul, and Julie Legler. 2021. Beyond multiple linear regression: applied generalized linear models and multilevel models in R. CRC Press.
Stroup, Walter W. 2012. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications. CRC press.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

🔗 STA 310 - Spring 2024

1 / 41
Review of multiple linear regression Prof. Maria Tackett Jan 17, 2024

  1. Slides

  2. Tools

  3. Close
  • Review of multiple linear regression
  • Announcements
  • Computing set up
  • Topics
  • Statistical models
  • Models and statistical models
  • Statistical models
  • Example
  • Example
  • Practice
  • Motivating generalized linear models (GLMs) and multilevel models
  • Assumptions for linear regression
  • Assumptions for linear regression
  • Violations in assumptions
  • Violations in assumptions
  • Beyond linear regression
  • Multiple linear regression
  • Data: Kentucky Derby Winners
  • Data
  • Data science workflow
  • Exploratory data analysis (EDA)
  • Univariate EDA
  • Univariate EDA code
  • Bivariate EDA
  • Bivariate EDA code
  • Scatterplot matrix
  • Scatterplot matrix code
  • Multivariate EDA
  • Multivariate EDA code
  • Candidate models
  • Application exercise
  • Inference for regression
  • Inference for regression
  • Hypothesis testing for βj
  • Confidence interval for βj
  • Application exercise
  • Measures of model performance
  • Model summary statistics
  • Characteristics of a “good” final model
  • Next class
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help