Review of multiple linear regression

Prof. Maria Tackett

Jan 17, 2024

Computing set up

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)

ggplot2::theme_set(ggplot2::theme_bw(base_size = 16))

colors <- tibble::tibble(green = "#B5BA72")

Assumptions for linear regression

Linearity: Linear relationship between mean of the response $Y$ and the predictor $X$
Independence: No connection between how far any two points lie from regression line
Normality: Response $Y$ follows a normal distribution at each level of the predictor $X$ (red curves)
Equal variance: Variance of the response $Y$ is equal for all levels of the predictor $X$

year	winner	condition	speed	starters
1896	Ben Brush	good	51.66	8
1897	Typhoon II	slow	49.81	6
1898	Plaudit	good	51.16	4
1899	Manuel	fast	50.00	5
1900	Lieut. Gibson	fast	52.28	7

Candidate models

Model 1: Main effects model (year, condition, starters)

model1 <- lm(speed ~ starters + year + condition, data = derby)

Model 2: Main effects + $y e a r^{2}$ , the quadratic effect of year

model2 <- lm(speed ~ starters + year + I(year^2) + condition,
             data = derby)

Model 3: Main effects + interaction between year and condition

model3 <- lm(speed ~ starters + year + condition + year * condition, 
             data = derby)

model	r.squared	adj.r.squared	AIC	BIC
Model1	0.730	0.721	259.478	276.302
Model2	0.827	0.819	207.429	227.057
Model3	0.751	0.738	253.584	276.016

1 / 41

Review of multiple linear regression Prof. Maria Tackett Jan 17, 2024

Review of multiple linear regression
Announcements
Computing set up
Topics
Statistical models
Models and statistical models
Statistical models
Example
Example
Practice
Motivating generalized linear models (GLMs) and multilevel models
Assumptions for linear regression
Assumptions for linear regression
Violations in assumptions
Violations in assumptions
Beyond linear regression
Multiple linear regression
Data: Kentucky Derby Winners
Data
Data science workflow
Exploratory data analysis (EDA)
Univariate EDA
Univariate EDA code
Bivariate EDA
Bivariate EDA code
Scatterplot matrix
Scatterplot matrix code
Multivariate EDA
Multivariate EDA code
Candidate models
Application exercise
Inference for regression
Inference for regression
Hypothesis testing for $β_{j}$
Confidence interval for $β_{j}$
Application exercise
Measures of model performance
Model summary statistics
Characteristics of a “good” final model
Next class
References