STA 310 - Spring 2024 - Logistic regression

Bernoulli + Binomial random variables

Logistic regression is used to analyze data with two types of responses:

Bernoulli (Binary): These responses take on two values success $(Y = 1)$ or failure $(Y = 0)$ , yes $(Y = 1)$ or no $(Y = 0)$ , etc.

$P (Y = y) = p^{y} (1 - p)^{1 - y} y = 0, 1$

Binomial: Number of successes in a Bernoulli process, $n$ independent trials with a constant probability of success $p$ .

$P (Y = y) = (\binom{n}{y}) p^{y} (1 - p)^{n - y} y = 0, 1, \dots, n$

In both instances, the goal is to model $p$ the probability of success.

age	sex	years	ppe_access
34	Male	2	1
32	Female	3	1
32	Female	1	1
40	Male	4	1
32	Male	10	1

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-2.127	0.458	-4.641	0.000	-3.058	-1.257
age	0.056	0.017	3.210	0.001	0.023	0.091
sexMale	0.341	0.224	1.524	0.128	-0.098	0.780
years	0.264	0.066	4.010	0.000	0.143	0.401

County	popBlack	popWhite	popTotal	pctBlack	distance	YesVotes	NumVotes
Carthage	841	599	1440	58.40	17	61	110
Cederville	1774	146	1920	92.40	7	0	15
Five Mile Creek	140	626	766	18.28	15	4	42
Greensboro	1425	975	2400	59.38	0	1790	1804
Harrison	443	355	798	55.51	7	0	15

Model in R

rr_model <- glm(cbind(YesVotes, NumVotes - YesVotes) ~ distance + pctBlack, 
                data = rr, family = binomial)
tidy(rr_model, conf.int = TRUE) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	4.222	0.297	14.217	0.000	3.644	4.809
distance	-0.292	0.013	-22.270	0.000	-0.318	-0.267
pctBlack	-0.013	0.004	-3.394	0.001	-0.021	-0.006

$\log (\frac{\hat{p}}{1 - \hat{p}}) = 4.22 - 0.292 d i s t - 0.013 p c t B l a c k$

Goodness of fit

Similar to Poisson regression, the sum of the squared deviance residuals is used to assess goodness of fit.

$\begin{aligned} H_{0} : Model is a good fit \\ H_{a} : Model is not a good fit \end{aligned}$

When $m_{i}$ ’s are large and the model is a good fit $(H_{0} true)$ the residual deviance follows a $χ^{2}$ distribution with $n - p$ degrees of freedom.
- Recall $n - p$ is the residual degrees of freedom.

If the model fits, we expect the residual deviance to be approximately what value?

Adjusting for overdispersion

Overdispersion occurs when there is extra-binomial variation, i.e. the variance is greater than what we would expect, $n p (1 - p)$ .
Similar to Poisson regression, we can adjust for overdispersion in the binomial regression model by using a dispersion parameter $\hat{ϕ} = \sum \frac{(Pearson residuals)^{2}}{n - p}$
- By multiplying by $\hat{ϕ}$ , we are accounting for the reduction in information we would expect from independent observations.

Logistic regression

Announcements

Learning goals

Logistic regression

Bernoulli + Binomial random variables

Logistic regression model

Interpreting coefficients

COVID-19 infection prevention practices at food establishments

Results

Interpretation

Visualizations for logistic regression

Access to personal protective equipment

EDA for binary response

EDA for binary response

Model results

Visualizing coefficient estimates

Logistic regression for binomial response variable

Data: Supporting railroads in the 1870s

The data

Exploratory data analysis

Exploratory data analysis

Model

Model in R

Application exercise

Residuals

Plot of deviance residuals

Goodness of fit

Overdispersion

Adjusting for overdispersion

Adjusting for overdispersion

Application exercise

References