Logistic regression

Binomial responses + overdispersion

Prof. Maria Tackett

Feb 07, 2024

Announcements

  • HW 02 due TODAY at 11:59pm

  • Project 01

    • presentations in class Wed, Feb 14

    • write up due Thu, Feb 15 at noon

Learning goals

  • Visualizations for logistic regression

  • Fit and interpret logistic regression model for binomial response variable

  • Explore diagnostics for logistic regression

  • Summarize GLMs for independent observations

Notes based on Chapter 6 Roback and Legler (2021) unless noted otherwise.

Logistic regression

Bernoulli + Binomial random variables

Logistic regression is used to analyze data with two types of responses:

  • Bernoulli (Binary): These responses take on two values success (Y=1) or failure (Y=0), yes (Y=1) or no (Y=0), etc.

P(Y=y)=py(1−p)1−yy=0,1

  • Binomial: Number of successes in a Bernoulli process, n independent trials with a constant probability of success p.

P(Y=y)=(ny)py(1−p)n−yy=0,1,…,n

In both instances, the goal is to model p the probability of success.

Logistic regression model

log⁡(p1−p)=β0+β1x1+β2x2+⋯+βpxp

  • The response variable, log⁡(p1−p), is the log(odds) of success, i.e. the logit
  • Use the model to calculate the probability of success p^=eβ0+β1x1+β2x2+⋯+βpxp1+eβ0+β1x1+β2x2+⋯+βpxp
  • When the response is a Bernoulli random variable, the probabilities can be used to classify each observation as a success or failure

Interpreting coefficients

log⁡(p^1−p^)=β^0+β^1x1+β^2x2+⋯+β^pxp

  • β^j is predicted change in the log-odds when going from xj to xj+1, holding all else constant

  • eβj is the predicted change the odds when going from xj to xj+1, holding all else constant (odds ratio)

COVID-19 infection prevention practices at food establishments

Researchers at Wollo Univeristy in Ethiopia conducted a study in July and August 2020 to understand factors associated with good COVID-19 infection prevention practices at food establishments. Their study is published in Andualem et al. (2022) .


They were particularly interested in the understanding implementation of prevention practices at food establishments, given the workers’ increased risk due to daily contact with customers.

Results

Interpretation

  • The (adjusted) odds ratio for availability of COVID-19 infection prevention guidelines is 2.68 with 95% CI (1.52, 4.75).

  • The odds ratio between workers at a restaurant with such guidelines and those at a restaurant without the guidelines is 2.68, after adjusting for the other factors.

  • Interpretation: The odds a worker at a restaurant with COVID-19 infection prevention guidelines uses good infection prevention practices is 2.68 times the odds of a worker at a restaurant without the guidelines, holding all other factors constant.

Visualizations for logistic regression

Access to personal protective equipment

We will use the data from Andualem et al. (2022) to explore the association between age, sex, years of service, and whether someone works at a food establishment with access to personal protective equipment (PPE) as of August 2020. We will use access to PPE as a proxy for wearing PPE.

age sex years ppe_access
34 Male 2 1
32 Female 3 1
32 Female 1 1
40 Male 4 1
32 Male 10 1

EDA for binary response

library(Stat2Data)
par(mfrow = c(1, 2))
emplogitplot1(ppe_access ~ age, data = covid_df, ngroups = 10)
emplogitplot1(ppe_access ~ years, data = covid_df, ngroups = 5)

EDA for binary response

library(viridis)
ggplot(data = covid_df, aes(x = sex, fill = factor(ppe_access))) + 
  geom_bar(position = "fill")  +
  labs(x = "Sex", 
       fill = "PPE Access", 
       title = "PPE Access by Sex") + 
  scale_fill_viridis_d()

Model results

ppe_model <- glm(factor(ppe_access) ~ age + sex + years, 
                 data = covid_df, family = binomial)
tidy(ppe_model, conf.int = TRUE) |>
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2.127 0.458 -4.641 0.000 -3.058 -1.257
age 0.056 0.017 3.210 0.001 0.023 0.091
sexMale 0.341 0.224 1.524 0.128 -0.098 0.780
years 0.264 0.066 4.010 0.000 0.143 0.401

Visualizing coefficient estimates

model_odds_ratios <- tidy(ppe_model, exponentiate = TRUE, conf.int = TRUE)
ggplot(data = model_odds_ratios, aes(x = term, y = estimate)) +
  geom_point() +
  geom_hline(yintercept = 1, lty = 2) + 
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
  labs(title = "Adjusted odds ratios",
       x = "",
       y = "Estimated AOR") +
  coord_flip()

Logistic regression for binomial response variable

Data: Supporting railroads in the 1870s

The data set RR_Data_Hale.csv contains information on support for referendums related to railroad subsidies for 11 communities in Hale County, Alabama in the 1870s. The data were originally collected from the US Census by historian Michael Fitzgerald and analyzed as part of a thesis project by a student at St. Olaf College. The variables in the data are

  • pctBlack: percentage of Black residents in the county
  • distance: distance the proposed railroad is from the community (in miles)
  • YesVotes: number of “yes” votes in favor of the proposed railroad line
  • NumVotes: number of votes cast in the election

Primary question: Was voting on the railroad referendum related to the distance from the proposed railroad line, after adjusting for the demographics of a county?

The data

rr <- read_csv("data/RR_Data_Hale.csv")
rr |> slice(1:5) |> kable()
County popBlack popWhite popTotal pctBlack distance YesVotes NumVotes
Carthage 841 599 1440 58.40 17 61 110
Cederville 1774 146 1920 92.40 7 0 15
Five Mile Creek 140 626 766 18.28 15 4 42
Greensboro 1425 975 2400 59.38 0 1790 1804
Harrison 443 355 798 55.51 7 0 15

Exploratory data analysis

rr <- rr |>
  mutate(pctYes = YesVotes/NumVotes, 
         emp_logit = log(pctYes / (1 - pctYes)))

Exploratory data analysis

rr <- rr |>
  mutate(inFavor = if_else(pctYes > 0.5, "Yes", "No"))

Check for potential multicollinearity and interaction effect.

Model

Let p be the percent of yes votes in a county. We’ll start by fitting the following model:

log⁡(p1−p)=β0+β1 dist+β2 pctBlack

Likelihood

L(p)=∏i=1n(miyi)piyi(1−pi)mi−yi=∏i=1n(miyi)[eβ0+β1 disti+β2 pctBlacki1+eβ0+β1 disti+β2 pctBlacki]yi[1eβ0+β1 disti+β2 pctBlacki]mi−yi

Use IRLS to find β^0,β^1,β^2.

Model in R

rr_model <- glm(cbind(YesVotes, NumVotes - YesVotes) ~ distance + pctBlack, 
                data = rr, family = binomial)
tidy(rr_model, conf.int = TRUE) |>
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4.222 0.297 14.217 0.000 3.644 4.809
distance -0.292 0.013 -22.270 0.000 -0.318 -0.267
pctBlack -0.013 0.004 -3.394 0.001 -0.021 -0.006

log⁡(p^1−p^)=4.22−0.292 dist−0.013 pctBlack

Application exercise

📋 sta310-sp24.netlify.app/ae/lec-08-logistic

Part 1

10:00

Residuals

Similar to Poisson regression, there are two types of residuals: Pearson and deviance residuals

Pearson residuals

Pearson residuali=actual count−predicted countSD count=Yi−mip^imip^i(1−p^i)

Deviance residuals

di=sign(Yi−mip^i)2[Yilog⁡(Yimip^i)+(mi−Yi)log⁡(mi−Yimi−mip^i)]

Plot of deviance residuals

rr_int_aug <- augment(rr_int_model, type.predict = "response", 
                        type.residuals = "deviance")

Goodness of fit

Similar to Poisson regression, the sum of the squared deviance residuals is used to assess goodness of fit.

H0: Model is a good fitHa: Model is not a good fit

  • When mi’s are large and the model is a good fit (H0 true) the residual deviance follows a χ2 distribution with n−p degrees of freedom.
    • Recall n−p is the residual degrees of freedom.
  • If the model fits, we expect the residual deviance to be approximately what value?

Overdispersion

Adjusting for overdispersion

  • Overdispersion occurs when there is extra-binomial variation, i.e. the variance is greater than what we would expect, np(1−p).

  • Similar to Poisson regression, we can adjust for overdispersion in the binomial regression model by using a dispersion parameter ϕ^=∑(Pearson residuals)2n−p

    • By multiplying by ϕ^, we are accounting for the reduction in information we would expect from independent observations.

Adjusting for overdispersion

  • We adjust for overdispersion using a quasibinomial model.
    • “Quasi” reflects the fact we are no longer using a binomial model with true likelihood.
  • The standard errors of the coefficients are SEQ(β^j)=ϕ^SE(β^)
    • Inference is done using the t distribution to account for extra variability

Application exercise

📋 sta310-sp24.netlify.app/ae/lec-08-logistic

Part 2

06:00

References

Andualem, Atsedemariam, Belachew Tegegne, Sewunet Ademe, Tarikuwa Natnael, Gete Berihun, Masresha Abebe, Yeshiwork Alemnew, et al. 2022. “COVID-19 Infection Prevention Practices Among a Sample of Food Handlers of Food and Drink Establishments in Ethiopia.” PLoS One 17 (1): e0259851.
Roback, Paul, and Julie Legler. 2021. Beyond multiple linear regression: applied generalized linear models and multilevel models in R. CRC Press.

🔗 STA 310 - Spring 2024

1 / 32
Logistic regression Binomial responses + overdispersion Prof. Maria Tackett Feb 07, 2024

  1. Slides

  2. Tools

  3. Close
  • Logistic regression
  • Announcements
  • Learning goals
  • Logistic regression
  • Bernoulli + Binomial random variables
  • Logistic regression model
  • Interpreting coefficients
  • COVID-19 infection prevention practices at food establishments
  • Results
  • Interpretation
  • Visualizations for logistic regression
  • Access to personal protective equipment
  • EDA for binary response
  • EDA for binary response
  • Model results
  • Visualizing coefficient estimates
  • Logistic regression for binomial response variable
  • Data: Supporting railroads in the 1870s
  • The data
  • Exploratory data analysis
  • Exploratory data analysis
  • Model
  • Model in R
  • Application exercise
  • Residuals
  • Plot of deviance residuals
  • Goodness of fit
  • Overdispersion
  • Adjusting for overdispersion
  • Adjusting for overdispersion
  • Application exercise
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help