02:00
Feb 12, 2024
Project 01
presentations in class Wed, Feb 14
write up due Thu, Feb 15 at 9pm
Quiz 02: Tue, Feb 20 - Thu, Feb 22
Covers readings & lectures: Jan 24 - Feb 12
Poisson regression, unifying framework for GLMs, logistic regression, proportional odds models, probit regression
Introduce proportional odds and probit regression models
Understand how these models are related to logistic regression models
Interpret coefficients in context of the data
See how these models are applied in research contexts
Ataman and Sarıyer (2021) use ordinal logistic regression to predict patient wait and treatment times in an emergency department (ED). The goal is to identify relevant factors that can be used to inform recommendations for reducing wait and treatment times, thus improving the quality of care in the ED.
Data: Daily records for ED arrivals in August 2018 at a public hospital in Izmir, Turkey.
Response variables:
Wait time
:
Treatment time
:
Patients who are treated for up to 10 minutes
Patients whose treatment time is in the range of 10 - 120 minutes
Patients who are treated for longer than 120 minutes
Predictor variables:
Gender
:
Age
:
Arrival mode
:
Triage level
:
ICD-10 diagnosis
: Codes specifying patient’s diagnosis
Categorical variables with 3+ levels
Unordered (Nominal)
Voting choice in election with multiple candidates
Type of cell phone owned by adults in the U.S.
Favorite social media platform among undergraduate students
Ordered (Ordinal)
Wait and treatment times in the emergency department
Likert scale ratings on a survey
Employee job performance ratings
Let \(Y\) be an ordinal response variable that takes levels \(1, 2, \ldots, J\) with associated probabilities \(p_1, p_2, \ldots, p_J\)
The proportional odds model can be written as the following:
\[\begin{aligned}&\log\Big(\frac{P(Y \leq 1)}{P(Y > 1)}\Big) = \beta_{01} - \beta_1x_1 - \dots - \beta_px_p \\ & \log\Big(\frac{P(Y\leq 2)}{P(Y > 2)}\Big) = \beta_{02} -\beta_1x_1 - \dots - \beta_px_p \\ & \dots \\ & \log\Big(\frac{P(Y\leq J-1)}{P(Y > J-1)}\Big) = \beta_{0{J-1}} - \beta_1x_1 - \dots - \beta_px_p\end{aligned}\]
What does \(\beta_{01}\) mean? What does \(\beta_1\) mean?
Let’s consider one portion of the model:
\[ \log\Big(\frac{P(Y\leq k)}{P(Y > k)}\Big) = \beta_{0k} - \beta_1x_1 - \dots - \beta_px_p \]
The response variable is \(logit(Y\leq k)\), the log-odds of observing an outcome less than or equal to category \(k\).
\(\beta_j > 0\) is associated with increased log-odds of being in a higher category of \(Y\)
Effect of one unit increase in \(x_j\) the same regardless of which category of \(Y\)
The variable arrival mode
has two possible values: ambulance and walk-in. Describe the effect of arrival mode on waiting time. Note: The baseline category is walk-in.
Consider the full output with the ordinal logistic models for wait and treatment times.
Use the results from both models to describe the effect of triage level on waiting and treatment times. Note: The baseline category is green.
02:00
Fit proportional odds models using the polr
function in the MASS package:
Suppose the outcome variable \(Y\) is categorical and can take values \(1, 2, \ldots, K\) such that
\[ P(Y = 1) = p_1, \ldots , P(Y = K) = p_K \hspace{5mm} \text{ and } \hspace{5mm} \sum_{k = 1}^{K} p_k = 1 \]
Choose baseline category. Let’s choose \(Y = 1\) . Then
\[\begin{aligned}&\log\Big(\frac{P(Y = 2)}{P(Y = 1)}\Big) = \beta_{02} - \beta_{12}x_1 - \dots - \beta_{p2}x_p \\ & \log\Big(\frac{P(Y = 3)}{P(Y =1)}\Big) = \beta_{03} -\beta_{13}x_1 - \dots - \beta_{p3}x_p \\ & \dots \\ & \log\Big(\frac{P(Y = K)}{P(Y = 1)}\Big) = \beta_{0{K}} - \beta_{1K}x_1 - \dots - \beta_{pK}x_{p}\end{aligned}\]
How is the proportional odds model similar to the multinomial logistic model? How is it different? What is an advantage of each model? What is a disadvantage?
03:00
Ibanez and Roussel (2022) conducted an experiment to understand the impact of watching a nature documentary on pro-environmental behavior. The researchers randomly assigned the 113 participants to watch an video about architecture in NYC (control) or a video about Yellowstone National Park (treatment). As part of the experiment, participants were asked to dispose of their headphone coverings in a recycle bin available at the end of the experiment.
Response variable: Recycle headphone coverings vs. not
Predictor variables:
Let \(Y\) be a binary response variable that takes values 0 or 1, and let \(p = P(Y = 1 | x_1, \ldots, x_p)\)
\[ probit(p) = \Phi^{-1}(p) = \beta_0 + \beta_1 x_1+ \dots + \beta_px_p \]
where \(\Phi^{-1}\) is the inverse normal distribution function.
The outcome is the z-score at which the cumulative probability is equal to \(p\)
\(\hat{\beta}_j\) is the estimated change in z-score for each unit increase in \(x_j\), holding all other factors constant.
This is a fairly clunky interpretation, so the (average) marginal effect of \(x_j\) is often interpreted instead
The marginal effect of \(x_j\) is essentially the change the probability from variable \(x_j\)
Interpret the effect of watching the nature documentary Nature (T2)
on recycling. Assume NEP is low, NEP-High
= 0.
Pros of probit regression:
Some statisticians like assuming the normal distribution over the logistic distribution.
Easier to work with in more advanced settings, such as multivariate and Bayesian modeling
Cons of probit regression:
Z-scores are not as straightforward to interpret as the outcomes of a logistic model.
We can’t use odds ratios to describe findings.
It’s more mathematically complicated than logistic regression.
It does not work well for response variable with 3+ categories
Fit probit regression models using the glm
function with family = binomial(link = probit)
.
Calculate marginal effects using the margins
function from the margins R package.
Let’s look at the model using ideology and party ID to explain the number of issue statements by politicians. We will use probit regression for the “hurdle” part of the model - the likelihood a candidate comments on at least one issue (has_issue_stmt
)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.272 | 0.117 | 10.829 | 0.000 |
ideology | 0.262 | 0.089 | 2.926 | 0.003 |
democrat1 | 0.149 | 0.180 | 0.827 | 0.408 |
ideology democrat1
0.04071 0.02333
Interpret the effect of democrat
on commenting on at least one issue.
Probit model
term | estimate |
---|---|
(Intercept) | 1.272 |
ideology | 0.262 |
democrat1 | 0.149 |
Logistic model
term | estimate |
---|---|
(Intercept) | 2.127 |
ideology | 0.575 |
democrat1 | 0.428 |
Suppose there is democratic representative with ideology score -2.5. Based on the probit model, what is the probability they will comment on at least one issue? What is the probability based on the logistic model?
03:00
Covered fitting, interpreting, and drawing conclusions from GLMs
Used Pearson and deviance residuals to assess model fit and determine if new variables should be added to the model
Addressed issues of overdispersion and zero-inflation
Used the properties of the one-parameter exponential family to identify the best link function for any GLM
Everything we’ve done thus far as been under the assumption that the observations are independent. Looking ahead we will consider models for data with dependent (correlated) observations.