Poisson Regression

Goodness-of-fit & overdispersion

Prof. Maria Tackett

Jan 29, 2024

Announcements

Quiz 01: Tue, Jan 30 ~ 9am - Thu, Feb 01 at noon

  • The quiz is not timed and will be administered through Canvas

  • Covers:

    • Syllabus

    • Covers readings & lectures: Jan 17 - 24

    • Open book, open note, open internet (not crowd-sourcing sites or AI). You may not discuss the quiz with anyone else. See policy in syllabus.

    • Please email me or send a me a direct message on Slack if you have questions.

Topics

  • Define and calculate residuals for the Poisson regression model

  • Use Goodness-of-fit to assess model fit

  • Identify overdispersion

  • Apply modeling approaches to deal with overdispersion

    • Quasi-Poisson

    • Negative binomial

The data: Household size in the Philippines

The data fHH1.csv come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.

Goal: Understand the association between household size and various characteristics of the household

Response:

  • total: Number of people in the household other than the head

Predictors:

  • location: Where the house is located
  • age: Age of the head of household
  • roof: Type of roof on the residence (proxy for wealth)

Other variables:

  • numLT5: Number in the household under 5 years old

Poisson regression model

If \(Y_i \sim Poisson\) with \(\lambda = \lambda_i\) for the given values \(x_{i1}, \ldots, x_{ip}\), then

\[\log(\lambda_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}\]

  • Each observation can have a different value of \(\lambda\) based on its value of the predictors \(x_1, \ldots, x_p\)

  • \(\lambda\) determines the mean and variance, so we don’t need to estimate a separate error term

Household vs. Age

hh_age <- glm(total ~ age, data = hh_data, family = poisson)

tidy(hh_age) |> 
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 1.5499 0.0503 30.8290 0
age -0.0047 0.0009 -5.0258 0

\[\log(\hat{\lambda}) = 1.5499 - 0.0047 ~ age\]

The mean household size is predicted to decrease by 0.47% (multiply by a factor of \(e^{-0.0047}\)) for each year older the head of the household is.

Household vs. age and location

Add \(age^2\) and \(location\) to model?

hh_age_loc <- glm(total ~ age + I(age^2) + location, 
                  data = hh_data, family = poisson)


Use Likelihood Ratio Test

anova(hh_age, hh_age_loc, test = "LRT") |>
  kable(digits = 3)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1498 2337.089 NA NA NA
1493 2187.800 5 149.289 0

Selected model

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3843 0.1821 -2.1107 0.0348 -0.7444 -0.0306
age 0.0704 0.0069 10.1900 0.0000 0.0569 0.0840
I(age^2) -0.0007 0.0001 -10.9437 0.0000 -0.0008 -0.0006
locationDavaoRegion -0.0194 0.0538 -0.3605 0.7185 -0.1250 0.0859
locationIlocosRegion 0.0610 0.0527 1.1580 0.2468 -0.0423 0.1641
locationMetroManila 0.0545 0.0472 1.1542 0.2484 -0.0378 0.1473
locationVisayas 0.1121 0.0417 2.6853 0.0072 0.0308 0.1945

Does this model sufficiently explain the variability in the mean household size?

Goodness-of-fit

Pearson residuals

We can calculate two types of residuals for Poisson regression: Pearson residuals and deviance residuals

\[\text{Pearson residual}_i = \frac{\text{observed} - \text{predicted}}{\text{std. error}} = \frac{Y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]

  • Similar interpretation as standardized residuals from linear regression

  • Expect most to fall between -2 and 2

  • Used to calculate overdispersion parameter (more on this soon)

Deviance residuals

  • The deviance residual describes how the observed data deviates from the fitted model \[\text{deviance residual}_i = \text{sign}(Y_i - \hat{\lambda}_i)\sqrt{2\Bigg[Y_i\log\bigg(\frac{Y_i}{\hat{\lambda}_i}\bigg) - (Y_i - \hat{\lambda}_i)\Bigg]}\]

where

\[\text{sign}(Y_i - \hat{\lambda}_i) = \begin{cases} 1 & \text{ if }(Y_i - \hat{\lambda}_i) > 0 \\ -1 & \text{ if }(Y_i - \hat{\lambda}_i) < 0 \\ 0 & \text{ if }(Y_i - \hat{\lambda}_i) = 0 \end{cases}\]

  • Good fitting models \(\Rightarrow\) small deviances

Selected model: Residual plots

hh_age_loc_aug_pearson <- augment(hh_age_loc, type.residuals = "pearson") 
hh_age_loc_aug_deviance <- augment(hh_age_loc, type.residuals = "deviance")

Goodness-of-fit

  • Goal: Use the (residual) deviance to assess how much the predicted values differ from the observed values.

    \[ \text{deviance} = \sum_{i=1}^{n}(\text{deviance residual})_i^2 \]

  • When a model is true, we expect
    \[\text{deviance} \sim \chi^2_{df}\]

where \(df\) is the model’s residual degrees of freedom

  • Question to answer: What is the probability of observing a deviance larger than the one we’ve observed, given this model sufficiently fits the data?

\[P(\chi^2_{df} > \text{ deviance})\]

Application exercise

Goodness-of-fit calculations

hh_age_loc$deviance
[1] 2187.8
hh_age_loc$df.residual
[1] 1493
pchisq(hh_age_loc$deviance, hh_age_loc$df.residual, lower.tail = FALSE)
[1] 3.153732e-29


The probability of observing a deviance greater than 2187.8 is \(\approx 0\), so there is significant evidence of lack-of-fit.

Lack-of-fit

There are a few potential reasons for observing lack-of-fit:

  • Missing important interactions or higher-order terms

  • Missing important variables (perhaps this means a more comprehensive data set is required)

  • There could be extreme observations causing the deviance to be larger than expected (assess based on the residual plots)

  • There could be a problem with the Poisson model

    • Only one parameter \(\lambda\) to describe mean and variance

    • May need more flexibility in the model to handle overdispersion

Overdispersion

Overdispersion: There is more variability in the response than what is implied by the Poisson model


Overall
mean var
3.685 5.534
by Location
location mean var
CentralLuzon 3.402 4.152
DavaoRegion 3.390 4.723
IlocosRegion 3.586 5.402
MetroManila 3.707 4.863
Visayas 3.902 6.602

Why overdispersion matters

If there is overdispersion, then there is more variation in the response than what’s implied by a Poisson model. This means

The standard errors of the model coefficients are artificially small

\(\Rightarrow\) The p-values are artificially small

\(\Rightarrow\) Could lead to models that are more complex than what is needed

We can take overdispersion into account by

  • inflating standard errors by multiplying them by a dispersion factor
  • using a negative-binomial regression model

Quasi-Poisson

Dispersion parameter

The dispersion parameter is represented by \(\phi\) \[\hat{\phi} = \frac{\sum_{i=1}^{n}(\text{Pearson residuals})^2}{n - p}\]

where \(p\) is the number of terms in the model (including the intercept)

  • If there is no overdispersion \(\hat{\phi} = 1\)

  • If there is overdispersion \(\hat{\phi} > 1\)

Accounting for dispersion

  • We inflate the standard errors of the coefficient by multiplying the variance by \(\hat{\phi}\)

\[SE_{Q}(\hat{\beta}) = \sqrt{\hat{\phi}} * SE(\hat{\beta})\]

“Q” stands for quasi-Poisson, since this is an ad-hoc solution - The process for model building and model comparison is called quasilikelihood (similar to likelihood without exact underlying distributions)

Quasi-Poisson model

hh_age_loc_q <- glm(total ~ age + I(age^2) + location, data = hh_data, 
                family = quasipoisson) 
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3843 0.2166 -1.7744 0.0762 -0.8134 0.0358
age 0.0704 0.0082 8.5665 0.0000 0.0544 0.0866
I(age^2) -0.0007 0.0001 -9.2000 0.0000 -0.0009 -0.0006
locationDavaoRegion -0.0194 0.0640 -0.3030 0.7619 -0.1451 0.1058
locationIlocosRegion 0.0610 0.0626 0.9735 0.3304 -0.0620 0.1837
locationMetroManila 0.0545 0.0561 0.9703 0.3320 -0.0552 0.1649
locationVisayas 0.1121 0.0497 2.2574 0.0241 0.0156 0.2103

Poisson vs. Quasi-Poisson models

Poisson
term estimate std.error
(Intercept) -0.3843 0.1821
age 0.0704 0.0069
I(age^2) -0.0007 0.0001
locationDavaoRegion -0.0194 0.0538
locationIlocosRegion 0.0610 0.0527
locationMetroManila 0.0545 0.0472
locationVisayas 0.1121 0.0417
Quasi-Poisson
estimate std.error
-0.3843 0.2166
0.0704 0.0082
-0.0007 0.0001
-0.0194 0.0640
0.0610 0.0626
0.0545 0.0561
0.1121 0.0497

Quasi-Poisson: Inference for coefficients

term estimate std.error
(Intercept) -0.3843 0.2166
age 0.0704 0.0082
I(age^2) -0.0007 0.0001
locationDavaoRegion -0.0194 0.0640
locationIlocosRegion 0.0610 0.0626
locationMetroManila 0.0545 0.0561
locationVisayas 0.1121 0.0497
Test statistic

\[t = \frac{\hat{\beta} - 0}{SE_{Q}(\hat{\beta})} \sim t_{n-p}\]

Quasi-Poisson model

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3843 0.2166 -1.7744 0.0762 -0.8134 0.0358
age 0.0704 0.0082 8.5665 0.0000 0.0544 0.0866
I(age^2) -0.0007 0.0001 -9.2000 0.0000 -0.0009 -0.0006
locationDavaoRegion -0.0194 0.0640 -0.3030 0.7619 -0.1451 0.1058
locationIlocosRegion 0.0610 0.0626 0.9735 0.3304 -0.0620 0.1837
locationMetroManila 0.0545 0.0561 0.9703 0.3320 -0.0552 0.1649
locationVisayas 0.1121 0.0497 2.2574 0.0241 0.0156 0.2103

Application exercise

Negative binomial regression model

Negative binomial regression model

Another approach to handle overdispersion is to use a negative binomial regression model

  • This has more flexibility than the quasi-Poisson model, because there is a new parameter in addition to \(\lambda\)


Let \(Y\) be a negative binomial random variable, \(Y\sim NegBinom(r, p)\), then

\[\begin{align}P(Y = y_i) = {y_i + r - 1 \choose r - 1}(1-p)^{y_i}p^r \hspace{5mm} y_i = 0, 1, 2, \ldots, \infty \\ E(Y) = \frac{r(1-p)}{p} \hspace{8mm} SD(Y) = \sqrt{\frac{r(1-p)}{p^2}}\end{align}\]

Negative binomial regression model

  • Main idea: Generate a \(\lambda\) for each observation (household) and generate a count using the Poisson random variable with parameter \(\lambda\)

    • Makes the counts more dispersed than with a single parameter
  • Think of it as a Poisson model such that \(\lambda\) is also random \[\begin{aligned} &\text{If }\hspace{2mm} Y|\lambda \sim Poisson(\lambda)\\ &\text{ and } \lambda \sim Gamma\bigg(r, \frac{1-p}{p}\bigg)\\ &\text{ then } Y \sim NegBinom(r, p)\end{aligned}\]

Negative binomial regression in R

Use the glm.nb function in the MASS R package.


Caution

The MASS package has a select function that conflicts with the select function in dplyr. You can avoid this by (1) always loading tidyverse after MASS, or (2) use MASS::glm.nb instead of loading the package.

Negative binomial regression in R

hh_age_loc_nb <- MASS::glm.nb(total ~ age + I(age^2) + location, data = hh_data)
tidy(hh_age_loc_nb) |> 
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) -0.3753 0.2076 -1.8081 0.0706
age 0.0699 0.0079 8.8981 0.0000
I(age^2) -0.0007 0.0001 -9.5756 0.0000
locationDavaoRegion -0.0219 0.0625 -0.3501 0.7262
locationIlocosRegion 0.0577 0.0615 0.9391 0.3477
locationMetroManila 0.0562 0.0551 1.0213 0.3071
locationVisayas 0.1104 0.0487 2.2654 0.0235

Negative binomial vs. Quasi-Poisson

Quasi-Poisson
term estimate std.error
(Intercept) -0.3843 0.2166
age 0.0704 0.0082
I(age^2) -0.0007 0.0001
locationDavaoRegion -0.0194 0.0640
locationIlocosRegion 0.0610 0.0626
locationMetroManila 0.0545 0.0561
locationVisayas 0.1121 0.0497
Negative binomial
estimate std.error
-0.3753 0.2076
0.0699 0.0079
-0.0007 0.0001
-0.0219 0.0625
0.0577 0.0615
0.0562 0.0551
0.1104 0.0487

References

Roback, Paul, and Julie Legler. 2021. Beyond multiple linear regression: applied generalized linear models and multilevel models in R. CRC Press.