Jan 24, 2024
Describe properties of the Poisson random variable
Write the mathematical equation of the Poisson regression model
Describe how the Poisson regression differs from least-squares regression
Interpret the coefficients for the Poisson regression model
Compare two Poisson regression models
Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
Does the number of paint defects per square foot of wall differ based on the years of experience of the painter?
Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
Does the number of paint defects per square foot of wall differ based on the years of experience of the painter?
Each response variable is a count per a unit of time or space.
Let
Features
| Mean | Variance | |
|---|---|---|
| lambda = 1 | 0.99351 | 0.9902178 |
| lambda = 5 | 4.99367 | 4.9865798 |
| lambda = 50 | 49.99288 | 49.8962683 |
The annual number of earthquakes registering at least 2.5 on the Richter Scale and having an epicenter within 40 miles of downtown Memphis follows a Poisson distribution with mean 6.5. What is the probability there will be at 3 or fewer such earthquakes next year?
The data fHH1.csv come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.
Goal: Understand the association between household size and various characteristics of the household
Response:
total: Number of people in the household other than the headPredictors:
location: Where the house is locatedage: Age of the head of householdroof: Type of roof on the residence (proxy for wealth)Other variables:
numLT5: Number in the household under 5 years old| location | age | total | numLT5 | roof |
|---|---|---|---|---|
| CentralLuzon | 65 | 0 | 0 | Predominantly Strong Material |
| MetroManila | 75 | 3 | 0 | Predominantly Strong Material |
| DavaoRegion | 54 | 4 | 0 | Predominantly Strong Material |
| Visayas | 49 | 3 | 0 | Predominantly Strong Material |
| MetroManila | 74 | 3 | 0 | Predominantly Strong Material |
| mean | var |
|---|---|
| 3.685 | 5.534 |
The goal is to model
We might be tempted to try a linear model
This model won’t work because…
If
Each observation can have a different value of

Poisson response: The response variable is a count per unit of time or space, described by a Poisson distribution, at each level of the predictor(s)
Independence: The observations must be independent of one another
Mean = Variance: The mean must equal the variance
Linearity: The log of the mean rate,

| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 |
| age | -0.0047 | 0.0009 | -5.0258 | 0 |
The coefficient for age is -0.0047. Interpret this coefficient in context. Select all that apply.
The mean household size is predicted to decrease by 0.0047 for each year older the head of the household is.
The mean household size is predicted to multiply by a factor of 0.9953 for each year older the head of the household is.
The mean household size is predicted to decrease by 0.9953 for each year older the head of the household is.
The mean household size is predicted to multiply by a factor of 0.47% for each year older the head of the household is.
The mean household size is predicted to decrease by 0.47% for each year older the head of the household is.
age statistically significant?| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 | 1.4512 | 1.6482 |
| age | -0.0047 | 0.0009 | -5.0258 | 0 | -0.0065 | -0.0029 |
Test statistic
P-value
age?| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 | 1.4512 | 1.6482 |
| age | -0.0047 | 0.0009 | -5.0258 | 0 | -0.0065 | -0.0029 |
95% confidence interval for the coefficient of age
where
Interpret the interval in terms of the change in mean household size.
agehh_data <- hh_data |>
mutate(age2 = age*age)
model2 <- glm(total ~ age + age2, data = hh_data, family = poisson)
tidy(model2, conf.int = T) |>
kable(digits = 4)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -0.3325 | 0.1788 | -1.8594 | 0.063 | -0.6863 | 0.0148 |
| age | 0.0709 | 0.0069 | 10.2877 | 0.000 | 0.0575 | 0.0845 |
| age2 | -0.0007 | 0.0001 | -11.0578 | 0.000 | -0.0008 | -0.0006 |
age| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -0.3325 | 0.1788 | -1.8594 | 0.063 | -0.6863 | 0.0148 |
| age | 0.0709 | 0.0069 | 10.2877 | 0.000 | 0.0575 | 0.0845 |
| age2 | -0.0007 | 0.0001 | -11.0578 | 0.000 | -0.0008 | -0.0006 |
We can determine whether to keep
1️⃣ Use the p-value (or confidence interval) for the coefficient (since we are adding a single term to the model)
2️⃣ Conduct a drop-in-deviance test
A deviance is a way to measure how the observed data differs (deviates) from the model predictions.
It’s a measure unexplained variability in the response variable (similar to SSE in linear regression )
Lower deviance means the model is a better fit to the data
We can calculate the “deviance residual” for each observation in the data (more on the formula later). Let
Note: Deviance is also known as the “residual deviance”
We can use a drop-in-deviance test to compare two models. To conduct the test
1️⃣ Compute the deviance for each model
2️⃣ Calculate the drop in deviance
3️⃣ Given the reduced model is the true model
where
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 1498 | 2337.089 | NA | NA | NA |
| 1497 | 2200.944 | 1 | 136.145 | 0 |
location to the model?Suppose we want to add location to the model, so we compare the following models:
Model A:
Model B:
Which of the following are reliable ways to determine if location should be added to the model?

