STA 310 - Spring 2024 - Poisson Regression

Data: Airbnbs in NYC

The data set NYCairbnb-sample.csv contains information about a random sample of 1000 Airbnbs in New York City. It is a subset of the data on 40628 Airbnbs scraped by Awad, Lebo, and Linden (2017).¹

Variables

number_of_reviews: Number of reviews for the unit on Airbnb (proxy for number of rentals)
price: price per night in US dollars
room_type: Entire home/apartment, private room, or shared room
days: Number of days the unit has been listed (date when info scraped - date when unit first listed on Airbnb)

Goal: Use the price and room type of Airbnbs to describe variation in the number of reviews (a proxy for number of rentals).

Data: Airbnbs in NYC

airbnb <- read_csv("data/NYCairbnb-sample.csv")

id	number_of_reviews	days	room_type	price
15756544	16	1144	Private room	120
14218251	15	471	Private room	89
21644	0	2600	Private room	89
13667835	1	283	Entire home/apt	150
265912	0	1970	Entire home/apt	89

EDA

Overall

mean	var
15.916	765.969

by room type

room_type	mean	var
Entire home/apt	16.283	760.348
Private room	15.608	786.399
Shared room	15.028	605.971

Considerations for modeling

We would like to fit the Poisson regression model

$\log (λ_{i}) = β_{0} + β_{1} p r i c e_{i} + β_{2} r o o m_t y p e 1_{i} + β_{3} r o o m_t y p e 2_{i}$

Based on the EDA, what are some potential issues we may want to address in the model building?
Suppose any model fit issues are addressed. What are some potential limitations to the conclusions and interpretations from the model?

02:00

Offset

Sometimes counts are not directly comparable because the observations differ based on some characteristic directly related to the counts, i.e. the sampling effort.
An offset can be used to adjust for differences in sampling effort.

Let $x_{o f f s e t}$ be the variable that accounts for differences in sampling effort, then $\log (x_{o f f s e t})$ will be added to the model.

$\log (λ_{i}) = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + . . . + β_{p} x_{i p} + \log (x_{o f f s e t_{i}})$

The offset is a term in the model with coefficient always equal to 1.

Adding an offset to the Airbnb model

We will add the offset $\log (d a y s)$ to the model. This accounts for the fact that we would expect Airbnbs that have been listed longer to have more reviews.

$\log (λ_{i}) = β_{0} + β_{1} p r i c e_{i} + β_{2} r o o m_t y p e 1_{i} + β_{3} r o o m_t y p e 2_{i} + \log (d a y s_{i})$

Note: The response variable for the model is still $\log (λ_{i})$ , the log mean number of reviews

Detail on the offset

We want to adjust for the number of days, so we are interested in $\frac{r e v i e w s}{d a y s}$ .

Given $λ$ is the mean number of reviews

$\log (\frac{λ_{i}}{d a y s_{i}}) = β_{0} + β_{1} p r i c e_{i} + β_{2} r o o m_t y p e 1_{i} + β_{3} r o o m_t y p e 2_{i}$

$\Rightarrow \log (λ_{i}) - \log (d a y s_{i}) = β_{0} + β_{1} p r i c e_{i} + β_{2} r o o m_t y p e 1_{i} + β_{3} r o o m_t y p e 2_{i}$

$\Rightarrow \log (λ_{i}) = β_{0} + β_{1} p r i c e_{i} + β_{2} r o o m_t y p e 1_{i} + β_{3} r o o m_t y p e 2_{i} + \log (d a y s_{i})$

Airbnb model in R

airbnb_model <- glm(number_of_reviews ~ price + room_type, 
                    data = airbnb, family = poisson, 
                    offset = log(days))

term	estimate	std.error	statistic
(Intercept)	-4.1351	0.0170	-243.1397
price	-0.0005	0.0001	-7.0952
room_typePrivate room	-0.0994	0.0174	-5.6986
room_typeShared room	0.2436	0.0452	5.3841

The coefficient for $\log (d a y s)$ is fixed at 1, so it is not in the model output.

Interpretations

term	estimate	std.error	statistic
(Intercept)	-4.1351	0.0170	-243.1397
price	-0.0005	0.0001	-7.0952
room_typePrivate room	-0.0994	0.0174	-5.6986
room_typeShared room	0.2436	0.0452	5.3841

Interpret the coefficient of price
Interpret the coefficient of room_typePrivate room

03:00

Quasi-Poisson model

airbnb_model_q <- glm(number_of_reviews ~ price + room_type, 
                    data = airbnb, family = quasipoisson, 
                    offset = log(days)) 

summary(airbnb_model_q)


Call:
glm(formula = number_of_reviews ~ price + room_type, family = quasipoisson, 
    data = airbnb, offset = log(days))

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -4.1350727  0.1159506 -35.662   <2e-16 ***
price                 -0.0004914  0.0004722  -1.041    0.298    
room_typePrivate room -0.0993582  0.1188728  -0.836    0.403    
room_typeShared room   0.2435939  0.3084581   0.790    0.430    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 46.48268)

    Null deviance: 31550  on 999  degrees of freedom
Residual deviance: 31379  on 996  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

Quasi-Poisson model

tidy(airbnb_model_q) |>
  kable(digits = 4)

term	estimate	std.error	statistic	p.value
(Intercept)	-4.1351	0.1160	-35.6624	0.0000
price	-0.0005	0.0005	-1.0407	0.2983
room_typePrivate room	-0.0994	0.1189	-0.8358	0.4034
room_typeShared room	0.2436	0.3085	0.7897	0.4299

term	type	estimate	std.error	statistic	p.value
(Intercept)	zero	-0.604	0.311	-1.938	0.053
first_year	zero	1.136	0.610	1.864	0.062

term	type	estimate	std.error	statistic	p.value
(Intercept)	zero	-0.604	0.311	-1.938	0.053
first_year	zero	1.136	0.610	1.864	0.062

Poisson Regression

Announcements

Topics

Offset

Data: Airbnbs in NYC

Data: Airbnbs in NYC

EDA

EDA

Considerations for modeling

Offset

Adding an offset to the Airbnb model

Detail on the offset

Airbnb model in R

Interpretations

Quasi-Poisson model

Quasi-Poisson model

Zero-inflated Poisson model

Data: Weekend drinking

EDA: Response variable

Observed vs. expected response

Two types of zeros

Zero-inflated Poisson model

Details of the ZIP model

ZIP model in R

Tidy output

Tidy output

Interpreting the model coefficients

Estimated proportion zeros

Comparing Poisson and ZIP Models

Likelihoods for ZIP model

Probabilities under ZIP model

Probabilities under ZIP model

Probabilities under ZIP model

References

term	type	estimate	std.error	statistic	p.value
(Intercept)	count	0.754	0.144	5.238	0.000
off_campus	count	0.416	0.206	2.020	0.043
sexm	count	1.021	0.175	5.827	0.000