HW 02: Poisson regression

library(tidyverse)
library(knitr)
library(kableExtra)
Important
  • This assignment is due on Wednesday, February 7 at 11:59pm with a grace period (i.e., no late penalty) until Thursday, February 8 noon (12pm).
    • Your access to the repo will be removed at the end of the grace period. If you wish to submit the HW late, please email me and I will extend your access to the repo.
    • You will have access to your HW repo again when grades are returned.

Instructions

  • Write all narrative using full sentences. Write all interpretations and conclusions in the context of the data.
  • Be sure all analysis code is displayed in the rendered pdf.
  • If you are fitting a model, display the model output in a neatly formatted table. (The tidy and kable functions can help!)
  • If you are creating a plot, use clear and informative labels and titles.
  • Render, commit, and push your work to GitHub regularly, at least after each exercise. Write short and informative commit messages.
  • When you’re done, we should be able to render the final version of the Quarto document in your GitHub repo to fully reproduce your pdf.

Exercises

Note

All exercises in this assignment were adapted from exercises in Chapter 4 of Roback and Legler (2021).

Exercise 1

Answer parts a - d in the context of the following study:

A state wildlife biologist collected data from 250 park visitors as they left at the end of their stay. Each was asked to report the number of fish they caught during their one-week stay. On average, visitors caught 21.5 fish per week.

  1. Define the response.

  2. What are the possible values for the response?

  3. What does \(\lambda\) represent?

  4. Would a zero-inflated model be considered here? If so, what would be a “true zero”?

Exercise 2

Brockmann (1996) carried out a study of nesting female horseshoe crabs. Female horseshoe crabs often have male crabs attached to a female’s nest known as satellites. One objective of the study was to determine which characteristics of the female were associated with the number of satellites. Of particular interest is the relationship between the width of the female carapace and satellites.

The data can be found in crab.csv in the data folder. It includes the following variables:

  • Satellite = number of satellites
  • Width = carapace width (cm)
  • Weight = weight (kg)
  • Spine = spine condition (1 = both good, 2 = one worn or broken, 3 = both worn or broken)
  • Color = color (1 = light medium, 2 = medium, 3 = dark medium, 4 = dark)

Make sure to convert Spine and Color to the appropriate data types in R before doing the analysis.

  1. Create a histogram of Satellite. Is there preliminary evidence the number of satellites could be modeled as a Poisson response? Briefly explain.

  2. Fit a Poisson regression model including Width, Weight, and Spine as predictors. Display the model with the 95% confidence interval for each coefficient.

  3. Describe the effect of Spine in terms of the mean number of satellites.

Exercise 3

Use the scenario from the previous exercise to answer questions (a) - (d).

  1. We would like to fit a quasi-Poisson regression model for this data. Briefly explain why we may want to consider fitting a quasi-Poisson regression model for this data.

  2. Fit a quasi-Poisson regression model that corresponds with the model chosen the previous exercise. Display the model.

  3. What is the estimated dispersion parameter? Show how this value is calculated.

  4. How do the estimated coefficients change compared to the model chosen in the previous exercise? How do the standard errors change?

Exercise 4

The goal of this exercise is to use simulation to understand the equivalency between a gamma-Poisson mixture and a negative binomial distribution.

Tip

Remember to set a seed so your simulations are reproducible!

  1. Use the R function rpois() to generate 10,000 \(x_i\) from a regular Poisson distribution, \(X \sim \textrm{Poisson}(\lambda=1.5)\). Plot a histogram of this distribution and note its mean and variance. Next, let \(Y \sim \textrm{Gamma}(r = 3, \lambda = 2)\) and use rgamma() to generate 10,000 random \(y_i\) from this distribution.

    Now, consider 10,000 different Poisson distributions where \(\lambda_i = y_i\). Randomly generate one \(z_i\) from each Poisson distribution. Plot a histogram of these \(z_i\) and compare it to your original histogram of \(X\) (where \(X \sim \textrm{Poisson}(1.5)\)). How do the means and variances compare?

  2. A negative binomial distribution can actually be expressed as a gamma-Poisson mixture. In Part a, you looked at a gamma-Poisson mixture \(Z \sim \textrm{Poisson}(\lambda)\) where \(\lambda \sim \textrm{Gamma}(r = 3, \lambda' = 2)\).

    Find the parameters of a negative binomial distribution \(X \sim \textrm{NegBinom}(r, p)\) such that \(X\) is equivalent to \(Z\). As a hint, the means of both distributions must be the same, so \(\frac{r(1-p)}{p} = \frac{3}{2}\).

    Show through histograms and summary statistics that your negative binomial distribution is equivalent to the gamma-Poisson mixture. You can use rnbinom() in R.

  3. Make an argument that if you want a \(\textrm{NegBinom}(r, p)\) random variable, you can instead sample from a Poisson distribution, where the \(\lambda\) values are themselves sampled from a gamma distribution with parameters \(r\) and \(\lambda' = \frac{p}{1-p}\).

Exercise 5

In a 2018 study, Chapp et al. (2018) scraped every issue statement from webpages of candidates for the U.S. House of Representatives, counting the number of issues candidates commented on and scoring the level of ambiguity of each statement. We will focus on the issue counts, and determining which attributes (of both the district as a whole and the candidates themselves) are associated with candidate silence (commenting on 0 issues) and a willingness to comment on a greater number of issues. The data set is in ambiguity.csv in the data folder . This analysis will focus on the following variables:

  • name : candidate name
  • distID : unique identification number for Congressional district
  • ideology : candidate left-right orientation
  • democrat : 1 if Democrat, 0 if Republican
  • totalIssuePages : number of issues candidates commented on (response)

See Roback and Legler (2021) for the full list of variables.

We will use a hurdle model to analyze the data. A hurdle model is similar to a zero-inflated Poisson model, but instead of assuming that “zeros” are comprised of two distinct groups—those who would always be 0 and those who happen to be 0 on this occasion—the hurdle model assumes that “zeros” are a single entity. Therefore, in a hurdle model, cases are classified as either “zeros” or “non-zeros”, where “non-zeros” hurdle the 0 threshold—they must always have counts of 1 or above.

We will use the pscl package and the hurdle function in it to analyze a hurdle model. Note that coefficients in the “zero hurdle model” section of the output relate predictors to the log-odds of being a non-zero (i.e., having at least one issue statement), which is opposite of the ZIP model.

  1. Visualize the distribution of the response variable totalIssuePages. Why might we consider using a hurdle model compared to a Poisson model? Why is a zero-inflated Poisson model not appropriate in this scenario?

  2. Create a plot of the empirical log odds of having at least one issue statement by ideology. You may want to group ideology values first. What can you conclude from this plot?

  3. Create a hurdle model with ideology and democrat as predictors in both parts.1 Display the model. Interpret ideology in both parts of the model.

  4. Repeat (d), but include an interaction in both parts. Interpret the interaction in the zero hurdle part of the model.

Exercise 6

  1. Awad, Lebo, and Linden (2017) scraped 40628 Airbnb listings from New York City in March 2017 and put together the data set NYCairbnb.csv. The codebook is in the data folder of the hw-02 repo.

    Perform the EDA and build a model, considering offset and accounting for overdispersion, if needed. Then, use the model to describe the characteristics of Airbnbs that are expected to have a high number of reviews.

Submission

To submit the assignment, push your final changes to your GitHub repo. Then, you’re done! We will grade the latest versions of the files that were pushed to the GitHub repo by the deadline unless otherwise notified that you wish to submit late work.

Grading

Total 50
Ex 1 4
Ex 2 6
Ex 3 8
Ex 4 8
Ex 5 10
Ex 6 11
Workflow & formatting 3

The “Workflow & formatting” grade is to based on the organization of the assignment write up along with the reproducible workflow. This includes having an organized write up with neat and readable headers, code, and narrative, including properly rendered mathematical notation. It also includes having a reproducible Quarto document that can be rendered to reproduce the submitted PDF, along with implementing version control using multiple commits with informative commit messages.

References

Awad, Annika, Evan Lebo, and Anna Linden. 2017. “Intercontinental Comparative Analysis of Airbnb Booking Factors.”
Brockmann, H. Jane. 1996. “Satellite Male Groups in Horseshoe Crabs, Limulus Polyphemus.” Ethology 102 (1): 1–21. https://doi.org/10.1111/j.1439-0310.1996.tb01099.x.
Roback, Paul, and Julie Legler. 2021. Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in r. CRC Press.

Footnotes

  1. The R syntax in the hurlde function is similar to the syntax for the zero-inflated Poisson model. You will need to specify the distributions for the count and hurdle portions of the model, using dist = "poisson" and zero.dist = "binomial".↩︎