library(tidyverse)
library(knitr)
library(viridis)
library(broom.mixed)
library(mclogit)
library(ordinal)
<- read_csv("data/yelp_sample.csv") |>
yelp mutate(stars_given = factor(stars_given))
Lecture 23: Modeling Yelp Reviews
Introduction
Today’s dataset contains a sample of 1000 Yelp reviews for restaurants in Madison, WI from 2005 to 2017. The data were originally obtained from the Yelp Dataset Challenge on Kaggle, which contains about 60,000 reviews on 888 restaurants posted by over 20,000 reviewers.
We will focus on the following variables:
stars_given
: Restaurant star rating 1(worst) - 5(best)length100
: Number words in the review / 100useful
: Number of users who rated the review as “useful”funny
: Number of users who rated the review was “funny”cool
: Number of users who rated the review as “cool”revs_user100
: Number of previous reviews the user has written / 100revs_biz100
: Number of reviews the restaurant has / 100avg_biz_stars
: Average star rating for restaurantuser_id
: Denotes the reviewername
: Name of restaurant being reviewed
The goal of the analysis is to determine what factors are associated with a good Yelp review.
Exploratory data analysis
Exercise 1
Consider the distribution of the response variable stars_given
.
What type of model(s) can we fit to understand the association between the a good Yelp review and the other factors?
What random effect(s) do we need to account for in the model(s)?
## add code here
Exercise 2
Visualize the relationship between the response and 1 - 2 predictors. What do you observe from these plots?
## add code here
Modeling
Exercise 3
Fit and display the model(s) described in Exercise 1.
# add code here
Conclusions
Exercise 4
We will use one of the models from Exercise 3 to draw conclusions about factors associated with a good Yelp review. Which model do choose? Briefly explain your choice.
Exercise 5
Use the model selected from the previous exercise. What factors are associated with a good Yelp review?
Acknowledgements
The data set used in this exercise was obtained from the collection of data sets in Beyond Multiple Linear Regression.