Lecture 23: Modeling Yelp Reviews

Published

April 17, 2024

library(tidyverse)
library(knitr)
library(viridis)
library(broom.mixed)
library(mclogit)
library(ordinal)

yelp <- read_csv("data/yelp_sample.csv") |>
  mutate(stars_given = factor(stars_given))

Introduction

Today’s dataset contains a sample of 1000 Yelp reviews for restaurants in Madison, WI from 2005 to 2017. The data were originally obtained from the Yelp Dataset Challenge on Kaggle, which contains about 60,000 reviews on 888 restaurants posted by over 20,000 reviewers.

We will focus on the following variables:

stars_given: Restaurant star rating 1(worst) - 5(best)
length100: Number words in the review / 100
useful: Number of users who rated the review as “useful”
funny: Number of users who rated the review was “funny”
cool: Number of users who rated the review as “cool”
revs_user100: Number of previous reviews the user has written / 100
revs_biz100: Number of reviews the restaurant has / 100
avg_biz_stars: Average star rating for restaurant
user_id: Denotes the reviewer
name: Name of restaurant being reviewed

The goal of the analysis is to determine what factors are associated with a good Yelp review.

Exploratory data analysis

Exercise 1

Consider the distribution of the response variable stars_given.

What type of model(s) can we fit to understand the association between the a good Yelp review and the other factors?
What random effect(s) do we need to account for in the model(s)?

## add code here

Exercise 2

Visualize the relationship between the response and 1 - 2 predictors. What do you observe from these plots?

## add code here

Modeling

Exercise 3

Fit and display the model(s) described in Exercise 1.

# add code here

Conclusions

Exercise 4

We will use one of the models from Exercise 3 to draw conclusions about factors associated with a good Yelp review. Which model do choose? Briefly explain your choice.

Exercise 5

Use the model selected from the previous exercise. What factors are associated with a good Yelp review?

Acknowledgements

The data set used in this exercise was obtained from the collection of data sets in Beyond Multiple Linear Regression.