Preface

This is an updated/overhauled case study on The Simpsons episodes as the data set has been more throughouly updated since last worked on. Previous observations and analysis done will mostly carry over but some parts will be adapted to the newer, cleaner, data set. Prior observations have also been trimmed to make this document more concise.

Specific references to season and episode short hand is: S##E## (Season #, Episode #).
Most plots will be interactive and have hover-able elements to inspect specific values.
Observational claims not valid until stated/statistically proven otherwise.
- e.g. “The average rating of Season 2 is greater than Season 7” is just a mere observation.

Data Source

Data obtained from Kaggle
Author credits IMDB, The Movid Database, and Wikipedia.
While investigating, Wikipedia also has Nielsen ratings
- Can scrape and do t-tests to compare whether or not Nielsen ratings significantly different from IMDb

Exploratory Data Analysis

dim(data)

## [1] 747  14

names(data)

##  [1] "id"                     "title"                  "description"           
##  [4] "original_air_date"      "production_code"        "directed_by"           
##  [7] "written_by"             "season"                 "number_in_season"      
## [10] "number_in_series"       "us_viewers_in_millions" "imdb_rating"           
## [13] "tmdb_rating"            "tmdb_vote_count"

summary(data)

##        id           title           description        original_air_date   
##  Min.   :  0.0   Length:747         Length:747         Min.   :1989-12-17  
##  1st Qu.:186.5   Class :character   Class :character   1st Qu.:1997-12-14  
##  Median :373.0   Mode  :character   Mode  :character   Median :2006-04-23  
##  Mean   :373.0                                         Mean   :2006-07-06  
##  3rd Qu.:559.5                                         3rd Qu.:2014-11-30  
##  Max.   :746.0                                         Max.   :2023-05-21  
##  production_code    directed_by         written_by            season     
##  Length:747         Length:747         Length:747         Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 9.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :17.00  
##                                                           Mean   :17.44  
##                                                           3rd Qu.:26.00  
##                                                           Max.   :34.00  
##  number_in_season number_in_series us_viewers_in_millions  imdb_rating   
##  Min.   : 1.00    Min.   :  1.0    Length:747             Min.   :4.000  
##  1st Qu.: 6.00    1st Qu.:187.5    Class :character       1st Qu.:6.600  
##  Median :12.00    Median :374.0    Mode  :character       Median :7.000  
##  Mean   :11.61    Mean   :374.3                           Mean   :7.151  
##  3rd Qu.:17.00    3rd Qu.:560.5                           3rd Qu.:7.700  
##  Max.   :25.00    Max.   :750.0                           Max.   :9.300  
##   tmdb_rating    tmdb_vote_count 
##  Min.   :0.000   Min.   :  1.00  
##  1st Qu.:5.800   1st Qu.: 14.00  
##  Median :6.300   Median : 17.00  
##  Mean   :6.379   Mean   : 18.07  
##  3rd Qu.:7.000   3rd Qu.: 23.00  
##  Max.   :8.600   Max.   :101.00

sum(is.na(data))

## [1] 0

head(data) %>% rmarkdown::paged_table()

Things of Note:

Most variables are in some way viable as predictor variable.
- i.e. Use description text for text analysis (frequency, sentiment, etc.) as opposed to completely ignoring.
- number_in_season being an interesting one as number of episodes of per (earlier) seasons ranged from 13 to 25.
  - Standard 22 episodes per season is adopted later.
No variable for episode run time.
- Can assume ~22 mins per avg run time for 30 minute T.V. slot.
- Run time of actual content would fluctuate due to variability of couch gags and ending credits.
Air date was originally considered with prior knowledge of the Friday night death slot which is where low rating shows are rescheduled to air that are pending cancellation.
- Can assume might not be too relevant due to popularity of the show and fixed airing schedule.
- Wikipedia lists broadcasting history which includes time slot(s).
Ratings will be most interesting to analyze/use as an independent variable.
- Two sources of ratings, can run t-test to compare if statistically different. For the sake of inital analysis, IMDb ratings will be used instead due to presumed greater sample size and reputation.
Adding two new variables:
- length_description and age (time since original air date).
  - age will be based off a fixed date of June 1st, 2023 in which Season 34 has finished airing new episodes.
  - As a point of reference, new seasons premiere around late September

Initial Observations Based Off of Summary Statistics:

747 x 14 table
747 episodes, 34 seasons
- The show has been renewed until Season 35.
First episode air date: 12-17-1989
- Pilot episode where the family adopts the family dog (Santa’s Little Helper).
Lowest rating 4.0, but note IMDB ratings skewed / bias by nature

summary(data$len_desc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   18.00   23.00   23.75   28.00   98.00

Shortest episode description length is five words:

## [1] "Marge becomes a real-estate agent."      
## [2] "Marge accidentally gets breast implants."
## [3] "Fat Tony becomes Maggie's godfather."

Potential Error(s) In the Data:

No blatant missing values as noted above but some numeric values listed as different type
- us_viewers_in_million is char type
- Variable will be renamed to viewers for ease of access
Not observed in above summary statistics, but there are unique cases where The Simpsons has two-part episodes. This updated data set is assumed to not account for the differences as each part (might) have a different viewer count, writers/directors, etc.
- Four cases in which entries need to be fixed.

Preliminary Questions/Notes:

The following are ways the data could be additionally analyzed and explored outside of given data set that will require external data.

Potential filter data by:
- Treehouse of Horror episodes (Halloween specials)
- First and last episodes of each season
New external data points to consider:
- Disney acquisition of Hulu/Fox, and FX (March 19-20, 2019)
- Sociopolitical/culture events (e.g. Presidential election episodes)
- Pre vs Post The Simpsons movie release (July 27, 2007)

Data Error Discussion

1. US Viewers in Millions

data$viewers <- as.numeric(data$us_viewers_in_millions)

## Warning: NAs introduced by coercion

which(is.na(as.numeric(data$viewers)))

## [1] 423

data[423, c('id','title', 'viewers')]

## # A tibble: 1 × 3
##      id title                          viewers
##   <dbl> <chr>                            <dbl>
## 1   422 Double, Double, Boy in Trouble      NA

Simply re-factor variables to as.numeric() for further application
Missing data points were originally missed because of type issue
- is.na() didn’t detect because it was of class/type char
- Viewership data is listed on Wikipedia (8.09 mil.), presumed to be updated on there prior to newest data version
However, an interesting observation was made; Wikipedia articles report viewership in both individual viewers and viewing households
- Seasons 1–11 are ranked by households (in millions), Seasons 12–33 are ranked by total viewers (in millions).
- This does skew the data as seen in the increase in average viewership between Season 11 and 12.

##    Group.1         x
## 9        9 16.052000
## 10      10 13.850000
## 11      11  8.772273
## 12      12 15.485238
## 13      13 12.510000

2. Two-Part Episodes

The four unique cases are:

In a previous version it was observed that the author scrapped the data improperly most likely due to Wikipedia table formatting. This did cause inconsistencies in the data when comparing to Wikipedia.
Who Shot Mr. Burns? (Part One)” (S6 Finale, S7 Pilot) (id: 127, 128)
The Great Phatsby (S28E12-13) (id: 607)
Warrin’ Priests (S31E19-20)
A Serious Flanders (S33E06-07)

There are no present issues in this pair of episodes as they are treated as two separate entities.

However, the following is a more unique case considering how it is the ’first two-part episode of the series since “Who Shot Mr. Burns?”, though it was promoted and aired as the show’s first hour-long episode in its initial airing.

On Disney+, Vol. I and Vol. II are one episode, episode 12, with a total running time of 44 minutes, with no episode 13 listed in the series.’¹

It is worth noting that both parts have different production codes and credits but the same premiere date. Thus, this special will be considered a single episode.

If the episode were to be split, the remaining data would have to be shifted (id, number_in_season/series).

Can also use number_in_series to identify episodes, id is for sake of indexing that author used.

Wikipedia Snippet

Note: The episode descriptions are from IMDb, not Wikipedia thus the discrepancy. The credits section on IMDb does not show who wrote the episode, only the developers.

S31 Snippet

Since this special is not promoted and treated as a single episode, it will be treated as two different episodes. As seen when comparing the IMDb episodes versus the data snippet, the descriptions do not line up with each other. The data will be fixed and updated accordingly. This error is also present in the next observation; i.e. Comparing the “The Hateful Eight-Year-Olds” description on Wikipedia, to IMDb, to the table.

Wikipedia Snippet - - -

Graphs

Note: Graphs expected to fall under time series analysis, and using id/original_air_date/age are somewhat interchangeable since they are effectively factors/‘categorical.’

Seasons Data

Seasons vs Ratings

Boxplot

Scatterplot

Boxplot w/ Jitter

Averages

When looking solely at the averages, a negative pattern is more definitive.
Season 30 has a large drop off from 29, then recovers the following seasons.
- Comparable to Season 8 and the more steep decline.

Note: Y-Axis scaled from 4-10 to trim excess white space.

Seasons with large spread: 8, 23, and potentially 33
S8 has highest maximum, alternatively S23 has lowest:
- S08E23: Homer’s Enemy
- S23E23: Lisa Goes Gaga
Ratings seem to plateau as seasons progress, but the occasional outlier does exist.

Seasons vs Viewers

Boxplot

Scatterplot

Boxplot w/ Jitter

Averages

Similar to looking at averages, when observing only the averages, a negative pattern is more definitive.
Season 11 drop and recovery is an interesting outlier as it is much more steep of a drop.
- This season included an episode where they killing-off of a recurring character, Maude Flanders.

Viewership does seemingly appear to decrease over time.
More outliers exist for maxima rather than minima.
- Two cases (S01 and S13) where the outlier are minima
- Most extreme case of maximum in S16 E8
  - Interesting to see how rest of maxima not even close to same viewership
  - Gap between the maximum and third quantile is largest compared to the other seasons
  - Why is it this specific episode that makes it an outlier?
    - Air date, Feb 6, 2005 which is when Super Bowl XXXIX happened alongside with the premiere of another FOX animated series, American Dad.
Outliers in the ratings graph(s) are more interesting to look at when compared to viewers due to more variability.
The ratings do not seem to follow a negative trend like the viewership.

Episode Data

Ratings

Plot

Interactive Plot

General negative trend as expected per the seasonal graph(s)
- Start / end of season no notable patterns of yet granted it is somewhat hard to observe
  - More detailed graph(s) later
- i.e. Season premiere vs finale (Presumably because of episodic nature of the series and no overarching plotline)

Viewers

Plot

Interactive Plot

Can see more detailed info when zoomed in accordingly; however, graph is still too condensed to get detailed view
Overall trend does show that rating does plateau (as expected from seasonal boxplots)

Season Specific Episode Data

Group 1 (S01 - S09)

Group 2 (S10 - S18)

Group 3 (S19 - S27)

Group 4 (S28 - S34)

Note: Green highlights are Treehouse of Horror (THOH) episodes.
Just rough estimations/observations, but note that overall trend is generally decrease in score.
Not surprising as noted earlier as The Simpsons is not a series with over arching season plot line.
S1 does not have a THOH episode.
THOH episodes within first five episodes per season, some cases where it is first episode of the season.
In general THOH seem to be in upper half of viewership per season
Seasons where THOH episodes highest of the respective Season: 5, 6 15 (?), 17, 19, 20, 32, 33

THOH

Ratings

Viewers

The ratings of THOH specials remain neutral after the initial drop, then fluctuate after the ~400th episode.
The 29th THOH special (Season 30) is lowest for both ratings and viewer.

Writers and Directors

length(unique(new_data$writers))

## [1] 166

length(unique(new_data$directors))

## [1] 49

166 unique writers records.
- Some entries have multiple people, i.e. outliers mentioned above, 2 part episodes.
- The way data obtained caused errors/inconsistencies within data.
  - Scrapped using different delimiters (&, spaces, or no space at all).
  - Some entries scraped including “Story by:” and/or “Teleplay by:”.
    - Different semantics based off the WGA. (How did WGA Strike of 07 affect episode metrics?)
      - Treating both titles as a writer to look for unique names.
    - e.g. id == 176
```
## [1] "Story by : Ken KeelerTeleplay by : David X. Cohen"
```
49 unique director listings

## [1] "Mimi Pond"                                     
## [2] "Jon Vitti"                                     
## [3] "Jay Kogen & Wallace Wolodarsky"                
## [4] "Al Jean & Mike Reiss"                          
## [5] "John Swartzwelder"                             
## [6] "Al Jean, Mike Reiss, Sam Simon & Matt Groening"

## [1] 166

## [1] 15.63685

The 166 unique entries are mix of individual writers and groups/duos.
First pass comparing to nchars (characters in name) to average length of writers names.
- Upon further inspection some names of single writers of episode longer than average (15.63685 chars), did manual search and decided 20 characters is good cutoff.
Otherwise 55 unique instance of episodes of multiple writers
- 1st Pass: Removing ‘Teleplay’ / ‘Story’ titles
- 2nd Pass: Separating entries by delimiters
- 3rd Pass: Cleaning side effects of using unlist function
- 4th Pass: (Done)

Writers

## [1] 155

##             all_names Freq
## 2             Al Jean   26
## 97        Matt Selman   28
## 148          Tim Long   32
## 76         John Frink   35
## 75      Joel H. Cohen   36
## 77  John Swartzwelder   58

~~Unique writers cut down to 155 which is still more than what I expected from the initial 166~~
NOTE: Wikipedia has a list of The Simpsons writers (assumed to be correct) which for the most part is consistent up with the parsed data. However, the way THOH episodes are listed on Wikipedia affected how the data was scraped. The episodes act as an anthology, thus have different writers for each part and listed as separate tables. In the data set, it only included the writer credits for the first segment. In the following images, compare listings for the second THOH episodes.

THOH 2

THOH Sample

Take note of how the writers are both separated per segment or listed all together. It is assumed the author of the data set collected the data per season Wikipedia page as it it included other metrics in the data set. i.e. Scraped data off of https://en.wikipedia.org/wiki/The_Simpsons_(season_X), X being each season number.

Thus to coninute, the table of writers from Wikipedia will be used. Some writers only worked with other writers and have no individual credits, so duos/groups will be treated as a single entity.

Bar Graph

## Rows: 142 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Writer
## dbl (1): Frequency
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 2
##   Writer            Frequency
##   <chr>                 <dbl>
## 1 Carolyn Omine            26
## 2 Matt Selman              30
## 3 Tim Long                 33
## 4 Joel H. Cohen            36
## 5 John Frink               37
## 6 John Swartzwelder        59

Pie Chart

Almost half of data is telling us that many writers only contributed to a single episode
One writer, John Swartzwelder, has written for 59 (max) episodes while second highest is 37 by John Frink

Directors

The issue persisted with directors, so the Wikipedia table for directors will be used.

Bar Graph

## Rows: 41 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Director
## dbl (1): Freq
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## [1] 41

## # A tibble: 6 × 2
##   Director           Freq
##   <chr>             <dbl>
## 1 Jim Reardon          35
## 2 Michael Polcino      40
## 3 Matthew Nastuk       59
## 4 Bob Anderson         65
## 5 Steven Dean Moore    82
## 6 Mark Kirkland        84

Pie Chart

freq_df <- data.frame(table(wiki_directors$Freq))
freq_df <- freq_df %>% rename(EpisodesDirected = Var1)

fig <- plot_ly(freq_df, labels=~EpisodesDirected, values=~Freq)
fig <- fig %>% add_pie(hole = 0.6)
fig <- fig %>% layout(title = "Episodes Directed Frequency",  showlegend = FALSE,
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig

41 unique directors
12.2% of unique directors only directed one episode, while most common value is seven episodes directed.
One director, Mark Kirkland, has directed for 84 episodes, second highest is 82 by Steven Dean Moore.
- Difference compared to writer difference between top two isn’t as large when comparing to writers.

as.numeric(unique(wiki_writers$Writer) %in% unique(wiki_directors$Director))

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

which(as.numeric(unique(wiki_writers$Writer) %in% unique(wiki_directors$Director)) == 1)

## [1] 120

unique(wiki_writers$Writer)[120]

## [1] "David Silverman"

For sake of curiosity, there is one overlap present between writers and directors.
- One entry, David Silverman.

Descriptions

## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents

## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents

https://en.wikipedia.org/wiki/The_Great_Phatsby ↩︎

Simpsons Episode Analysis

Richard Luu

Preface

Data Source

Exploratory Data Analysis

Data Error Discussion

Graphs

Seasons Data

Seasons vs Ratings

Boxplot

Scatterplot

Boxplot w/ Jitter

Averages

Seasons vs Viewers

Boxplot

Scatterplot

Boxplot w/ Jitter

Averages

Episode Data

Ratings

Plot

Interactive Plot

Viewers

Plot

Interactive Plot

Season Specific Episode Data

Group 1 (S01 - S09)

Group 2 (S10 - S18)

Group 3 (S19 - S27)

Group 4 (S28 - S34)

THOH

Ratings

Viewers

Writers and Directors

Writers

Bar Graph

Pie Chart

Directors

Bar Graph

Pie Chart

Descriptions