This is an updated/overhauled case study on The Simpsons episodes as the data set has been more throughouly updated since last worked on. Previous observations and analysis done will mostly carry over but some parts will be adapted to the newer, cleaner, data set. Prior observations have also been trimmed to make this document more concise.
dim(data)
## [1] 747 14
names(data)
## [1] "id" "title" "description"
## [4] "original_air_date" "production_code" "directed_by"
## [7] "written_by" "season" "number_in_season"
## [10] "number_in_series" "us_viewers_in_millions" "imdb_rating"
## [13] "tmdb_rating" "tmdb_vote_count"
summary(data)
## id title description original_air_date
## Min. : 0.0 Length:747 Length:747 Min. :1989-12-17
## 1st Qu.:186.5 Class :character Class :character 1st Qu.:1997-12-14
## Median :373.0 Mode :character Mode :character Median :2006-04-23
## Mean :373.0 Mean :2006-07-06
## 3rd Qu.:559.5 3rd Qu.:2014-11-30
## Max. :746.0 Max. :2023-05-21
## production_code directed_by written_by season
## Length:747 Length:747 Length:747 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 9.00
## Mode :character Mode :character Mode :character Median :17.00
## Mean :17.44
## 3rd Qu.:26.00
## Max. :34.00
## number_in_season number_in_series us_viewers_in_millions imdb_rating
## Min. : 1.00 Min. : 1.0 Length:747 Min. :4.000
## 1st Qu.: 6.00 1st Qu.:187.5 Class :character 1st Qu.:6.600
## Median :12.00 Median :374.0 Mode :character Median :7.000
## Mean :11.61 Mean :374.3 Mean :7.151
## 3rd Qu.:17.00 3rd Qu.:560.5 3rd Qu.:7.700
## Max. :25.00 Max. :750.0 Max. :9.300
## tmdb_rating tmdb_vote_count
## Min. :0.000 Min. : 1.00
## 1st Qu.:5.800 1st Qu.: 14.00
## Median :6.300 Median : 17.00
## Mean :6.379 Mean : 18.07
## 3rd Qu.:7.000 3rd Qu.: 23.00
## Max. :8.600 Max. :101.00
sum(is.na(data))
## [1] 0
head(data) %>% rmarkdown::paged_table()
Things of Note:
number_in_season being an interesting one as number of
episodes of per (earlier) seasons ranged from 13 to 25.
length_description and age (time since
original air date).
age will be based off a fixed date of June 1st, 2023 in
which Season 34 has finished airing new episodes.Initial Observations Based Off of Summary Statistics:
summary(data$len_desc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 18.00 23.00 23.75 28.00 98.00
Shortest episode description length is five words:
## [1] "Marge becomes a real-estate agent."
## [2] "Marge accidentally gets breast implants."
## [3] "Fat Tony becomes Maggie's godfather."Potential Error(s) In the Data:
us_viewers_in_million is char typeviewers for ease of
accessPreliminary Questions/Notes:
The following are ways the data could be additionally analyzed and explored outside of given data set that will require external data.
1. US Viewers in Millions
data$viewers <- as.numeric(data$us_viewers_in_millions)
## Warning: NAs introduced by coercion
which(is.na(as.numeric(data$viewers)))
## [1] 423
data[423, c('id','title', 'viewers')]
## # A tibble: 1 × 3
## id title viewers
## <dbl> <chr> <dbl>
## 1 422 Double, Double, Boy in Trouble NA
as.numeric() for further
applicationis.na() didn’t detect because it was of class/type
char## Group.1 x
## 9 9 16.052000
## 10 10 13.850000
## 11 11 8.772273
## 12 12 15.485238
## 13 13 12.510000
2. Two-Part Episodes
The four unique cases are:
In a previous version it was observed that the author scrapped the data improperly most likely due to Wikipedia table formatting. This did cause inconsistencies in the data when comparing to Wikipedia.
Who Shot Mr. Burns? (Part One)” (S6 Finale, S7 Pilot) (id: 127, 128)
The Great Phatsby (S28E12-13) (id: 607)
Warrin’ Priests (S31E19-20)
A Serious Flanders (S33E06-07)
There are no present issues in this pair of episodes as they are treated as two separate entities.
However, the following is a more unique case considering how it is the ’first two-part episode of the series since “Who Shot Mr. Burns?”, though it was promoted and aired as the show’s first hour-long episode in its initial airing.
On Disney+, Vol. I and Vol. II are one episode, episode 12, with a total running time of 44 minutes, with no episode 13 listed in the series.’1
It is worth noting that both parts have different production codes and credits but the same premiere date. Thus, this special will be considered a single episode.
If the episode were to be split, the remaining data would have to be shifted (id, number_in_season/series).
number_in_series to identify episodes,
id is for sake of indexing that author used.
Note: The episode descriptions are from IMDb, not Wikipedia thus the discrepancy. The credits section on IMDb does not show who wrote the episode, only the developers.
Since this special is not promoted and treated as a single episode, it will be treated as two different episodes. As seen when comparing the IMDb episodes versus the data snippet, the descriptions do not line up with each other. The data will be fixed and updated accordingly. This error is also present in the next observation; i.e. Comparing the “The Hateful Eight-Year-Olds” description on Wikipedia, to IMDb, to the table.
- -
-
Note: Graphs expected to fall under time series analysis, and using
id/original_air_date/age are
somewhat interchangeable since they are effectively
factors/‘categorical.’
Note: Y-Axis scaled from 4-10 to trim excess white space.
viewers due to more variability.





length(unique(new_data$writers))
## [1] 166
length(unique(new_data$directors))
## [1] 49
## [1] "Story by : Ken KeelerTeleplay by : David X. Cohen"## [1] "Mimi Pond"
## [2] "Jon Vitti"
## [3] "Jay Kogen & Wallace Wolodarsky"
## [4] "Al Jean & Mike Reiss"
## [5] "John Swartzwelder"
## [6] "Al Jean, Mike Reiss, Sam Simon & Matt Groening"
## [1] 166
## [1] 15.63685
## [1] 155
## all_names Freq
## 2 Al Jean 26
## 97 Matt Selman 28
## 148 Tim Long 32
## 76 John Frink 35
## 75 Joel H. Cohen 36
## 77 John Swartzwelder 58
Unique writers cut down to 155 which is still more than what
I expected from the initial 166
NOTE: Wikipedia has a list of The Simpsons writers (assumed to be correct) which for the most part is consistent up with the parsed data. However, the way THOH episodes are listed on Wikipedia affected how the data was scraped. The episodes act as an anthology, thus have different writers for each part and listed as separate tables. In the data set, it only included the writer credits for the first segment. In the following images, compare listings for the second THOH episodes.
Take note of how the writers are both separated per segment or listed all together. It is assumed the author of the data set collected the data per season Wikipedia page as it it included other metrics in the data set. i.e. Scraped data off of https://en.wikipedia.org/wiki/The_Simpsons_(season_X), X being each season number.
Thus to coninute, the table of writers from Wikipedia will be used. Some writers only worked with other writers and have no individual credits, so duos/groups will be treated as a single entity.
## Rows: 142 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Writer
## dbl (1): Frequency
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 2
## Writer Frequency
## <chr> <dbl>
## 1 Carolyn Omine 26
## 2 Matt Selman 30
## 3 Tim Long 33
## 4 Joel H. Cohen 36
## 5 John Frink 37
## 6 John Swartzwelder 59
The issue persisted with directors, so the Wikipedia table for directors will be used.
## Rows: 41 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Director
## dbl (1): Freq
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] 41
## # A tibble: 6 × 2
## Director Freq
## <chr> <dbl>
## 1 Jim Reardon 35
## 2 Michael Polcino 40
## 3 Matthew Nastuk 59
## 4 Bob Anderson 65
## 5 Steven Dean Moore 82
## 6 Mark Kirkland 84
freq_df <- data.frame(table(wiki_directors$Freq))
freq_df <- freq_df %>% rename(EpisodesDirected = Var1)
fig <- plot_ly(freq_df, labels=~EpisodesDirected, values=~Freq)
fig <- fig %>% add_pie(hole = 0.6)
fig <- fig %>% layout(title = "Episodes Directed Frequency", showlegend = FALSE,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
as.numeric(unique(wiki_writers$Writer) %in% unique(wiki_directors$Director))
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
which(as.numeric(unique(wiki_writers$Writer) %in% unique(wiki_directors$Director)) == 1)
## [1] 120
unique(wiki_writers$Writer)[120]
## [1] "David Silverman"
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents