Introduction
I entered the first feeder competition for the Iron Viz in March 2017 and found it to be a very worthwhile and enjoyable process. I didn’t manage to submit an entry for the second round so I was determined not to miss out on round three! When I saw the blog on Tableau Public announcing the ‘Iron Viz: Silver Screen’ it seemed like something I would be able to work with, so I started to have a look for a data set. I have an interest in languages and decided I’d like to look at word usage and frequencies. I knew of some word frequency lists from movies and TV shows on Wikipedia so this was where I turned first. A data set of the top 5000 words by frequency for the Simpsons captured my attention. I always loved the Simpsons growing up, although I rarely watch it now. It was the highlight of my Sunday evenings back in the day – there were back to back Simpson episodes at 6 o’clock (this was followed by Irish favourites ‘Where in the World’ and ‘Glenroe’ which just didn’t offer the excitement of the Simpsons but were an excuse to push out bedtime). The list of words I found only went up to 2012 however, so I went on the hunt for another source. I came across one on the website Kaggle and it looked like there was a comprehensive source of data, from script lines to character codes, episode detail and locations. I downloaded the script lines, character table and list of episodes and decided. Todd Schneider is credited on the Kaggle site and his blog can be read here. This is some overlap in the analysis I carried out but there were additional areas that Todd focused on. Also, I wanted to look more closely at the words themselves and create an interactive visualisation in Tableau.
Aim of my data visualisation
To provide an ‘at a glance’ overview of some potentially interesting figures and patterns in the data and give a flavour of the word usage at a character level. I wanted to address some of the crucial questions that surely must be burning in everyone’s minds such as:
- What is Side Show Bob’s favourite word?
- Who uses the word ‘doh’ in addition to Homer?
- What is the gender split in words uttered?
I wanted to create a visualisation that would display some of this and allow interaction should anyone have the urge to look into it in more detail, filtering on particular episodes if required.
Data Preparation Step 1
Firstly, I wanted to split the script lines into the separate words. I had done something similar in Python previously, but I decided to do this in Alteryx, as I want to get more practice using it. A colleague, Emmet McCormack, who is very knowledgeable in all things Alterx got my workflow started for me. The critical step was to use a ‘Parse’ icon to split the script lines into individual rows (third icon below). Within Alteryx I then carried out some data cleansing such as removing some symbols and creating a field to identify if the text was a number, so that I could filter out at a later point. Some data cleansing had already been carried out before I started working on it. I used the field ‘normalised text’, where capitals and apostrophes already seemed to have been eliminated. I then summarised the table in order to count the number of instances or utterances of each word i.e. the words spoken. I grouped by character ID and episode ID so that for each character and episode I would get a sub-total e.g. on one row we would see how many times Homer said ‘the’ in a particular episode. This is the workflow:
Data Preparation Step 2
For some of my analysis I knew I would want to eliminate stop words such as ‘a, the, about, there, that’ etc. I found a number of lists here and appended the first three lists at the end of the page together to give me a list of about 640 words. I did this in Excel. I decided to then join these stop words in Tableau rather than Alteryx. Within the datasource pane in Tableau I also joined the character table so that I had the name of character, and the episodes table so that I could see the title of each episode and the season.
The mystery of the missing D’ohs
At one point during the data exploration process when I was looking at levels of particular utterances of individual words, I noticed that the word ‘doh’ only appear 50 times in the data. I did a comparison back to the original .csv file to ensure it wasn’t due to something that happened during the transformation process, but it didn’t appear to be. This number seemed far too low so I wondered if the data was complete. I discussed with others who agreed it didn’t seem plausible and also watched videos such as this one which confirmed there was no way there could be so few occurrences of ‘doh’ in 26 seasons. Would I have to ditch the data set?! I then luckily came across a paragraph on Wikipedia which noted that in the scripts, the utterances of ‘doh’ are often written as an ‘annoyed grunt’. Going back to the original data set I identified a field with the ‘raw text’ rather than the ‘normalised text’ that I had been working with. Sure enough, there were about 370 annoyed grunts and about 300 came from Homer. This may warrant some further analysis but it was enough to ease my worry that I had an incomplete data set.
Layout and images
I wanted to create a comic book ‘look and feel’ as I thought it would work for the Simpsons theme, given it’s a cartoon and some Simpsons comics have been published. Initially I considered bringing in images from Wikipedia and other sources online but with the potential copyright implications around the official Simpsons images I decided against it. I read up a bit on Fair Use but didn’t gain enough understanding to establish if I could use them in this way. I decided to purchase one image which was a photograph of some graffiti with Bart Simpsons (from Big Stock). In addition I then used a selection of free images from http://www.flaticon.com (spray can, wall and blackboard) and used PowerPoint for the cloud.
Points to note
One point to note is that there are other character versions of the main characters e.g. Homer’s spirit, 8 year-old Homer etc. These have different character IDs so are not included in the totals for the family members. The levels of words spoken didn’t seem to be significant overall.
It is worth mentioning that there are some ‘words’ included which are in fact dates or contain text such as ‘3rd’ or ’21yearold’ or ‘zzzzzzzzz’. The method for separation of the script lines used blank spaces within the ‘normalised’ text. I removed numbers as part of the data preparation process but about 30 were missed due to the format. Further steps could be made to define more clearly what constitutes a word for the purposes of meaningful linguistic analysis.
There is a link to the finished product on Tableau Public here.
Leave a Reply