登录查看更多内容

Automated pipeline, warehousing, and visualization for TV Ratings data

Julian DeVille

Healthcare Data Integration Engineer | Python, Airflow, AWS

发布日期: 2023年11月10日

Warehousing and Visualizing TV Ratings in the Microsoft Ecosystem It’s no secret that the conventional recommendation engines built into most streaming services are less than helpful, especially for genres that aren’t popular among the general public. This comes coupled with a vast sea of low-effort shows designed to market streaming platforms by pandering to a wide variety of niche subgroups, often to the detriment of writing quality. Good content is still certainly being created, but it’s growing harder to find through all the noise, and it’s disappointing that recommendation algorithms aren’t more effective at helping sift through it all.

While I’m still not as far as training my own curation algorithms, I have taken matters into my own hands and set up a live pipeline to warehouse the open IMDB dataset in SQL Server, and found some key insights in the process. IMDB is a website which hosts a considerable amount of data about television and movies, as well as their own ratings system, and it’s quite easy to access. I also chose to use the Microsoft data ecosystem both as a training exercise for myself, and because I honestly really enjoy their data ecosystem. SSIS in particular is great for its ease of use, parallelization capabilities, and ecosystem integration. To start, I used generative AI to produce a C# script capable of fetching and decompressing the raw IMDB data from their web endpoint, and implemented this as a script task in an SSIS data flow. I then ran several instances of these tasks to parallelize downloading and unzipping all the raw files, re-syncing at a checkpoint to ensure all files have been downloaded prior to staging. I included several of these checkpoints to ensure there is no chance of updating the EDW with only a partial update.

Upon successfully retrieving the raw data, the transformation into a snowflake (or snowflake-like) schema took a considerable amount of work- this dataset features some techniques that don’t play nicely with relational databases such as arrays in cells, polymorphic keys, and of course no ETL project would be complete without encoding issues.

Refactoring the polymorphic key structure into one more intuitive for a snowflake schema was not only the majority of transformations done, but also required some of the slowest transformations. Because what eventually became numerous tables to account for different jobs, roles, genres, etc. the large tables containing polymorphic keys had to use large conditional splits. However, this strategy proved invaluable for query performance and quality of life for downstream analysis, as this data was now prepared for traditional SQL warehousing.

polymorphic key conditionally split in 8 dimension tables

Once staged, some ecosystem features really save some time and effort. I created views for reshaping data into their final tables, and then used SQL Server’s MERGE feature to “sync” the view reading the staging tables with the prod database, which automatically updates, inserts, or deletes as needed via the query below. These queries can then run in parallel as pictured above in the data flow.

Once warehoused, navigating the data is fairly straightforward. For security purposes, I wrote a simple stored procedure for accessing data strictly for the TV ratings analysis, created a user with only permission to run that procedure, and then created a simple Python function to retrieve the data for a given TV show. It was nice to have capabilities like changing the data point sizes based on the number of votes, and color code by season.

Scatterplot and line of best fit for Young Sheldon generated with matplotlib

Significant outliers in the data emerged quite quickly, so I included further analytics functions to fetch a top 5 and bottom 5 episode by ratings, along with their names in order to research them and understand their impact. This also automatically fetched significant outliers for analysis as well.

Top 5 and bottom 5 episodes of Young Sheldon plus outliers

While this proved useful, I was still less than impressed with the aesthetic quality of these graphs, and wanted to instead implement the same in PowerBI. Where I had already created a stored procedure for accessing TV ratings dataset by title_id, this was trivial to set up, and PowerBI produced remarkably better visualizations out of the box. I was even able to get the scatterplot color-coded by season!

Scatterplot and line of best fit for The Office generated in PowerBI

So there we go- we now have polished, professional visualization of television ratings over time by show. I spent a considerable amount of time exploring other shows- even from these visuals, it was easy to notice trends like that most shows decrease in ratings over time (the ones that don’t are great though!), and that shows seem to get canceled around 65%.

Scatterplot and line of best fit for TBBT generated in PowerBI

Scatterplot and line of best fit for HIMYM generated in PowerBI

Scatterplot and line of best fit for Futurama generated in PowerBI

Now these graphs are just one analysis- the whole dataset is now warehoused with automatic updates, so there are numerous other explorations to do- perhaps there’s a correlation between ratings and cast. Maybe I could set up a feed to automatically recommend shows above a certain threshold by genre. Once data is nicely automated and warehoused like this, the possibilities are endless.

Julian DeVille的更多文章

Adding Voice to ChatGPT

2023年11月6日

Adding Voice to ChatGPT

We’ve all had a daydream about talking to technology like Jarvis in Iron Man, The Computer in Star Trek, or even Rick’s…

Julian DeVille的更多文章

Adding Voice to ChatGPT