WHAT CAN WE LEARN FROM NETFLIX DATA?
Nasir Yusuf Ahmad, ASMNES
Transforming Data into Actionable Insights for Informed Decision-Making| SDG 2, 3 & 4.| Co-Founder HealStoc | Human Capital Development Advocate | AI, ML, LLM Engineer.| Economist.| Author.| Certified Data Scientist.|
"When you torture the data, it will confess to anything." -Ronald A. Fisher.
Yeah, it's the weekend again. The past two weeks have been an incredible one. It was Python Intensive week. The Click-on Kaduna instructor were nothing short of geeks. We had two incredible instructors and more than enough co-instructors who taught us the philosophy and anatomy of Python.
We went from writing python basic commands such as variables, functions, for loops, to using complex functions such as lambda and list comprehension.?
Among many projects we did, I find it worthy of your time to go through this Netflix analysis project using Jupyter Notebook.?
Introduction: In the era of streaming services, Netflix has become a household name, providing an extensive library of TV shows and movies. This article delves into the insights derived from a comprehensive Netflix dataset, aiming to shed light on the trends, preferences, and dynamics of content available on the platform until 2021.
The dataset encompasses key variables such as title, type, release year, age certification, runtime, genres, production countries, seasons, imdb_id, imdb_score, and imdb_votes.
I have series of questions I provided answers too in this analysis. Here are the eight (8) questions and solutions.
1.Find number of movies released in each decade from 1945, show a line chart to display this.
2. Show the distributions of run time features using an appropriate chart.
3. Show bar chart for movie genres.
4. Show bar chart for movie genres for children movies.
5. Which of the genres have the highest runtime on the overall dataset and for each age group.
6- Trends of Content Produced on Netflix Annually.
7- Sentiment Analysis From 2005 to 2020 Base on Genres.
8- Age Certification Distributions.
Like with every dataset, at first, after importing all the necessary dependencies, we checked for the head and tail to have overview of the behavior of the data we are dealing with as you can see below:
Information About The Dataset:
The dataset has absolutely 5361 rows and 11 columns respectively. The columns have the following headers as: title, type, release_year, age_certification, runtime, genres, production_countries, seasons, imdb_id, imdb_score and imdb_votes respectively.
The head of the Netflix dataset unveils cinematic gems from the past, including the documentary-style "Five Came Back" in 1945 and iconic movies like "Taxi Driver" and "Monty Python and the Holy Grail" in the 1970s. These early entries showcase diverse genres, ranging from crime dramas to fantasy comedies, providing a glimpse into Netflix's historical content landscape.
Moving to the tail of the dataset, contemporary offerings like "Fine Wine" and "Mighty Little Bheem: Kite Festival" reflect the platform's evolving content strategy, catering to romance, drama, and family audiences in 2021. The dynamic interplay of genres, runtimes, and IMDb scores across decades highlights Netflix's commitment to a rich and varied viewing experience.
The null values in the dataset sum up to about 2331 in the age_certification column, the seasons column have 3450 null values, while the imdb_score has 78 null values and the imdb_votes 94 respectively.
After the principal component analysis, I started out with the age certification ratings and what children could watch at first. In order to achive my objective, I created the following function with certain for loop conditions:
def age_group(age_cert):
if age_cert in ['TV-MA', 'R', 'NC-17']:
return 'Adult'
elif age_cert in ['PG','TV-14','G','PG-13','TV-PG']:
return 'Requires Supervision'
elif age_cert in ['TV-Y','TV-G','TV-Y7']:
return 'Good for Children'
else:
return 'Not Rated'
Since the "age_certification" column has the following elements:
(['TV-MA', 'R', 'PG', 'TV-14', 'G', 'PG-13', 'TV-PG', 'TV-Y', 'TV-G', 'TV-Y7', 'NC-17')]
The above information is a list of content age certifications for TV shows and movies.
Let's break it down:
'TV-MA': "TV Mature Audiences", 'R': Movie rating, 'PG': "Parental Guidance Suggested", 'TV-14': "appropriate for viewers over 14 years old." 5. 'G': This is a general audience rating, indicating that the content is suitable for all ages.
'PG-13': "inappropriate for children under 13." 7. 'TV-PG': "suitable for all ages but may still require some parental guidance." 'TV-Y': "This rating is for content designed to be appropriate for all children, even those as young as preschool."
'TV-G': "General audience content, meaning all ages are admitted." 'TV-Y7': "Content is suitable for children age 7 and older." 'NC-17': "MPAA rating for movies, indicating that no one 17 and under admitted."
The list encompasses various age certifications used to guide viewers and parents on the suitability of TV shows and movies for different age groups. The certifications are assigned based on the content's themes, language, and other factors that may impact the appropriateness for certain audiences.
1- Returning to our 1st question: "Find number of movies released in each decade from 1945, show a line chart to display this."
I put the following command to the notebook:
df['decade'] = df['release_year'] // 10 * 10
decade_counts = df.groupby('decade').size()
plt.figure(figsize=(10,6))
decade_counts.plot.line()
plt.title('Number of Movies Released by Decade')
plt.xlabel('Decade')
plt.ylabel('Counts')
plt.show()
And here is the output:
It is very obvious that there has been a significant increase in the numbers of movies released, especially from the year 2000 to 2019. This can be attributed to a number of factors including advancement in technology and increase in our total factor productivity (TFP). The sharp decline we see around 2020 was obviously due to the Covid_19 pandemic in 2020/2021 when there was global economic lockdown. On the other hand, it also demonstrates that the demand for entertainment has been growing exponentially since the beginning of the 21st century.
2- Our 2nd question is to: Show the distributions of run time features using an appropriate chart.
For the above question, the code snippet is provided below:
plt.figure(figsize=(10,6))
df['runtime'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Movie Runtimes')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Counts')
plt.show()
Here is the output:
About 600 movies have the average runtime of about 100 munities, while less then 100 movies/tv shows have the minimum runtime of 150 - 250. Mostly, children shows have the least runtime as we will see in later analysis while seasons and adult TV shows have the most runtime.
3- The third question is to: "Show bar chart for movie genres."
For this question, we have to create a function that is going to group the genres by similarities and enables us plot a great bar chart without so much noise. This can be achieved using the following code:
# Define a function to map genres to broader categories
def map_to_category(genre):
if 'action' in genre.lower() or 'crime' in genre.lower() or 'thriller' in genre.lower():
return 'Crime'
elif 'comedy' in genre.lower() or 'fantasy' in genre.lower():
return 'Comedy'
elif 'drama' in genre.lower():
return 'Drama'
elif 'romance' in genre.lower() or 'music' in genre.lower():
return 'Romance'
elif 'family' in genre.lower() or 'animation' in genre.lower():
return 'Family'
else:
return 'Other'
# Apply the function to create a new 'category' column
df['category'] = df['genres'].apply(map_to_category)
# Display the result
print(df[['genres', 'category']])
# Count the occurrences of each category
category_counts = df['category'].value_counts()
# Plot a bar chart
plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Movies by Category')
plt.xlabel('Category')
plt.ylabel('Count')
# Rotate x-axis labels to be horizontal
plt.xticks(rotation=0)
# Add legend
plt.legend()
plt.show()
And here is the output below:
The above chart clearly tells us that about 2000 movies released belong to the crime genre, about 1750 are comedy, 750 are drama, less than 250 are family and romance while around 500 are others whose genres were not specified.
I think it is really important we investigate if there is a relationship between the increase in actual crime rate and increase in the numbers of crime movies produced since the beginning of the decade.?
4. Moving forward, our fourth question is to: "Show bar charts for movie genres for children movies."
Given the preceding code, it is easy to guess. Yeah, you might have guessed. In order to arrive at the stated objective, I created the following function with a for loop condition as well:
领英推荐
def age_group(age_cert):
if age_cert in ['TV-MA', 'R', 'NC-17']:
return 'Adult'
elif age_cert in ['PG','TV-14','G','PG-13','TV-PG']:
return 'Requires Supervision'
elif age_cert in ['TV-Y','TV-G','TV-Y7']:
return 'Good for Children'
else:
return 'Not Rated'
df['age_group_ratings']= df['age_certification'].apply(age_group)
children_movies = df['age_group_ratings'] == 'Good for Children'
# Children genres in the dataset
children_movies_df['genres'].value_counts().head(10).plot(kind='barh')
plt.title("Count of Children's Movie by Genre")
plt.show()
And below is the output:
For this output, we have seen that the most released movie for children is actually the animation. Followed by animation and family, followed by animation, family and comedy, followed by reality, followed by comedy and family while the least children film has some elements of fantasy and sci-fi in it.
Suppose your child has access to Netflix subscription, you can tell what is it he or she will likely be exposed to by looking at what frequently reoccur in the above chart.
5. Which of the genres have the highest runtime on the overall dataset and for each age group.
Yeah, this is also an important question. The length of the running-time will actually determine the number of times one will commit to watching the content and also tells us who is committing such time. Let us see what we have below:
As usual, we start with the snippet:
# Average runtime by Ratings
df.groupby('age_group_ratings')['runtime'].mean().plot(kind='barh')
plt.title("Age Group Rating Vs Runtime")
plt.xlabel('Runtime')
plt.show()
And here is the result:
Films that requires supervision, such as adult movies have the second most running time. While those that were not rated at have the most running time. Adult movies/tv shows have an average running time of an hour. While children have about 20 minutes of average running time.
This is understandable, given that most children's films are animations, they require less time to conclude than adult movies with various conflicting theories and justifications. In other words, Adults may likely spend more time on Netflix than children.
Also here is a warning, however, the chance of a child who has unsupervised access to Netflix to watch Adult content is very high.?
6- Trends of Content Produced on Netflix Annually.
Let us see the direction of the most released contents on Netflix from 2006 till 2020:
For this, I imported a new dependency called "plotly.express as px". Then I created a new data frame called df1 while referencing it to the 'type column and release_year' respectively. I proceeded to rename the column names and store them in a new data frame called df2 with the intention of grouping by.
import plotly.express as px
df1 = df[['type', 'release_year']]
df1 = df1.rename(columns={"release_year":"Release Year", "type": "Type"})
df2 = df1.groupby(['Release Year', 'Type']).size().reset_index(name="Total Counts")
To print the code, run the command below:
df2 = df2[df2['Release Year']>=2000]
graph = px.line(df2, x = "Release Year", y ="Total Counts", color = "Type",
title = "Trends of Content Produced on Netflix From The Year 2000 to 2020.")
graph.show()
Here is the output:
Trends of Movies and Shows Produced From 2000 to 2020 actually tells us that from the beginning of the 21st century to around 2019, more movies have been released on Netflix than TV Shows. It made sense for the average running time to be around 60 munities. Which is the average time a full movie ends.
7- Sentiment Analysis From 2005 to 2020 Based on Genres.
Let us quickly understand what is sentiment analysis:
Sentiment Analysis involves using computational methods to determine the emotional tone expressed in text, classifying it as positive, negative, or neutral. It is conducted to understand public perception, customer feedback, or sentiment towards a product, service, brand, or event. Businesses leverage sentiment analysis to gain insights, improve decision-making, enhance customer experience, and respond effectively to public opinion.
I was curious to run this NLP analysis using the genres column since the data set lacks descriptions. Below is the code snippet:
df3 = df[['release_year', 'genres']]
df3 = df3.rename(columns = {"release_year":"Release Year", "genres": "Genre"})
for index, row in df3.iterrows():
d=row['Genre']
testimonial = TextBlob(d)
p = testimonial.sentiment.polarity
if p==0:
sent = 'Neutral'
elif p>0:
sent = 'Positive'
else:
sent = 'Negative'
df3.loc[[index, 2], 'Sentiment']=sent
df3 = df3.groupby(['Release Year', 'Sentiment']).size().reset_index(name='Total Count')
df3 = df3[df3['Release Year']>2005]
bargraph = px.bar(df3, x = "Release Year", y ="Total Count", color="Sentiment", title="Sentiment Analysis On Netflix Datset From 2005 to 2020.")
To plot the graph, run:
bargraph.show()
The above code performs sentiment analysis on our Netflix dataset. It first selects the 'release_year' and 'genres' columns, renames them for clarity, and then applies sentiment analysis to the 'genres' using TextBlob.?
As previously stated, the sentiment polarity is categorized as 'Neutral,' 'Positive,' or 'Negative.' The sentiment analysis results are then aggregated by release year, forming a new DataFrame ('df3') with the count of each sentiment category.?
The analysis is constrained to data from the year 2005 onwards. Finally, a bar graph is generated using Plotly Express ('px.bar') to visualize the sentiment distribution over the specified years, providing a graphical representation of sentiment trends in the Netflix dataset from 2005 to 2020.
Here is the output:
The sentiment analysis is based on the genres column and the release year from 2006 to 2020. This clearly shows that we have just a few contents which are rated positives, while the rest are neutral. This is a clear testimony of the fact that movies and TV shows released from the years in question belong to the crime genres as seen above in question 3. This is a problem for those calling the shots.
8- Finally, we will answer the question of "Age Certification Distributions."
To plot this, run the following code:
import pandas as pd
import matplotlib.pyplot as plt
# Assuming df is your DataFrame
# Example data:
# df['age_certification'] = ...
# Create a bar chart for age certification distribution
age_cert_counts = df['age_certification'].value_counts()
age_cert_counts.plot(kind='bar', color='skyblue')
plt.title('Age Certification Distribution')
plt.xlabel('Age Certification')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
This code snippet utilizes pandas and matplotlib to generate a bar chart depicting the distribution of age certifications in a DataFrame named 'df.' It counts the occurrences of each unique age certification, creates a bar chart, and adds labels and titles for clarity. The resulting visualization offers a quick overview of the age certification distribution in the dataset.
Here is the output:
The Netflix age certification distribution graph shows the number of movies and TV shows in each age certification category. The most common age certification is TV-MA, with over 700 movies and TV shows in this category.?
This is followed by R, with over 500 movies and TV shows. The least common age certification is NC-17, with less than 50 movies and TV shows in this category.
Conclusion:
In conclusion, the Netflix dataset analysis has provided valuable insights into various aspects of the platform's content. Here's a summary based on the eight questions posed:
1. Number of Movies Released in Each Decade (1945 Onward): The analysis revealed a significant increase in movie releases from the 2000s onwards, peaking in the 2010s. The decline in 2020 can be attributed to the global impact of the COVID-19 pandemic.
2. Distribution of Run Time Features: The distribution of runtimes indicates that the majority of movies and TV shows on Netflix have runtimes around 100 minutes. Children's content tends to have shorter runtimes, while some outliers, such as seasons and adult TV shows, have longer durations.
3. Bar Chart for Movie Genres: The bar chart categorizing movie genres into broader groups like Crime, Comedy, Drama, Romance, Family, and Others showed that Crime, Comedy, and Drama are the most prevalent genres on Netflix, with Crime being the most prominent.
4. Bar Chart for Children's Movie Genres: The analysis of genres specifically for children's movies revealed that Animation, Family, and Comedy are the most common genres, emphasizing the family-friendly nature of children's content on Netflix.
5. Genres with Highest Runtime: Adult-rated content, indicated by age certifications like TV-MA, tends to have the second-longest runtime after content with no specified age certification. This suggests that adult-oriented content often requires a longer time commitment, while children's content, especially animation, has shorter runtimes.
6. Trends of Content Produced Annually (2006-2020): The trend analysis showcased a substantial increase in the production of movies compared to TV shows on Netflix from 2006 to 2019. This aligns with the growing demand for varied entertainment options.
7. Sentiment Analysis on Genres (2005-2020): The sentiment analysis, although limited by the absence of textual descriptions, indicated that the majority of genres in the dataset were classified as neutral. This suggests a lack of strongly polarized sentiments in the genres.
8. Age Certification Distributions: The bar chart depicting age certification distributions highlighted that TV-MA and R are the most common age certifications on Netflix, with the majority of content falling into these mature categories.
Information Management/Data Analytic/Research Assistant/Click-On Kaduna Data science Fellow 3.0
1 年Weldone Nasir
Technical Support Professional @ M2M Data Connect Group | Data Scientist | Process Automation, Chatbot Development
1 年Well-done Nasir, excellent use of the two weeks engagement!
Network Administration | Data Analytics
1 年Nice work ??
Data Engineer | Machine Learning & AWS Specialist | Delivering Data-Driven Insights for Product Innovation & Growth
1 年Interesting ????