登录查看更多内容

WHAT CAN WE LEARN FROM NETFLIX DATA?

Nasir Yusuf Ahmad, ASMNES

Transforming Data into Actionable Insights for Informed Decision-Making| SDG 2, 3 & 4.| Co-Founder HealStoc | Human Capital Development Advocate | AI, ML, LLM Engineer.| Economist.| Author.| Certified Data Scientist.|

发布日期: 2023年11月26日

"When you torture the data, it will confess to anything." -Ronald A. Fisher.

Yeah, it's the weekend again. The past two weeks have been an incredible one. It was Python Intensive week. The Click-on Kaduna instructor were nothing short of geeks. We had two incredible instructors and more than enough co-instructors who taught us the philosophy and anatomy of Python.

We went from writing python basic commands such as variables, functions, for loops, to using complex functions such as lambda and list comprehension.?

Among many projects we did, I find it worthy of your time to go through this Netflix analysis project using Jupyter Notebook.?

Introduction: In the era of streaming services, Netflix has become a household name, providing an extensive library of TV shows and movies. This article delves into the insights derived from a comprehensive Netflix dataset, aiming to shed light on the trends, preferences, and dynamics of content available on the platform until 2021.

The dataset encompasses key variables such as title, type, release year, age certification, runtime, genres, production countries, seasons, imdb_id, imdb_score, and imdb_votes.

I have series of questions I provided answers too in this analysis. Here are the eight (8) questions and solutions.

1.Find number of movies released in each decade from 1945, show a line chart to display this.

2. Show the distributions of run time features using an appropriate chart.

3. Show bar chart for movie genres.

4. Show bar chart for movie genres for children movies.

5. Which of the genres have the highest runtime on the overall dataset and for each age group.

6- Trends of Content Produced on Netflix Annually.

7- Sentiment Analysis From 2005 to 2020 Base on Genres.

8- Age Certification Distributions.

Like with every dataset, at first, after importing all the necessary dependencies, we checked for the head and tail to have overview of the behavior of the data we are dealing with as you can see below:

Author's Computation Using Jupyter Notebook.

Information About The Dataset:

The dataset has absolutely 5361 rows and 11 columns respectively. The columns have the following headers as: title, type, release_year, age_certification, runtime, genres, production_countries, seasons, imdb_id, imdb_score and imdb_votes respectively.

The head of the Netflix dataset unveils cinematic gems from the past, including the documentary-style "Five Came Back" in 1945 and iconic movies like "Taxi Driver" and "Monty Python and the Holy Grail" in the 1970s. These early entries showcase diverse genres, ranging from crime dramas to fantasy comedies, providing a glimpse into Netflix's historical content landscape.

Moving to the tail of the dataset, contemporary offerings like "Fine Wine" and "Mighty Little Bheem: Kite Festival" reflect the platform's evolving content strategy, catering to romance, drama, and family audiences in 2021. The dynamic interplay of genres, runtimes, and IMDb scores across decades highlights Netflix's commitment to a rich and varied viewing experience.

The null values in the dataset sum up to about 2331 in the age_certification column, the seasons column have 3450 null values, while the imdb_score has 78 null values and the imdb_votes 94 respectively.

After the principal component analysis, I started out with the age certification ratings and what children could watch at first. In order to achive my objective, I created the following function with certain for loop conditions:

def age_group(age_cert):
    if age_cert in ['TV-MA', 'R', 'NC-17']:
        return 'Adult'
    elif age_cert in ['PG','TV-14','G','PG-13','TV-PG']:
        return 'Requires Supervision'
    elif age_cert in ['TV-Y','TV-G','TV-Y7']:
        return 'Good for Children'
    else:
        return 'Not Rated'

Since the "age_certification" column has the following elements:

(['TV-MA', 'R', 'PG', 'TV-14', 'G', 'PG-13', 'TV-PG', 'TV-Y', 'TV-G', 'TV-Y7', 'NC-17')]

The above information is a list of content age certifications for TV shows and movies.

Let's break it down:

'TV-MA': "TV Mature Audiences", 'R': Movie rating, 'PG': "Parental Guidance Suggested", 'TV-14': "appropriate for viewers over 14 years old." 5. 'G': This is a general audience rating, indicating that the content is suitable for all ages.

'PG-13': "inappropriate for children under 13." 7. 'TV-PG': "suitable for all ages but may still require some parental guidance." 'TV-Y': "This rating is for content designed to be appropriate for all children, even those as young as preschool."

'TV-G': "General audience content, meaning all ages are admitted." 'TV-Y7': "Content is suitable for children age 7 and older." 'NC-17': "MPAA rating for movies, indicating that no one 17 and under admitted."

The list encompasses various age certifications used to guide viewers and parents on the suitability of TV shows and movies for different age groups. The certifications are assigned based on the content's themes, language, and other factors that may impact the appropriateness for certain audiences.

1- Returning to our 1st question: "Find number of movies released in each decade from 1945, show a line chart to display this."

I put the following command to the notebook:

df['decade'] = df['release_year'] // 10 * 10
decade_counts = df.groupby('decade').size()
plt.figure(figsize=(10,6))
decade_counts.plot.line()
plt.title('Number of Movies Released by Decade')
plt.xlabel('Decade')
plt.ylabel('Counts')
plt.show()

And here is the output:

It is very obvious that there has been a significant increase in the numbers of movies released, especially from the year 2000 to 2019. This can be attributed to a number of factors including advancement in technology and increase in our total factor productivity (TFP). The sharp decline we see around 2020 was obviously due to the Covid_19 pandemic in 2020/2021 when there was global economic lockdown. On the other hand, it also demonstrates that the demand for entertainment has been growing exponentially since the beginning of the 21st century.

2- Our 2nd question is to: Show the distributions of run time features using an appropriate chart.

For the above question, the code snippet is provided below:

plt.figure(figsize=(10,6))
df['runtime'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Movie Runtimes')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Counts')
plt.show()

Here is the output:

About 600 movies have the average runtime of about 100 munities, while less then 100 movies/tv shows have the minimum runtime of 150 - 250. Mostly, children shows have the least runtime as we will see in later analysis while seasons and adult TV shows have the most runtime.

3- The third question is to: "Show bar chart for movie genres."

For this question, we have to create a function that is going to group the genres by similarities and enables us plot a great bar chart without so much noise. This can be achieved using the following code:

# Define a function to map genres to broader categories
def map_to_category(genre):
    if 'action' in genre.lower() or 'crime' in genre.lower() or 'thriller' in genre.lower():
        return 'Crime'
    elif 'comedy' in genre.lower() or 'fantasy' in genre.lower():
        return 'Comedy'
    elif 'drama' in genre.lower():
        return 'Drama'
    elif 'romance' in genre.lower() or 'music' in genre.lower():
        return 'Romance'
    elif 'family' in genre.lower() or 'animation' in genre.lower():
        return 'Family'
    else:
        return 'Other'

# Apply the function to create a new 'category' column
df['category'] = df['genres'].apply(map_to_category)


# Display the result
print(df[['genres', 'category']])


# Count the occurrences of each category
category_counts = df['category'].value_counts()

# Plot a bar chart
plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Movies by Category')
plt.xlabel('Category')
plt.ylabel('Count')

# Rotate x-axis labels to be horizontal
plt.xticks(rotation=0)

# Add legend
plt.legend()
plt.show()

And here is the output below:

The above chart clearly tells us that about 2000 movies released belong to the crime genre, about 1750 are comedy, 750 are drama, less than 250 are family and romance while around 500 are others whose genres were not specified.

I think it is really important we investigate if there is a relationship between the increase in actual crime rate and increase in the numbers of crime movies produced since the beginning of the decade.?

4. Moving forward, our fourth question is to: "Show bar charts for movie genres for children movies."

Given the preceding code, it is easy to guess. Yeah, you might have guessed. In order to arrive at the stated objective, I created the following function with a for loop condition as well:

领英推荐

Things You Can Do with Python: Advanced and Special…

Towards Data Science 1 年前

Python Upskilling for Advanced Data Science Workflows?…

Towards Data Science 1 个月前

The RTutor Project, Python Resources, Git Full Course,…

Rami Krispin 2 个月前

def age_group(age_cert):
    if age_cert in ['TV-MA', 'R', 'NC-17']:
        return 'Adult'
    elif age_cert in ['PG','TV-14','G','PG-13','TV-PG']:
        return 'Requires Supervision'
    elif age_cert in ['TV-Y','TV-G','TV-Y7']:
        return 'Good for Children'
    else:
        return 'Not Rated'

df['age_group_ratings']= df['age_certification'].apply(age_group)
children_movies = df['age_group_ratings'] == 'Good for Children'

# Children genres in the dataset
children_movies_df['genres'].value_counts().head(10).plot(kind='barh')
plt.title("Count of Children's Movie by Genre")
plt.show()

And below is the output:

For this output, we have seen that the most released movie for children is actually the animation. Followed by animation and family, followed by animation, family and comedy, followed by reality, followed by comedy and family while the least children film has some elements of fantasy and sci-fi in it.

Suppose your child has access to Netflix subscription, you can tell what is it he or she will likely be exposed to by looking at what frequently reoccur in the above chart.

5. Which of the genres have the highest runtime on the overall dataset and for each age group.

Yeah, this is also an important question. The length of the running-time will actually determine the number of times one will commit to watching the content and also tells us who is committing such time. Let us see what we have below:

As usual, we start with the snippet:

# Average runtime by Ratings
df.groupby('age_group_ratings')['runtime'].mean().plot(kind='barh')
plt.title("Age Group Rating Vs Runtime")
plt.xlabel('Runtime')
plt.show()

And here is the result:

Films that requires supervision, such as adult movies have the second most running time. While those that were not rated at have the most running time. Adult movies/tv shows have an average running time of an hour. While children have about 20 minutes of average running time.

This is understandable, given that most children's films are animations, they require less time to conclude than adult movies with various conflicting theories and justifications. In other words, Adults may likely spend more time on Netflix than children.

Also here is a warning, however, the chance of a child who has unsupervised access to Netflix to watch Adult content is very high.?

6- Trends of Content Produced on Netflix Annually.

Let us see the direction of the most released contents on Netflix from 2006 till 2020:

For this, I imported a new dependency called "plotly.express as px". Then I created a new data frame called df1 while referencing it to the 'type column and release_year' respectively. I proceeded to rename the column names and store them in a new data frame called df2 with the intention of grouping by.

import plotly.express as px 

df1 = df[['type', 'release_year']]

df1 = df1.rename(columns={"release_year":"Release Year", "type": "Type"})

df2 = df1.groupby(['Release Year', 'Type']).size().reset_index(name="Total Counts")

To print the code, run the command below:

df2 = df2[df2['Release Year']>=2000]
graph = px.line(df2, x = "Release Year", y ="Total Counts", color = "Type", 
                title = "Trends of Content Produced on Netflix From The Year 2000 to 2020.")

graph.show()

Here is the output:

Trends of Movies and Shows Produced From 2000 to 2020 actually tells us that from the beginning of the 21st century to around 2019, more movies have been released on Netflix than TV Shows. It made sense for the average running time to be around 60 munities. Which is the average time a full movie ends.

7- Sentiment Analysis From 2005 to 2020 Based on Genres.

Let us quickly understand what is sentiment analysis:

Sentiment Analysis involves using computational methods to determine the emotional tone expressed in text, classifying it as positive, negative, or neutral. It is conducted to understand public perception, customer feedback, or sentiment towards a product, service, brand, or event. Businesses leverage sentiment analysis to gain insights, improve decision-making, enhance customer experience, and respond effectively to public opinion.

I was curious to run this NLP analysis using the genres column since the data set lacks descriptions. Below is the code snippet:

df3 = df[['release_year', 'genres']]

df3 = df3.rename(columns = {"release_year":"Release Year", "genres": "Genre"})

for index, row in df3.iterrows():
    d=row['Genre']
    testimonial = TextBlob(d)
    p = testimonial.sentiment.polarity
    if p==0:
        sent = 'Neutral'
    elif p>0:
        sent = 'Positive'
    else:
        sent = 'Negative'
    df3.loc[[index, 2], 'Sentiment']=sent
df3 = df3.groupby(['Release Year', 'Sentiment']).size().reset_index(name='Total Count')

df3 = df3[df3['Release Year']>2005]

bargraph = px.bar(df3, x = "Release Year", y ="Total Count", color="Sentiment", title="Sentiment Analysis On Netflix Datset From 2005 to 2020.")

To plot the graph, run:

bargraph.show()

The above code performs sentiment analysis on our Netflix dataset. It first selects the 'release_year' and 'genres' columns, renames them for clarity, and then applies sentiment analysis to the 'genres' using TextBlob.?

As previously stated, the sentiment polarity is categorized as 'Neutral,' 'Positive,' or 'Negative.' The sentiment analysis results are then aggregated by release year, forming a new DataFrame ('df3') with the count of each sentiment category.?

The analysis is constrained to data from the year 2005 onwards. Finally, a bar graph is generated using Plotly Express ('px.bar') to visualize the sentiment distribution over the specified years, providing a graphical representation of sentiment trends in the Netflix dataset from 2005 to 2020.

Here is the output:

The sentiment analysis is based on the genres column and the release year from 2006 to 2020. This clearly shows that we have just a few contents which are rated positives, while the rest are neutral. This is a clear testimony of the fact that movies and TV shows released from the years in question belong to the crime genres as seen above in question 3. This is a problem for those calling the shots.

8- Finally, we will answer the question of "Age Certification Distributions."

To plot this, run the following code:

import pandas as pd
import matplotlib.pyplot as plt

# Assuming df is your DataFrame
# Example data:
# df['age_certification'] = ...

# Create a bar chart for age certification distribution
age_cert_counts = df['age_certification'].value_counts()
age_cert_counts.plot(kind='bar', color='skyblue')
plt.title('Age Certification Distribution')
plt.xlabel('Age Certification')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

This code snippet utilizes pandas and matplotlib to generate a bar chart depicting the distribution of age certifications in a DataFrame named 'df.' It counts the occurrences of each unique age certification, creates a bar chart, and adds labels and titles for clarity. The resulting visualization offers a quick overview of the age certification distribution in the dataset.

Here is the output:

The Netflix age certification distribution graph shows the number of movies and TV shows in each age certification category. The most common age certification is TV-MA, with over 700 movies and TV shows in this category.?

This is followed by R, with over 500 movies and TV shows. The least common age certification is NC-17, with less than 50 movies and TV shows in this category.

Conclusion:

In conclusion, the Netflix dataset analysis has provided valuable insights into various aspects of the platform's content. Here's a summary based on the eight questions posed:

1. Number of Movies Released in Each Decade (1945 Onward): The analysis revealed a significant increase in movie releases from the 2000s onwards, peaking in the 2010s. The decline in 2020 can be attributed to the global impact of the COVID-19 pandemic.

2. Distribution of Run Time Features: The distribution of runtimes indicates that the majority of movies and TV shows on Netflix have runtimes around 100 minutes. Children's content tends to have shorter runtimes, while some outliers, such as seasons and adult TV shows, have longer durations.

3. Bar Chart for Movie Genres: The bar chart categorizing movie genres into broader groups like Crime, Comedy, Drama, Romance, Family, and Others showed that Crime, Comedy, and Drama are the most prevalent genres on Netflix, with Crime being the most prominent.

4. Bar Chart for Children's Movie Genres: The analysis of genres specifically for children's movies revealed that Animation, Family, and Comedy are the most common genres, emphasizing the family-friendly nature of children's content on Netflix.

5. Genres with Highest Runtime: Adult-rated content, indicated by age certifications like TV-MA, tends to have the second-longest runtime after content with no specified age certification. This suggests that adult-oriented content often requires a longer time commitment, while children's content, especially animation, has shorter runtimes.

6. Trends of Content Produced Annually (2006-2020): The trend analysis showcased a substantial increase in the production of movies compared to TV shows on Netflix from 2006 to 2019. This aligns with the growing demand for varied entertainment options.

7. Sentiment Analysis on Genres (2005-2020): The sentiment analysis, although limited by the absence of textual descriptions, indicated that the majority of genres in the dataset were classified as neutral. This suggests a lack of strongly polarized sentiments in the genres.

8. Age Certification Distributions: The bar chart depicting age certification distributions highlighted that TV-MA and R are the most common age certifications on Netflix, with the majority of content falling into these mature categories.

Anna Atajiri

Information Management/Data Analytic/Research Assistant/Click-On Kaduna Data science Fellow 3.0

1 年

Weldone Nasir

1 次回应

Mustapha Alaba

Technical Support Professional @ M2M Data Connect Group | Data Scientist | Process Automation, Chatbot Development

1 年

Well-done Nasir, excellent use of the two weeks engagement!

2 次回应

Hauwa Suleiman

Network Administration | Data Analytics

1 年

Nice work ??

1 次回应

muhammad yekini

Data Engineer | Machine Learning & AWS Specialist | Delivering Data-Driven Insights for Product Innovation & Growth

1 年

Interesting ????

2 次回应

查看更多评论

要查看或添加评论，请登录

Nasir Yusuf Ahmad, ASMNES的更多文章

WHAT DOES ZERO CARBON MEANS FOR THE GLOBAL SOUTH?

2025年1月12日

WHAT DOES ZERO CARBON MEANS FOR THE GLOBAL SOUTH?

People who have contributed least to climate change are most affected by it- Pew. Climate Change! Climate Change and…

2 条评论
Why OpenAI’s GPT-4o Model is Free to Use; Potential Lessons for Businesses.

2024年5月28日

Why OpenAI’s GPT-4o Model is Free to Use; Potential Lessons for Businesses.

"Marketing is about values. It’s a complicated and noisy world, and we’re not going to get a chance to get people to…

2 条评论
Are You an Eternal Optimist or an Eternal Pessimist?

2024年4月2日

Are You an Eternal Optimist or an Eternal Pessimist?

"Pessimism leads to weakness, optimism to power." -William James.

2 条评论
Malthus Revisited: Are We on the Brink of a Global Food Crisis?

2024年2月24日

Malthus Revisited: Are We on the Brink of a Global Food Crisis?

"The power of population is indefinitely greater than the power in the earth to produce subsistence for man." - Thomas…
Harnessing Data for Transformative Governance: Insights from Clickon Kaduna 2nd Guest Lecture Series

2024年2月21日

Harnessing Data for Transformative Governance: Insights from Clickon Kaduna 2nd Guest Lecture Series

"Data will talk to you if you're willing to listen." - Jim Bergeson On February 15, 2024, I had the privilege of…

2 条评论
Web Traffic Data Analysis For Online Store.

2023年12月2日

Web Traffic Data Analysis For Online Store.

“In the end you should only measure and look at the numbers that drive action, meaning that the data tells you what you…

1 条评论
Fostering Financial Inclusion For Sustainable Economic Development in Nigeria: Episode 1.

2023年10月8日

Fostering Financial Inclusion For Sustainable Economic Development in Nigeria: Episode 1.

"We cannot be doing the same thing and expecting different results" -Albert Einstein. Welcome to the 1st Episode of…
The Beginning of A New Dawn

2023年9月14日

The Beginning of A New Dawn

"It is always impossible until it is done." - Nelson Mandela.

4 条评论
The Rise of Indian Tech Leaders: A Journey Fueled by Vision and Growth.

2023年8月24日

The Rise of Indian Tech Leaders: A Journey Fueled by Vision and Growth.

"Our future growth relies on competitiveness and innovation, skills and productivity..

1 条评论
Fostering an African Renaissance: Pantami's Vision for Prosperity in Africa.

2023年8月14日

Fostering an African Renaissance: Pantami's Vision for Prosperity in Africa.

"The illiterates of the 21st century will not be those who cannot read and write, but those who cannot learn, relearn…

1 条评论

See all articles

WHAT CAN WE LEARN FROM NETFLIX DATA?

Nasir Yusuf Ahmad, ASMNES

Transforming Data into Actionable Insights for Informed Decision-Making| SDG 2, 3 & 4.| Co-Founder HealStoc | Human Capital Development Advocate | AI, ML, LLM Engineer.| Economist.| Author.| Certified Data Scientist.|

领英推荐

Conclusion:

Nasir Yusuf Ahmad, ASMNES的更多文章

社区洞察

其他会员也浏览了

The Elmer Project, New Shiny Release for Python, Mastering NLP from Foundations to LLMs

GenAI Weekly — Edition 23

Bootcamp ‘End to End Data Science Project using Python’

AIML 09- Data Augmentation in Python: Everything You Need to Know

Why Is Python Used for Machine Learning

AI Text Detection in Python: How to Identify AI-Generated Content

A Gentle Introduction to XGBoost for Applied Machine Learning

Why Retrieval-augmented generation (RAG)?

Top 5 Python Frameworks For Machine Learning

Python scikit-learn Toolkit

领英推荐

Conclusion:

Nasir Yusuf Ahmad, ASMNES的更多文章

WHAT DOES ZERO CARBON MEANS FOR THE GLOBAL SOUTH?

Why OpenAI’s GPT-4o Model is Free to Use; Potential Lessons for Businesses.

Are You an Eternal Optimist or an Eternal Pessimist?

Malthus Revisited: Are We on the Brink of a Global Food Crisis?

Harnessing Data for Transformative Governance: Insights from Clickon Kaduna 2nd Guest Lecture Series

Web Traffic Data Analysis For Online Store.

Fostering Financial Inclusion For Sustainable Economic Development in Nigeria: Episode 1.

The Beginning of A New Dawn

The Rise of Indian Tech Leaders: A Journey Fueled by Vision and Growth.

Fostering an African Renaissance: Pantami's Vision for Prosperity in Africa.

社区洞察

其他会员也浏览了

The Elmer Project, New Shiny Release for Python, Mastering NLP from Foundations to LLMs

GenAI Weekly — Edition 23

Bootcamp ‘End to End Data Science Project using Python’

AIML 09- Data Augmentation in Python: Everything You Need to Know

Why Is Python Used for Machine Learning

AI Text Detection in Python: How to Identify AI-Generated Content

A Gentle Introduction to XGBoost for Applied Machine Learning

Why Retrieval-augmented generation (RAG)?

Top 5 Python Frameworks For Machine Learning

Python scikit-learn Toolkit