End to End Movie Recommendation System with Flask app
AI With Noor
AI Engineer | Genai | RAG | Expertise in AI,ML,NLP,DL | YouTuber | Data Engineer - Python | Computer Vision | AI Tutor
End to End Movie Recommendation System with Flask app
Introduction
In this blog post, we will go through the process of building an end-to-end machine learning project. We will start by acquiring a dataset from Kaggle, then we will preprocess the data and train a machine learning model. Finally, we will deploy the model as a web application using Flask.
Getting Data from Kaggle
Kaggle is a popular platform that hosts various datasets for machine learning. To get data from Kaggle, we first need to create an account on Kaggle and join a competition or find a dataset of interest.
Once we have found a dataset, we can download it directly from Kaggle. However, some datasets might require us to accept a competition rule or agreement first. After downloading the dataset, we can extract the files and load the data into our project.
Preprocessing Data
Before we can train a machine learning model, we need to preprocess the data. Preprocessing involves tasks such as cleaning the data, handling missing values, scaling the data, and encoding categorical features.
In our project, we will use pandas for data preprocessing. Pandas is a powerful library that provides data structures and functions for data analysis. We will load the data into a pandas dataframe and perform various preprocessing tasks on the dataframe.
Training Machine Learning Model
Once we have preprocessed the data, we can train a machine learning model. In our project, we will use the scikit-learn library for machine learning. Scikit-learn is a popular library that provides various machine learning algorithms and tools for model selection, evaluation, and preprocessing.
We will use the cosine similarity for our project. It is used to calculate distance between vectors.
After training the model, we will serialize the model using the pickle library. Pickle is a library that allows us to save Python objects in a binary format. We will save the trained model as a file so that we can load it later in our Flask application.
Building Flask Application
Flask is a popular web framework for Python that allows us to build web applications quickly and easily. We will use Flask to build a web application that takes an input image and predicts the class of the image using the trained model.
Our Flask application will have two routes: a home route and a prediction route. The home route will display a simple HTML page with a form for uploading an image. The prediction route will take the uploaded image, preprocess the image, and make a prediction using the trained model. The prediction result will be displayed on a new page.
Conclusion
In this blog post, we have gone through the process of building an end-to-end machine learning project. We started by acquiring a dataset from Kaggle, then we preprocessed the data and trained a machine learning model. Finally, we deployed the model as a web application using Flask.
The code for the project can be found below.
Note Book Code:
import numpy as np
import pandas as pd
[3]
movie = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv("tmdb_5000_credits.csv")
[4]
movie.head()
[5]
credits.head()
merging both dataset
[6]
movies = movie.merge(credits, on='title')
[7]
movies.head()
keeping important columns
[8]
movies.columns
Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
'original_title', 'overview', 'popularity', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
'vote_count', 'movie_id', 'cast', 'crew'],
dtype='object')
[9]
movies = movies[['id','title','overview','keywords','genres','cast','crew']]
[10]
movies
checking null vals
[11]
movies.isnull().sum()
id 0
title 0
overview 3
keywords 0
genres 0
cast 0
crew 0
dtype: int64
[12]
moveis.dropna(inplace=True)
working with overview `
[13]
movies.iloc[0].overview
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'
[14]
# it is a string so convert is into list
movies['overview'] = movies['overview'].apply(lambda x:x.split())
[15]
movies['overview'][0]
['In',
'the',
'22nd',
'century,',
'a',
'paraplegic',
'Marine',
'is',
'dispatched',
'to',
'the',
'moon',
'Pandora',
'on',
'a',
'unique',
'mission,',
'but',
'becomes',
'torn',
'between',
'following',
'orders',
'and',
'protecting',
'an',
'alien',
'civilization.']
working with keywords
[16]
movies['keywords'][0]
'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'
[17]
import ast # use to convert string to integer
def keywords(obj):
l = []
for i in ast.literal_eval(obj):
l.append(i['name'])
return l
[18]
movies['keywords'] = movies['keywords'].apply(keywords)
[19]
movies['keywords']
0 [culture clash, future, space war, space colon...
1 [ocean, drug abuse, exotic island, east india ...
2 [spy, based on novel, secret agent, sequel, mi...
3 [dc comics, crime fighter, terrorist, secret i...
4 [based on novel, mars, medallion, space travel...
...
4804 [united states–mexico barrier, legs, arms, pap...
4805 []
4806 [date, love at first sight, narration, investi...
4807 []
4808 [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object
working with genres
[20]
movies['genres'][0]
'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'
[21]
import ast # use to convert string to integer
def genres(obj):
l = []
for i in ast.literal_eval(obj):
l.append(i['name'])
return l
[22]
movies['genres'] = movies['genres'].apply(genres)
[23]
movies['genres']
0 [Action, Adventure, Fantasy, Science Fiction]
1 [Adventure, Fantasy, Action]
2 [Action, Adventure, Crime]
3 [Action, Crime, Drama, Thriller]
4 [Action, Adventure, Science Fiction]
...
4804 [Action, Crime, Thriller]
4805 [Comedy, Romance]
4806 [Comedy, Drama, Romance, TV Movie]
4807 []
4808 [Documentary]
Name: genres, Length: 4806, dtype: object
working with cast
[24]
movies['cast'][0]
'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang",
import ast # use to convert string to integer
def cast(obj):
l = []
# intersted in top three cast
count = 0
for i in ast.literal_eval(obj):
if count != 3:
l.append(i['name'])
count+=1
else:
break
return l
movies['cast'] = movies['cast'].apply(cast)
[26]
movies['cast']
0 [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1 [Johnny Depp, Orlando Bloom, Keira Knightley]
2 [Daniel Craig, Christoph Waltz, Léa Seydoux]
3 [Christian Bale, Michael Caine, Gary Oldman]
4 [Taylor Kitsch, Lynn Collins, Samantha Morton]
...
4804 [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805 [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806 [Eric Mabius, Kristin Booth, Crystal Lowe]
4807 [Daniel Henney, Eliza Coupe, Bill Paxton]
4808 [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object
working with crew
[27]
movies['crew'][0]
'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job":
import ast # use to convert string to integer
def crew(obj):
l = []
# interested in top three cast
count = 0
for i in ast.literal_eval(obj):
if i['job'] == 'Director':
l.append(i['name'])
break
return l
[29]
movies['crew'] = movies['crew'].apply(crew)
[30]
movies['crew']
0 [James Cameron]
1 [Gore Verbinski]
2 [Sam Mendes]
3 [Christopher Nolan]
4 [Andrew Stanton]
...
4804 [Robert Rodriguez]
4805 [Edward Burns]
4806 [Scott Smith]
4807 [Daniel Hsia]
4808 [Brian Herzlinger]
Name: crew, Length: 4806, dtype: object
concatenating last four cols into one`
[31]
movies['tags'] = movies['overview'] + movies['cast'] + movies['crew'] + movies['keywords']
[32]
movies = movies[['id','title','tags']]
[33]
movies
removing spaces form tags
[34]
movies['tags'] = movies['tags'].apply(lambda x: [i.replace(" ", "") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
movies['tags'] = movies['tags'].apply(lambda x: [i.replace(" ", "") for i in x])
[35]
movies['tags'][0]
['In',
'the',
'22nd',
'century,',
'a',
'paraplegic',
'Marine',
'is',
'antiwar',
'powerrelations',
'mindandsoul',
'3d']
applying stemming
[36]
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
[37]
movies['tags']
0 [In, the, 22nd, century,, a, paraplegic, Marin...
1 [Captain, Barbossa,, long, believed, to, be, d...
2 [A, cryptic, message, from, Bond’s, past, send...
3 [Following, the, death, of, District, Attorney...
4 [John, Carter, is, a, war-weary,, former, mili...
...
4804 [El, Mariachi, just, wants, to, play, his, gui...
4805 [A, newlywed, couple's, honeymoon, is, upended...
4806 ["Signed,, Sealed,, Delivered", introduces, a,...
4807 [When, ambitious, New, York, attorney, Sam, is...
4808 [Ever, since, the, second, grade, when, he, fi...
Name: tags, Length: 4806, dtype: object
[38]
def stemming(text):
l = []
for i in text:
l.append(ps.stem(i))
return " ".join(l)
[39]
movies['tags'] = movies['tags'].apply(stemming)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
[40]
movies['tags'][10]
'superman return to discov hi 5-year absenc ha allow lex luthor to walk free, and that those he wa closest too felt abandon and have move on. luthor plot hi ultim reveng that could see million kill and chang the face of the planet forever, as well as rid himself of the man of steel. brandonrouth kevinspacey katebosworth bryansing savingtheworld dccomic invulner sequel superhero basedoncomicbook kryptonit superpow superhumanstrength lexluthor'
Vectorization code
[41]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=500, stop_words='english')
[42]
vectors = vectorizer.fit_transform(movies['tags']).toarray()
[43]
vectors
array([[1, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 1, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
['3d',
'accident',
'act',
'action',
'adventur',
'affair',
'aftercreditssting',
'age',
'agent',
'alcohol',
'alien',
'alway',
'young',
'zombi']
Calculating distances
[45]
from sklearn.metrics.pairwise import cosine_similarity
[46]
similarity = cosine_similarity(vectors)
[47]
movies[movies['title']=="Avatar"]
[48]
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])
[(0, 0.9999999999999998),
(507, 0.50709255283711),
(151, 0.46188021535170054),
(1216, 0.44262666813799045),
(539, 0.38729833462074165),
(1321, 0.36514837167011066),
(1920, 0.3544587784792833),
(305, 0.3464101615137754),
(2786, 0.3450327796711771),
(1774, 0.3442651863295481),
...]
movies.iloc[100].title
'The Curious Case of Benjamin Button'
[50]
def Recommendation_system(movie):
movie_index = movies[movies['title']== movie].index[0]
distances = sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])
for i in distances[1:20]:
print(movies.iloc[i[0]].title)
[51]
Recommendation_system('Avatar')
Independence Day
Beowulf
Aliens vs Predator: Requiem
Titan A.E.
The Thing
Lifeforce
Treasure Planet
Attack the Block
Martian Child
Edge of Tomorrow
Predators
Meet Dave
Capricorn One
Tears of the Sun
Under the Skin
Independence Daysaster
Lockout
Aliens in the Attic
E.T. the Extra-Terrestrial
Pickling files
[53]
import pickle
pickle.dump(movies, open('model.pkl','wb'))
pickle.dump(similarity, open('similarity’,'wb'))
Flask Code;
from flask import Flask, request, render_template import requests import pandas as pd import pickle app = Flask(__name__) # loading models # movies = pd.read_csv('movies.csv') movies = pickle.load(open('model.pkl', 'rb')) similarity = pickle.load(open('similarity.pkl', 'rb')) # function to fetch movie poster def fetch_poster(movie_id): url = "https://api.themoviedb.org/3/movie/{}?api_key=390e76286265f7638bb6b19d86474639&language=en-US".format(movie_id) data = requests.get(url) data = data.json() full_path = "https://image.tmdb.org/t/p/w500/" + data['poster_path'] return full_path # function to get recommended movies def get_recommendations(movie): # get the index of the selected movie idx = movies[movies['title'] == movie].index[0] # get pairwise similarity scores of all movies with the selected movie sim_scores = list(enumerate(similarity[idx])) # sort the movies based on similarity scores in descending order sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # get top 20 similar movies (excluding the selected movie) sim_scores = sim_scores[1:21] # get titles and posters of the recommended movies movie_indices = [i[0] for i in sim_scores] movie_titles = movies['title'].iloc[movie_indices].tolist() movie_posters = [fetch_poster(movies['id'].iloc[i]) for i in movie_indices] return movie_titles, movie_posters # home page @app.route('/') def home(): movie_list = movies['title'].tolist() return render_template('index.html', movie_list=movie_list) # recommendation page @app.route('/recommend', methods=['POST']) def recommend(): movie_title = request.form['selected_movie'] recommended_movie_titles, recommended_movie_posters = get recommendations(movie_title) return render_template('index.html', movie_list=movies['title'].tolist(), recommended_movie_titles=recommended_movie_titles, recommended_movie_posters=recommended_movie_posters) if __name__ == '__main__': app.run(debug=True)
HTML Code:
<!doctype html> <html> <head> <title>Movie Recommender</title> <link rel="stylesheet" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous"> </head> <body style="background:#D9F799"> <div style="color:white; margin-top:15px; border-radius:20px;" class="container my-3 mt-3 bg-dark"> <h1 class="text-center">Movie Recommendation System</h1> <form action="/recommend" method="POST"> <div class="form-group"> <label for="movie-select">Select a movie:</label> <select class="form-control" id="movie-select" name="selected_movie"> {% for movie in movie_list %} <option value="{{ movie }}">{{ movie }}</option> {% endfor %} </select> </div> <button type="submit" class="btn btn-primary">Get Recommendations</button> </form> {% if movie_list %} <h2>Recommended Movies:</h2> <div class="row"> {% for i in range(recommended_movie_titles|length) %} <div class="col-md-3"> <div class="card mb-3"> <img src="{{ recommended_movie_posters[i] }}" class="card-img-top" alt="..."> <div class="card-body"> <h5 class="card-title">{{ recommended_movie_titles[i] }}</h5> </div> </div> </div> {% endfor %} </div> {% endif %} </div> <!-- Optional JavaScript --> <!-- jQuery first, then Popper.js, then Bootstrap JS --> <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script> <script src="https://cdn.jsdelivr.net/npm/@popperjs/[email protected]/dist/umd/popper.min.js"></script> <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script> </body> </html>
Dev at NT GROUP
1 年Nice post. Is it deployed?