登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Text classification Model Development & Deployment

Ramen Borah

Data scientist at Rabobank || Ex Morgan Stanley || Ex Moody's || Quant Finance || Climate Risk

发布日期: 2021年7月25日

In today's world, all of us are keen to learn how to build a machine learning model to solve any business problem. But in reality, most of the machine learning model that has been developed does not make it to the production environment. No doubt the skill set to understand all the math behind the machine learning algorithm is most important for Data scientists, but it would have been more effective if we could express our thoughts from an end-user perspective. If we could showcase our project to others, it not only encourages you but will also help you to understand better about your project from an end-user standpoint.

Hence the #laststage of a machine learning lifecycle which is model deployment plays an important role when it comes to showcasing your work to the world and get practical insights to make a better business decision from it. It is a bit tricky process to work with different cross-functional teams (data scientists/IT/S.dev/business professionals) to deploy a ML-trained model in production.

In this article, we will be covering end-to-end NLP-based text classification model development till deployment using Flask API and finally publish it through the Heroku platform.

The app is deployed in this URL link https://imdbreviewramen.herokuapp.com/. There is still a lot of improvement that needs to be done from the UI perspective, but as of now, I think it's cool.

Github link https://github.com/Ramen16july/IMDBreview

Problem statement: To classify movie review sentiment.

Data: The dataset is taken from the Kaggle competition website.

Step 1:The best practice would be to set up a new environment for your new project using Anaconda Prompt.

> conda create -n IMDBreview python=3.9          ### Create your envirment
> activate IMDBreview         ## activate your envirnment
> cd "IMDBreview"             ## setup your working directory
> jupyter notebook            ### luanch jupyter notebook from the directory

Step 2: Let's start first with the basic EDA and Data cleaning process.

import pandas as pd
import numpy as np

imdb_data= pd.read_csv("IMDB Dataset.csv")

print(imdb_data.info())

###
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):

 #   Column     Non-Null Count  Dtype 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
 dtypes: object(2)

Step 3: Check the "review" feature and start preprocessing/cleaning the data accordingly.

imdb_data.review[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

Below are functions used to remove the noisy text from the "review" feature.

##Removing the html script

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

####Removing the  brackets

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#### Removing the noisy text

def denoise_text(text):
? ? text = strip_html(text)
? ? text = remove_between_square_brackets(text)
? ? return text

imdb_data['review']=imdb_data['review'].apply(denoise_text)

imdb_data.review[1]

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

Now let's remove the special characters.

def remove_special_characters(text, remove_digits=True):
? ? pattern=r'[^a-zA-z0-9\s]'
? ? text=re.sub(pattern,'',text)
    return text

imdb_data['review']=imdb_data['review'].apply(remove_special_characters)

imdb_data.review[1]

'A wonderful little production The filming technique is very unassuming very oldtimeBBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great masters of comedy and his life The realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwells murals decorating every surface are terribly well done'

Stemming: Stemming is a process of reducing infected or derived words to their base or root form. Let's apply the same in our "review" feature.

def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text

imdb_data['review']=imdb_data['review'].apply(simple_stemmer)

imdb_data.review[1]

'a wonder littl product the film techniqu is veri unassum veri oldtimebbc fashion and give a comfort and sometim discomfort sens of realism to the entir piec the actor are extrem well chosen michael sheen not onli ha got all the polari but he ha all the voic down pat too you can truli see the seamless edit guid by the refer to william diari entri not onli is it well worth the watch but it is a terrificli written and perform piec a master product about one of the great master of comedi and hi life the realism realli come home with the littl thing the fantasi of the guard which rather than use the tradit dream techniqu remain solid then disappear it play on our knowledg and our sens particularli with the scene concern orton and halliwel and the set particularli of their flat with halliwel mural decor everi surfac are terribl well done'

StopWords: Removing the stopwords. For this step, we will first extract the stop words present in nltk.corpus.stopwords from nltk library.

stopword_list=nltk.corpus.stopwords.words('english')

### lets check few stopwords

stopword_list[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Let's now remove the stopwords stored in "stopword_list" from our review data using the below function.

def remove_stopwords(text, is_lower_case=False):
? ? tokens = tokenizer.tokenize(text)
? ? tokens = [token.strip() for token in tokens]
? ? if is_lower_case:
? ? ? ? filtered_tokens = [token for token in tokens if token not in stopword_list]
? ? else:
? ? ? ? filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
? ?  ??
? ? filtered_text = ' '.join(filtered_tokens)??
    return filtered_text

imdb_data['review']=imdb_data['review'].apply(remove_stopwords)

imdb_data.review[1]

'wonder littl product film techniqu veri unassum veri oldtimebbc fashion give comfort sometim discomfort sens realism entir piec actor extrem well chosen michael sheen onli ha got polari ha voic pat truli see seamless edit guid refer william diari entri onli well worth watch terrificli written perform piec master product one great master comedi hi life realism realli come home littl thing fantasi guard rather use tradit dream techniqu remain solid disappear play knowledg sens particularli scene concern orton halliwel set particularli flat halliwel mural decor everi surfac terribl well done'

Step 4: As we are done with the preprocessing, now let's move ahead with building our bag of words model. We will first import the CountVectorizer from sklearn so that we can Convert the collection of text documents to a matrix of token counts and LabelBinarizer for labeling the sentiment data.

cv=CountVectorizer()

### Split the data into train and test

train_reviews=imdb_data.review[:40000]
test_reviews=imdb_data.review[40000:]

lb=LabelBinarizer()
sentiment_data=lb.fit_transform(imdb_data['sentiment'])

train_sentiments=sentiment_data[:40000]
test_sentiments=sentiment_data[40000:]

#### fitting the cv in train and test data to covert text into matrix of count

cv_train_reviews=cv.fit_transform(train_reviews)
cv_test_reviews=cv.fit_transform(test_reviews)

Step 5: Training our model with logistic regression algorithm

lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)

lr.fit(train_reviews,train_sentiments)

Step 6: Once we trained our model now we will compute the prediction for test data and check the performance of the model.

predict=lr.predict(test_reviews)

accuracy_score(test_sentiments,predict)

## 0.8861

confusion_matrix(test_sentiments,predict)

[4413,  580]
[ 559, 4448]

Step 7: The accuracy of the model is fairly good, now need to save our model so that we can deploy our model. We will save our "cv" and "lr" as ".pickle" files using the below command and we are ready to publish our model.

with open('Countvector.pickle','wb') as f:
  pickle.dump(cv,f)

with open('logistic.pickle','wb') as f:
  pickle.dump(lr,f)

The libraries used in this project can be found in the GitHub repo link

Creating the front-end application and deployment

Step 1: Create a requirement.txt file that will save all the libraries that we have installed and used in this "IMDBreview" environment using the below command. This text file will be required for the deployment of our model.

pip freeze > requirements.txt 

### once you excute this command it will save "requirements.txt" file in working directory.

Step 2: Go to your spider environment to create the front-end app for the user. For that, we need to create an "app.py" file and inside that, we need to import flask and some other packages. (Please make sure you have installed these libraries using pip install command)

from flask import Flask,render_template,url_for,request
import jsonify
import requests

Step 3: In the app.py file, a) We will load the pickle file (cv & lr) that we have saved earlier steps. b)we will call the functions that have been used for preprocessing the text and we will apply these functions in the new input text entered by the user and clean the text. c) After that we will convert the text into a count vectorizer matrix using "cv" and will use the trained model "lr" for sentiment prediction. d)Creat and link the Html scripts in the flask app.

Step 4: Run the " python app.py" command to see your flask app has been successfully deployed and working fine in your localhost URL.

Step 5: Once step 4 is done, we will upload all the files you have saved in your working directory to your GitHub repo. Once it's uploaded we can create an app in the Heroku platform and Connect your GitHub repo and deploy your model. Once successfully deployed you will get an URL for the app which is accessible anywhere in the world.

The app is deployed in this URL link https://imdbreviewramen.herokuapp.com/. There is still a lot of improvement that needs to be done from the UI perspective, but as of now, I think it's cool.

Github link https://github.com/Ramen16july/IMDBreview

Note: The viewpoints expressed in this article are those of the author's personal interest and it doesn't reflect his employer.

Veer singh

Director - Advisory Services Lead at Moody's | Middle East

3 年

Good work Ramen!

1 次回应

查看更多评论

要查看或添加评论，请登录

Ramen Borah的更多文章

Impact Assessment of IFRS 9 & Basel (IRB) Regulatory Capital

2020年7月18日

Impact Assessment of IFRS 9 & Basel (IRB) Regulatory Capital

In response to new accounting provisioning models i.e IFRS-9/CECL adopted by the Banks, there are still a high level…

2 条评论
Lifetime Probability of default estimation and Survival Modelling

2020年7月5日

Lifetime Probability of default estimation and Survival Modelling

The modelling of probability of an event to survive within a stimulated time is very important topic with many…

2 条评论
Scenario Analysis for IFRS-9 & CECL

2020年6月21日

Scenario Analysis for IFRS-9 & CECL

Macro-economic variables: As we know a forward-looking view is central for the overall estimate for IFRS9/CECL. One…

4 条评论
Cumulative Accuracy profile (CAP) & Low default portfolio

2020年4月26日

Cumulative Accuracy profile (CAP) & Low default portfolio

Cumulative accuracy profile (CAP) Another approach of estimating Probability of Default (PD) for a low default…
Markov chain & bad loan

2020年4月18日

Markov chain & bad loan

As we know the memoryless property of the Markov chain is a very simple concept which tells whatever going to happen…
COVID -19 adjustment in next reporting date - IFRS-9 ECL

2020年4月13日

COVID -19 adjustment in next reporting date - IFRS-9 ECL

Due to COVID-19, there is a sharp decline in economic activity globally, affecting across all the sectors. Specifically…
PD for zero or low default portfolio Part-II

2020年3月22日

PD for zero or low default portfolio Part-II

Continuing from my previous article, here I will be addressing a portfolio with a few default history using Pluto…
PD for zero or low default portfolio Part-I

2020年3月14日

PD for zero or low default portfolio Part-I

One of the most frustrating situations in Credit Risk Modelling is when you are dealing with a portfolio with zero or…

1 条评论
Importance of PPNR under CCAR

2019年7月4日

Importance of PPNR under CCAR

The 2007-08 Financial crisis and subsequent changes has brought a revolution in the field of stress testing. Federal…

See all articles

Ramen Borah的更多文章

Impact Assessment of IFRS 9 & Basel (IRB) Regulatory Capital

Lifetime Probability of default estimation and Survival Modelling

Scenario Analysis for IFRS-9 & CECL

Cumulative Accuracy profile (CAP) & Low default portfolio

Markov chain & bad loan

COVID -19 adjustment in next reporting date - IFRS-9 ECL

PD for zero or low default portfolio Part-II

PD for zero or low default portfolio Part-I

Importance of PPNR under CCAR

社区洞察