ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

ML Model Building with Amazon Alexa Review Data set using Advanced NLP spaCy, NLTK, Regression, GridCV & Xgboost

Nilesh Gode

Manager - (Data Science & Analytics) Mettler Toledo - IMSG

å‘å¸ƒæ—¥æœŸ: 2020å¹´5æœˆ25æ—¥

The data set consists of a nearly 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train Machine for sentiment analysis.

we are using this data to analyze Amazonâ€™s Alexa product ; discover insights into consumer reviews and assist with machine learning models.We can train our machine models for sentiment analysis and analyze customer reviews how many positive reviews ? and how many negative reviews ?

It has been performed with Following Steps:

Import the dataset from my Github Repo which is raw file as 'amazon_alexa.tsv'.
Perform Exploratory Data Analysis with different python commands such as head,info,describe,shape,isnull & groupby methods.
Then we perform the data visualization part which will try to understand the business problem and its solution.perform operations with matplotlib, wordcloud commands
Then we use the Advanced spaCy NLP library for text analysis by using English language and load 'en_code_web_md' Model Create our list of punctuation marks, stopwords, use tokenizer, tagger, parser, NER and word vectors.
Creating function for tokenizer, Lemmatizing each token and converting each token into lowercase, Custom transformer using spaCy, clean the text.
create training & testing data for Logistic Regression classifier, Create pipeline for cleaner, vectorizer & Classifier.
Logistic Regression Accuracy: 0.9417989417989417
Same dataset we are processing with libraries for NLTK, create corpus and bag of words for further processing
Apply min-max scalar and again build RandomForestClassifier, here we get score for Training Accuracy : 0.9941043083900227, Testing Accuracy : 0.9428571428571428.
Then we apply k fold cross validation and we get an Accuracy : 0.9365158371040725.
Then applying grid search with stratified folds and try to find out the best hyper parameters as : Best Parameter Combination : {'bootstrap': True, 'max_depth': 100, 'min_samples_split': 8, 'n_estimators': 300} and Accuracy Score for Test Set : 0.9428571428571428
then we build the model with xgboost classifier and get results as : Training Accuracy : 0.970521541950113, Testing Accuracy : 0.9407407407407408

Kindly go through each & every step and try to understand the different accuracy scores with each model, it's predictions and try to interpret the results on your own.

Thank you.. !!!! Happy Learning

Shivani Goyal

Data Scientist

3 å¹´

Please give the link to the code. Thanks

èµž

å›žå¤

1 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Nilesh Godeçš„æ›´å¤šæ–‡ç«

NLP - Sarcasm_detection using CNN

2020å¹´6æœˆ28æ—¥

NLP - Sarcasm_detection using CNN

Detection of sarcasm is an important task such as effective computing and sentiment analysis because such expressionsâ€¦
End to End NLP Model by using Syntactic Processing with ATIS Data over flight booking queries

2020å¹´6æœˆ14æ—¥

End to End NLP Model by using Syntactic Processing with ATIS Data over flight booking queries

There are some companies use the application over which customers can see their flight bookings options by giving aâ€¦
NLP Topic -The Language Modeling

2020å¹´6æœˆ7æ—¥

NLP Topic -The Language Modeling

The language model can predict the probability of the next word in the sequence, based on the words already observed inâ€¦
Theoretical approach towards Advanced NLP using Deep Learning

2020å¹´6æœˆ4æ—¥

Theoretical approach towards Advanced NLP using Deep Learning

INTRODUCTION NLP helps empower machines to understand, process, and analyze human language. Recent advances inâ€¦
Advanced NLP-spaCy : Text classification with Conference data set

2020å¹´5æœˆ24æ—¥

Advanced NLP-spaCy : Text classification with Conference data set

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. we are working with aâ€¦

2 æ¡è¯„è®º
Data Science beginners Hands on:Class Imbalanced Datasets & SMOTE

2020å¹´5æœˆ1æ—¥

Data Science beginners Hands on:Class Imbalanced Datasets & SMOTE

Imbalanced classes in a data sets are a common problem in ML classification algorithms where there are aâ€¦
Data Science beginner Hands on : Feature Engineering implementation in Machine Learning Models

2020å¹´4æœˆ30æ—¥

Data Science beginner Hands on : Feature Engineering implementation in Machine Learning Models

This article is with practical demonstration how data science beginners can implement feature engineering concept whileâ€¦
Statistical Study of Corona Virus (COVID-19) and its world wide effect visualize using Python

2020å¹´2æœˆ24æ—¥

Statistical Study of Corona Virus (COVID-19) and its world wide effect visualize using Python

Coronaviruses (CoV) are a large family of viruses that cause illness ranging from the common cold to more severeâ€¦
Working with spark:Spark Session

2020å¹´2æœˆ13æ—¥

Working with spark:Spark Session

Thus far in my previous article, I tried to covered the basic concepts of Spark Applications. we are going to need aâ€¦
Apache Spark for Data Scientist

2020å¹´2æœˆ12æ—¥

Apache Spark for Data Scientist

Apache Spark has seen an immense growth over the decade which will help in Data Engineering tasks to make Data Scienceâ€¦

See all articles

ML Model Building with Amazon Alexa Review Data set using Advanced NLP spaCy, NLTK, Regression, GridCV & Xgboost

Nilesh Gode

Manager - (Data Science & Analytics) Mettler Toledo - IMSG

Nilesh Godeçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

5 Best Text Analytics Softwareâ€™s for 2022

Vector databases & indexes, similarity search, and?RAG

Semantic search in 2020: the BERT way

Fine Tuning OPEN AI GPT 3 Transformer Model for Custom Dataset

The power of connections: Explore the world of graph databases with Neo4j by Thomas Adler

Text Analytics and Location Intelligence in ArcGIS - The New Normal

NLP Solutions on Azure OpenAI

Is RAG becoming obsolete?

BOW vs TF-IDF for NLP Text Vectorization

Nilesh Godeçš„æ›´å¤šæ–‡ç«

NLP - Sarcasm_detection using CNN

End to End NLP Model by using Syntactic Processing with ATIS Data over flight booking queries

NLP Topic -The Language Modeling

Theoretical approach towards Advanced NLP using Deep Learning

Advanced NLP-spaCy : Text classification with Conference data set

Data Science beginners Hands on:Class Imbalanced Datasets & SMOTE

Data Science beginner Hands on : Feature Engineering implementation in Machine Learning Models

Statistical Study of Corona Virus (COVID-19) and its world wide effect visualize using Python

Working with spark:Spark Session

Apache Spark for Data Scientist

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

5 Best Text Analytics Softwareâ€™s for 2022

Vector databases & indexes, similarity search, and?RAG

Semantic search in 2020: the BERT way

Fine Tuning OPEN AI GPT 3 Transformer Model for Custom Dataset

The power of connections: Explore the world of graph databases with Neo4j by Thomas Adler

Text Analytics and Location Intelligence in ArcGIS - The New Normal

NLP Solutions on Azure OpenAI

Is RAG becoming obsolete?

BOW vs TF-IDF for NLP Text Vectorization

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†