登录查看更多内容

NLU classification and auto?ML

Rahul Agarwal

Senior Staff Engineer | Business and Technical Leadership | Distributed Systems

发布日期: 2021年5月4日

Getting started from an excellent intro from Charlie Flanagan and his Machine Learning for Business class here is some experimentation with my own models, Google AutoML and AWS SageMaker.

Problem

Product reviews from a women’s clothing ecommerce store are provided. Based on each review it has been classified as whether the reviewer would recommend that product or not. Along with each review some additional features are also provided. The goal is to create a model that can predict the likelihood of product recommendation given the customer review and some additional features.

My Attempt

Using Colab and following Charlie’s example.

Step1 — Data and Model

The raw data is something like this table. Note the many empty columns so needs some cleanup. See this notebook for various things done to cleanup the data (example raw data).

The file ecommerce-reviews-full-set.csv created is what is used further in my code. Empty cells are a big problem and for missing titles which are a lot, I copied them over into the review text itself so then there is only 1 text feature.

data['Title_Review'] = data.Title.astype(str).str.cat(data['Review_Text'].astype(str), sep=' ')

The next steps are in this second notebook. The text is cleaned up removing non A-Z, converting to lowercase, removing stopwords, lemmatization and finally creating the count vector.

Step 2 — Results

Using the validation data in all cases. Classification attempt with logistic regression, AUC 0.93 and weighted F1 0.90.

Classification attempt with Naive Bayes, AUC 0.94 and weighted F1 0.90.

Classification attempt with XGBoost, AUC 0.91 and weighted F1 0.84.

Finally using the test data set Naive Bayes looks to be best.

Maybe AUC 0.93 is too good to be true?

AWS SageMaker

Sign-in to your AWS console and find the SageMaker service. First step is to launch the SageMaker Studio which is basically Jupyter. I created a test user with suggested role. Takes few minutes to get setup the first time.

Step 1 — Data and Setup

Goto S3 and upload your training dataset. Then create a new autopilot experiment in SageMaker Studio. Fill in the form and start it.

It takes some time so check back later. Once the “Pre-processing” stage is done you will see links to 2 notebooks. The data exploration shows some helpful details into the data. The candidate generation notebook is very interesting and actually shows what it is doing. I don’t understand all of it but looks like XGBoost is one of the models it is trying with some variations and other models I am not familiar with. It also describes the hyper parameter tuning it will do.

Step 2 — Results

Once complete you can see various “trials” it did and the F1 score progressively improving.

The F1 of 0.84 matches similar F1 from my XGBoost model so that is good! You can see the model detail and it is XGBoost. Complete details are available so with some effort I can probably print out a AUC as well. A new thing I learned is SHAP values and relative feature importance.

Feature importance in terms of SHAP values

Step 3 — Cleanup

Frankly SageMaker pricing is very confusing and there is a 2 month free-tier so I think my usage falls into that but its very hard to see what is running and potential cost. My training apparently happened on ml.m5.4xlarge so I do owe AWS some $$ (not sure where it is set to use the free tier version).

From within SageMaker Studio make sure to stop all kernels and instances in the “Running Terminals and Kernels” tab
Any models you deploy show in deployments and make sure you delete them
Look in the SageMaker dashboard and nothing should be running
I’m not clear if leaving SageMaker studio is ok so delete that too if you don’t plan to use it again
Cleanup your S3 bucket (SageMaker actually adds a lot of files here)
Additionally SageMaker exhausts your free-tier KMS and S3 calls quota so expect some charge there as well

I think I have stopped everything but will check back in few days to see if any new cost got added.

Google AutoML

Sign-in to GCP and create a new project (this is paid only so you need a billing account linked to your project). Under the hamburger menu group for Artificial Intelligence pick “Natural Language.” As a one-time step it asks to “enable API” but subsequently you will always have the dashboard.

For my case, the first one applies — text and document classification.

Step 1 — Data and Setup

So Google is not as forgiving as pandas and before uploading csv it needs some cleanup. Some pointers in GCP docs. So I had to do some cleanup (see notebook). Google is actually very picky about what you can upload and your csv file can ONLY contain 3 columns. Their documentation is not clear and it gives unhelpful errors. So I could not reuse the file that worked with AWS. Additionally Google AutoML classification is basically only based on text and no other features.

Create a dataset and “single label classification” is what applies in this problem. Then provide the file and pick a bucket where to save it. Import the data next and start training.

Step 2 — Results

The results are very simple and there is no notebook or any details so no idea what they did.

Overall Google is very simple but not much to learn.

This topic is unrelated to my everyday work but if this interests you then reach out to me and I will appreciate any feedback. If you would like to work on other problems, you will generally find open roles as well! Please refer to LinkedIn.

Originally published on Medium

要查看或添加评论，请登录

Rahul Agarwal的更多文章

Recommendation systems in Python

2021年6月12日

Recommendation systems in Python

As a machine learning learner one important topic to learn is producing recommendations. The Netflix prize is cited…

1 条评论
Time series forecasting with Greykite

2021年5月25日

Time series forecasting with Greykite

I have been learning about time series and recently looked into Facebook Prophet and SARIMA. There is now a new…
Time series forecasting with Facebook Prophet

2021年5月17日

Time series forecasting with Facebook Prophet

If you are monitoring your web services with Wavefront, Prometheus, Grafana etc. then you are aware of time series.

3 条评论
API Gateway and a Service Mesh working?together

2021年4月23日

API Gateway and a Service Mesh working?together

In a recent presentation about API Gateways, I got a question about service mesh and I did not understand the question…
Understanding Microservice Failures

2021年2月27日

Understanding Microservice Failures

“Software does not fail” is a provocative statement and your initial reaction maybe to dismiss it. But if we look…
Opinionated Backend Cloud Service Delivery

2020年2月23日

Opinionated Backend Cloud Service Delivery

In this post I am expanding on an original question I asked recently where I am looking for an opinionated build/deploy…

2 条评论
Ethereum and Gas

2018年2月12日

Ethereum and Gas

Getting into the world of Ethereum one of the learning curve has been about gas. Gas is essentially how you track and…
*Pay

2017年2月16日

*Pay

I have been involved in online payments for many years in large and small companies. As a backend services engineer it…

2 条评论

See all articles

NLU classification and auto?ML

Rahul Agarwal

Senior Staff Engineer | Business and Technical Leadership | Distributed Systems

Problem

My Attempt

Step1 — Data and Model

Step 2 — Results

AWS SageMaker

Step 1 — Data and Setup

Step 2 — Results

Step 3 — Cleanup

Google AutoML

Step 1 — Data and Setup

Step 2 — Results

Rahul Agarwal的更多文章

社区洞察

其他会员也浏览了

ML Systems for Business: A Step-by-Step Guide

Top Machine Learning Algorithms in Data Science Explained: 7+ Algorithms

Course Launch - Scaling and Accelerating Machine Learning Models

DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning

The six most painstaking steps in machine learning – what your team isn’t telling you

Marvelous MLOPs #46: Model serving architectures on Databricks

MLflow Alternatives for Data Version Control: DVC vs. MLflow

Your First Steps in Data Science: Top 10 Machine Learning Algorithms for Beginners

Applied Machine Learning Projects: Course Launch

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

Problem

My Attempt

Step1 — Data and Model

Step 2 — Results

AWS SageMaker

Step 1 — Data and Setup

Step 2 — Results

Step 3 — Cleanup

Google AutoML

Step 1 — Data and Setup

Step 2 — Results

Rahul Agarwal的更多文章

Recommendation systems in Python

Time series forecasting with Greykite

Time series forecasting with Facebook Prophet

API Gateway and a Service Mesh working?together

Understanding Microservice Failures

Opinionated Backend Cloud Service Delivery

Ethereum and Gas

*Pay

社区洞察

其他会员也浏览了

ML Systems for Business: A Step-by-Step Guide

Top Machine Learning Algorithms in Data Science Explained: 7+ Algorithms

Course Launch - Scaling and Accelerating Machine Learning Models

DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning

The six most painstaking steps in machine learning – what your team isn’t telling you

Marvelous MLOPs #46: Model serving architectures on Databricks

MLflow Alternatives for Data Version Control: DVC vs. MLflow

Your First Steps in Data Science: Top 10 Machine Learning Algorithms for Beginners

Applied Machine Learning Projects: Course Launch

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)