NLU classification and auto?ML
Photo by Tengyart on Unsplash

NLU classification and auto?ML

Getting started from an excellent intro from Charlie Flanagan and his Machine Learning for Business class here is some experimentation with my own models, Google AutoML and AWS SageMaker.

Problem

Product reviews from a women’s clothing ecommerce store are provided. Based on each review it has been classified as whether the reviewer would recommend that product or not. Along with each review some additional features are also provided. The goal is to create a model that can predict the likelihood of product recommendation given the customer review and some additional features.


My Attempt

Using Colab and following Charlie’s example.

Step1 — Data and Model

The raw data is something like this table. Note the many empty columns so needs some cleanup. See this notebook for various things done to cleanup the data (example raw data).

The file ecommerce-reviews-full-set.csv created is what is used further in my code. Empty cells are a big problem and for missing titles which are a lot, I copied them over into the review text itself so then there is only 1 text feature.

data['Title_Review'] = data.Title.astype(str).str.cat(data['Review_Text'].astype(str), sep=' ')

The next steps are in this second notebook. The text is cleaned up removing non A-Z, converting to lowercase, removing stopwords, lemmatization and finally creating the count vector. 

Step 2 — Results

Using the validation data in all cases. Classification attempt with logistic regression, AUC 0.93 and weighted F1 0.90.

Logistic Regression classification

Classification attempt with Naive Bayes, AUC 0.94 and weighted F1 0.90.

Naive Bayes classification

Classification attempt with XGBoost, AUC 0.91 and weighted F1 0.84.

XGBoost classification

Finally using the test data set Naive Bayes looks to be best. 

Test set AUC

Maybe AUC 0.93 is too good to be true? 


AWS SageMaker

Sign-in to your AWS console and find the SageMaker service. First step is to launch the SageMaker Studio which is basically Jupyter. I created a test user with suggested role. Takes few minutes to get setup the first time. 

SageMaker Studio setup

Step 1 — Data and Setup

Goto S3 and upload your training dataset. Then create a new autopilot experiment in SageMaker Studio. Fill in the form and start it.

Create an autopilot experiment

It takes some time so check back later. Once the “Pre-processing” stage is done you will see links to 2 notebooks. The data exploration shows some helpful details into the data. The candidate generation notebook is very interesting and actually shows what it is doing. I don’t understand all of it but looks like XGBoost is one of the models it is trying with some variations and other models I am not familiar with. It also describes the hyper parameter tuning it will do.

Auto created Notebooks

Step 2 — Results

Once complete you can see various “trials” it did and the F1 score progressively improving. 

Hyperparameter tuning list and F1 score

The F1 of 0.84 matches similar F1 from my XGBoost model so that is good! You can see the model detail and it is XGBoost. Complete details are available so with some effort I can probably print out a AUC as well. A new thing I learned is SHAP values and relative feature importance.

Feature importance in terms of SHAP values

Step 3 — Cleanup

Frankly SageMaker pricing is very confusing and there is a 2 month free-tier so I think my usage falls into that but its very hard to see what is running and potential cost. My training apparently happened on ml.m5.4xlarge so I do owe AWS some $$ (not sure where it is set to use the free tier version).

  • From within SageMaker Studio make sure to stop all kernels and instances in the “Running Terminals and Kernels” tab
  • Any models you deploy show in deployments and make sure you delete them
  • Look in the SageMaker dashboard and nothing should be running
  • I’m not clear if leaving SageMaker studio is ok so delete that too if you don’t plan to use it again
  • Cleanup your S3 bucket (SageMaker actually adds a lot of files here)
  • Additionally SageMaker exhausts your free-tier KMS and S3 calls quota so expect some charge there as well
Recent SageMaker activity

I think I have stopped everything but will check back in few days to see if any new cost got added.


Google AutoML

Sign-in to GCP and create a new project (this is paid only so you need a billing account linked to your project). Under the hamburger menu group for Artificial Intelligence pick “Natural Language.” As a one-time step it asks to “enable API” but subsequently you will always have the dashboard.

GCP Natural Language dashboard

For my case, the first one applies — text and document classification.

Step 1 — Data and Setup

So Google is not as forgiving as pandas and before uploading csv it needs some cleanup. Some pointers in GCP docs. So I had to do some cleanup (see notebook). Google is actually very picky about what you can upload and your csv file can ONLY contain 3 columns. Their documentation is not clear and it gives unhelpful errors. So I could not reuse the file that worked with AWS. Additionally Google AutoML classification is basically only based on text and no other features.

Create a dataset and “single label classification” is what applies in this problem. Then provide the file and pick a bucket where to save it. Import the data next and start training.

Create dataset
Start training

Step 2 — Results

The results are very simple and there is no notebook or any details so no idea what they did.

Training results

Overall Google is very simple but not much to learn.


This topic is unrelated to my everyday work but if this interests you then reach out to me and I will appreciate any feedback. If you would like to work on other problems, you will generally find open roles as well! Please refer to LinkedIn.

Originally published on Medium

要查看或添加评论,请登录

Rahul Agarwal的更多文章

  • Recommendation systems in Python

    Recommendation systems in Python

    As a machine learning learner one important topic to learn is producing recommendations. The Netflix prize is cited…

    1 条评论
  • Time series forecasting with Greykite

    Time series forecasting with Greykite

    I have been learning about time series and recently looked into Facebook Prophet and SARIMA. There is now a new…

  • Time series forecasting with Facebook Prophet

    Time series forecasting with Facebook Prophet

    If you are monitoring your web services with Wavefront, Prometheus, Grafana etc. then you are aware of time series.

    3 条评论
  • API Gateway and a Service Mesh working?together

    API Gateway and a Service Mesh working?together

    In a recent presentation about API Gateways, I got a question about service mesh and I did not understand the question…

  • Understanding Microservice Failures

    Understanding Microservice Failures

    “Software does not fail” is a provocative statement and your initial reaction maybe to dismiss it. But if we look…

  • Opinionated Backend Cloud Service Delivery

    Opinionated Backend Cloud Service Delivery

    In this post I am expanding on an original question I asked recently where I am looking for an opinionated build/deploy…

    2 条评论
  • Ethereum and Gas

    Ethereum and Gas

    Getting into the world of Ethereum one of the learning curve has been about gas. Gas is essentially how you track and…

  • *Pay

    *Pay

    I have been involved in online payments for many years in large and small companies. As a backend services engineer it…

    2 条评论

社区洞察

其他会员也浏览了