How to easily classify customer feedback w/ Machine Learning

How to easily classify customer feedback w/ Machine Learning

We collect customer feedback all the time. If your traffic is in millions, you end up having tens of thousands of comments to read through. It’s a challenge to efficiently identify what the key themes of those comments are. One way of accomplishing this is by bucketing the comments into themes of concerns and prioritizing based on which concern is bothering the maximum number of customers. This can be easily accomplished by employing text classification techniques using the scikit-learn packages in python. Let me describe how we went about accomplishing this.

You can employ a semi-supervised learning technique here. It means, we train a model using a training dataset, which contains customer comments/concerns and the theme each comment should be associated with. This needs to be manually tagged for a small sample of comments, say about 100. The themes could be, bad.site.experience, slow.speeds, incorrect.data, spam, unsubscribe etc.

Once the 100 comments are tagged, we train an Machine Learning model using SVM (Support Vector Machine) technique. It’s described in detail on the scikit-learn page here. Once trained, this model is now capable of tagging any given comment. 100 is a rather small data set to learn from, so we expand the data set by bootstrapping to about 250 comments. (Bootstrap in statistics is essentially filling missing data). To bootstrap, we run the model on 250 comments (100 of which we already tagged) we have 250 comments with tags now. Not all the comments would be perfect, but this is a quick way of increasing the sample data set.

The 250 comments dataset serves as a training and test data set (80:20) to improve our model on. Once we are satisfied with the accuracy, you can run the fitted model on a much larger dataset, like six-month data. I have about ~65% accuracy. For 10 different categories, a random pick is 10% accuracy. This is ~6.5 times better. A larger training data set (in 1000s) will improve accuracy. We did not want a very accurate model for our exercise. We picked a couple of categories and dug deeper into specifics. For example, if "feature.request" is a big theme, we dig deeper into what types of feature requests are popping up more often? This makes it more actionable for the product teams to pursue and build their pipeline.

For the full code, check out here

The core ML code is ...
#SVM - Code for classification
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42,
                                           max_iter=5, tol=None))])
#Fit the model
text_clf.fit(X_train, y_train)


#Load data on which to predict
data_all = pd.read_csv("/Users/.../six_months_data.csv")
#Predict - Classify 
predicted = text_clf.predict(data_all)
Almustafa Azhari

Sr. Software Engineer, Elevate (YC W22)

3 个月

Loved it. Thanks for sharing this

回复
Rohan Attravanam

?? DM "Nurture", to get my free training on how to nurture your database to get 3 additional deals a month ?? GoHighLevel Certified Admin

6 年

A few of you have reached out asking for more info and explanation on the article. For others... please feel free to reach out if you want to learn more about the technique or code. Happy to teach.

回复
Rohan Attravanam

?? DM "Nurture", to get my free training on how to nurture your database to get 3 additional deals a month ?? GoHighLevel Certified Admin

6 年

Patty... Does Jeff use anything similar?

回复
Patty Tran

?? traveler & trailer runner. ex Apple, KPMG. CPA. Bitcoin ??

6 年

Jeff Lo

要查看或添加评论,请登录

Rohan Attravanam的更多文章

社区洞察

其他会员也浏览了