Data Science Case Study 2: NLP Complaint Classification

Data Science Case Study 2: NLP Complaint Classification

...or how we were able to work out what our customers were complaining about and do something about it...

Would you believe it, people complain about banks. Some banks are complained about more than others, but it may not be due to the level of service; some of the differences are  down to the way that banks log and report complaints. Nevertheless, the financial services regulator cares about the complaints numbers and so there is an enormous effort within banks to reduce them. In fact senior executives in the UK bank I worked for are explicitly incentivised on whether the bank was able to meet its complaints reduction targets.

But there's a problem. The way we log complaints makes it very hard for us to understand who is complaining about what at an aggregate level and so it is hard to monitor what is going on. When you ring up the bank to complain you'll speak to a telephone operator who has to do a number of things while you speak, including: typing out the verbatim of your conversation, dealing with your complaint, and classifying it into a hierarchical 5-level 4,000 complaint code frame. You can imagine which of these is the lowest priority for the phone operator. As a result there is significant bias in the way that complaints are categorised and that varies significantly from one operator to the next.

This should be a problem tailor made for machine learning. We have the text of the complaint and we would like to apply natural language processing techniques to categorise the complaint. We would then be able to visualise the categorised data for root cause analysis colleagues. That's the theory.

Although it sounds space-age, NLP is in fact pretty straightforward. We used Spark to create a large database of complaint text and metadata and then build one categorisation model using latent dirichlet allocation and another using non-negative matrix factorization. It turns out that complaints can be adequately categorised into around 80 topics, compared to the 4,000 topics of the original code frame. So far so good. We played around with different NLP libraries including nltk in Python and the Java Stanford NLP library and they all worked about the same.

The challenge in these projects is in the cleaning of the data. The complaints text is typed in at the speed of the complainer and so the telephone operator is bound to make mistakes and to use abbreviations and jargon in order to make their job possible. This means that the data is incredibly messy and does not always conform to the rules of grammar or to generally accepted dictionaries. A lot of manipulation is required. One of the nice things about using Scala to clean the data is that monads allow you to preserve the transformations that have been applied to a complaint string so that not only are the transformations applied in a very natural and easy-to-understand way but it is easy to audit which transformations have been applied in any specific case. This greatly aided our ability to build and debug data transformation pipelines.

Some amusing things came out of this process. There were quite a number of complaints which, when we removed all the swear words, ended up being empty strings! That must have been a fun phone call to receive.

Another challenge in the data cleaning phase was that of anonymity. Complaints text often includes all sorts of a personal and financial details, for example people's names, their account numbers or details of their behaviour and daily life. It was very important to remove these at an early stage of the analysis so that customers are protected if the data were to fall into the wrong hands. But named entity recognition is tough and we spent considerable time ensuring that the data was anonymized appropriately.

What nobody tells you in the NLP manuals is that while it is relatively easy to do a one-off categorisation of a large corpus of text, putting this into production is much harder. For example LDA is a probabilistic methodology and is inherently unstable; Every time you run it you get a slightly different result. In some sense this is what you want. If you read a complaint even a human might not be sure which category is put it into. On the other hand stochastic results are generally not popular with business people or regulators. So you have to do a lot of thinking around how to update the model while ensuring that it remains compatible with previous results. The way we did this with LDA was to use previous runs for model as being a prior to the current run so that any variance due to the methodology was minor.

Just as problematic are changes in the underlying data. New topics can arise and you'd want your model to be able to adapt and add to the categories when they do so. The way we thought about this was to always run (n-1), n and (n+1) categories, and then use an information criterion to choose between the them. in fact this is how we chose the base number of categories in the first place, buy running a sequence of categorisations at 10, 20, 30 etc categories I'm looking at the information gain/loss for each step. 

One way of stabilizing the model was to use an unsupervised approach such as LDA or NMF to label a training set of complaints and then use some supervised approach such as a neural network to build a deterministic mapping between complaint features and a categorisation. This is quite promising because introducing the intermediate step of a training set allows much more control over the quality of the output.

But for this to be valuable to the business we needed to get into production and that is where we failed over 4 attempts and 4 years. I think this could be for a couple of reasons.

Firstly organisations tend to want to over-build everything. Or at least they were suspicious of our MVP because it didn't have the kind of front end that a root cause analysis colleague might be able to use easily. I think our view was that we wanted to roll out an MVP early so that a few root cause analysis colleagues could just get using it and get used to the ideas and concepts. From there we could learn from them about how to improve the product. What happened though was that someone senior in the organisation would get interested in the potential of NLP and decide that it would be better to outsource to a huge software company. And then we got embroiled in huge requirements-gathering exercises and contract negotiation and budgeting and hardware provision and everything else and it never happened. We presented our NLP models annually for 4 years and this happened every single time. If only they had been prepared to accept a rudimentary first attempt, our colleagues would have had the tools they needed to reduce complaints within weeks. But by going for the Enterprise solution, after four years at colleagues was still working with Excel and SQL and visual inspection of text.

The other reason we had trouble is because NLP is too cool. Every IT guy in the Bank wanted to get in on the act and believed that because they can download NLTK they can do NLP. This blocks both internal and external processes because whenever we made progress the IT guys that we relied on to put the solution in production intervened and insisted that they be given the mandate to do the job. Unfortunately each time they underestimated the task and hit problems that they were not capable of solving, causing the project to fail. As a result senior management became jaded about the potential for real change using NLP because all they ever see is failure, delay and excuses. We were unable to implement partly because the solution was seen as too exciting and so many factions fought to be the ones to build it.

 

Rahul Adwani

Gen AI, NLP and ML Review at Citi | IIT Bombay

3 年

Thanks for sharing this insightful post, Mr. Powell. I am initiating my NLP journey officially with a project on similar lines and your post is going to be quite helpful.

回复
Paul Keating

Advanced Planning and Optimisation Specialist

3 年

Jody Snowdon could be of interest

回复
Graham Giller

Chief Executive Officer at Giller Investments

3 年

Nice article Harry Powell — really captures the issues. Now you have to work out why my new Jag comes up with the message “Seats not available” in the morning! Pretty sure they are (I’m sitting in one). ;-)

回复
Harry (Charis) Sfyrakis

Algomo || Helping B2B Companies Triple Conversion Rates Through real-time and individual-level Personalization

3 年

I have some really vivid memories from that project. A) It was mid-2016, way before transformers and GPT. The data was so dirty, that a vanilla LSTM trained on it would generate gibberish that close to the original data that it was almost impossible to tell if a text was real or generated by the model. Thats an easy way to hack Turing test! B) But as Harry Powell already mentioned, the biggest pain was the lack of engagement from the business team and their understanding (or lack thereof) of what NLP can do. They expected us to automagically come up with a taxonomy that would fit THEIR needs, without them ever feeding into that process. In other words, they wanted something that works 'out of the box' (works for everyone) but simultaneously tailored exactly for their operations, which is, of course, a paradox.

Victor Paraschiv

Entrepreneur, engineer, scientist.

3 年

A nice comeback article of a great project. This adds another tenant to the peaceful, green graveyard of disruptive tech and innovation that failed to grow because the politicians didn't gain anything. For those who seek the revival of innovation in their companies a great starting question should be: How are we killing innovation inside our company already? Harry Powell Looking forward to an article about the elephant in the room when it comes to "the digital transformation": firing people smart.

要查看或添加评论,请登录

Harry Powell的更多文章

社区洞察

其他会员也浏览了