Traditional Text Analysis Methods In The World of Deep?Learning
Shanif Dhanani
Founder of Nobi —?AI shopping assistant that boosts conversion rates by improving discovery and recommendations
Several years ago a new technique for processing text flipped the world of natural language processing (NLP) on its head. That technique, of course, was the process of converting words into vectors, known today as word embeddings.
Word embeddings allow data scientists to represent a word in multi-dimensional space, where these dimensions correspond to the statistical properties related to word co-occurrence. They allow us to concisely represent key properties of text using a small(ish) number of numerical weights.
This is a huge improvement over what had to be done prior to the creation of word embeddings. In those days, words had to be one-hot encoded, causing an explosion in input feature dimensionality. In addition, it was common to use n-grams, stemming, lemmatization, and other text pre-processing techniques to make it easier to encode words into numbers.
At Apteo, we’ve been using embeddings without any regard to any of the pre-processing methods mentioned above. However, we were interested in seeing if any of the methods mentioned above could improve the accuracy of the network in which these embeddings are used.
So we grabbed our document corpus, transformed all of the words using these traditional methods, then used embeddings for the transformed words and tossed the resulting vectors into our neural network to see if there would be any improvements in our accuracy.
There weren’t.
None of the pre-processing methods helped improve our cross-validated accuracy.
Of course, in retrospect, that’s not that surprising. The whole point of vectorizing a word is to transform it into high dimensional space that can represent the nuances of how that word is used in real life. Dropping suffixes or lemmatizing words won’t change the underlying context of that word all that much, so it’s unlikely that any changes in the underlying embeddings that were used had much of an impact.
However, despite the fact that traditional text pre-processing didn’t help us all that much, I still have hopes for other methods that could help us process our text in a more effective manner.
Topic extraction techniques like LDA and LSI, which are statistical methods for loosely labeling documents with topics, may still be able to provide deep networks with useful information about the high-level context of what’s being said in these documents.
We have yet to see if these methods can improve our performance, obviously we hope the do, but we won’t know until we try them.
It’s great to see how far we’ve come in the world of NLP in a few years. I have no doubt that as AI advances and more researchers start looking into text processing, we’ll get even better.