On Variety Of Encoding Text
Pratik Bhavsar
?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG
Encoding text is at the heart of understanding language. If we know how to represent words, sentences and paragraphs with small vectors, all our problems are solved!
Having one generalised model to semantically represent text in a compressed vector is the holy grail of NLP ??
What does encoding text mean?
When we encode a variable-length text to a fixed-length vector, we are essentially doing feature engineering. If we use language models or embedding modules, we are also doing dimensionality reduction.
As I discussed in one of my previous posts on transfer learning, there are 2 approaches to modelling — Fine-tuning and Feature extraction. In this post, I will discuss the various ways of encoding text(feature extraction) with deep learning which can be used for the downstream tasks.
Suppose you have this sentence — “I love travelling to beaches.” and you are working with a classification project. If your vocabulary is huge, it becomes difficult to train the classifier. This happens when you use a TF-IDF vectorizer and get sparse vectors for each word.
With embeddings like GloVe you can get a dense vector of 100 dimensions for every word. But the problem with a model like GloVe is that it cannot handle OOV(Out of vocabulary) words and cannot deal with Polysemy —many possible meanings for a word based on context.
So the best approach is to use a model like ELMo or USE(Universal sentence encoder) to encode words. These models work on character level and can handle polysemy. This means that they can handle unseen words and the vector we get for every word/sentence will encapsulate its meaning.
Once we have a fixed vector for word/sentence, we can do anything with it. This is what the feature extraction approach is. Create feature once and then do any downstream task. We can try out different classification models and hypertune them. We can also create a semantic search or recommendation engine.
Now, the real question is what are the different models available for encoding text? Is there a model that works for everything or is it task dependent?
Evaluation of sentence embeddings in downstream and linguistic probing tasks
So I was reading this paper and it opened Pandora’s box for me. Ideally, we want an embedding model which gives us the smallest embedding vector and works great for the task. The smaller the embedding size, the lesser the compute required for training as well as inference.
As you can see, there is a huge variation in the size of embedding — varies from 300 to 4800. As per the basics, more the vector size, the more information it can contain! But is it actually true? Let’s see how they perform on the tasks.
Classification tasks
Authors tried out different classification tasks as shown below to understand the performance of these models. For the linguistic probing tasks, a MLP was used with a single hidden layer of 50 neurons, with no dropout added, using Adam optimizer with a batch size of 64.
(For the Word Content (WC) probing task in which a Logistic Regression was used since it provided consistently better results)
From the results we can see that different ELMo embeddings perform really good for classification tasks. USE and InferSent also top on some of the tasks. The difference between the best and the 2nd best is around 2%. Word2Vec and GloVe do not top in any task as expected but their performance is also in the range of 3%.
The thing to note here — ELMo has a vector size of 1024, USE has 512 and InferSent has 4096. So if somebody has to actually put a system to production, his first choice will be USE and then maybe ELMo.
Semantic relatedness tasks
Then they try out the embeddings for semantic relatedness and textual similarity tasks. This time USE(Transformer) model is a clear winner. If we neglect InferSent, which is 8x bigger embedding than USE, USE is far ahead of others.
This makes USE a clear choice for semantic search and similar question kind of tasks.
BTW, when should we use USE(DAN) and USE(Transformer)? The performance of USE(DAN) is O(n) with length of text while its O(n2) for USE(Transformer). So if you are dealing with long texts, you might want to go with USE(DAN).
Linguistic probing tasks
Next, they show results for Linguistic probing tasks which consist of some esoteric tasks. In this case, ELMo seems to rule the world!
BShift (bi-gram shift) task — the goal is to identify whether if two consecutive tokens within the sentence have been inverted or not such as “This is my Eve Christmas”
The differences are huge between ELMo and non-ELMo models.
Information retrieval tasks
In the caption-image retrieval task, each image and language features are jointly evaluated with the objective of ranking a collection of images in respect to a given caption (image retrieval task — text2image) or ranking captions with respect to a given image (caption retrieval — image2text).
InferSent is a clear winner in this one. The 2nd in the line is ELMo.
We can say that ELMo is a badass model for sure ??
Universal Sentence Encoder
As we can see, USE is a great production-level model to use and let's discuss it a bit. I will not talk about ELMo as there are many articles on it.
There are 2 models available for USE
- Transformer
- DAN(Deep Averaging Network)
The encoder takes as input a lowercased PTB tokenized string and outputs a 512 dimensional vector as the sentence embedding. Both the encoding models are designed to be as general-purpose as possible. This is accomplished by using multi-task learning whereby a single encoding model is used to feed multiple downstream tasks.
USE(Transformer)
This uses the transformer architecture which creates context-aware representations for every token. The sentence embedding is created by element-wise addition of embedding of all tokens.
USE(DAN)
This is a controversial modelling methodology because it doesn’t regard for the sequence of words. The GloVe embedding of words are first averaged together and then passed through a feedforward deep neural network to produce sentence embeddings.
The model makes use of a deep network to amplify the small differences in embeddings that might come from just one word like good/bad. It performs great most of the time but experiments show it fails at double negation like “not bad” because the model strongly associates ‘not’ with negative sentiment. Have a look at the last example.
This makes USE(DAN) a great model for classifying news articles into categories but might cause problem in sentiment classification problems where words like ‘not’ can change the meaning.
I hope you liked it so far. You can read the rest of the article on my medium ??
Data Science at Salesforce | 40Under40DataScientist(2022) | Author | Mentor | University of Arizona | Mississippi State | Ex: Verizon
5 年Great article Pratik. I am a fan of your articles.
Data Scientist at US Bank
5 年Nice article!??Antonia Calvi?,you many find this interesting as well.