On Variety Of Encoding Text
unsplash

On Variety Of Encoding Text

Encoding text is at the heart of understanding language. If we know how to represent words, sentences and paragraphs with small vectors, all our problems are solved!

Having one generalised model to semantically represent text in a compressed vector is the holy grail of NLP ??

What does encoding text mean?

When we encode a variable-length text to a fixed-length vector, we are essentially doing feature engineering. If we use language models or embedding modules, we are also doing dimensionality reduction.


As I discussed in one of my previous posts on transfer learning, there are 2 approaches to modelling — Fine-tuning and Feature extraction. In this post, I will discuss the various ways of encoding text(feature extraction) with deep learning which can be used for the downstream tasks.


Suppose you have this sentence — “I love travelling to beaches.” and you are working with a classification project. If your vocabulary is huge, it becomes difficult to train the classifier. This happens when you use a TF-IDF vectorizer and get sparse vectors for each word.

With embeddings like GloVe you can get a dense vector of 100 dimensions for every word. But the problem with a model like GloVe is that it cannot handle OOV(Out of vocabulary) words and cannot deal with Polysemy —many possible meanings for a word based on context.

So the best approach is to use a model like ELMo or USE(Universal sentence encoder) to encode words. These models work on character level and can handle polysemy. This means that they can handle unseen words and the vector we get for every word/sentence will encapsulate its meaning.

Once we have a fixed vector for word/sentence, we can do anything with it. This is what the feature extraction approach is. Create feature once and then do any downstream task. We can try out different classification models and hypertune them. We can also create a semantic search or recommendation engine.

Now, the real question is what are the different models available for encoding text? Is there a model that works for everything or is it task dependent?

Evaluation of sentence embeddings in downstream and linguistic probing tasks

So I was reading this paper and it opened Pandora’s box for me. Ideally, we want an embedding model which gives us the smallest embedding vector and works great for the task. The smaller the embedding size, the lesser the compute required for training as well as inference.

As you can see, there is a huge variation in the size of embedding — varies from 300 to 4800. As per the basics, more the vector size, the more information it can contain! But is it actually true? Let’s see how they perform on the tasks.

No alt text provided for this image

Classification tasks

Authors tried out different classification tasks as shown below to understand the performance of these models. For the linguistic probing tasks, a MLP was used with a single hidden layer of 50 neurons, with no dropout added, using Adam optimizer with a batch size of 64.

(For the Word Content (WC) probing task in which a Logistic Regression was used since it provided consistently better results)

No alt text provided for this image

From the results we can see that different ELMo embeddings perform really good for classification tasks. USE and InferSent also top on some of the tasks. The difference between the best and the 2nd best is around 2%. Word2Vec and GloVe do not top in any task as expected but their performance is also in the range of 3%.

The thing to note here — ELMo has a vector size of 1024, USE has 512 and InferSent has 4096. So if somebody has to actually put a system to production, his first choice will be USE and then maybe ELMo.
No alt text provided for this image

Semantic relatedness tasks

Then they try out the embeddings for semantic relatedness and textual similarity tasks. This time USE(Transformer) model is a clear winner. If we neglect InferSent, which is 8x bigger embedding than USE, USE is far ahead of others.

This makes USE a clear choice for semantic search and similar question kind of tasks.

BTW, when should we use USE(DAN) and USE(Transformer)? The performance of USE(DAN) is O(n) with length of text while its O(n2) for USE(Transformer). So if you are dealing with long texts, you might want to go with USE(DAN).

No alt text provided for this image

Linguistic probing tasks

Next, they show results for Linguistic probing tasks which consist of some esoteric tasks. In this case, ELMo seems to rule the world!

BShift (bi-gram shift) task — the goal is to identify whether if two consecutive tokens within the sentence have been inverted or not such as “This is my Eve Christmas”

The differences are huge between ELMo and non-ELMo models.
No alt text provided for this image

Information retrieval tasks

In the caption-image retrieval task, each image and language features are jointly evaluated with the objective of ranking a collection of images in respect to a given caption (image retrieval task — text2image) or ranking captions with respect to a given image (caption retrieval — image2text).

InferSent is a clear winner in this one. The 2nd in the line is ELMo.

We can say that ELMo is a badass model for sure ??
No alt text provided for this image

Universal Sentence Encoder

As we can see, USE is a great production-level model to use and let's discuss it a bit. I will not talk about ELMo as there are many articles on it.

There are 2 models available for USE

  • Transformer
  • DAN(Deep Averaging Network)

The encoder takes as input a lowercased PTB tokenized string and outputs a 512 dimensional vector as the sentence embedding. Both the encoding models are designed to be as general-purpose as possible. This is accomplished by using multi-task learning whereby a single encoding model is used to feed multiple downstream tasks.

USE(Transformer)

This uses the transformer architecture which creates context-aware representations for every token. The sentence embedding is created by element-wise addition of embedding of all tokens.

USE(DAN)

This is a controversial modelling methodology because it doesn’t regard for the sequence of words. The GloVe embedding of words are first averaged together and then passed through a feedforward deep neural network to produce sentence embeddings.

No alt text provided for this image

The model makes use of a deep network to amplify the small differences in embeddings that might come from just one word like good/bad. It performs great most of the time but experiments show it fails at double negation like “not bad” because the model strongly associates ‘not’ with negative sentiment. Have a look at the last example.

No alt text provided for this image
This makes USE(DAN) a great model for classifying news articles into categories but might cause problem in sentiment classification problems where words like ‘not’ can change the meaning.

I hope you liked it so far. You can read the rest of the article on my medium ??

Bharath Kumar Bolla

Data Science at Salesforce | 40Under40DataScientist(2022) | Author | Mentor | University of Arizona | Mississippi State | Ex: Verizon

5 年

Great article Pratik. I am a fan of your articles.

回复
Joanne Burke

Data Scientist at US Bank

5 年

Nice article!??Antonia Calvi?,you many find this interesting as well.

回复

要查看或添加评论,请登录

Pratik Bhavsar的更多文章

  • 5 Rules of getting a job in data science

    5 Rules of getting a job in data science

    This is inspired by Jordan Peterson's practical advice book "12 rules for life" First - Know the rules Understand the…

    6 条评论
  • Mental Models In Data Science

    Mental Models In Data Science

    Slow-thinking vs fast-thinking From Google’s 43 rules of ML “Rule #4: Keep the first model simple and get the…

  • On Semantic Search

    On Semantic Search

    It took me a long time to realise that search is the biggest problem in NLP. Just look at Google, Amazon and Bing.

    4 条评论
  • How to train your Neural Network

    How to train your Neural Network

    The value of a neural network lies in its hypertuning. General intuition The VA(validation accuracy) of your NN(Neural…

    6 条评论

社区洞察

其他会员也浏览了