登录查看更多内容

On Variety Of Encoding Text

Pratik Bhavsar

?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG

发布日期: 2019年12月10日

Encoding text is at the heart of understanding language. If we know how to represent words, sentences and paragraphs with small vectors, all our problems are solved!

Having one generalised model to semantically represent text in a compressed vector is the holy grail of NLP ??

What does encoding text mean?

When we encode a variable-length text to a fixed-length vector, we are essentially doing feature engineering. If we use language models or embedding modules, we are also doing dimensionality reduction.

As I discussed in one of my previous posts on transfer learning, there are 2 approaches to modelling — Fine-tuning and Feature extraction. In this post, I will discuss the various ways of encoding text(feature extraction) with deep learning which can be used for the downstream tasks.

Suppose you have this sentence — “I love travelling to beaches.” and you are working with a classification project. If your vocabulary is huge, it becomes difficult to train the classifier. This happens when you use a TF-IDF vectorizer and get sparse vectors for each word.

With embeddings like GloVe you can get a dense vector of 100 dimensions for every word. But the problem with a model like GloVe is that it cannot handle OOV(Out of vocabulary) words and cannot deal with Polysemy —many possible meanings for a word based on context.

So the best approach is to use a model like ELMo or USE(Universal sentence encoder) to encode words. These models work on character level and can handle polysemy. This means that they can handle unseen words and the vector we get for every word/sentence will encapsulate its meaning.

Once we have a fixed vector for word/sentence, we can do anything with it. This is what the feature extraction approach is. Create feature once and then do any downstream task. We can try out different classification models and hypertune them. We can also create a semantic search or recommendation engine.

Now, the real question is what are the different models available for encoding text? Is there a model that works for everything or is it task dependent?

Evaluation of sentence embeddings in downstream and linguistic probing tasks

So I was reading this paper and it opened Pandora’s box for me. Ideally, we want an embedding model which gives us the smallest embedding vector and works great for the task. The smaller the embedding size, the lesser the compute required for training as well as inference.

As you can see, there is a huge variation in the size of embedding — varies from 300 to 4800. As per the basics, more the vector size, the more information it can contain! But is it actually true? Let’s see how they perform on the tasks.

Classification tasks

Authors tried out different classification tasks as shown below to understand the performance of these models. For the linguistic probing tasks, a MLP was used with a single hidden layer of 50 neurons, with no dropout added, using Adam optimizer with a batch size of 64.

(For the Word Content (WC) probing task in which a Logistic Regression was used since it provided consistently better results)

From the results we can see that different ELMo embeddings perform really good for classification tasks. USE and InferSent also top on some of the tasks. The difference between the best and the 2nd best is around 2%. Word2Vec and GloVe do not top in any task as expected but their performance is also in the range of 3%.

The thing to note here — ELMo has a vector size of 1024, USE has 512 and InferSent has 4096. So if somebody has to actually put a system to production, his first choice will be USE and then maybe ELMo.

Semantic relatedness tasks

Then they try out the embeddings for semantic relatedness and textual similarity tasks. This time USE(Transformer) model is a clear winner. If we neglect InferSent, which is 8x bigger embedding than USE, USE is far ahead of others.

This makes USE a clear choice for semantic search and similar question kind of tasks.

BTW, when should we use USE(DAN) and USE(Transformer)? The performance of USE(DAN) is O(n) with length of text while its O(n2) for USE(Transformer). So if you are dealing with long texts, you might want to go with USE(DAN).

Linguistic probing tasks

Next, they show results for Linguistic probing tasks which consist of some esoteric tasks. In this case, ELMo seems to rule the world!

BShift (bi-gram shift) task — the goal is to identify whether if two consecutive tokens within the sentence have been inverted or not such as “This is my Eve Christmas”

The differences are huge between ELMo and non-ELMo models.

Information retrieval tasks

In the caption-image retrieval task, each image and language features are jointly evaluated with the objective of ranking a collection of images in respect to a given caption (image retrieval task — text2image) or ranking captions with respect to a given image (caption retrieval — image2text).

InferSent is a clear winner in this one. The 2nd in the line is ELMo.

We can say that ELMo is a badass model for sure ??

Universal Sentence Encoder

As we can see, USE is a great production-level model to use and let's discuss it a bit. I will not talk about ELMo as there are many articles on it.

There are 2 models available for USE

Transformer
DAN(Deep Averaging Network)

The encoder takes as input a lowercased PTB tokenized string and outputs a 512 dimensional vector as the sentence embedding. Both the encoding models are designed to be as general-purpose as possible. This is accomplished by using multi-task learning whereby a single encoding model is used to feed multiple downstream tasks.

USE(Transformer)

This uses the transformer architecture which creates context-aware representations for every token. The sentence embedding is created by element-wise addition of embedding of all tokens.

USE(DAN)

This is a controversial modelling methodology because it doesn’t regard for the sequence of words. The GloVe embedding of words are first averaged together and then passed through a feedforward deep neural network to produce sentence embeddings.

The model makes use of a deep network to amplify the small differences in embeddings that might come from just one word like good/bad. It performs great most of the time but experiments show it fails at double negation like “not bad” because the model strongly associates ‘not’ with negative sentiment. Have a look at the last example.

This makes USE(DAN) a great model for classifying news articles into categories but might cause problem in sentiment classification problems where words like ‘not’ can change the meaning.

I hope you liked it so far. You can read the rest of the article on my medium ??

Bharath Kumar Bolla

5 年

Great article Pratik. I am a fan of your articles.

Joanne Burke

Data Scientist at US Bank

5 年

Nice article!??Antonia Calvi?,you many find this interesting as well.

查看更多评论

要查看或添加评论，请登录

Pratik Bhavsar的更多文章

5 Rules of getting a job in data science

2020年6月8日

5 Rules of getting a job in data science

This is inspired by Jordan Peterson's practical advice book "12 rules for life" First - Know the rules Understand the…

6 条评论
Mental Models In Data Science

2020年3月11日

Mental Models In Data Science

Slow-thinking vs fast-thinking From Google’s 43 rules of ML “Rule #4: Keep the first model simple and get the…
On Semantic Search

2019年12月2日

On Semantic Search

It took me a long time to realise that search is the biggest problem in NLP. Just look at Google, Amazon and Bing.

4 条评论
How to train your Neural Network

2017年8月4日

How to train your Neural Network

The value of a neural network lies in its hypertuning. General intuition The VA(validation accuracy) of your NN(Neural…

6 条评论

On Variety Of Encoding Text

Pratik Bhavsar

?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG

Evaluation of sentence embeddings in downstream and linguistic probing tasks

Classification tasks

Semantic relatedness tasks

Linguistic probing tasks

Information retrieval tasks

Universal Sentence Encoder

Pratik Bhavsar的更多文章

社区洞察

其他会员也浏览了

New Open Long-Context LLM; LLMs For Text Analysis; Graph-2-Text Generative Models; Fine-Tune Your Own Llama 2; and More

The Business Case for Open Source Large Language Models: A Deep Dive into Llama-2

Watch#8: Extreme Teachers and Mixing Tokens, not Experts

Mastering Prompt Engineering Techniques – Part 2

Unveiling Text Representation and Embeddings: A Comprehensive Guide for NLP Practitioners

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

Bahdanau Attention Mechanism

Word Embedding: Unveiling the Hidden Semantics of Words

What Is the Google BERT Search Algorithm Update?

Evaluation of sentence embeddings in downstream and linguistic probing tasks

Classification tasks

Semantic relatedness tasks

Linguistic probing tasks

Information retrieval tasks

Universal Sentence Encoder

Pratik Bhavsar的更多文章

5 Rules of getting a job in data science

Mental Models In Data Science

On Semantic Search

How to train your Neural Network

社区洞察

其他会员也浏览了

New Open Long-Context LLM; LLMs For Text Analysis; Graph-2-Text Generative Models; Fine-Tune Your Own Llama 2; and More

The Business Case for Open Source Large Language Models: A Deep Dive into Llama-2

Watch#8: Extreme Teachers and Mixing Tokens, not Experts

Mastering Prompt Engineering Techniques – Part 2

Unveiling Text Representation and Embeddings: A Comprehensive Guide for NLP Practitioners

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

Bahdanau Attention Mechanism

Word Embedding: Unveiling the Hidden Semantics of Words

What Is the Google BERT Search Algorithm Update?