登录查看更多内容

What Is the Google BERT Search Algorithm Update?

Ved S.

Sales & Marketing

发布日期: 2020年1月26日

Google BERT stands for Bidirectional Encoder Representations from Transformers and is an update to the core search algorithm aimed at improving the language understanding capabilities of Google.

BERT is one of the biggest updates that Google has made since RankBrain in 2015 and has proven successful in comprehending the intent of the searcher behind a search query.

How Does Google BERT Work?

Let’s understand what BERT can do with the help of an example query:

Here, the intent of the searcher is to find out whether any family member of a patient can pick up a prescription on their behalf.

Here is what Google returned before BERT:

As you can see, Google has returned an unsatisfactory search result because it was unable to process the meaning of the word “someone” in the query.

Here is what Google returned after BERT systems were integrated into the core algorithm:

This search result accurately answers the searcher's question. Google has now understood the meaning of the word “someone” in the correct context after processing the entire query.

Instead of processing one word at a time and not assigning substantial weight to words like “someone” in a specific context, BERT helps Google process each and every word in the query and assigns a token to them. This results in much more accurate search results.

In another example, the query is “math practice book for adults” where the searcher is looking to buy math books for adults:

Before BERT, Google returned results suggesting books for grades 6-8, which is incorrect. Google provided this answer because the description contains the phrase “young adult,” but in our context, “young adult” is irrelevant to the question:

After BERT, Google is able to correctly discern the difference between “young adult” and “adult” and excludes results with out-of-context matches:

What Is Google NLP and How Does It Work?

NLP stands for Natural Language Processing, which is a subset of artificial intelligence and consists of machine learning and linguistics (study of language). It's what makes communication between computers and humans in natural-sounding language possible.

NLP is the technology behind such popular language applications as:

Google Translate
Microsoft Word
Grammarly
OK Google, Siri, Cortana and Alexa

NLP is the framework that powers Google BERT. The Google natural language API consists of the following five services.

1) Syntax Analysis

Google breaks down a query into individual words and extracts linguistic information for each of them.

For example, the query “who is the father of science?” is broken down via syntax analysis into individual parts such as:

Who tag = pronoun
Is tag (singular present number) = singular
The tag = determiner
Father tag (noun number) = singular
Of tag = preposition
Science tag = noun

2) Sentiment Analysis

Google’s sentiment analysis system assigns an emotional score to the query. Here are some examples of sentiment analysis:

Please note: The above values and examples are all taken randomly. This is done to make you understand the concept of sentiment analysis done by Google. The actual algorithm that Google uses is different and confidential.

3) Entity Analysis

In this process, Google picks up “entities” from a query and generally uses Wikipedia as a database to find the entities in the query.

For example, in the query “what is the age of selena gomez?”, Google detects “Selena Gomez” as the entity and returns a direct answer to the searcher from Wikipedia:

4) Entity Sentiment Analysis

Google goes a step further and identifies the sentiment in the overall document containing the entities. While processing web pages, Google assigns a sentiment score to each of the entities depending on how they are used in the document. The scoring is similar to the scoring done during sentiment analysis.

5) Text Classification

Imagine having a large database of categories and subcategories like DMOZ (a multilingual open-content directory of World Wide Web links). When DMOZ was active, it classified a website into categories and subcategories and even more subcategories.

This is what text classification does. Google matches the closest subcategory of web pages depending on the query entered by the user.

For example, for a query like “design of a butterfly,” Google might identify different subcategories like “modern art,” “digital art,” “artistic design,” “illustration,” “architecture,” etc., and then choose the closest matching sub category.

In the words of Google:

“One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples.”

To solve the problem of a shortage of training data, Google went a step further and designed Google AutoML Natural Language that allows users to create customized machine learning models. Google’s BERT model is an extension of the Google AutoML Natural Language.

Please note: The Google BERT model understands the context of a webpage and presents the best documents to the searcher. Don’t think of BERT as a method to refine search queries; rather, it is also a way of understanding the context of the text contained in the web pages.

BERT (Bidirectional Encoder Representations from Transformers) is a paper recently published by researchers in the Google AI language. This has caused a stir in the machine learning community by presenting cutting-edge results in a wide variety of NLP functions, including question answering (SQuAD v1.1), natural language invention (MNLI), and others.

BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training.

The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.

Background

In the field of computer vision, researchers have repeatedly shown the value of transfer learning — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning — using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks.

A different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is feature-based training. In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models.

BERT: State of the art language model for NLP

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies:

Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness (see Takeaways #3).

Note: In practice, the BERT implementation is slightly more elaborate and doesn’t replace all of the 15% masked words. See Appendix A for additional information.

Next Sentence Prediction (NSP)

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

Source: BERT [Devlin et al., 2018], with modifications

To predict if the second sentence is indeed connected to the first, the following steps are performed:

The entire input sequence goes through the Transformer model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax.

When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.

How to use BERT (Fine-tuning)

Using BERT for a specific task is relatively straightforward:

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.

In the fine-tuning training, most hyper-parameters stay the same as in BERT training, and the paper gives specific guidance (Section 3.5) on the hyper-parameters that require tuning. The BERT team has used this technique to achieve state-of-the-art results on a wide variety of challenging natural language tasks, detailed in Section 4 of the paper.

Takeaways

Model size matters, even at huge scale. BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters.
With enough training data, more training steps == higher accuracy. For instance, on the MNLI task, the BERT_base accuracy improves by 1.0% when trained on 1M steps (128,000 words batch size) compared to 500K steps with the same batch size.
BERT’s bidirectional approach (MLM) converges slower than left-to-right approaches (because only 15% of words are predicted in each batch) but bidirectional training still outperforms left-to-right training after a small number of pre-training steps.

Compute considerations (training and applying)

Conclusion

BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. In this summary, we attempted to describe the main ideas of the paper while not drowning in excessive technical details. For those wishing for a deeper dive, we highly recommend reading the full article and ancillary articles referenced in it. Another useful reference is the BERT source code and models, which cover 103 languages and were generously released as open source by the research team.

Appendix A — Word Masking

Training the language model in BERT is done by predicting 15% of the tokens in the input, that were randomly picked. These tokens are pre-processed as follows — 80% are replaced with a “[MASK]” token, 10% with a random word, and 10% use the original word. The intuition that led the authors to pick this approach is as follows (Thanks to Jacob Devlin from Google for the insight):

If we used [MASK] 100% of the time the model wouldn’t necessarily produce good token representations for non-masked words. The non-masked tokens were still used for context, but the model was optimized for predicting masked words.
If we used [MASK] 90% of the time and random words 10% of the time, this would teach the model that the observed word is never correct.
If we used [MASK] 90% of the time and kept the same word 10% of the time, then the model could just trivially copy the non-contextual embedding.

No ablation was done on the ratios of this approach, and it may have worked better with different ratios. In addition, the model performance wasn’t tested with simply masking 100% of the selected tokens.

For more summaries on the recent Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

要查看或添加评论，请登录

Ved S.的更多文章

49 AI Tools For Growth Hacking

2020年2月27日

49 AI Tools For Growth Hacking

AI is exceptionally useful for changing the face of digital marketing and how we do business. Some of the benefits of…
Every Thing You Need to Know About Ai ML Enabled Website

2020年2月19日

Every Thing You Need to Know About Ai ML Enabled Website

Today there is no argument when it comes to personalization, consumers wants it. Whether they KNOW it or NOT.
Don't get caught in a dead end: Avoid these IT jobs bound for extinction and learn the 10 most wanted traits of the inevitable IT professionals.

2020年1月31日

Don't get caught in a dead end: Avoid these IT jobs bound for extinction and learn the 10 most wanted traits of the inevitable IT professionals.

1. Python A programming language used in software development, infrastructure management and data analysis.
The top 10 SEO trends that you need to know in 2020

2020年1月26日

The top 10 SEO trends that you need to know in 2020

This is the question SEOs ask themselves at the beginning of every year. What SEO strategies and trends that will work…
Which Is Smarter: An Ant Colony Or A Human Organization?

2020年1月1日

Which Is Smarter: An Ant Colony Or A Human Organization?

Humans are smarter than ants, but in some ways ant colonies are smarter than organizations of humans. Look, for…
Few New Statistics to Help You Build a Better Content Marketing Strategy in 2019

2019年1月27日

Few New Statistics to Help You Build a Better Content Marketing Strategy in 2019

According to new data from CMI and MarketingProfs, both content creation and content marketing budgets are on the rise…
101 Mind-Blowing Digital Marketing Statistics

2019年1月1日

101 Mind-Blowing Digital Marketing Statistics

According to current digital marketing statistics, one of the key reasons digital sources are taking over traditional…
AI & THE SUPER FUTURE

2018年12月31日

AI & THE SUPER FUTURE
Marketing isn't simple in 2019

2018年12月30日

Marketing isn't simple in 2019

Marketing isn’t as simple as it used to be. It’s no longer a matter of bringing in as many leads as possible and…
Learn to love robots, automation and artificial intelligence

2018年11月24日

Learn to love robots, automation and artificial intelligence

After a decade of stop-and-go development, Artificial Intelligence has now begun to provide real, tangible value to the…

See all articles

What Is the Google BERT Search Algorithm Update?

Ved S.

Sales & Marketing

How Does Google BERT Work?

What Is Google NLP and How Does It Work?

1) Syntax Analysis

2) Sentiment Analysis

3) Entity Analysis

4) Entity Sentiment Analysis

5) Text Classification

Background

BERT: State of the art language model for NLP

Masked LM (MLM)

Next Sentence Prediction (NSP)

How to use BERT (Fine-tuning)

Takeaways

Compute considerations (training and applying)

Conclusion

Appendix A — Word Masking

Ved S.的更多文章

社区洞察

其他会员也浏览了

Large Language Models: From Prototype to Production

Small Language Models: Big Potential in the LLM Landscape

NLP: Text Classification using Keras

NLP: Text Classification using Keras

NLP: Text Classification using Keras

Bahdanau Attention Mechanism

Word Embedding: Unveiling the Hidden Semantics of Words

Discover Your Westeros Legacy: Use NLP to Find Your Affiliation with the 7 Great Houses

NLP : Future English and Chatbots

How Does Google BERT Work?

What Is Google NLP and How Does It Work?

1) Syntax Analysis

2) Sentiment Analysis

3) Entity Analysis

4) Entity Sentiment Analysis

5) Text Classification

Background

BERT: State of the art language model for NLP

Masked LM (MLM)

Next Sentence Prediction (NSP)

How to use BERT (Fine-tuning)

Takeaways

Compute considerations (training and applying)

Conclusion

Appendix A — Word Masking

Ved S.的更多文章

49 AI Tools For Growth Hacking

Every Thing You Need to Know About Ai ML Enabled Website

Don't get caught in a dead end: Avoid these IT jobs bound for extinction and learn the 10 most wanted traits of the inevitable IT professionals.

The top 10 SEO trends that you need to know in 2020

Which Is Smarter: An Ant Colony Or A Human Organization?

Few New Statistics to Help You Build a Better Content Marketing Strategy in 2019

101 Mind-Blowing Digital Marketing Statistics

AI & THE SUPER FUTURE

Marketing isn't simple in 2019

Learn to love robots, automation and artificial intelligence

社区洞察

其他会员也浏览了

Large Language Models: From Prototype to Production

Small Language Models: Big Potential in the LLM Landscape

NLP: Text Classification using Keras

NLP: Text Classification using Keras

NLP: Text Classification using Keras

Bahdanau Attention Mechanism

Word Embedding: Unveiling the Hidden Semantics of Words

Discover Your Westeros Legacy: Use NLP to Find Your Affiliation with the 7 Great Houses

NLP : Future English and Chatbots