登录查看更多内容

Foundational Papers in NLP: Wordpiece Modelling for Machine Translation - Wu et al 2016

Vijay Raghavan Ph.D., M.B.A.,

Leader in AI

发布日期: 2024年1月5日

Circa 2016, the core of Google's Neural Machine Translation (NMT) system was built on deep stacked Long Short-Term Memory (LSTM) networks consisting of 8 encoder layers and 8 decoder layers. Using residual connections between layers allows training to converge despite the model depth required for strong accuracy. Unidirectional LSTMs were used, except for a bi-directional first layer which provides encoder context from both directions.?

On the left is the encoder network, on the right is the decoder network, in the middle is the attention module. The bottom encoder layer is bi-directional: the pink nodes gather information from left to right while the green nodes gather information from right to left. The other layers of the encoder are uni-directional. Residual connections start from the layer third from the bottom in the encoder and decoder.

Handling lengthy source contexts is important for robust translation quality but presents computational challenges. Specifically, attention is applied from the bottom decoder layer to the top encoder layer. This attention pathway reduces the total number of sequential operations required. The attention module itself comprises a feedforward network with a single hidden layer which is lightweight and fast to compute.?

On the left: simple stacked LSTM layers. On the right: stacked LSTM layers with residual connections

Attention probabilities are determined by comparing the current decoder state with each encoder hidden state, allowing the model to focus on relevant input segments as needed. This attention architecture strikes an effective balance between translation quality and computational efficiency requirements. By selectively applying attention from only the bottom decoder, parallelizability is maximized during inference compared to attention from all decoder layers. The lightweight feedforward attention network avoids slow computations. The smart attention architecture design enables the deployment of accurate yet fast neural translation models, while retaining the benefits of attention for handling longer, real-world sentences.

The structure of bi-directional connections in the first layer of the encoder. LSTM layer LSTMf processes information from left to right, while LSTM layer LSTMb processes information from right to left. Output from LSTMf and LSTMb are first concatenated and then fed to the next LSTM layer LSTM1

In addition to the attention mechanism, the paper introduces wordpiece modeling, an innovative technique to handle the huge vocabularies needed for high-quality machine translation. Wordpiece modeling breaks down words into common subword units called wordpieces using a data-driven algorithm. The algorithm iteratively selects the wordpieces that maximize the language model likelihood of the corpus. Special delimiters like underscores are added to ensure original words can be recovered from the wordpiece sequences. This results in a medium-sized vocabulary with 8k-32k wordpieces covering most words while keeping sequences reasonably short.??

In the above example, the word “Jet” is broken into two wordpieces “_J” and “et”, and the word “feud” is broken into two wordpieces “_fe” and “ud”. The other words remain as single wordpieces. “_” is a special character added to mark the beginning of a word.

The algorithm also restricts the number of basic characters to a manageable size (~ 500) and maps the rest to an unknown character token to avoid rare characters. This provides a good balance between vocabulary size, sequence length, and likelihood. The final wordpiece vocabulary is generated greedily selecting the most frequent wordpieces at each step to maximize overall corpus likelihood.

Wordpiece modeling strikes a strategic balance between flexibility of character modeling and efficiency of full word modeling, benefiting from both. The approach handles essentially unlimited vocabularies without needing special treatment for rare or unknown words, while allowing flexibility for copying entities using a shared source/target vocabulary. Empirically, the paper shows a 32k shared wordpiece vocabulary achieving cutting edge accuracy on the WMT English-French translation task.?

Along with attention and wordpeice modeling, the paper utilizes a specialized beam search technique. This is used to generate the translated output sequence that maximizes a scoring function. The function is based on the trained model and input sentence. This scoring incorporates terms for translation probability, length normalization, and coverage penalty. The length normalization accounts for fairly comparing hypotheses of varying lengths. The coverage penalty encourages translating all parts of the input. An optimized length formulation was found to work best. This formulation uses the length raised to a power alpha between 0.6-0.7. The coverage penalty adds the log of untranslated input words. This incentivizes full input coverage.?

领英推荐

?????? Attention Is All Graphs Need

Pascal Biese 3 个月前

Deploying LLMs in Production: The Anatomy of LLM…

XenonStack 1 年前

Demystifying Mixture of Experts (MoE): A Scalable…

Nick Gupta 3 周前

The implementation keeps 8-12 hypotheses during search. It prunes using multiple criteria based on token and normalized scores. Batch decoding parallelizes inference for up to 35 sentences simultaneously. The beam search optimization provides roughly a +1 BLEU point gain. This includes the tuned length normalization and coverage penalty. The optimized search procedure balances accuracy and efficiency. This complements the neural architecture and wordpiece modeling approach.

Log perplexity vs. steps for normal (non-quantized) training and quantization-aware training on WMT’14 English to French during maximum likelihood training. Notice the training losses are similar, with the quantization-aware loss being slightly better. conjecture for quantization-aware training being slightly better is that the clipping constraints act as additional regularization which improves the model quality.

In the paper, the authors show how Reinforcement learning (RL) offers a way to further fine-tune neural machine translation models beyond standard maximum likelihood training. The authors apply RL to refine the top English-to-French and English-to-German models with the goal of improving their BLEU scores.

Single model test BLEU scores, averaged over 8 runs, on WMT En→Fr and En→De

Their key results show that RL fine-tuning provides a modest boost to the state-of-the-art models. On English-to-French translation, they achieve close to a 1 BLEU point increase on the test set after RL fine-tuning. For English-to-German, a 0.4 BLEU improvement is attained on the development set, but test performance slightly declines. They note that some of these gains overlap with improvements from separately fine-tuning the decoder itself through techniques like length normalization and coverage penalty. Thus, the boost from RL may have been more pronounced on a decoder without as much optimization. Nonetheless, averaging over 8 independently trained models, the results indicate a positive impact from using reinforcement learning to refine models past standard maximum likelihood training. While not a silver bullet, RL presents a supplemental way to eke out gains in translation quality.

Histogram of side-by-side scores on 500 sampled sentences from Wikipedia and news websites for a typical language pair, here English → Spanish (PBMT blue, GNMT red, Human orange).

In summary, the optimized architecture, training strategy, and inference setup allows the model to surpass prior state-of-the-art on WMT benchmarks while reducing errors by over 60% on large-scale internal datasets compared to previous Google Translate systems (in 2016 timeframe). This substantial boost in deployable neural translation quality highlights the impact of techniques tackling key challenges like vocabulary breadth through solutions such as data-driven wordpiece modeling.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

要查看或添加评论，请登录

查看全部

Foundational Papers in NLP: Wordpiece Modelling for Machine Translation - Wu et al 2016

Vijay Raghavan Ph.D., M.B.A.,

Leader in AI

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

How economics have flipped on LLM-based classifiers on external data.

Cognitive Computing

My Top 10 Takeaways from "Machine Learning and Artificial Intelligence" by Travis Goleman

Beyond Context: Answering Deeper Questions for Healthcare and Life Sciences by Combining Spark NLP and Graph Database Analytics

How I turned a NLP Transformer into a Time Series Predictor (PyTorch)

The Algorithmic Core of Positional Encoding in Transformers

Exploring Neural Entity-Based Contextual Searches

Understanding GPTs: A Deep Dive for Beginners

Automated Reasoning, The New Deep Learning, Could the two work together?

领英推荐

Supercharge Your Sales Force: The Ranking Revolution

2024年9月23日

Nvidia's Heptagon of Power: Crushing the AI Game with 7 Unbeatable Strategies

2024年9月16日

AI Engineering: Scaling your models with Ray Train for Blazing-Fast Performance

2024年9月9日

Impossible Distillation: How to Make High-quality Lemonade out of Small, Low-quality Model.

2024年9月4日

Comprehensive Report on LLM Evaluation Metrics

2024年9月3日

Generative AI Series - 4 Introduction to Autoencoders and Variational Autoencoders

2024年7月29日

Generative AI Series 3 - Introduction to Diffusion Models

2024年7月22日

Generative AI Series 2 - Introduction to Energy Based Models

2024年7月8日

Generative AI Series - 1 Introduction to Normalized Flow Models - Without equations

2024年6月24日

The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)

2024年6月18日

社区洞察

其他会员也浏览了

How economics have flipped on LLM-based classifiers on external data.

Cognitive Computing

My Top 10 Takeaways from "Machine Learning and Artificial Intelligence" by Travis Goleman

Beyond Context: Answering Deeper Questions for Healthcare and Life Sciences by Combining Spark NLP and Graph Database Analytics

How I turned a NLP Transformer into a Time Series Predictor (PyTorch)

The Algorithmic Core of Positional Encoding in Transformers

Exploring Neural Entity-Based Contextual Searches

Understanding GPTs: A Deep Dive for Beginners

Automated Reasoning, The New Deep Learning, Could the two work together?