Foundational Papers in NLP: Wordpiece Modelling for Machine Translation - Wu et al 2016
Circa 2016, the core of Google's Neural Machine Translation (NMT) system was built on deep stacked Long Short-Term Memory (LSTM) networks consisting of 8 encoder layers and 8 decoder layers. Using residual connections between layers allows training to converge despite the model depth required for strong accuracy. Unidirectional LSTMs were used, except for a bi-directional first layer which provides encoder context from both directions.?
Handling lengthy source contexts is important for robust translation quality but presents computational challenges. Specifically, attention is applied from the bottom decoder layer to the top encoder layer. This attention pathway reduces the total number of sequential operations required. The attention module itself comprises a feedforward network with a single hidden layer which is lightweight and fast to compute.?
Attention probabilities are determined by comparing the current decoder state with each encoder hidden state, allowing the model to focus on relevant input segments as needed. This attention architecture strikes an effective balance between translation quality and computational efficiency requirements. By selectively applying attention from only the bottom decoder, parallelizability is maximized during inference compared to attention from all decoder layers. The lightweight feedforward attention network avoids slow computations. The smart attention architecture design enables the deployment of accurate yet fast neural translation models, while retaining the benefits of attention for handling longer, real-world sentences.
In addition to the attention mechanism, the paper introduces wordpiece modeling, an innovative technique to handle the huge vocabularies needed for high-quality machine translation. Wordpiece modeling breaks down words into common subword units called wordpieces using a data-driven algorithm. The algorithm iteratively selects the wordpieces that maximize the language model likelihood of the corpus. Special delimiters like underscores are added to ensure original words can be recovered from the wordpiece sequences. This results in a medium-sized vocabulary with 8k-32k wordpieces covering most words while keeping sequences reasonably short.??
The algorithm also restricts the number of basic characters to a manageable size (~ 500) and maps the rest to an unknown character token to avoid rare characters. This provides a good balance between vocabulary size, sequence length, and likelihood. The final wordpiece vocabulary is generated greedily selecting the most frequent wordpieces at each step to maximize overall corpus likelihood.
Wordpiece modeling strikes a strategic balance between flexibility of character modeling and efficiency of full word modeling, benefiting from both. The approach handles essentially unlimited vocabularies without needing special treatment for rare or unknown words, while allowing flexibility for copying entities using a shared source/target vocabulary. Empirically, the paper shows a 32k shared wordpiece vocabulary achieving cutting edge accuracy on the WMT English-French translation task.?
Along with attention and wordpeice modeling, the paper utilizes a specialized beam search technique. This is used to generate the translated output sequence that maximizes a scoring function. The function is based on the trained model and input sentence. This scoring incorporates terms for translation probability, length normalization, and coverage penalty. The length normalization accounts for fairly comparing hypotheses of varying lengths. The coverage penalty encourages translating all parts of the input. An optimized length formulation was found to work best. This formulation uses the length raised to a power alpha between 0.6-0.7. The coverage penalty adds the log of untranslated input words. This incentivizes full input coverage.?
领英推荐
The implementation keeps 8-12 hypotheses during search. It prunes using multiple criteria based on token and normalized scores. Batch decoding parallelizes inference for up to 35 sentences simultaneously. The beam search optimization provides roughly a +1 BLEU point gain. This includes the tuned length normalization and coverage penalty. The optimized search procedure balances accuracy and efficiency. This complements the neural architecture and wordpiece modeling approach.
In the paper, the authors show how Reinforcement learning (RL) offers a way to further fine-tune neural machine translation models beyond standard maximum likelihood training. The authors apply RL to refine the top English-to-French and English-to-German models with the goal of improving their BLEU scores.
Their key results show that RL fine-tuning provides a modest boost to the state-of-the-art models. On English-to-French translation, they achieve close to a 1 BLEU point increase on the test set after RL fine-tuning. For English-to-German, a 0.4 BLEU improvement is attained on the development set, but test performance slightly declines. They note that some of these gains overlap with improvements from separately fine-tuning the decoder itself through techniques like length normalization and coverage penalty. Thus, the boost from RL may have been more pronounced on a decoder without as much optimization. Nonetheless, averaging over 8 independently trained models, the results indicate a positive impact from using reinforcement learning to refine models past standard maximum likelihood training. While not a silver bullet, RL presents a supplemental way to eke out gains in translation quality.
In summary, the optimized architecture, training strategy, and inference setup allows the model to surpass prior state-of-the-art on WMT benchmarks while reducing errors by over 60% on large-scale internal datasets compared to previous Google Translate systems (in 2016 timeframe). This substantial boost in deployable neural translation quality highlights the impact of techniques tackling key challenges like vocabulary breadth through solutions such as data-driven wordpiece modeling.
?
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
?
?