A Stacking Ensemble for SQuAD2.0
A stacking ensemble for SQuAD2.0: Level-0 base models produce top-N hypotheses, which are used as input for the meta-model in level 1 (where the final predictions are made).

A Stacking Ensemble for SQuAD2.0

Originally published at https://medium.com/@elgeish/a-stacking-ensemble-for-squad2-0-8513d96041ef — for the paper, please see https://arxiv.org/pdf/2004.07067.pdf

Despite the simplicity of the idea, ensemble learning has been widely successful in a plethora of tasks — ranging from machine learning contests to real-world applications. We aim at using ensemble learning to create a deep-learning system for a machine reading comprehension application: question answering (QA). The main goal of QA systems is to answer a question that’s posed in a natural manner using a human language. Some QA systems directly answer a given question by generating complete sentences like in ELI5 while others extract a short span of text in a corresponding context paragraph to present it as an answer; the latter is the main objective of the Stanford Question Answering Dataset (SQuAD) challenge. In SQuAD2.0, an additional challenge was introduced: The model has to indicate when a question is unanswerable given the corresponding context paragraph — here are a couple of SQuAD2.0 examples:

Two examples of SQuAD2.0 questions and answers written by crowd-workers, along with plausible model predictions; the one in red is incorrect while the one in blue is incomplete.

SQuAD2.0 systems face many challenges: The task requires accurately concocting some forms of natural language representation and understanding that aid in processing a question and the context to which it relates, then selecting a reasonable correct answer that humans may find satisfactory or indicate the lack of such answer. The vast majority of modern systems, which outperform humans according to the SQuAD2.0 leaderboard, try to find two indices: the start and end positions of the answer span in the corresponding context paragraph (or sentinel values if no answer was found). Recently, this has been usually done with the aid of Pre-trained Contextual Embeddings (PCE) models, which help with language representation and understanding.

The SQuAD2.0 leaderboard shows that ensembles improve upon the performance of single models: BERT introduced an ensemble of six models that added 1.4 F1 points; ALBERT’s ensemble averages the scores of 6 to 17 models, leading to a gain of 1.3 F1 points, compared to the single model; RoBERTa and XLNet also introduced ensembles but did not provide sufficient details. Our QA ensemble system, Gestalt, combined only two models and added a gain of 0.473 EM points (0.55% relative gain) and 0.546 F1 points (0.61% relative gain), compared to the best-performing single model in the ensemble, when measured using the project’s test set (which is half of the official SQuAD2.0 dev set; the other half was used as the dev set here).

A key difference in our approach is the use of stacking to combine top-N predictions produced by each model in the ensemble. We picked heterogeneous PCE models, fine-tuned them for the SQuAD2.0 task, and combined their top-N predictions in a multiclass classification task using a convolutional neural meta-model that selects the best possible prediction as the ensemble’s output. Since each model in the ensemble is learned differently, we expect their results (given the same input) to vary — a behavior that’s analogous to asking humans, who come from diverse backgrounds, for their opinions:

Asking models in an ensemble for their hypotheses is akin to asking the audience of “Who Wants to Be a Millionaire?” for their opinions — they give various answers given their differences.

Asking models in an ensemble for their hypotheses is akin to asking the audience of “Who Wants to Be a Millionaire?” for their opinions — they give various answers given their differences.

Creating a stacking ensemble for SQuAD2.0 entails building a pipeline of two stacked stages: level 0 and level 1. In level 0, models learn from SQuAD2.0 and produce predictions, which are then used as input to level 1, for a meta-model, to produce better predictions. We extend this approach further by producing top-N hypotheses from each level-0 model in the ensemble to feed as input to level 1. As we show in the figure below, the set of top-N predictions (when N > 1, compared to the set of top-1 predictions) has a much better chance of including the correct answer:

Top-N EM, F1, and No-Answer Accuracy scores of level-0 models. Using the top-8 answers as input to the meta-model provides a good lift in metrics (at a relatively small N).

To learn the meta-model in level 1, we gave it a classification task: We selected the top-8 hypotheses produced by each of albert-xxlarge-v1 and roberta-base in level 0, computed the F1 score distribution y for the resulting 16 hypotheses given the ground-truth answers, and then asked the meta-model to predict the F1 score distribution ?; the ensemble’s predicted answer is the argmax of ?. We picked the Kullback–Leibler (KL) divergence loss with a summative reduction as the cost function to minimize; the KL divergence loss impels the meta-model to learn to predict log-probability scores for a mix of — potentially multiple — correct, partially correct, and incorrect hypotheses in the input x.

A stacking ensemble for SQuAD2.0: Level-0 base models produce top-N hypotheses, which are used as input for the meta-model in level 1 (where the final predictions are made).

In conclusion, ensemble learning is a tremendously useful technique to improve upon state-of-the-art models; it helps models generalize better and overcome their weaknesses. In a stacking-ensemble setting, heterogeneous level-0 models can complement each other like a gestalt — when blended properly, the ensemble outperforms the best model in level 0. In our system, using only two level-0 models, it improved the EM and F1 score by 0.55% and 0.61%, respectively. Moreover, it can benefit single models and other ensembles alike. The paper includes more details on related research, results, and future work.




要查看或添加评论,请登录

Mohamed El-Geish的更多文章

  • Managing the Organized Chaos That Is Software Development

    Managing the Organized Chaos That Is Software Development

    Disclaimer: This article was written mostly by GPT-3 given the first paragraph as a prompt; a few edits were made for…

  • On Perception vs. Reality & AI-Generated Media

    On Perception vs. Reality & AI-Generated Media

    Disclaimer: This article was written with the aid of GPT-3, given the first paragraph as a prompt; it's about 50%…

    1 条评论
  • Can AI Make Us Better Writers?

    Can AI Make Us Better Writers?

    Disclaimer: This article was written mostly by GPT-3 given the first sentence as a prompt; a few edits were made for…

    1 条评论
  • A Glimpse into the Future of AI

    A Glimpse into the Future of AI

    The pursuit of doing more is as old as history can recall: technology has been playing an important role in it from the…

    2 条评论
  • Reacting to Corrective Feedback

    Reacting to Corrective Feedback

    My brain used to freeze when someone points out a mistake I made; a sense of embarrassment and shame typically…

    2 条评论
  • We Love Go; So Here's Gooseberry

    We Love Go; So Here's Gooseberry

    Read this on our blog: https://www.voicera.

  • 5 Books to Enjoy in 2018

    5 Books to Enjoy in 2018

    Work Rules! Laszlo Bock, the former head of people operations at Google, walks us through management lessons learned at…

    3 条评论
  • Learning from Hundreds of Resumes and Tens of Interviews in a Few Weeks

    Learning from Hundreds of Resumes and Tens of Interviews in a Few Weeks

    Hiring great employees is hard; ask any hiring manager, and you will hear war stories. I've had my share of those…

    5 条评论
  • Open-Sourcing Tester: Lightweight Go Test Utilities

    Open-Sourcing Tester: Lightweight Go Test Utilities

    At Workfit, we strive for building world-class software at a high iteration velocity. We heavily rely on open-source…

  • #ReadToLead: Books You May Enjoy

    #ReadToLead: Books You May Enjoy

    I'll jump right to it; here are some of my favorite books that I read in 2016: Deep Work: Rules for Focused Success in…

    4 条评论

社区洞察

其他会员也浏览了