登录查看更多内容

A Stacking Ensemble for SQuAD2.0

Mohamed El-Geish

CTO & Co-Founder at Monta.AI | Sharing thoughts to learn

发布日期: 2020年4月16日

Originally published at https://medium.com/@elgeish/a-stacking-ensemble-for-squad2-0-8513d96041ef — for the paper, please see https://arxiv.org/pdf/2004.07067.pdf

Despite the simplicity of the idea, ensemble learning has been widely successful in a plethora of tasks — ranging from machine learning contests to real-world applications. We aim at using ensemble learning to create a deep-learning system for a machine reading comprehension application: question answering (QA). The main goal of QA systems is to answer a question that’s posed in a natural manner using a human language. Some QA systems directly answer a given question by generating complete sentences like in ELI5 while others extract a short span of text in a corresponding context paragraph to present it as an answer; the latter is the main objective of the Stanford Question Answering Dataset (SQuAD) challenge. In SQuAD2.0, an additional challenge was introduced: The model has to indicate when a question is unanswerable given the corresponding context paragraph — here are a couple of SQuAD2.0 examples:

Two examples of SQuAD2.0 questions and answers written by crowd-workers, along with plausible model predictions; the one in red is incorrect while the one in blue is incomplete.

SQuAD2.0 systems face many challenges: The task requires accurately concocting some forms of natural language representation and understanding that aid in processing a question and the context to which it relates, then selecting a reasonable correct answer that humans may find satisfactory or indicate the lack of such answer. The vast majority of modern systems, which outperform humans according to the SQuAD2.0 leaderboard, try to find two indices: the start and end positions of the answer span in the corresponding context paragraph (or sentinel values if no answer was found). Recently, this has been usually done with the aid of Pre-trained Contextual Embeddings (PCE) models, which help with language representation and understanding.

The SQuAD2.0 leaderboard shows that ensembles improve upon the performance of single models: BERT introduced an ensemble of six models that added 1.4 F1 points; ALBERT’s ensemble averages the scores of 6 to 17 models, leading to a gain of 1.3 F1 points, compared to the single model; RoBERTa and XLNet also introduced ensembles but did not provide sufficient details. Our QA ensemble system, Gestalt, combined only two models and added a gain of 0.473 EM points (0.55% relative gain) and 0.546 F1 points (0.61% relative gain), compared to the best-performing single model in the ensemble, when measured using the project’s test set (which is half of the official SQuAD2.0 dev set; the other half was used as the dev set here).

A key difference in our approach is the use of stacking to combine top-N predictions produced by each model in the ensemble. We picked heterogeneous PCE models, fine-tuned them for the SQuAD2.0 task, and combined their top-N predictions in a multiclass classification task using a convolutional neural meta-model that selects the best possible prediction as the ensemble’s output. Since each model in the ensemble is learned differently, we expect their results (given the same input) to vary — a behavior that’s analogous to asking humans, who come from diverse backgrounds, for their opinions:

Asking models in an ensemble for their hypotheses is akin to asking the audience of “Who Wants to Be a Millionaire?” for their opinions — they give various answers given their differences.

Creating a stacking ensemble for SQuAD2.0 entails building a pipeline of two stacked stages: level 0 and level 1. In level 0, models learn from SQuAD2.0 and produce predictions, which are then used as input to level 1, for a meta-model, to produce better predictions. We extend this approach further by producing top-N hypotheses from each level-0 model in the ensemble to feed as input to level 1. As we show in the figure below, the set of top-N predictions (when N > 1, compared to the set of top-1 predictions) has a much better chance of including the correct answer:

Top-N EM, F1, and No-Answer Accuracy scores of level-0 models. Using the top-8 answers as input to the meta-model provides a good lift in metrics (at a relatively small N).

To learn the meta-model in level 1, we gave it a classification task: We selected the top-8 hypotheses produced by each of albert-xxlarge-v1 and roberta-base in level 0, computed the F1 score distribution y for the resulting 16 hypotheses given the ground-truth answers, and then asked the meta-model to predict the F1 score distribution ?; the ensemble’s predicted answer is the argmax of ?. We picked the Kullback–Leibler (KL) divergence loss with a summative reduction as the cost function to minimize; the KL divergence loss impels the meta-model to learn to predict log-probability scores for a mix of — potentially multiple — correct, partially correct, and incorrect hypotheses in the input x.

A stacking ensemble for SQuAD2.0: Level-0 base models produce top-N hypotheses, which are used as input for the meta-model in level 1 (where the final predictions are made).

In conclusion, ensemble learning is a tremendously useful technique to improve upon state-of-the-art models; it helps models generalize better and overcome their weaknesses. In a stacking-ensemble setting, heterogeneous level-0 models can complement each other like a gestalt — when blended properly, the ensemble outperforms the best model in level 0. In our system, using only two level-0 models, it improved the EM and F1 score by 0.55% and 0.61%, respectively. Moreover, it can benefit single models and other ensembles alike. The paper includes more details on related research, results, and future work.

要查看或添加评论，请登录

Mohamed El-Geish的更多文章

Managing the Organized Chaos That Is Software Development

2020年9月28日

Managing the Organized Chaos That Is Software Development

Disclaimer: This article was written mostly by GPT-3 given the first paragraph as a prompt; a few edits were made for…
On Perception vs. Reality & AI-Generated Media

2020年9月22日

On Perception vs. Reality & AI-Generated Media

Disclaimer: This article was written with the aid of GPT-3, given the first paragraph as a prompt; it's about 50%…

1 条评论
Can AI Make Us Better Writers?

2020年9月20日

Can AI Make Us Better Writers?

Disclaimer: This article was written mostly by GPT-3 given the first sentence as a prompt; a few edits were made for…

1 条评论
A Glimpse into the Future of AI

2020年9月18日

A Glimpse into the Future of AI

The pursuit of doing more is as old as history can recall: technology has been playing an important role in it from the…

2 条评论
Reacting to Corrective Feedback

2020年6月23日

Reacting to Corrective Feedback

My brain used to freeze when someone points out a mistake I made; a sense of embarrassment and shame typically…

2 条评论
We Love Go; So Here's Gooseberry

2018年4月9日

We Love Go; So Here's Gooseberry

Read this on our blog: https://www.voicera.
5 Books to Enjoy in 2018

2018年1月8日

5 Books to Enjoy in 2018

Work Rules! Laszlo Bock, the former head of people operations at Google, walks us through management lessons learned at…

3 条评论
Learning from Hundreds of Resumes and Tens of Interviews in a Few Weeks

2017年3月26日

Learning from Hundreds of Resumes and Tens of Interviews in a Few Weeks

Hiring great employees is hard; ask any hiring manager, and you will hear war stories. I've had my share of those…

5 条评论
Open-Sourcing Tester: Lightweight Go Test Utilities

2017年3月7日

Open-Sourcing Tester: Lightweight Go Test Utilities

At Workfit, we strive for building world-class software at a high iteration velocity. We heavily rely on open-source…
#ReadToLead: Books You May Enjoy

2016年12月22日

#ReadToLead: Books You May Enjoy

I'll jump right to it; here are some of my favorite books that I read in 2016: Deep Work: Rules for Focused Success in…

4 条评论

See all articles

A Stacking Ensemble for SQuAD2.0

Mohamed El-Geish

CTO & Co-Founder at Monta.AI | Sharing thoughts to learn

Mohamed El-Geish的更多文章

社区洞察

其他会员也浏览了

?? Breaking Compute Barriers

Watch#5: Enjoying a Free Lunch and Boosting the Math Capabilities of Small LLMs

??Top ML Papers of the Week

??Top ML Papers of the Week

Understanding CoALA (Cognitive Architectures for Language Agents) Through a ReAct Agent Example Using LangChain

Part Beta: Information Discovery and Discoverability

Back to the Future: xLSTM Revives the Power of Long Short-Term Memory for Large Language Models

How to Access the Jurassic-2 Large Language Model via an AWS Lambda Endpoint

Why no AGI can be built with language models

Blending Large Language Models and Knowledge Graphs - An Introduction

Mohamed El-Geish的更多文章

Managing the Organized Chaos That Is Software Development

On Perception vs. Reality & AI-Generated Media

Can AI Make Us Better Writers?

A Glimpse into the Future of AI

Reacting to Corrective Feedback

We Love Go; So Here's Gooseberry

5 Books to Enjoy in 2018

Learning from Hundreds of Resumes and Tens of Interviews in a Few Weeks

Open-Sourcing Tester: Lightweight Go Test Utilities

#ReadToLead: Books You May Enjoy

社区洞察

其他会员也浏览了

?? Breaking Compute Barriers

Watch#5: Enjoying a Free Lunch and Boosting the Math Capabilities of Small LLMs

??Top ML Papers of the Week

??Top ML Papers of the Week

Understanding CoALA (Cognitive Architectures for Language Agents) Through a ReAct Agent Example Using LangChain

Part Beta: Information Discovery and Discoverability

Back to the Future: xLSTM Revives the Power of Long Short-Term Memory for Large Language Models

How to Access the Jurassic-2 Large Language Model via an AWS Lambda Endpoint

Why no AGI can be built with language models

Blending Large Language Models and Knowledge Graphs - An Introduction