A Stacking Ensemble for SQuAD2.0
Originally published at https://medium.com/@elgeish/a-stacking-ensemble-for-squad2-0-8513d96041ef — for the paper, please see https://arxiv.org/pdf/2004.07067.pdf
Despite the simplicity of the idea, ensemble learning has been widely successful in a plethora of tasks — ranging from machine learning contests to real-world applications. We aim at using ensemble learning to create a deep-learning system for a machine reading comprehension application: question answering (QA). The main goal of QA systems is to answer a question that’s posed in a natural manner using a human language. Some QA systems directly answer a given question by generating complete sentences like in ELI5 while others extract a short span of text in a corresponding context paragraph to present it as an answer; the latter is the main objective of the Stanford Question Answering Dataset (SQuAD) challenge. In SQuAD2.0, an additional challenge was introduced: The model has to indicate when a question is unanswerable given the corresponding context paragraph — here are a couple of SQuAD2.0 examples:
SQuAD2.0 systems face many challenges: The task requires accurately concocting some forms of natural language representation and understanding that aid in processing a question and the context to which it relates, then selecting a reasonable correct answer that humans may find satisfactory or indicate the lack of such answer. The vast majority of modern systems, which outperform humans according to the SQuAD2.0 leaderboard, try to find two indices: the start and end positions of the answer span in the corresponding context paragraph (or sentinel values if no answer was found). Recently, this has been usually done with the aid of Pre-trained Contextual Embeddings (PCE) models, which help with language representation and understanding.
The SQuAD2.0 leaderboard shows that ensembles improve upon the performance of single models: BERT introduced an ensemble of six models that added 1.4 F1 points; ALBERT’s ensemble averages the scores of 6 to 17 models, leading to a gain of 1.3 F1 points, compared to the single model; RoBERTa and XLNet also introduced ensembles but did not provide sufficient details. Our QA ensemble system, Gestalt, combined only two models and added a gain of 0.473 EM points (0.55% relative gain) and 0.546 F1 points (0.61% relative gain), compared to the best-performing single model in the ensemble, when measured using the project’s test set (which is half of the official SQuAD2.0 dev set; the other half was used as the dev set here).
A key difference in our approach is the use of stacking to combine top-N predictions produced by each model in the ensemble. We picked heterogeneous PCE models, fine-tuned them for the SQuAD2.0 task, and combined their top-N predictions in a multiclass classification task using a convolutional neural meta-model that selects the best possible prediction as the ensemble’s output. Since each model in the ensemble is learned differently, we expect their results (given the same input) to vary — a behavior that’s analogous to asking humans, who come from diverse backgrounds, for their opinions:
Asking models in an ensemble for their hypotheses is akin to asking the audience of “Who Wants to Be a Millionaire?” for their opinions — they give various answers given their differences.
Creating a stacking ensemble for SQuAD2.0 entails building a pipeline of two stacked stages: level 0 and level 1. In level 0, models learn from SQuAD2.0 and produce predictions, which are then used as input to level 1, for a meta-model, to produce better predictions. We extend this approach further by producing top-N hypotheses from each level-0 model in the ensemble to feed as input to level 1. As we show in the figure below, the set of top-N predictions (when N > 1, compared to the set of top-1 predictions) has a much better chance of including the correct answer:
To learn the meta-model in level 1, we gave it a classification task: We selected the top-8 hypotheses produced by each of albert-xxlarge-v1 and roberta-base in level 0, computed the F1 score distribution y for the resulting 16 hypotheses given the ground-truth answers, and then asked the meta-model to predict the F1 score distribution ?; the ensemble’s predicted answer is the argmax of ?. We picked the Kullback–Leibler (KL) divergence loss with a summative reduction as the cost function to minimize; the KL divergence loss impels the meta-model to learn to predict log-probability scores for a mix of — potentially multiple — correct, partially correct, and incorrect hypotheses in the input x.
In conclusion, ensemble learning is a tremendously useful technique to improve upon state-of-the-art models; it helps models generalize better and overcome their weaknesses. In a stacking-ensemble setting, heterogeneous level-0 models can complement each other like a gestalt — when blended properly, the ensemble outperforms the best model in level 0. In our system, using only two level-0 models, it improved the EM and F1 score by 0.55% and 0.61%, respectively. Moreover, it can benefit single models and other ensembles alike. The paper includes more details on related research, results, and future work.