Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.
Circa 2016, Machine comprehension and question answering had become popular tasks in NLP and computer vision. End-to-end neural systems could now achieve promising results by using attention mechanisms to focus on the most relevant parts of the context for answering the question. However, previous attention approaches had some limitations:
This paper introduced the Bi-Directional Attention Flow (BIDAF) network to address these limitations.
The key aspects were:
Evaluations showed BIDAF achieving state-of-the-art results on the SQuAD question answering dataset. Ablations demonstrated the importance of its components. Analyses provided visualizations and comparisons showing BIDAF learning suitable representations for strong machine comprehension.?
BIDAF represented the context paragraph at different granularity levels using the following layers:
1.? Character Embedding Layer (1D CNN): Mapped words to vectors based on character n-grams. CNN outputs were max pooled to obtain fixed-size character-based word representations.??
2.???? Word Embedding Layer (GloVe): Leveraged pre-trained word vectors to map words to additional embeddings.?
3.???? Contextual Embedding Layer (BiLSTM): Employed a bi-directional LSTM on top of the word embeddings to incorporate contextual cues from surrounding words and model temporal interactions.
4.???? Attention Flow Layer: Computed a similarity matrix between LSTM-encoded context and query embeddings to capture relevance between each query and context word. Applied context-to-query (C2Q) attention to attend from context to query words and query-to-context (Q2C) attention to identify salient context words. Combined bidirectional attention representations with context embeddings to create query-aware context representations.??
5.???? Modeling Layer (BiLSTM): Captured interactions between context words conditioned on the query using two layers of bi-directional LSTM over the query-aware representations from the attention flow layer.??
领英推荐
6.???? Output Layer: Predicted answer span start/end positions for QA or missing words for cloze tasks based on the BiLSTM representations of the context.
The training process for the BIDAF network defined a loss function that calculated the negative log probability of the true start and end positions of the answer span in the context, averaged over all training examples. By minimizing this loss, the model was optimized to predict the correct start and end indices using all its components including the CNN, LSTM, attention, and output layers. The test inference process then leveraged the distributions learned by the model to find the optimal answer span. It chose the start and end positions that maximized the product of their predicted probabilities, using dynamic programming for efficient computation in linear time.
Overall, training guided the model parameters to discern likely answer positions, which enabled the testing process to associate the highest joint probability start and end points as the predicted answer span. Defining suitable objectives for both phases allowed learning how to locate answer contexts and efficiently surface the most probable ones.
The paper conducted experiments on machine comprehension using the SQuAD question answering dataset, which contained over 100k examples of questions along with context paragraphs that contained the answer span. As previously stated, the BIDAF model architecture combined character and word-level embeddings and passed them through recurrent and attention layers to predict the start and end positions of the answer in the context. The attention mechanism in BIDAF used both context-to-query and query-to-context alignments to get query-aware context representations.
The results showed that the BIDAF model achieved state-of-the-art performance on the hidden SQuAD test set leaderboard, with an Exact Match score of 73.3 and F1 score of 81.1 using an ensemble of 12 identically initialized models. This outperformed all previous published approaches. Ablation studies on the dev set without ensembling showed that removing character embeddings, word embeddings, bidirectional attention, or using a more traditional dynamic attention formulation resulted in significant performance drops. Further analyses provided visualizations and comparisons suggesting BIDAF learned improved intermediate representations.
Additionally, BIDAF was evaluated on cloze-style reading comprehension using CNN/DailyMail datasets. With minor output layer modifications, it achieved new state-of-the-art single model performance, even surpassing previous ensemble approaches. The strong results across two machine comprehension formulations demonstrated the effectiveness of BIDAF's hierarchical modeling and attention mechanisms for this task.
Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2016). Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
?
?