Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.
DALL-E: Attention in multiple directions - Picasso style

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.

Circa 2016, Machine comprehension and question answering had become popular tasks in NLP and computer vision. End-to-end neural systems could now achieve promising results by using attention mechanisms to focus on the most relevant parts of the context for answering the question. However, previous attention approaches had some limitations:

  1. They often summarized the context into a fixed-size vector prematurely, losing information.
  2. They employed temporally dynamic attention weights, depending on previous attention steps.
  3. They were usually unidirectional, with the query attending to the context.

This paper introduced the Bi-Directional Attention Flow (BIDAF) network to address these limitations.

The key aspects were:

  1. ?Attention was computed at each time step and allowed to flow through to subsequent layers rather than early summarization. This reduced information loss.??
  2. It used memory-less attention, where each attention step depended only on the current query/context, not previous attention. This simplified things.
  3. It employed both context-to-query and query-to-context attention to get complimentary representations.

Evaluations showed BIDAF achieving state-of-the-art results on the SQuAD question answering dataset. Ablations demonstrated the importance of its components. Analyses provided visualizations and comparisons showing BIDAF learning suitable representations for strong machine comprehension.?

BiDirectional Attention Flow Model

BIDAF represented the context paragraph at different granularity levels using the following layers:

1.? Character Embedding Layer (1D CNN): Mapped words to vectors based on character n-grams. CNN outputs were max pooled to obtain fixed-size character-based word representations.??

2.???? Word Embedding Layer (GloVe): Leveraged pre-trained word vectors to map words to additional embeddings.?

3.???? Contextual Embedding Layer (BiLSTM): Employed a bi-directional LSTM on top of the word embeddings to incorporate contextual cues from surrounding words and model temporal interactions.

4.???? Attention Flow Layer: Computed a similarity matrix between LSTM-encoded context and query embeddings to capture relevance between each query and context word. Applied context-to-query (C2Q) attention to attend from context to query words and query-to-context (Q2C) attention to identify salient context words. Combined bidirectional attention representations with context embeddings to create query-aware context representations.??

5.???? Modeling Layer (BiLSTM): Captured interactions between context words conditioned on the query using two layers of bi-directional LSTM over the query-aware representations from the attention flow layer.??

6.???? Output Layer: Predicted answer span start/end positions for QA or missing words for cloze tasks based on the BiLSTM representations of the context.

The training process for the BIDAF network defined a loss function that calculated the negative log probability of the true start and end positions of the answer span in the context, averaged over all training examples. By minimizing this loss, the model was optimized to predict the correct start and end indices using all its components including the CNN, LSTM, attention, and output layers. The test inference process then leveraged the distributions learned by the model to find the optimal answer span. It chose the start and end positions that maximized the product of their predicted probabilities, using dynamic programming for efficient computation in linear time.

Overall, training guided the model parameters to discern likely answer positions, which enabled the testing process to associate the highest joint probability start and end points as the predicted answer span. Defining suitable objectives for both phases allowed learning how to locate answer contexts and efficiently surface the most probable ones.

The paper conducted experiments on machine comprehension using the SQuAD question answering dataset, which contained over 100k examples of questions along with context paragraphs that contained the answer span. As previously stated, the BIDAF model architecture combined character and word-level embeddings and passed them through recurrent and attention layers to predict the start and end positions of the answer in the context. The attention mechanism in BIDAF used both context-to-query and query-to-context alignments to get query-aware context representations.

The results showed that the BIDAF model achieved state-of-the-art performance on the hidden SQuAD test set leaderboard, with an Exact Match score of 73.3 and F1 score of 81.1 using an ensemble of 12 identically initialized models. This outperformed all previous published approaches. Ablation studies on the dev set without ensembling showed that removing character embeddings, word embeddings, bidirectional attention, or using a more traditional dynamic attention formulation resulted in significant performance drops. Further analyses provided visualizations and comparisons suggesting BIDAF learned improved intermediate representations.

(a) t-SNE visualizations of the months names embedded in the two feature spaces. The contextual embedding layer is able to distinguish the two usages of the word May using context from the surrounding text. (b) Venn diagram of the questions answered correctly by our model and the more traditional baseline (Rajpurkar et al., 2016). (c) Correctly answered questions broken down by the 10 most frequent first words in the question.
Attention matrices for question-context tuples. The left palette shows the context paragraph (correct answer in red and underlined), the middle palette shows the attention matrix (each row is a question word, each column is a context word), and the right palette shows the top attention points for each question word, above a threshold.

Additionally, BIDAF was evaluated on cloze-style reading comprehension using CNN/DailyMail datasets. With minor output layer modifications, it achieved new state-of-the-art single model performance, even surpassing previous ensemble approaches. The strong results across two machine comprehension formulations demonstrated the effectiveness of BIDAF's hierarchical modeling and attention mechanisms for this task.

Results on CNN/DailyMail datasets. Includes the results of previous ensemble methods (marked with ? ) for completeness.


Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2016). Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.

?

?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了