PUB: How well do Large Language Models (LLMs) understand pragmatics?
Authors: Settaluri Lakshmi Sravanthi, Meet Doshi, Tankala Pavan Kalyan, Rudra Murthy, Pushpak Bhattacharyya, Raj Dabre?
TL; DR: We introduce PUB- a Pragmatics Understanding Benchmark consisting of 14 tasks across 4 domains of pragmatics (Implicature, Presupposition, Reference, and Deixis). Our findings show that LLMs fail at simple pragmatics understanding which is found in day-to-day human conversations and there is a significant gap to be bridged.?
Note: This work has been accepted for publication at ACL-2024 (findings)?
Link to arxiv version: https://arxiv.org/abs/2401.07078?
Acknowledgments: This was joint work with Prof. Pushpak Bhattacharya and his students as part of the academic research collaboration between IBM Research and IIT Bombay?
Pragmatics?
The phenomenon of pragmatics in linguistics examines how context influences language understanding in communication. People's ability to understand pragmatics comes from their cognitive skills and social awareness. They can understand not only spoken words but also context and implied messages. ?
The phenomena of pragmatics can be further divided into various types. The major types of pragmatics phenomena are:?
The semantic meaning is just that speaker 2 ordered something, but the underlying pragmatic meaning says otherwise.?
Hypothesis: Alan was at the bottom of the ladder.?
?
Here the hypothesis is assumed to be true before (“pre”)?the premise can be uttered (“supposition”) and can be considered meaningful.?
?
?
Sentence: The customer really liked the dish made by Chef Gordon Ramsay.?
?
Here, we take an example of a phenomenon called Metonymy where the association between “dish” and “food is usually served on a dish” is used for reference to refer to a meal instead of an actual plate.?
?
?
Sentence: If you come over here, I can show you where it happened, all that time ago.?
?
In this case, “you”, “here”, “I”, “where”, “it”, and “that” are all pointing words that refer to different objects in a single conversation.?
Testing LLMs' ability to understand Pragmatics?
This section discusses our methodology for testing LLMs to understand Pragmatics. We begin by describing the dataset creation process followed by the evaluation methodology.?
Datasets and Tasks?
Figure 3. summarizes the datasets and tasks used to evaluate the LLMs. We select existing datasets namely Circa, GRICE, FigQA, FLUTE, IMPPRES, and NOPE. However, these datasets do not explicitly evaluate implicature, presupposition, and reference understanding in conversations and some are generated synthetically. ?
Therefore, we created 4 new datasets on top of existing conversational datasets like CIRCA, DailyDialog, and Convokit comprising 6100 newly annotated data points. We convert the datasets into Multiple-Choice Question Answering (MCQA) format.?
Re-purposing Existing Datasets?
Consider the CIRCA dataset which consists of a single-turn dialog with additional context, the task is to classify if the speaker-2's response is a direct answer an indirect answer, or yes subject to some conditions. ?
This dataset is re-purposed for 3 tasks. ?
Task 1 is Direct/Indirect Classification where we given the context, Speaker-1 and Speaker-2's conversation, we prompt the LLM if the Speaker-2's response in a Direct/Indirect response.?
Task 2 is the same as in the CIRCA dataset, given the context, Speaker-1 and Speaker-2's conversation, the task is to classify the conversation into one of four options centering around whether the Speaker-2's response is direct or indirect or has some ambiguity.?
Task 3 is an extended version of Task 2. We ask human annotators to provide implied meaning for Speaker 2's response. Given the context, Speaker-1 and Speaker-2's conversation, implied meaning the task is to classify the conversation into one of four options centering around whether Speaker-2's response is direct or indirect or has some ambiguity.?
We similarly, re-purpose other datasets to introduce more tasks.?
The dataset can be found at: https://huggingface.co/datasets/cfilt/PUB?
Figure 1: Illustration of each task from PUB. The dataset used for each task is prepended to each row in the figure. Related tasks are grouped. This is followed by the task name, an illustration and a prompt example. ?
?
Evaluation Methodology?
Language models are robust to prompts and other contextual changes which makes them very unreliable suspects for evaluation. We follow both generation and perplexity-based approaches to evaluate LLMs for Pragmatic abilities. Other methods rely on debiasing approaches for MCQA evaluation but we use Cloze prompting as a reliable proxy due to its pertinency to instruction tuning of LMs.?
Figure 2: Visualization of Multiple-Choice Prompt and Cloze Prompt evaluation strategies from Robinson and Wingate (2023).?
领英推荐
We do a zero-shot and a 3-shot evaluation. We consider the following models: flan-t5-xxl ; llama-2:: 7b, 7b-chat, 13b, 13b-chat, 70b, 70b-chat; t5 and GPT-3.5. The OpenAI model is evaluated only using MCP. For Zero-shot prompts, all the instances of the data were used as is. For Few-shot prompts, a dev set of 20 examples is created for each task.?
?
Human Evaluation?
We select 100 examples from the test set for each task. We used three human evaluators to answer these 100 samples from each task, resulting in 4,200 human evaluations. The evaluators are fluent English speakers and have graduated from a technical university where English is the medium of instruction. It is important to note that the human evaluation does not reflect expert human reference but rather the performance of an average human on complex pragmatic tasks. These evaluators are presented with the same prompt as the 0-shot MCP presented to the LLMs.?
Results and Discussion?
?
?
Figure 3: Average performance of models on three different pragmatics phenomena. Average accuracy for reference and deixis are merged and plotted as Reference as they are closely related phenomena. Human - I, P, R represent the performance of human evaluators on Implicature, Presupposition, and Reference respectively?
?
We find that an increase in model size and instruction tuning are both correlated to an increase in pragmatic understanding. Llama variants optimized with preference tuning do not show any improvement in pragmatic understanding. Even then, we find these LLMs fail at simple pragmatic understanding tasks where humans succeed daily. Overall, we see a significant gap in performance between humans and LLMs.?
?
Detailed Results?
We first discuss the performance of an average person compared against the ground truth.?
We calculate Matthew’s correlation coefficient to compare the choices made by?average Humans with Ground Truths generated by expert linguists and choices made by LLMs in contrast to humans. We see a clear indication that choices made by LLMs are less correlated to what humans think and choose. We also see the consistent bias of LLMs to respond with false negatives even with answer bias correction.?
?
Figure 4: Confusion matrix comparing mistakes of LLMs vs. Humans against ground truth answers. These tasks are chosen to have binary and consistent options for all questions in the task?
?
We now discuss the performance of the model on individual tasks.?
?
?
Figure 5: Results for tasks 2 & 3, tasks 5 & 6, and tasks 7, 8 & 9. The results presented in this table are the maximum across all types of evaluations (0-shot and 3-shot Cloze and MCQA) performed on the models.?
For response classification and figurative language understanding tasks, in most cases, we see an improvement in performance with a hint. We test whether a contrastive hint, which consists of both sides of the story but is still enough to help answer the question, improves or degrades performance. Here, the contrastive hint is a distractor that humans are robust at answering, but we see a significant drop in performance for LLMs, leading to the suggestion that LLMs cannot identify important signals for pragmatic understanding. For tasks like agreement and sarcasm detection, when pragmatics are introduced, LLMs fail miserably.?
?
?
Figure 6: Results for tasks 1, 4, 10, 11, 12, 13 and 14. The results presented in this table are the maximum across all types of evaluations (0-shot and 3-shot Cloze and MCQA) performed on the models?
Here also we see that in tasks like direction/indirect classification and implicature NLI, most LLMs lag due to failure of simple implicature understanding. Presupposition is also met with the same fate where models show far worse performance. For newly annotated datasets like Metonymy, we see that Llama2-70B achieves superhuman performance due to an understanding of world knowledge and common references.?
?
Examples?
We now illustrate a few examples from our evaluation.?
?
Proportion of Plurality Agreement?
We also perform a reliability test for models used for testing to check how consistent they are with their responses. We follow Robinson and Wingate and use the Proportion of Plurality Agreement (PPA) in evaluating consistency. We find that model consistency is correlated with #parameters and instruction tuning. Overall, we find that most models we choose can be considered consistent (except t5-11B) in a 3-shot setting due to being significantly above the baseline PPA of 0.25.?
?
Summary?
Other Blogs from IBM Research Labs Team
?
?