登录查看更多内容

LongCite - Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Aditi Khare

AWS & AI Research Specialist-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | AI Research [Portfolio] Build Production-Grade AI Products from Scratch | Vision Transformers??Open-Source Contributor

发布日期: 2024年9月10日

#ai #airesearchskills #airesearch #genai #llms

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations.

This paper supports in enabling long-context LLMs to generate responses with finegrained sentence-level citations, improving their faithfulness and verifiability. Introduce LongBench-Cite, an automated benchmark for assessing current LLMs’ performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC.

And then train LongCite-8B and LongCite-9B using the LongCite45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o. SFT with citation information effectively reduces hallucinations and enables a more uniform utilization of context.

Summary -

1. We introduce LongBench-Chat, an automatic benchmark for the task of LQAC, and reveal the limited performance of current long-context LLMs.

2. We propose CoF, which utilizes off-the-shelf LLMs to automatically construct high-quality longcontext QA instances with fine-grained sentence-level citations. Using this method, we construct LongCite-45k, a large-scale SFT dataset for LQAC.

3. We successfully train LongCite-8B and LongCite-9B using LongCite-45k dataset, allowing the generation of accurate responses and fine-grained citations in one pass. Our experiments show that SFT on LQAC data not only enhances the capacity for generating citations from lengthy contexts but also further improves response correctness.

Chunk-level citations - where the context D is divided into indexed chunks with a fix length of 128 tokens, and each citation ci,j is in the form of [k], referring to the k-th chunk.
Sentence-level citations - where D is divided into indexed sentences using NLTK (Bird, 2006), and each ci,j takes the form of [k] or [a-b], referring to the k-th sentence or the snippet that includes the a-th to b-th sentences in D, respectively.

BENCHMARKING RESULTS OF CURRENT LONG-CONTEXT LLMS -

1. Open-source LLMs, especially models with smaller sizes, have poor citation quality and lag far behind proprietary LLMs. Though achieving correctness close to proprietary LLMs, opensource LLMs have obvious difficulty in citing supporting evidence for their generated statements. We attribute this to (1) poor instruction-following and in-context learning ability: open-source models often generate citations that do not conform to the prescribed format; (2) relatively weak evidence-searching ability: they often fail to find evidence for some statements (i.e., Ci = ?), or only find partially supporting or irrelevant evidence. 2. The citation quality of proprietary LLMs is still unsatisfactory. The citation F1 of proprietary LLMs on LongBench-chat and HotpotQA is only around 0.5, which means less than half statements in their responses are fully supported by the citations. Furthermore, their average citation length is even larger than chunk-level citation (whose citation length is 128), reflecting a coarse citation granularity. For example, the citation length of Claude-3-Sonnet reaches 220 and each cited snippet contains about 6 sentences on average.

Azeem Azhar 1 年前

Ahead of AI #8: The Latest Open Source LLMs and…

Sebastian Raschka, PhD 1 年前

The Art & Science of AI Whispering: Mastering Prompt…

Anand Ramachandran 3 个月前

3. Generating responses and citations in one pass via in-context learning hurts the long-context QA performance. On most datasets, current LLMs have correctness ratios less than 100%, indicating that compared to standard long-context question answering, generating responses and citations at once through in-context learning always leads to correctness degradation due to the distribution shift from the post-training data. Overall, the performance of current LLMs on LQAC remains to be improved. To this end, we will explore automatic construction of SFT data in the following section to further enhance LLMs’ capabilities for generating fine-grained sentence-level citations from lengthy contexts.

Summary -

This paper enhances LLMs’ capacity to generate fine-grained citations from lengthy contexts. We first propose LongBench-Cite, an automatic benchmark to reveal current LLMs’ limited performance on long-context question answering with citations (LQAC).

And then introduce CoF, a novel pipeline that uses off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we successfully train LongCite-8B and LongCite-9B with LongCite-45k, allowing the generation of accurate responses and fine-grained citations in one pass. Extensive analyses and human evaluation further verify the effectiveness of our approach.

This paper lays a solid foundation for further research on LQAC and contributes to the development of more reliable and trustworthy LLMs.

References -

Paper Reading Link - https://arxiv.org/abs/2409.02897

Github Link - https://github.com/THUDM/LongCite.

For more information on AI Research Papers you can visit my Github Profile -

https://github.com/aditikhare007/AI_Research_Junction_Aditi_Khare

For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -

https://www.dhirubhai.net/newsletters/7152631955203739649/

Thank you & Happy Reading !!

LongCite - Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Aditi Khare

AWS & AI Research Specialist-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | AI Research [Portfolio] Build Production-Grade AI Products from Scratch | Vision Transformers??Open-Source Contributor

领英推荐

AI Research Junction

1,564 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

?????? LLMs Opening Their Inner Eyes

What is a LLM? The Key to Next-Level Growth for SMBs

LLM Paper Reading Notes - June 2024

Our 4-Tool Stack + Strategy for Building Enterprise AI Solutions on LLMs - AI&YOU #53

Insider's Edit: OpenAI's Tips for Writing Better Prompts

The Grand Duel: GPT-4 vs. Google's Gemini Ultra

Exploring the Advanced Variants of Retrieval-Augmented Generation (RAG)

Fine-Tuning SLMs for Enterprise-Grade Evaluation & Observability

From Weights to Words: A Beginner’s Guide to GenAI and LLMs

What Are LLM Hallucinations and How to Avoid Them?

领英推荐

AI Research Junction

1,564 位关注者

OpenAI's AI Powered Search Engine Into ChatGPT

2024年11月1日

Introducing Anthropic's Claude 3.5 Sonnet, and Claude 3.5 Haiku

2024年10月23日

OpenAI Introduces Swarm, a Framework for Building Multi-Agent Systems

2024年10月12日

Architecture Search Framework for Inference-Time Techniques & Designing Priors for Better Few-Shot Image Synthesis

2024年10月7日

Meta's Llama 3.2 - Edge AI & Vision with Open, Customizable Models

2024年9月28日

Agents in Software Engineering-Survey, Landscape, and Vision & Qwen2.5-Coder

2024年9月24日

Anthropic Introduces Contextual Retrieval Using Prompt Caching & Contextual Embeddings & Reranking Techniques

2024年9月23日

Google's Training Language Models to Self-Correct via Reinforcement Learning & Iteration of Thought - Autonomous Large Language Model Reasoning

2024年9月22日

Learning to Reason with LLMs - Introducing OpenAI o1

2024年9月14日

Role of RAG Noise in Large Language Models & Strategic Chain-of-Thought

2024年9月9日