LongCite - Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Aditi Khare
AWS & AI Research Specialist-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | AI Research [Portfolio] Build Production-Grade AI Products from Scratch | Vision Transformers??Open-Source Contributor
#ai #airesearchskills #airesearch #genai #llms
Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations.
This paper supports in enabling long-context LLMs to generate responses with finegrained sentence-level citations, improving their faithfulness and verifiability. Introduce LongBench-Cite, an automated benchmark for assessing current LLMs’ performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC.
And then train LongCite-8B and LongCite-9B using the LongCite45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o. SFT with citation information effectively reduces hallucinations and enables a more uniform utilization of context.
Summary -
1. We introduce LongBench-Chat, an automatic benchmark for the task of LQAC, and reveal the limited performance of current long-context LLMs.
2. We propose CoF, which utilizes off-the-shelf LLMs to automatically construct high-quality longcontext QA instances with fine-grained sentence-level citations. Using this method, we construct LongCite-45k, a large-scale SFT dataset for LQAC.
3. We successfully train LongCite-8B and LongCite-9B using LongCite-45k dataset, allowing the generation of accurate responses and fine-grained citations in one pass. Our experiments show that SFT on LQAC data not only enhances the capacity for generating citations from lengthy contexts but also further improves response correctness.
BENCHMARKING RESULTS OF CURRENT LONG-CONTEXT LLMS -
1. Open-source LLMs, especially models with smaller sizes, have poor citation quality and lag far behind proprietary LLMs. Though achieving correctness close to proprietary LLMs, opensource LLMs have obvious difficulty in citing supporting evidence for their generated statements. We attribute this to (1) poor instruction-following and in-context learning ability: open-source models often generate citations that do not conform to the prescribed format; (2) relatively weak evidence-searching ability: they often fail to find evidence for some statements (i.e., Ci = ?), or only find partially supporting or irrelevant evidence. 2. The citation quality of proprietary LLMs is still unsatisfactory. The citation F1 of proprietary LLMs on LongBench-chat and HotpotQA is only around 0.5, which means less than half statements in their responses are fully supported by the citations. Furthermore, their average citation length is even larger than chunk-level citation (whose citation length is 128), reflecting a coarse citation granularity. For example, the citation length of Claude-3-Sonnet reaches 220 and each cited snippet contains about 6 sentences on average.
领英推荐
3. Generating responses and citations in one pass via in-context learning hurts the long-context QA performance. On most datasets, current LLMs have correctness ratios less than 100%, indicating that compared to standard long-context question answering, generating responses and citations at once through in-context learning always leads to correctness degradation due to the distribution shift from the post-training data. Overall, the performance of current LLMs on LQAC remains to be improved. To this end, we will explore automatic construction of SFT data in the following section to further enhance LLMs’ capabilities for generating fine-grained sentence-level citations from lengthy contexts.
Summary -
This paper enhances LLMs’ capacity to generate fine-grained citations from lengthy contexts. We first propose LongBench-Cite, an automatic benchmark to reveal current LLMs’ limited performance on long-context question answering with citations (LQAC).
And then introduce CoF, a novel pipeline that uses off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we successfully train LongCite-8B and LongCite-9B with LongCite-45k, allowing the generation of accurate responses and fine-grained citations in one pass. Extensive analyses and human evaluation further verify the effectiveness of our approach.
This paper lays a solid foundation for further research on LQAC and contributes to the development of more reliable and trustworthy LLMs.
References -
Paper Reading Link - https://arxiv.org/abs/2409.02897
Github Link - https://github.com/THUDM/LongCite.
For more information on AI Research Papers you can visit my Github Profile -
For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -
Thank you & Happy Reading !!