LLM Paper Reading Notes - September 2024

LLM Paper Reading Notes - September 2024

Sharing short notes (from myself and others) about LLM research papers I came across in August. These notes differ in their level of detail and precision. I hope they're still useful in piquing your curiosity and helping you breathe under the waterfall. At the current pace of AI, it takes the power of all of us to keep up.

Check my newsletter for past reading notes!

Reading Notes

What Matters in Transformers? Not All Attention is Needed

https://arxiv.org/pdf/2406.15786v4

Although attention layers are the hallmark of transformers, their removal does not significantly degrade performance. Larger LLMs, such as Llama-3-70B, demonstrate even greater robustness to attention pruning, retaining 99% of their performance after 50% of the attention layers are pruned. The pruning metric used is the cosine similarity between the input and output of a layer, with high similarity indicating redundancy. However, removing the MLP component within the attention blocks, or removing entire attention blocks, significantly degrades performance.


Self-Taught Evaluators

https://arxiv.org/pdf/2408.02666v1

Human-generated preference data are typically used to align LLMs via Reinforcement Learning from Human Feedback (RLHF) or for evaluation purposes. This paper proposes a method to generate such data without human intervention. The process involves selecting a set of instructions from a production system and generating preference pairs: one good response and one bad response for each instruction. The good response is generated using Llama3-70B-Instruct. To create bad responses, the model is asked to generate similar but semantically different instructions and then produce responses for these new instructions, with the expectation that these responses will not be suitable for the original instruction. Remarkably, models trained iteratively on this synthetic data achieve performance levels comparable to top-performing reward models trained with human-labeled data.


Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together

https://arxiv.org/pdf/2407.10930

The paper describes a process to improve NLP pipelines involving multiple LLMs. Given a small dataset of labeled examples, the process involves producing “self-generated” examples by initially running the pipeline on the provided training data. These examples, which consist of successful input-output pairs from these runs, are then used for prompt optimization and fine-tuning. In prompt optimization, different combinations of these self-generated few-shot examples are randomly selected and included in the prompts to improve the LLMs' performance. Similarly, fine-tuning leverages the self-generated examples to adjust the model's weights. The experiments show that while optimizing prompts is crucial across all tasks, combining prompt optimization with weight fine-tuning leads to significantly stronger performance gains compared to optimizing either component alone.


RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

https://arxiv.org/pdf/2408.08067v1

This paper introduces RAGCHECKER, a framework that provides a set of fine-grained metrics covering overall quality (precision and recall), the retriever (context precision, claim recall), and the generator (noise sensitivity, hallucination, faithfulness, etc.). The key components of this framework are a text-to-claim extractor, which extracts granular claims from model responses or gold standards, and an entailment checker, which verifies the validity of these claims by assessing whether they are entailed by a reference text. Both components are implemented using Llama3-70B. The framework is validated through experiments showing that RAGCHECKER has the strongest correlation with human preferences compared to other metrics and evaluation frameworks.

While this approach mirrors what practitioners often do, it's useful to see it systematically detailed here. However I've observed that using an LLM for claim extraction and entailment verification can be somewhat unreliable in practice.


Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models

https://arxiv.org/pdf/2408.05093

The author observes that models such as GPT-4o-mini, Llama-3.1-70b, Claude-3.5-sonnet, and Gemini-1.5-flash are sensitive to the order in which they generate their reasoning. These models produce different answers when asked to provide the answer first followed by the reasoning, compared to when they provide the reasoning first and then the answer. Notably, providing the reasoning first only leads to better answers about 50% of the time. To address this issue, they propose a multi-step prompting strategy called “Reflexive Prompting.” In this approach, the model is prompted twice: once to give the answer first and then the reasoning, and once to do the reverse. The two responses are then fed back into the model for review, which, in almost all cases, results in slightly improved performance.


Cost-Effective Hallucination Detection for LLMs

https://arxiv.org/pdf/2407.21424v1

This paper addresses the challenge of detecting hallucinations in LLM outputs. It compares several scoring methods: the inverse perplexity of the LLM response; prompting the LLM to determine whether an answer is true or false and using the logits of the "true"/"false" token; prompting the LLM to assess whether the output contradicts the input, contradicts itself, or contradicts generally established facts; a DeBERTa model fine-tuned to predict if the output conflicts with the question; and prompting the LLM to estimate the probability that the answer is correct. Each method is calibrated independently using the Fortuna library. While prompting the LLM to determine whether an answer is true or false works well on most datasets, the authors found that aggregating the scoring methods using logistic regression performs the best, often recovering or surpassing the performance of the best individual method.


RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

https://arxiv.org/pdf/2408.02545

Summary by The Sequence of AI Knowledge :

This paper presents a framework that streamlines the development and evaluation of Retrieval-Augmented Generation (RAG) systems. The framework uses YAML configuration files to seamlessly integrate data creation, retrieval, fine-tuning, inference, prompt design, and evaluation into a unified workflow, making it easier to prototype and experiment with various RAG techniques.


Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

https://arxiv.org/pdf/2408.03314v1

Summary by Davis Blalock :

This paper asks the question, “How well can you get your output quality to scale with various approaches to test-time computation?” And in particular, can we beat big models with small models and more complex inference techniques?

The takeaway here is roughly that you can replace parameters with inference flops for easy math problems but not hard ones. There’s also a trend where huge amounts of test-time compute start beating the larger parameter count, but it’s a more complex story because they used a single, larger model here and really you’d want to scale up your param and pretraining token counts differently for different regimes.

This paper was great in that it changed my thinking about what works and where we’re headed in multiple ways:

  1. Simple strategies like weighted voting just keep getting better as you scale up test-time compute
  2. There’s a regime where you’re limited by your ability to recognize a good solution, not your ability to generate one.
  3. Batch size 1 inference is becoming a less important workload.
  4. Tree search with Process Reward Models is a legit strategy that people have independently replicated.
  5. Many simple models may be replaced with compound systems consisting of proposer and verifier modules that can be trained and improved separately


Meta Knowledge for Retrieval Augmented Large Language Models

https://arxiv.org/pdf/2408.09017

This paper proposes a technique to enhance Retrieval-Augmented Generation using metadata and synthetic Q&A pairs. Each document is enriched with metadata generated by a large language model (LLM) and then clustered based on this metadata. For each cluster, a Meta Knowledge Summary is created, encapsulating key concepts across the documents in the cluster. Instead of indexing the documents directly, synthetic Q&A pairs are generated for each document and indexed. During inference, the LLM uses these metadata summaries to augment the initial user query, generating additional, more focused sub-questions. This approach allows for more effective retrieval, potentially synthesizing information from multiple documents. Experimental results demonstrate that both the synthetic Q&A and metadata summaries significantly improve retrieval performance.


Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost

https://arxiv.org/pdf/2407.19825v1

Long reasoning chains, such as Chain-of-Thoughts, lead to higher inference time and higher cost. This paper proposes to limit the length of the reasoning output of LLMs by adding “and limit the answer length to N words.” (N=15, 30, 40, etc.) to “Let’s think a bit step by step”. Large language models (LLMs) such as LLaMA2-70b show a significant improvement in both accuracy (!?) and inference time when constrained to a shorter output length. In contrast, smaller and medium-sized models like Falcon-7b and Vicuna-13b struggle to handle these length constraints effectively. These models either fail to respect the constraints or suffer a drop in accuracy when forced to produce shorter outputs.


StructuredRAG: JSON Response Formatting with Large Language Models

https://arxiv.org/pdf/2408.11061

Proposes a benchmark to measure LLMs' ability to follow formatting instructions. Compares Gemini 1.5 Pro with a 4-bit quantized Llama 3 8B-instruct. Gemini achieves 100% accuracy; Llama requires Optimization by PROmpting (Yang et al., 2024) to reach the same performance.


Enhancing Robustness in LLMs: Prompting for Mitigating the Impact of Irrelevant Information?

https://arxiv.org/pdf/2408.10615

This paper notes that while LLMs perform well on reasoning benchmarks, they often struggle in practice due to problem descriptions containing irrelevant information. To address this, the authors propose a mathematical reasoning dataset designed to measure the impact of irrelevant information. Additionally, they introduce a prompting technique that instructs the LLM to identify and exclude irrelevant statements from the problem description, thereby improving its reasoning performance.


Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

https://arxiv.org/pdf/2408.07852v1

This paper investigates the relationship between LLM size and hallucination rate by training models with 3.15M to 1.61B parameters on knowledge graph data in the form of triplets. The authors find that while larger LLMs tend to hallucinate less, achieving a low hallucination rate requires models much larger and many epochs (20+), to significantly reduce hallucinations. However, as models become larger and better trained, detecting their remaining hallucinations becomes increasingly difficult. The authors also acknowledge that training on structured triplets may not fully replicate the types of hallucinations seen in models trained on more typical language data.


The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

https://arxiv.org/pdf/2404.03189

This paper investigates methods for assessing the trustworthiness of LLMs in explaining their reasoning for classification tasks. Prior work introduced the Counterfactual Test (CT), which involves inserting text into an input query and observing if it appears in the subsequent explanation. However, the CT has limitations: it doesn't evaluate whether more impactful features are more likely to be mentioned than less impactful ones, and it treats impactfulness as binary, ignoring the magnitude of the impact. To address these issues, the paper proposes the Correlational Counterfactual Test (CCT), an extension of the CT that considers the model’s predicted distributions before and after text insertion to provide a more nuanced measure of faithfulness.


Enhancing LLM’s Cognition via Structurization

https://arxiv.org/pdf/2407.16434v1

The paper proposes to preprocess a prompt before sending it to a LLM. More precisely, inspired by how the brain stores and structures information, it uses a LLM to deconstruct the prompt into bullet points and summaries. This LLM is trained by distilling Qwen-Max (200B parameters). This approach enhances the performances of small size LLMs for question answering, small and large LLMs for hallucination detection; and of RAG retrievers. It outperforms aspect and summary based augmentation techniques.


Beyond My Bandwidth


Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

https://arxiv.org/pdf/2408.10189v1

Summary by Last Week in AI :

Transformers can be distilled into subquadratic state space models (SSMs) using a method called MOHAWK, allowing SSMs to benefit from the computational resources invested in training Transformer-based architectures.


TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

https://arxiv.org/pdf/2408.09174v1

Summary by Last Week in AI :

Advancements in Large Language Models have improved the processing of tabular data, leading to the creation of a comprehensive benchmark called TableBench to address the challenges of applying LLMs in industrial scenarios.


Training-free Graph Neural Networks and the Power of Labels as Features

https://arxiv.org/pdf/2404.19288v2

Summary by Last Week in AI :

Training-free Graph Neural Networks can leverage the power of labels as features, eliminating the need for extensive training.


FocusLLM: Scaling LLM’s Context by Parallel Decoding?

https://arxiv.org/pdf/2408.11745

Summary by Turing Post :

introduces a framework extending context length using parallel decoding, handling sequences up to 400K tokens efficiently with improved accuracy.


The Vizier Gaussian Process Bandit Algorithm?

https://arxiv.org/pdf/2408.11527

Summary by Turing Post :

enhances Google Vizier's Bayesian optimization with a scalable Gaussian Process Bandit Algorithm, optimizing complex, high-dimensional tasks.


MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding?

https://arxiv.org/pdf/2408.11049

Summary by Turing Post :

proposes a method using speculative decoding to improve latency and throughput in long-context LLMs, demonstrating significant speedup without compromising accuracy.


STRATEGIST: Learning Strategic Skills by LLMs via Bi-Level Tree Search?

https://arxiv.org/pdf/2408.10635

Summary by Turing Post :

introduces a method for developing strategic skills in LLMs using self-play and bi-level tree search, outperforming traditional reinforcement learning.


D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

https://arxiv.org/pdf/2408.08441

Summary by Turing Post :

proposes the D5RL benchmark to evaluate offline deep reinforcement learning using diverse, realistic datasets for robotic tasks, focusing on task variability and policy robustness.


Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

https://arxiv.org/pdf/2408.12570

Summary by Turing Post :

introduces Jamba-1.5 models with a hybrid architecture and a novel quantization technique, achieving high performance in long-context and standard benchmarks.


Automating Thought of Search: A Journey Towards Soundness and Completeness

https://arxiv.org/pdf/2408.11326

Summary by Turing Post :

introduces AutoToS, an extension that automates the creation of search components, ensuring accuracy and completeness without human feedback.


SCALING LAW WITH LEARNING RATE ANNEALING

https://arxiv.org/pdf/2408.11029

Summary by Davis Blalock :

So you know how we have power law scaling curves to let you estimate the final loss for an LLM training run given a bunch of other, smaller runs? What would be even better is if we had a formula that also predicted loss within a training run, taking into account the learning rate schedule. Besides letting us save a lot of compute by predicting the outcomes of training runs early, this would also shed light on what design choices actually matter in a learning rate schedule. We now have such a formula…

So how well does this formula work? It’s not perfect but it’s pretty good, even on funky learning rate schedules… In principle, this means that you could fit your parameter and dataset size scaling coefficients based on just one or two runs, using each run to collect many samples instead of just the final loss… this paper—assuming it replicates—looks like a big chunk of progress on this front.


RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

https://arxiv.org/pdf/2408.11381

Summary by Top LLM Papers of the Week :

RAGLab is a modular, research-oriented open-source library that includes the implementation of? 6 existing RAG algorithms. It provides a comprehensive ecosystem for investigating RAG algorithms, addressing the constraints in RAG development.?


Graph Retrieval-Augmented Generation: A Survey

https://arxiv.org/pdf/2408.08921

Summary by Top LLM Papers of the Week :

Retrieval-Augmented Generation (RAG) addresses LLM challenges, but struggles to handle complex entity relationships in databases. GraphRAG addresses this by leveraging structural information for more precise retrieval and context-aware responses. This paper presents the first comprehensive overview of GraphRAG methodologies and also explores applications, evaluation methods, and future research directions.?


CommunityKG-RAG: Leveraging Community Structures in Knowledge Graphs for Advanced Retrieval-Augmented Generation in Fact-Checking

https://arxiv.org/pdf/2408.08535

Summary by Top LLM Papers of the Week :

This paper introduces CommunityKG-RAG which integrates community structures within Knowledge Graphs with Retrieval-Augmented Generation systems to enhance fact-checking. CommunityKG-RAG can adapt to new domains and queries without additional training which makes it highly versatile and applicable across various contexts.


W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering

https://arxiv.org/pdf/2408.08444

Summary by Top LLM Papers of the Week :

Training of dense retrieval in RAG systems is challenging due to the scarcity of ground-truth evidence. This paper introduces W-RAG, which utilizes LLMs' ranking capabilities to create weakly labeled data for training dense retrievers. W-RAG enhances both retrieval and OpenQA performance compared to baseline models.?


LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

https://arxiv.org/pdf/2408.07055v1

Summary by Gradient Ascent :

LongWriter addresses the challenge of generating ultra-long outputs (10,000+ words) LLMs. The "AgentWrite" pipeline breaks down extensive generation tasks into manageable subtasks, enabling existing LLMs to produce coherent, extended outputs. The authors introduce the LongWriter-6k dataset to fine-tune models and create LongBench-Write for evaluation. The 9B parameter model achieves SOTA results, demonstrating that current LLMs can generate much longer outputs with appropriate training data.


Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic

https://arxiv.org/pdf/2404.07177

Summary by The Batch :

When computational resources are limited relative to the amount of data available, some AI developers try to select the highest-quality examples and train on them for multiple iterations. However, the utility of examples declines a little bit every time they’re used. As computational resources rise, it’s better to introduce new examples even if they’re of slightly lower quality.?


Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

https://arxiv.org/pdf/2402.02805

Summary by Last Week in AI :

Large language models (LLMs) show promise in planning tasks when prompted with graph representations, but still struggle with complex scenarios and out-of-distribution examples, highlighting their limitations in reasoning capabilities.


To Code, or Not To Code? Exploring Impact of Code in Pre-training

https://arxiv.org/pdf/2408.10914

Summary by AlphaSignal :

The study investigates the impact of including code in pre-training data for LLMs, even when not specifically designed for code tasks… Researchers conducted extensive ablations on models ranging from 470M to 2.8B parameters. They varied code proportions, quality, and insertion points in pre-training. Evaluations covered natural language reasoning, world knowledge, code benchmarks, and LLM-as-a-judge win-rates.

Including code improved non-code task performance significantly. The best variant showed relative increases of 8.2% in NL reasoning, 4.2% in world knowledge, 6.6% in generative win-rates, and a 12x boost in code performance. Cooldown with code further improved results by 3.6% in NL reasoning, 10.1% in world knowledge, and 20% in code performance. High-quality synthetic code data, even in small proportions, had a strong impact, improving NL reasoning by 9% and code benchmarks by 44.9%.


The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence

https://cdn.prod.website-files.com/669550d38372f33552d2516e/66bc918b580467717e194940_The%20AI%20Risk%20Repository_13_8_2024.pdf

Summary by Last Week in AI :

MIT researchers release a comprehensive AI risk repository to guide policymakers and stakeholders in understanding and addressing the diverse and fragmented landscape of AI risks.


LLM Pruning and Distillation in Practice: The Minitron Approach

https://arxiv.org/pdf/2408.11796

Summary by LLM Watch :

The researchers explore two pruning strategies to compress the Llama 3.1 8B and Mistral NeMo 12B models. Depth pruning reduces the number of layers in the model, while joint hidden/attention/MLP (width) pruning reduces the dimensionality of the hidden states, attention mechanisms, and MLPs. The pruned models are then fine-tuned using distillation, where a larger "teacher" model transfers its knowledge to a smaller "student" model. The authors also found that slightly fine-tuning the teacher models on the distillation dataset, even without access to the original training data, improves the overall performance.


Automated Design of Agentic Systems

https://arxiv.org/pdf/2408.08435

Summary by LLM Watch :

The researchers propose a promising approach within ADAS where agents are defined in code, and new agents are automatically discovered by a meta agent that programs increasingly better agents. This approach leverages the Turing Completeness of programming languages, enabling the learning of any possible agentic system, including novel prompts, tool use, control flows, and their combinations. The authors present a simple yet effective algorithm called Meta Agent Search, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries.


Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers

https://arxiv.org/pdf/2408.05506v1

Summary by The Sequence of AI Knowledge :

Qualcomm Research published a paper that explores the limitations of transformers. The paper suggests that some of the generalization challenges of transformers are related with the inability to perform random memory access within its context window.


Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

https://arxiv.org/pdf/2408.06195v1

Summary by The Sequence of AI Knowledge :

Microsoft Research published a paper introducing rStar, a self-play multi reasoning approach that seems to improve reasoning capabilities in small language models. rStar uses a generation-discrimination process to decouple the different steps in the reasoning process.


Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

https://arxiv.org/pdf/2408.06663v2

Summary by The Sequence of AI Knowledge :

Researchers from Johns Hopkins University published a paper exploring the relationship between pretraining and fine-tuning in LLMs. The paper explores the diminishing returns of fine-tuning after a certain scale.


ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

https://arxiv.org/pdf/2408.04682

Summary by LLM Watch :

ToolSandbox is a comprehensive evaluation framework that addresses the limitations of previous approaches. It includes stateful tool execution, allowing LLMs to maintain and manipulate state across multiple interactions. ToolSandbox also incorporates implicit state dependencies between tools, enabling the evaluation of LLMs' ability to reason about the relationships between different tools. Additionally, ToolSandbox features a built-in user simulator that supports on-policy conversational evaluation, allowing for a more realistic assessment of LLMs' performance in interactive scenarios. Finally, ToolSandbox introduces a dynamic evaluation strategy that assesses LLMs' performance at intermediate and final milestones over arbitrary trajectories, providing a more nuanced understanding of their capabilities.


Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

https://arxiv.org/pdf/2408.00690

Summary by Turing Post :

enhances text embeddings using contrastive fine-tuning for better performance in semantic similarity tasks.


ThinK: Thinner Key Cache by Query-Driven Pruning

https://arxiv.org/pdf/2407.21018

Summary by Turing Post :

optimizes memory usage in LLMs during inference by pruning less important cache channels based on query criteria.


Synthesizing Text-to-SQL Data from Weak and Strong LLMs

https://arxiv.org/pdf/2408.03256

Summary by Turing Post :

develops a model that bridges the gap between open-source and closed-source LLMs by using a combination of synthetic data from strong and weak models to improve text-to-SQL tasks.


Better Alignment with Instruction Back-and-Forth Translation

https://arxiv.org/pdf/2408.04614

Summary by Turing Post :

proposes a method to improve LLMs by using instruction backtranslation and response rewriting, enhancing alignment and response diversity.


StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

https://arxiv.org/pdf/2408.03281

Summary by Turing Post :

introduces a multi-layered framework to assess LLMs across various cognitive levels, reducing biases and improving the consistency of evaluations.


LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

https://arxiv.org/pdf/2408.04284

Summary by Turing Post :

presents a system that classifies text into multiple categories to detect the extent of LLM involvement, enhancing the detection of machine-generated content.


CoverBench: A Challenging Benchmark for Complex Claim Verification

https://arxiv.org/pdf/2408.03325

Summary by Turing Post :

creates a benchmark to evaluate LLM accuracy in verifying complex claims, revealing significant challenges in this domain.


Body Transformer: Leveraging Robot Embodiment for Policy Learning

https://arxiv.org/pdf/2408.06316v1

Summary by Last Week in AI :

Leveraging the robot embodiment, the Body Transformer (BoT) architecture outperforms vanilla transformers and multilayer perceptrons in robot learning tasks.


(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

https://arxiv.org/pdf/2405.11804

Summary by The Batch :

Literary works are challenging to translate. Their relative length, cultural nuances, idiomatic expressions, and expression of an author’s individual style call for skills beyond swapping words in one language for semantically equivalent words in another. Researchers built a machine translation system to address these issues… Prompting a large language model (LLM) to translate literature often results in subpar quality. Employing multiple LLMs to mimic human roles involved in translation breaks down this complex problem into more tractable parts. For example, separate LLMs (or instances of a single LLM) can act as agents that take on roles such as translator and localization specialist, and they can check and revise each other’s work. An agentic workflow raises unsolved problems such as how to evaluate individual agents’ performance and how to measure translation quality. This work offers a preliminary exploration.


CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases

https://arxiv.org/pdf/2408.03910

Summary by The Sequence of AI Knowledge :

Alibaba Research published a paper introducing CodexGraph, a system that integrates LLMs with a graph database based on code repositories. The graph model allows LLMs to navigate more sophisticated code structures and tackle more complex tasks.


GENEVA: GENErating and Visualizing branching narratives using LLMs

https://arxiv.org/pdf/2311.09213

Summary by The Sequence of AI Knowledge :

Microsoft Research published a paper introducing GENEVA, a tool that can generate rich narrative graphs based on a high level description and a set of constraints. GENEVA, explores different narrative paths through a visual, graph interface.


Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

https://arxiv.org/pdf/2408.02442v1

Summary by LLM Watch :

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, but their application in real-world scenarios often requires the output to be in structured formats like JSON or XML. ... However, the impact of imposing such structural constraints on the reasoning abilities and domain knowledge comprehension of LLMs has not been thoroughly investigated. This study addresses the problem by conducting a comprehensive evaluation of LLMs' performance when generating structured output compared to free-form responses across various common tasks. By systematically varying the strictness of format constraints, the researchers aim to quantify the extent to which these restrictions affect the models' reasoning capabilities. The findings suggest that enforcing structured output formats can lead to a significant decline in LLMs' reasoning abilities, with stricter constraints resulting in greater performance degradation…


Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval

https://arxiv.org/pdf/2407.10805v3

Summary by LLM Watch :

Think-on-Graph 2.0 (ToG2.0) enhances the RAG paradigm by leveraging knowledge graphs (KGs) to guide the retrieval process. Instead of simply retrieving relevant documents, ToG2.0 aligns the input questions with the KG and uses it as a navigational tool. This approach allows the model to make deep and long-range associations, ensuring logical consistency and optimizing the scope of retrieval. By incorporating semantic similarity guided by precise directives, ToG2.0 also improves the factual consistency of the generated responses.


Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation

https://arxiv.org/pdf/2408.04187v1

Summary by LLM Watch :

MedGraphRAG introduces a novel graph-based Retrieval-Augmented Generation (RAG) framework tailored for the medical domain. It starts by employing a hybrid static-semantic approach for document chunking, which significantly improves context capture compared to traditional methods. Extracted entities are then used to construct a three-tier hierarchical graph structure, connecting entities to foundational medical knowledge from papers and dictionaries. These entities are further linked to form meta-graphs, which are merged based on semantic similarities to create a comprehensive global graph. This structured representation enables precise information retrieval and evidence-based response generation. The retrieval process utilizes a U-retrieve method to balance global awareness and indexing efficiency of the LLM.


Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

https://arxiv.org/pdf/2408.03314

Summary by AlphaSignal :

The paper investigates if optimizing test-time compute can improve performance more efficiently than increasing model size… The authors propose "compute-optimal" scaling of test-time computation. They explore two approaches: (1) searching against process-based verifier reward models, and (2) updating the model's response distribution adaptively. They develop strategies that allocate test-time compute based on prompt difficulty. On the MATH benchmark using PaLM 2-S*, compute-optimal scaling outperforms best-of-N baselines by 4x in efficiency. In FLOPs-matched comparisons, test-time compute outperforms a 14x larger model on easy/medium questions. However, for very difficult questions, scaling model parameters remains more effective.


HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

https://arxiv.org/pdf/2408.04948

Summary by AlphaSignal :

HybridRAG combines VectorRAG and GraphRAG, leveraging both vector databases and knowledge graphs. It uses GPT-3.5-turbo for generation, text-embedding-ada-002 for embeddings, and a custom knowledge graph construction method. HybridRAG concatenates contexts from both approaches to generate responses. HybridRAG demonstrates superior performance in extracting information from financial documents, balancing high-quality answers with comprehensive context retrieval. Outperforms VectorRAG and GraphRAG.

Faithfulness: 0.96 (tied with GraphRAG, vs 0.94 for VectorRAG)

Answer Relevancy: 0.96 (vs 0.91 for VectorRAG, 0.89 for GraphRAG)

Context Recall: 1.0 (tied with VectorRAG, vs 0.85 for GraphRAG)


ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

https://arxiv.org/pdf/2408.04682v1

Summary by Last Week in AI :

A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities is discussed, along with recommendations for similar papers and a call for feedback.


Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

https://arxiv.org/pdf/2407.08516v3


Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

https://arxiv.org/pdf/2407.18003v2


rLLM: Relational Table Learning with LLMs

https://arxiv.org/pdf/2407.20157v1


ByteCheckpoint: A Unified Checkpointing System for LLM Development

https://arxiv.org/pdf/2407.20143v1


MindSearch 思·索: Mimicking Human Minds Elicits Deep AI Searcher

https://arxiv.org/pdf/2407.20183v1


Transformers are Universal In-context Learners

https://arxiv.org/pdf/2408.01367v1


CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

https://arxiv.org/pdf/2408.02193v1


MoExtend: Tuning New Experts for Modality and Task Extension

https://arxiv.org/pdf/2408.03511v1


Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

https://arxiv.org/pdf/2408.00114v2


UNLEARN Efficient Removal of Knowledge in Large Language Models

https://arxiv.org/pdf/2408.04140v1


Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

https://arxiv.org/pdf/2407.16607


The CLRS-Text Algorithmic Reasoning Language Benchmark

https://arxiv.org/pdf/2406.04229


Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

https://arxiv.org/pdf/2407.15549


Retrieval-augmented code completion for local projects using large language models

https://arxiv.org/pdf/2408.05026


Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

https://arxiv.org/pdf/2408.04693


An Empirical Study on Challenges for LLM Developers

https://arxiv.org/pdf/2408.05002


A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?

https://arxiv.org/pdf/2408.05109


A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning

https://arxiv.org/pdf/2408.05141

CRAG KDD Cup 2024.? Winner announcement August 26, 2024?


Flexora: Flexible Low Rank Adaptation for Large Language Models

https://arxiv.org/pdf/2408.10774v1


Critique-out-Loud Reward Models

https://arxiv.org/pdf/2408.11791v1


Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information

https://arxiv.org/pdf/2408.10615


Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

https://arxiv.org/pdf/2408.07852


Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

https://arxiv.org/pdf/2408.10920


Generating novel experimental hypotheses from language models: A case study on cross-dative generalization

https://arxiv.org/pdf/2408.05086


A compilation of papers about non-english languages by Turing Post

JACOLBERTV2.5: OPTIMIZING MULTI-VECTOR RETRIEVERS TO CREATE STATE-OF-THE-ART JAPANESE RETRIEVERS WITH CONSTRAINED RESOURCES

https://arxiv.org/pdf/2407.20750

Summary by Turing Post :

improves Japanese language retrieval by optimizing multi-vector retrievers with constrained computational resources.

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

https://arxiv.org/pdf/2407.20267

Summary by Turing Post :

introduces a family of transformer-based models for chemical tasks, achieving state-of-the-art performance in molecular property prediction and classification.

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

https://arxiv.org/pdf/2407.19672

Summary by Turing Post :

introduces a multilingual model tailored for Southeast Asian languages, improving performance and efficiency in various tasks.

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

https://arxiv.org/pdf/2407.19835

Summary by Turing Post :

provides a dataset for improving Classical Arabic to English translation, addressing gaps in existing resources.

Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models

https://arxiv.org/pdf/2407.19914

Summary by Turing Post :

explores sentiment analysis for Lithuanian using fine-tuned transformer models, highlighting challenges in less-resourced languages.

Meltemi: The first open Large Language Model for Greek

https://arxiv.org/pdf/2407.20743

Summary by Turing Post :

develops Meltemi 7B, the first open-source LLM for Greek, showing strong performance on Greek language benchmarks.

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework

https://arxiv.org/pdf/2407.20729

Summary by Turing Post :

creates a Safe-for-Work classifier for Malay text to detect unsafe content, improving safety in AI applications for Malaysia.

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

https://arxiv.org/pdf/2407.20581

Summary by Turing Post :

fine-tunes a Hebrew language model on parliamentary texts for enhanced analysis of political discourse.

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

https://arxiv.org/pdf/2408.00584

Summary by Turing Post :

assesses LLMs' performance on Italian rebuses, revealing limitations in sequential reasoning despite fine-tuning.


Beyond Words: Other Modalities

RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis

https://arxiv.org/pdf/2408.03356

Summary by Turing Post :

proposes a new method for photorealistic rendering using Gaussian functions, achieving superior results in novel view synthesis.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

https://arxiv.org/pdf/2408.01800v1

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

https://arxiv.org/pdf/2408.06070v1

Summary by Last Week in AI :

A new method called ControlNeXt is proposed for controllable image and video generation, offering powerful and efficient control with reduced computational resources and improved training stability.

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

https://arxiv.org/pdf/2408.06327v1

Summary by Last Week in AI :

Introducing VisualAgentBench, a benchmark designed to train and evaluate Large Multimodal Models as visual foundation agents across diverse scenarios, showcasing their considerable yet developing capabilities.

VITA: Towards Open-Source Interactive Omni Multimodal LLM

https://arxiv.org/pdf/2408.05211v1

Summary by Last Week in AI :

VITA is the first open-source Multimodal Large Language Model (MLLM) with advanced capabilities in processing and analyzing Video, Image, Text, and Audio modalities, as well as providing a strong multimodal interactive experience.

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

https://arxiv.org/pdf/2408.04810v1

Summary by Last Week in AI :

Visual reasoning in AI requires rethinking beyond scaling, as scaling training data or model size may not significantly improve reasoning or relations, and more precise interventions such as data quality or tailored-learning objectives offer more promise.

Visual reasoning in AI requires rethinking beyond scaling, as scaling training data or model size may not significantly improve reasoning or relations, and more precise interventions such as data quality or tailored-learning objectives offer more promise.

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

https://arxiv.org/pdf/2408.04785v1

Summary by Last Week in AI :

BRAT introduces a new method for textual inversion using bonus tokens and a vision transformer, improving adherence to source images and prompts without relying on the UNet.

Achieving Human Level Competitive Robot Table Tennis

https://arxiv.org/pdf/2408.03906

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

https://www.arxiv.org/pdf/2408.02718

ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation

https://www.arxiv.org/pdf/2408.02226

OmniParser for Pure Vision Based GUI Agent

https://arxiv.org/pdf/2408.00203

Summary by Turing Post :

develops a vision-based method for parsing UI screenshots into structured elements, improving model performance across various applications.

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

https://arxiv.org/pdf/2407.19474

Summary by Turing Post :

proposes a framework to enhance Vision Transformers by dynamically routing visual tokens to reduce computational costs while maintaining accuracy.

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

https://arxiv.org/pdf/2406.11271

Summary by LLM Watch :

MINT-1T addresses the need for large-scale multimodal interleaved datasets by providing an extensive and diverse collection of data. With one trillion text tokens and 3.4 billion images, MINT-1T offers a significant scale-up compared to existing open-source datasets, being 10 times larger. Moreover, MINT-1T incorporates previously untapped data sources, such as PDFs and ArXiv papers, further enhancing its diversity. By curating and releasing this dataset, the researchers aim to benefit the community and facilitate the development of LMMs.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

https://arxiv.org/pdf/2408.04840

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

https://arxiv.org/pdf/2408.11039

Summary by AlphaSignal :

Existing approaches struggle to unify discrete (text) and continuous (image) modalities in a single model. Current methods either use separate architectures or quantize images, sacrificing information. Transfusion trains a single transformer on both next-token prediction (for text) and diffusion (for images). It uses a VAE for image encoding, linear or U-Net layers for patch processing, and a novel attention mechanism allowing bidirectional attention within images. Transfusion outperforms Chameleon (discrete approach) using 34× less compute for image generation (FID). It matches DALL-E 2 on GenEval (0.63 vs 0.52) and Llama 1 on text tasks (66.1 vs 66.1 accuracy). The 7B parameter model, trained on 2T tokens, generates high-quality images and text.

TurboEdit: Instant text-based image editing

https://arxiv.org/pdf/2408.08332v1

Summary by Last Week in AI :

A new text-based image editing tool, TurboEdit, allows for precise and disentangled image editing using an encoder-based iterative inversion technique, resulting in fast and realistic text-guided image edits.

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

https://arxiv.org/pdf/2408.08872v1

Summary by Last Week in AI :

Introducing xGen-MM (BLIP-3), a framework for developing Large Multimodal Models (LMMs) with meticulously curated datasets, model architectures, and a suite of LMMs, all open-sourced to advance research in the field.

Imagen 3

https://arxiv.org/pdf/2408.07009v1

Summary by Last Week in AI :

Introducing Imagen 3, a latent diffusion model that generates high-quality images from text prompts, preferred over other state-of-the-art models, with a focus on responsibility and minimizing potential harm.

SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

https://arxiv.org/pdf/2408.08870v1

Summary by Last Week in AI :

SAM2-UNet is a strong encoder for natural and medical image segmentation, embraced by individuals and organizations for its values of openness, community, excellence, and user data privacy.

xGen-MM (BLIP-3) - A New Frontier in Large Multimodal Models

https://arxiv.org/pdf/2408.08872

LONGVILA: SCALING LONG-CONTEXT VISUAL LANGUAGE MODELS FOR LONG VIDEOS

https://arxiv.org/pdf/2408.10188

Summary by Last Week in AI :

Scaling Long-Context Visual Language Models for Long Videos through the introduction of LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development.

SHOW-O: ONE SINGLE TRANSFORMER TO UNIFY MULTIMODAL UNDERSTANDING AND GENERATION

https://arxiv.org/pdf/2408.12528

Sapiens: Foundation for Human Vision Models

https://arxiv.org/pdf/2408.12569

Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation

https://arxiv.org/pdf/2408.11812

Summary by Turing Post :

introduces CrossFormer, a transformer model capable of controlling diverse robotic platforms, demonstrating adaptability and performance across various real-world tasks.

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

https://arxiv.org/pdf/2408.09702

Summary by Turing Post :

presents DiPIR, a method for realistic object insertion into images using large diffusion models to guide inverse rendering, enhancing applications in virtual production.

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

https://arxiv.org/pdf/2408.11039

Summary by Turing Post :

introduces Transfusion, a multi-modal model combining language modeling and diffusion techniques, achieving high-quality text and image generation.

Towards flexible perception with visual memory

https://arxiv.org/pdf/2408.08172

Language Model Can Listen While Speaking

https://arxiv.org/pdf/2408.02622v1

Summary by Last Week in AI :

A language model has been developed to enable real-time interaction in speech-based conversational AI, allowing for interruption and turn-taking in spoken scenarios.

My addition: based on a listening module, a speaking module, and a fusion technique to enable the LMM to listen while speaking.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了