LLM Paper Reading Notes - August 2024

LLM Paper Reading Notes - August 2024

Sharing short notes (from myself and others) about LLM research papers I came across in July. These notes differ in their level of detail and precision. I hope they're still useful in piquing your curiosity and helping you breathe under the waterfall. At the current pace of AI, it takes the power of all of us to keep up.

Check my newsletter for past reading notes!

Reading Notes

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

https://arxiv.org/pdf/2407.16833

This study investigates how large language models handle question-answering tasks under two conditions: when they receive comprehensive context information (long-context) versus when they are given only selected chunks of the necessary information (RAG). It shows that long context surpasses RAG significantly for Gemini-1.5-Pro, GPT-4O and GPT-3.5-Turbo. The authors propose a hybrid solution which consists first of generating the context with RAG and asking the LLM if it is sufficient to answer the question. If it is not, long-context is used. The approach surpasses long-context for GPT-4O and GPT-3.5-Turbo, and is more cost effective.

Confabulation: The Surprising Value of Large Language Model Hallucinations

https://arxiv.org/pdf/2406.04175v2

This paper makes the parallel between human confabulation and LLM hallucinations. On one hand, it highlights? research from psychiatry that suggests that everyday memory reconstruction often involves some degree of confabulation. When humans do not have access to sufficient information to formulate coherent semantic meaning, they often confabulate to ‘fill in the blanks’ with self-consistent narratives that are not necessarily factual but bear close semantic verisimilitude to reality. On the other hand, it shows, using a story detection model (fine-tuned ELECTRA-large based) and 3 datasets, that hallucinated content has higher narrativity. It concludes that some degree of confabulation may be necessary for LLM to maintain cognitive coherence. Contrary to what the title suggests, however, it does not propose practical applications or benefits of LLM hallucinations in a way that distinctly separates beneficial confabulation from the potential risks of fabricating facts.

RATT: A Thought Structure for Coherent and Correct LLM Reasoning

https://arxiv.org/pdf/2406.02746v3

It is well known that asking a LLM to explicitly reason step by step improves problem solving performances. Tree of Thoughts (ToT) extended this paradigm by exploring and evaluating multiple reasoning branches of a thought tree structure. This paper proposes to extend ToT further by retrieving documents at each node of the tree and using them to guide the reasoning. Experiments on code generation, creative writing and hallucination detection benchmark, with GPT3.5T, show the efficacy of this approach. However, problem solving capabilities are only evaluated for the “Game of 24” mathematical puzzle and not on commonly used reasoning benchmarks.

Memory3: Language Modeling with Explicit Memory

https://arxiv.org/pdf/2407.01178v1

Summary by LLM Watch:

…Memory3 introduces a third form of memory in addition to the implicit knowledge stored in model parameters and the short-term working memory used during inference (context key-values). This explicit memory is designed to store factual knowledge more efficiently than model parameters. The researchers also developed techniques to make this approach feasible, including a memory sparsification mechanism to reduce storage requirements and a two-stage pretraining scheme to facilitate the formation of the explicit memory during training…The results of Memory3 are showing that a 2.4B parameter model with explicit memory can outperform much larger models and maintain higher decoding speed than retrieval-augmented generation (RAG) approaches.

A good summary. I would just add that the paper introduces 3 types of memories: implicit memories stored in the model parameters, working memory consisting of cached key-values from the current sequence, and explicit memory consisting of external knowledge encoded from the knowledge base, similar to retrievable model parameters or sparsely-activated neural circuits. Explicit memory is recalled through vector search à la RAG.

Characterizing Prompt Compression Methods for Long Context Inference

https://arxiv.org/pdf/2407.08892

The paper provides a comprehensive overview and comparison of three prompt compression techniques: token pruning, abstractive compression (summarization), and extractive compression. In extractive compression, the prompt is divided into smaller chunks such as sentences or phrases. These chunks are then scored based on their relevance to the query or question using a fine-tuned DeBERTa model, with the most relevant chunks being selected. This method yields the best results, often even improving task accuracy at low compression rates (5-10%).

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

https://arxiv.org/pdf/2407.12994v2

This article summarizes 38 prompt engineering techniques for LLM reasoning and lists the types of problems and datasets they have been used with.

Distilling System 2 into System 1

https://arxiv.org/pdf/2407.06023

The paper proposes to distill a LLM, called System 2 LLM, performing multi-steps or multi-calls reasoning (such as Chain of Thoughts, Rephrase and Respond, Branch-Solve-Merge) into a LLM that outputs the response directly in a single forward pass without multi-steps reasoning, called System 1. First, question-answer pairs are generated in an unsupervised manner from System 2 (using self consistency to curate errors). Second, System 1 is fine-tuned on these question-answer pairs (without the reasoning steps). Experiments with Llama-2-70B-chat show that distillation of System 2 methods often maintains or improves performance compared to the original System 2 methods while significantly reducing inference costs. The effectiveness of distillation varies depending on the task, with some tasks (like CoT for complex reasoning) proving challenging to distill effectively.

Just read twice: closing the recall gap for recurrent language models

https://arxiv.org/pdf/2407.05483v1

Recurrent models such as RNNs and their modern variant, Mamba, process tokens sequentially and do not suffer from quadratic complexity like Transformers do. This paper notes that these recurrent models are brittle to the order of the input data. For example, suppose we ask question Q (e.g., “When did Galileo move to Florence?”), over documents D (e.g., the detailed Wikipedia for Galileo Galilei). The model needs to remember just one fact from D if the prompt is ordered [Q,D], but needs to remember all facts when it is [D,Q]. Based on this insight, they show that repeating information in a prompt (D,Q,D) improves recurrent model and Transformer++ on all model sizes (up to 2.7B parameters) and tasks. The authors also propose JRT-RNN, a non causal encode-decoder architecture that is more robust to data ordering (still repeating the information). JRT-RNN provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

https://arxiv.org/pdf/2407.04620

Transformers, though powerful in handling long sequences, suffer from quadratic complexity. In contrast, RNNs and their modern variant, Mamba, process tokens sequentially and memorize them in a hidden state represented as a fixed-sized vector. Despite improvements over RNNs, Mamba struggles to maintain competitiveness with Transformers beyond 8,000 tokens. This paper proposes a richer representation of the hidden state in the form of a set of weights, akin to an "inner" neural network. Updating this hidden state is approached similarly to training a neural network. For a given sentence, the inner model is tasked with recovering the current token from a corrupted version using a reconstruction loss. The corruption involves low-rank projections, whose parameters are learned during the overall model training (outer training). The paper presents two specific instantiations of the inner model: TTT-Linear and TTT-MLP. Evaluations demonstrate that TTT-Linear matches or exceeds the performance of both Mamba and Transformers, especially in long-context scenarios. To enhance hardware efficiency, the authors introduce techniques such as mini-batch TTT and dual form operations. Evaluations show that TTT-Linear not only outperforms Mamba after an 8k context length but also maintains performance comparable to Transformers with greater efficiency.

However, the claim that this approach performs "Test-Time Training" of the inner model is debatable since the inner weights are reset after each sentence, ensuring that sentences are processed independently without accumulating knowledge across them.

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

https://arxiv.org/pdf/2407.02485v1

This paper demonstrates that fine-tuning a LLM for RAG can significantly enhance its question answering capabilities. This fine-tuning process leverages QA datasets to train the LLM to both estimate the relevance of specific document snippets to a question and generate answers from these snippets. At inference time, after retrieving snippets, the RankRAG pipeline involves an additional call to the LLM to estimate the relevance of each snippet (reranking). The 5 most relevant snippets are then added to the LLM context for generating the answer. RankRAG, when applied to Llama 8B and 70B, achieves state-of-the-art (SOTA) performance, albeit at the cost of an extra inference step, which can result in up to a 6x increase in time.

Large Language Models Understand Layouts

https://arxiv.org/pdf/2407.05750v1

This paper demonstrates that GPT-3.5 Turbo, and to a lesser extent other LLMs, can understand text layout. When text is formatted into four quadrants (top-left, top-right, bottom-left, bottom-right) using only spaces and newlines, GPT-3.5 Turbo can accurately answer questions like “What is the name mentioned in the top-left corner?” with an F1 score of 87.77. The experiments suggest that training data containing code is crucial for developing this layout understanding capability in LLMs, which is further improved during instruction-tuning. This capability can be significantly enhanced with specifically crafted training data.

Searching for Best Practices in Retrieval-Augmented Generation

https://arxiv.org/pdf/2407.01219

This paper reviews and evaluates many components commonly used in RAG pipelines, such as query classification, chunking, retrieval modules, re-rankers, summarizers and more. All in all a good review with some good pointers.

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

https://arxiv.org/pdf/2405.13622v1

A very interesting method to evaluate RAG pipelines is proposed in this paper. It involves using a LLM to generate multiple-choice questions (MCQ), each with only one correct answer, from the RAG document corpus. Furthermore, they describe how Item Response Theory (IRT) can be applied to assess the sensitivity of the test across multiple cognitive dimensions (understanding, remembering, creating, etc.) and iteratively improve its quality. This process is conducted without any human supervision, aside from a few regex filters to remove poor-quality questions. Some key findings from their experiments using various retrievers and LLMs include: hybrid retrievers that combine BM25 and dense models offer greater robustness; the performance gain from using an appropriate retriever can surpass that of choosing a larger LLM; and poorly aligned retriever components can lead to worse accuracy than having no retrieval at all.

However, this approach does not assess RAG pipelines in the typical manner they are used, which is to generate natural language responses for users. Instead, the LLM is prompted to answer MCQs. Could a potential solution be to have the LLM generate a response and then use another LLM to answer the MCQs based on that response?

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

https://arxiv.org/pdf/2407.08223v1

In Retrieval-Augmented Generation, documents are retrieved based on a question and used as context within the LLM prompt to generate accurate and informed answers. This paper proposes a novel framework called Speculative RAG to enhance this process by utilizing a smaller, faster, and specifically fine-tuned model known as the drafter. The drafter generates multiple answer drafts along with their rationales by clustering the retrieved documents and sampling from these clusters. These drafts are then evaluated by a larger generalist LLM called the verifier, which estimates the probability of each answer-rationale pair and selects the best one. This approach leads to SOTA performance on QA datasets while reducing latency and cost. The fine-tuning dataset consists of triplets of questions, documents, and answers, augmented with rationales generated by the LLM.

Highlights from the Community

FlashAttention-3

https://tridao.me/publications/flash3/flash3.pdf

Summary by Last Week in AI:

FlashAttention is an important and widely used method for speeding up the inference of Large Language Models. This discusses FlashAttention-3, an improved method for speeding up attention on Hopper GPUs, the latest and best hardware for LLMs from Nvidia. The new method utilizes three main techniques: exploiting asynchrony of the Tensor Cores and TMA to overlap computation and data movement, interleaving block-wise matmul and softmax operations, and using block quantization and incoherent processing that leverages hardware support for FP8 low-precision. The results show that FlashAttention-3 achieves a speedup on H100 GPUs by 1.5-2.0 times with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s.

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

https://arxiv.org/pdf/2407.16741

Summary by The Sequence of AI Knowledge:

Researchers from elite AI universities such as UC Berkeley, Yale, Carnegie Mellon and others published a paper introducing OpenDevin, a framework for developing AI agents that interact with environments similar to human programmers. OpenDevin agents are able to collaborate with human programmers in different tasks such as bug fixing, feature building, testing and many others

AI models collapse when trained on recursively generated data

https://www.nature.com/articles/s41586-024-07566-y

Summary by The Sequence of AI Knowledge:

Researchers from Oxford, Cambridge, Imperial College of London and other institutions published a paper in Nature outlining a curious phenomenon in LLMs coined as model collapse. The thesis of model collapse states that LLMs will start showing irreversible degenerative behavior when trained in data created by other AI models

Compact Language Models via Pruning and Knowledge Distillation

https://arxiv.org/pdf/2407.14679v1

Summary by The Sequence of AI Knowledge:

NVIDIA Research published a paper proposing a set of effective compression best practices to build compact LLMs. The techniques combine the best strategies for depth, width, attention and MLP pruning with knowledge distillation-based retraining

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

https://arxiv.org/pdf/2406.08464v1

Summary by Last Week in AI:

Magpie presents a method for synthesizing high-quality instruction data at scale by extracting it directly from aligned large language models, demonstrating its effectiveness in comparison to other public instruction datasets.

GraphFM: A Scalable Framework for Multi-Graph Pretraining

https://arxiv.org/pdf/2407.11907v1

Summary by Last Week in AI:

GraphFM is a scalable framework for multi-graph pretraining, but the specific details of the article are not available.

Transformer Layers as Painters

https://arxiv.org/pdf/2407.09298v1

Summary by Last Week in AI:

Understanding the impact of removing or reorganizing information throughout the layers of a pretrained transformer can yield better usage of existing models and make architectural improvements to produce new variants, as shown by a series of empirical studies on frozen models.

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

https://arxiv.org/pdf/2407.12772v1

Summary by Last Week in AI:

Introducing LMMS-EVAL, a unified multimodal benchmark framework with over 50 tasks and 10 models, addressing the challenges of low cost and zero contamination in evaluating large multi-modal models.

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

https://arxiv.org/pdf/2406.06469v1

Summary by Last Week in AI:

Husky is an open-source language agent that outperforms existing models in addressing complex reasoning problems by using a unified action space and expert models.

LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data

https://arxiv.org/pdf/2407.11418v1

KAN or MLP: A Fairer Comparison

https://arxiv.org/pdf/2407.16674

Summary by AlphaSignal:

MLP outperformed KAN in machine learning (86.16% vs. 85.96%), computer vision (85.88% vs. 77.88%), NLP (80.45% vs. 79.95%), and audio processing (17.74% vs. 15.49%). KAN excelled only in symbolic formula representation (1.2e-3 RMSE vs. 7.4e-3). Access the code

Accuracy is Not All You Need

https://arxiv.org/pdf/2407.09141v1

Summary by LLM Watch:

The researchers propose two new metrics to evaluate compressed LLMs: KL-Divergence and flips. KL-Divergence measures the difference in probability distributions between the baseline and compressed models, providing a more nuanced understanding of how the models' outputs differ. The flips metric quantifies the proportion of answers that change from correct to incorrect (and vice versa) between the baseline and compressed models, even when overall accuracy remains similar. By incorporating these metrics, the study offers a more comprehensive evaluation framework for compressed LLMs.

Case2Code: Learning Inductive Reasoning with Synthetic Data

https://arxiv.org/pdf/2407.12504v1

Summary by LLM Watch:

Inductive reasoning, the ability to infer underlying rules by observing examples or sequential transformations, is a crucial aspect of complex reasoning. While Large Language Models (LLMs) have shown impressive deductive reasoning skills, their inductive reasoning capabilities have not been extensively evaluated or explicitly trained. Collecting large-scale, diverse human-generated inductive data is challenging, making it difficult to assess and enhance LLMs' inductive reasoning abilities. The researchers propose a novel approach called Case2Code, which leverages the expressiveness and correctness of programs to synthesize inductive reasoning tasks. They collect a diverse set of executable programs and generate input-output transformations for each program. LLMs are then tasked with inferring the underlying code implementations based on the synthetic input-output cases. By evaluating representative LLMs on the Case2Code task, the researchers demonstrate that case-to-code induction is challenging for current models. To address this, they synthesize large-scale Case2Code training samples to explicitly train LLMs in inductive reasoning.

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

https://arxiv.org/pdf/2407.13623

Summary by LLM Watch:

While a lot of research has been done on scaling laws for LLMs, most of it has been focused on the number of parameters and the amount of training data. The vocabulary size, which determines the granularity of the tokens that are used to represent the input and output sequences, has been largely overlooked. Choosing the right vocabulary size is a trade-off between representing the input and output more efficiently with fewer tokens and the risk of under-fitting rare tokens. The researchers propose three different methods for predicting the optimal vocabulary size for a given compute budget: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. All three methods converge on the same result, showing that the optimal vocabulary size depends on the available compute budget and that larger models should use larger vocabularies. For example, they predict that the Llama2-70B model should have used a vocabulary size of at least 216K instead of the 32K that was actually used.

H2O-Danube3 Technical Report

https://arxiv.org/pdf/2407.09276

Summary by Turing Post:

Presents small LLMs optimized for mobile devices, highlighting efficient operation and accessibility

LETS-C: Leveraging Language Embedding for Time Series Classification

https://arxiv.org/pdf/2407.06533

Summary by Turing Post:

Utilizes language embeddings for time-series classification, demonstrating high performance with reduced computational costs

Lynx: An Open Source Hallucination Evaluation Model

https://arxiv.org/pdf/2407.08488

Summary by Turing Post:

Develops an open-source model for detecting hallucinations in Retrieval-Augmented Generation systems?

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in

Large Language Models Using Only Attention Maps

https://arxiv.org/pdf/2407.07071

Summary by Turing Post:

Detects contextual hallucinations in LLMs using attention maps, providing a tool to reduce hallucinations?

Associative Recurrent Memory Transformer

https://arxiv.org/pdf/2407.04841

Summary by Turing Post:

Develops a new architecture for processing long sequences efficiently using associative memory

HUMAN-LIKE EPISODIC MEMORY FOR INFINITE CONTEXT LLMS

https://arxiv.org/pdf/2407.09450

Summary by Turing Post:

Integrates features of human episodic memory into LLMs to manage infinite context lengths

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

https://arxiv.org/pdf/2407.09435

Summary by Turing Post:

Introduces a model update strategy that minimizes negative flips during LLM updates, ensuring consistent task performance

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

https://arxiv.org/pdf/2407.05700

Summary by Turing Post:

Enhances code LLMs by generating natural language instructions from code, improving model diversity and performance

On Leakage of Code Generation Evaluation Datasets

https://arxiv.org/pdf/2407.07565

Summary by Turing Post:

Identifies contamination sources in code generation datasets and introduces a cleaner benchmark for evaluating LLMs?

An accurate detection is not all you need to combat label noise in web-noisy datasets

https://arxiv.org/pdf/2407.05528

Summary by Turing Post:

Proposes a hybrid approach to improve classification performance in noisy datasets by combining unsupervised learning with noise detection methods

Self-Recognition in Language Models

https://arxiv.org/pdf/2407.06946

Summary by Turing Post:

Investigates whether LLMs can recognize their own outputs, revealing insights into model decision-making processes?

From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty

https://arxiv.org/pdf/2407.06071

Summary by Turing Post:

Studies fallback behaviors of LLMs under uncertainty, detailing how advanced models handle errors and uncertainties

Understanding Visual Feature Reliance through the Lens of Complexity

https://arxiv.org/pdf/2407.06076

Summary by Turing Post:

Analyzes how deep learning models prioritize features based on complexity, impacting model decisions

SPREADSHEETLLM: Encoding Spreadsheets for Large Language Models

https://arxiv.org/pdf/2407.09025

Summary by Turing Post:

Enhances LLMs' ability to handle complex spreadsheet data through advanced serialization and compression techniques

INTERNET OF AGENTS: WEAVING A WEB OF HETEROGENEOUS AGENTS FOR COLLABORATIVE INTELLIGENCE

https://arxiv.org/pdf/2407.07061

Summary by Turing Post:

Proposes a collaborative framework integrating diverse autonomous agents to overcome limitations in multi-agent systems, enhancing intelligence and interaction

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

https://arxiv.org/pdf/2407.08296

Summary by Turing Post:

Combines quantization and adaptive low-rank projections to reduce memory usage during LLM training

Inference Performance Optimization for Large Language Models on CPUs

https://arxiv.org/pdf/2407.07304

Summary by Turing Post:

Optimizes LLM inference on CPUs using techniques like SlimAttention and an INT8 KV cache approach

Toto: Time Series Optimized Transformer for Observability

https://arxiv.org/pdf/2407.07874

Summary by Turing Post:

Introduces a foundation model for time-series forecasting optimized for observability metrics

Gradient Boosting Reinforcement Learning

https://arxiv.org/pdf/2407.08250

Summary by Turing Post:

Extends gradient boosting techniques to reinforcement learning for improved performance on structured tasks

AgentInstruct: Toward Generative Teaching with Agentic Flows

https://arxiv.org/pdf/2407.03502

Summary by Turing Post:

Develops an agentic framework that autonomously generates synthetic data to teach language models new skills, significantly improving model performance

GTA: A Benchmark for General Tool Agents

https://arxiv.org/pdf/2407.08713

Summary by Turing Post:

Introduces a benchmark to evaluate language model agents in real-world scenarios, highlighting existing models' limitations in tool-use capabilities

Just read twice: closing the recall gap for recurrent language models

https://arxiv.org/pdf/2407.05483v1

Summary by Last Week in AI:

Improving the recall gap for recurrent language models by addressing the challenge of information selection and proposing JRT-Prompt and JRT-RNN as solutions.

Universal Length Generalization with Turing Programs

https://arxiv.org/pdf/2407.03310

On Leakage of Code Generation Evaluation Datasets

https://arxiv.org/pdf/2407.07565v1

Summary by The Sequence of AI Knowledge:

Researchers from Cohere published a paper providing evidence of the levels of contamination of code generation benchmarks in major LLMs. The paper also proposes a Less Basic Python Problems, a new benchmark more resilient to contamination?

A Survey on Efficient Inference for Large Language Models

https://arxiv.org/pdf/2404.14294v3

On scalable oversight with weak LLMs judging strong LLMs

https://arxiv.org/pdf/2407.04622

Graph-Structured Speculative Decoding

https://arxiv.org/pdf/2407.16207v1

Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning

https://arxiv.org/pdf/2407.16920v1

MAVEN-FACT: A Large-scale Event Factuality Detection Dataset

https://arxiv.org/pdf/2407.15352v1

xLSTMTime : Long-term Time Series Forecasting With xLSTM

https://arxiv.org/pdf/2407.10240

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

https://arxiv.org/pdf/2407.11963

In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery

https://arxiv.org/pdf/2404.19094v2

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

https://www.arxiv.org/pdf/2407.11068

Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis

https://arxiv.org/pdf/2406.10273v4

Robust ASR Error Correction with Conservative Data Filtering

https://arxiv.org/pdf/2407.13300v1

Learning From Correctness Without Prompting Makes LLM Efficient Reasoner

https://arxiv.org/pdf/2403.19094v2

Evaluating Language Model Context Windows: A “Working Memory” Test and Inference-time Correction

https://arxiv.org/pdf/2407.03651

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

https://arxiv.org/pdf/2402.14905

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

https://arxiv.org/pdf/2407.04694

A nice summary is available here: https://situational-awareness-dataset.org/

OthelloGPT learned a bag of heuristics

https://www.alignmentforum.org/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1

Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models

https://arxiv.org/pdf/2407.05502

Understanding Transformers via N-Gram Statistics

https://www.researchgate.net/publication/382204056_Understanding_Transformers_via_N-Gram_Statistics

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

https://arxiv.org/pdf/2407.05733v1

Retrieved In-Context Principles from Previous Mistakes

https://arxiv.org/pdf/2407.05682v1

On Speeding Up Language Model Evaluation

https://arxiv.org/pdf/2407.06172v1

Towards Building Specialized Generalist AI with System 1 and System 2 Fusion

https://arxiv.org/pdf/2407.08642

Mixture of A Million Experts

https://arxiv.org/pdf/2407.04153

Summary by Last Week in AI:

DeepMind introduces PEER, a novel architecture that scales MoE models to millions of experts, improving the performance-compute tradeoff of large language models by efficiently routing input data and using tiny experts with a single neuron in the hidden layer.

PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

https://arxiv.org/pdf/2407.06027v2

Summary by LLM Watch:

…PAS leverages the power of LLMs to generate high-quality prompt complementary datasets automatically. By training on these datasets, PAS achieves exceptional performance in prompt engineering tasks. The system is highly efficient, requiring only 9000 data points to reach state-of-the-art performance, which is a significant improvement over previous APE models. Additionally, PAS can autonomously generate prompt augmentation data without the need for human intervention, further streamlining the prompt engineering process…

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

https://arxiv.org/pdf/2407.07071

Summary by LLM Watch:

…The proposed approach [to mitigate LLM hallucinations], called Lookback Lens, leverages the attention mechanism of LLMs to detect contextual hallucinations. The key idea is to examine the ratio of attention weights that the model assigns to the provided context versus its own generated tokens. By training a simple linear classifier on these lookback ratio features, the authors demonstrate that it is possible to effectively detect hallucinations without the need for more complex models that rely on the entire hidden states of the LLM or text-based entailment. Remarkably, the detector is found to be transferable across tasks and even different-sized models, allowing for efficient deployment without retraining…

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

https://arxiv.org/pdf/2407.01370v1

Summary by Last Week in AI:

Challenging long-context LLMs and RAG systems with the "Summary of a Haystack" task, the article presents a new evaluation method for AI systems' output quality on long-context tasks, highlighting the need for improved performance.

Revealing Fine-Grained Values and Opinions in Large Language Models

https://arxiv.org/pdf/2406.19238v1

Summary by Last Week in AI:

Uncovering biases and disparities in large language models through analysis of responses to politically charged statements and the impact of demographic features on outcomes.

AI Agents That Matter

https://arxiv.org/pdf/2407.01502v1

Summary by Last Week in AI:

AI agents' benchmarks and evaluation practices have shortcomings, such as a narrow focus on accuracy, leading to needlessly complex and costly agents, and a lack of standardization in evaluation practices, hindering their usefulness in real-world applications.

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

https://arxiv.org/pdf/2407.00402

Summary by Turing Post:

Argues that defining long-context NLP tasks by input length is insufficient, proposing a taxonomy to better evaluate and develop LLM capabilities in genuinely difficult long-context scenarios.?

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

https://arxiv.org/pdf/2407.02398

Summary by Turing Post:

Enhances flow matching in generative models by enforcing self-consistency in the velocity field, improving training efficiency and sample quality.

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

https://arxiv.org/pdf/2407.04078

Summary by Turing Post:

Improves LLM performance on complex math tasks by decomposing problems into logical subtasks and incorporating self-correction, demonstrating robust generalization capabilities.?

Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

https://arxiv.org/pdf/2407.01272

Summary by Turing Post:

Employs instruction-tuning with enriched prompts containing definitions and guidelines, significantly improving the model's ability to generalize to unseen entity types in NER tasks.

CHAIN-OF-KNOWLEDGE: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

https://arxiv.org/pdf/2407.00653

Summary by Turing Post:

Enhances LLMs with knowledge reasoning abilities using knowledge graphs and a trial-and-error mechanism, improving general reasoning capabilities and addressing rule overfitting.

E2 TTS: EMBARRASSINGLY EASY FULLY NON-AUTOREGRESSIVE ZERO-SHOT TTS

https://arxiv.org/pdf/2406.18009

Summary by Turing Post:

Introduces a non-autoregressive zero-shot text-to-speech system with a simple architecture, achieving human-level naturalness and state-of-the-art speaker similarity and intelligibility.

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

https://arxiv.org/pdf/2407.02490

Summary by Turing Post:

Utilizes dynamic sparse attention patterns to speed up the pre-filling stage of long-context LLMs, significantly reducing inference latency while maintaining accuracy.

AGENTLESS : Demystifying LLM-based Software Engineering Agents

https://arxiv.org/pdf/2407.01489

Summary by Turing Post:

Simplifies LLM-based software development using a two-step process of localization and repair without autonomous tool usage, achieving high performance and low cost.

RouteLLM: Learning to Route LLMs with Preference Data

https://arxiv.org/pdf/2406.18665v2

Summary by Turing Post:

Optimizes cost and performance by dynamically selecting between strong and weak LLMs, reducing costs while maintaining response quality through data augmentation and human preference data.

LiteSearch: Efficacious Tree Search for LLM

https://arxiv.org/pdf/2406.18665v2

Summary by Turing Post:

Develops a novel tree search algorithm to improve LLMs' performance on mathematical reasoning tasks, reducing computational costs while maintaining competitive performance.

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

https://arxiv.org/pdf/2407.01906

Summary by Turing Post:

Proposes Expert-Specialized Fine-Tuning (ESFT) for sparse Mixture-of-Experts (MoE) architectures, tuning only the most relevant experts for a task, improving tuning efficiency and performance.

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

https://arxiv.org/pdf/2407.00106

Summary by Turing Post:

Highlights that unlearning fails to prevent reintroduction of removed knowledge through in-context learning, emphasizing the need for robust content filtering mechanisms.

ProgressGym: Alignment with a Millennium of Moral Progress

https://arxiv.org/pdf/2406.20087

Summary by Turing Post:

Introduces a framework to align LLMs with human moral progress using historical texts and LLMs, offering benchmarks to track evolving values and address value lock-in risks in AI.

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

https://arxiv.org/pdf/2407.02855

Summary by Turing Post:

Proposes a method to defend against jailbreak attacks by unlearning harmful knowledge, significantly reducing attack success rates and demonstrating remarkable generalizability.?

A FALSE SENSE OF SAFETY: UNSAFE INFORMATION LEAKAGE IN ‘SAFE’ AI RESPONSES

https://arxiv.org/pdf/2407.02551

Summary by Turing Post:

Explores limitations of current AI safety measures, introducing "inferential adversaries" to exploit seemingly safe outputs, emphasizing the need for new defense mechanisms.?

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

https://arxiv.org/pdf/2407.03234

Summary by Turing Post:

Develops a defense mechanism using self-evaluation to reduce attack success rates, outperforming existing defenses and remaining robust even under adaptive attacks.

Scaling Synthetic Data Creation with 1,000,000,000 Personas

https://arxiv.org/pdf/2406.20094v1

Summary by LLM Watch:

…The researchers propose a novel approach called Persona Hub, which automatically curates a collection of 1 billion diverse personas from web data. These personas act as distributed carriers of world knowledge, allowing the LLM to tap into various perspectives and generate synthetic data accordingly. By utilizing these personas, the LLM can create diverse and high-quality synthetic data for a wide range of scenarios, such as mathematical and logical reasoning problems, instructions, knowledge-rich texts, game NPCs, and tools (functions). This persona-driven approach ensures that the generated data is versatile, scalable, and flexible…

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

https://arxiv.org/pdf/2407.02936v1

AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation

https://arxiv.org/pdf/2406.19251

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

https://arxiv.org/pdf/2406.19392v1

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

https://arxiv.org/pdf/2402.09742v3

Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning

https://arxiv.org/pdf/2406.19502v1

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

https://arxiv.org/pdf/2406.20015v1

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

https://arxiv.org/pdf/2406.15486v2

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

https://arxiv.org/pdf/2406.17557v1

Summary by The Sequence of AI Knowledge:

HuggingFace published a paper detailing how they built FineWeb, one of the largest open source datasets for LLM pretraining ever built. FineWeb boosts and impressive 15 trillion tokens from 96 Common Crawl snapshots

Symbolic Learning Enables Self-Evolving Agents

https://arxiv.org/pdf/2406.18532v1

Summary by The Sequence of AI Knowledge:

Researchers from AIWaves published a paper introducing a technique known as agent symbolic learning aimed to self-improve agents. The core idea is to draw a parallel between an agent pipeline and a neural net and use symbolic optimizers to improve the agent network?

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

https://arxiv.org/pdf/2406.18518

Summary by The Sequence of AI Knowledge:

Salesforce Research published a paper introducing APIGen, a pipeline designed to synthesize function-calling datasets. APIGen was used to train models over 7B parameters based on state-of-the-art benchmarks?

Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts

https://arxiv.org/pdf/2405.01121

Summary by The Sequence of AI Knowledge:

Google Research published a paper introducing Meeting Information Seeking Dialogs(MISeD), a dataset focused on meeting transcripts. MISeD tries to optimize for finding factual information in meeting transcripts which could be a notoriously difficult task

OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

https://arxiv.org/pdf/2406.16772

Summary by The Sequence of AI Knowledge:

Researchers from Shanghai Jiao Tong University, Generative AI Research Lab published a paper detailing the results of the Olympic Arena superintelligence benchmark. Olympic Arena was designed to evaluate models across many disciplines and modalities

Gemma 2: Improving Open Language Models at a Practical Size

https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

Summary by LLM Watch:

…Gemma 2 introduces several technical modifications to enhance performance and efficiency. The architecture incorporates interleaved local-global attentions, which allow the model to capture both local and global dependencies effectively. Additionally, group-query attention is employed to reduce computational complexity. For the smaller models (2B and 9B), knowledge distillation is used instead of next token prediction during training, enabling them to learn from larger, more powerful models while maintaining a compact size…

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

https://arxiv.org/pdf/2406.17262v1

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

https://arxiv.org/pdf/2406.17565v2

A Survey of Large Language Models for Graphs

https://arxiv.org/pdf/2405.08011v2

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

https://arxiv.org/pdf/2405.20233v2

Poisoned LangChain: Jailbreak LLMs by LangChain

https://arxiv.org/pdf/2406.18122v1

Beyond Words: Other Modalities

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

https://arxiv.org/pdf/2406.09406v2

Summary by Turing Post:

Trains a vision model on over twenty diverse modalities, enabling it to perform a wide range of tasks without performance loss, enhancing multimodal generation and retrieval.

UNDERSTANDING ALIGNMENT IN MULTIMODAL LLMS: A COMPREHENSIVE STUDY

https://arxiv.org/pdf/2407.02477

Summary by Turing Post:

Explores alignment of responses in multimodal LLMs with image content, proposing Bias-Driven Hallucination Sampling (BDHS) and highlighting the benefits of combined offline and online methods.

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

https://arxiv.org/pdf/2406.19741

Summary by Turing Post:

Integrates LLMs with the Robot Operating System (ROS) to facilitate intuitive robot programming, incorporating feedback to refine tasks, demonstrating robustness and scalability.

STARK : Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

https://arxiv.org/pdf/2407.03958

Summary by Turing Post:

Introduces a large-scale multi-modal conversation dataset featuring diverse social personas and images, enabling the creation of advanced conversation models with superior visual imagination abilities.

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

https://arxiv.org/pdf/2406.19389v1

Summary by Last Week in AI:

A new framework called OMG-LLaVA combines powerful pixel-level vision understanding with reasoning abilities, accepting various visual and text prompts for flexible user interaction and achieving image-level, object-level, and pixel-level reasoning and understanding in a single model.

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Summary by Last Week in AI:

A new benchmark, MMEvalPro, addresses biases in evaluating Large Multimodal Models (LMMs) by introducing a trilogy evaluation pipeline and more rigorous metrics, making evaluations more challenging and trustworthy.

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

https://arxiv.org/pdf/2407.00114v1

Summary by Last Week in AI:

OmniJARVIS is a novel Vision-Language-Action model that uses unified tokenization of multimodal interaction data to enable open-world instruction-following agents in Minecraft, demonstrating strong reasoning and efficient decision-making capabilities.

Magic Insert: Style-Aware Drag-and-Drop

https://arxiv.org/pdf/2407.02489v1

Summary by Last Week in AI:

A new method called Magic Insert allows for style-aware drag-and-drop of subjects from one image to another, addressing the challenges of style-aware personalization and realistic object insertion in stylized images.

Vision language models are blind

https://arxiv.org/pdf/2407.06581

Summary by The Gradient:

Researchers from Auburn University and the University of Alberta found that state-of-the-art large language models with vision capabilities (VLMs) are surprisingly poor at understanding spatial information involving basic geometric shapes, such as whether two circles overlap. They propose BlindTest, a new benchmark of 7 simple tasks that are unlikely to have prior answers in natural language on the Internet, to test VLM ability to "see" images like humans do.

Data curation via joint example selection further accelerates multimodal learning

https://arxiv.org/pdf/2406.17711v1

Summary by Last Week in AI:

Joint example selection for data curation accelerates multimodal learning, surpassing state-of-the-art models with significantly fewer iterations and less computation.

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

https://arxiv.org/pdf/2407.04051v2

Summary by Last Week in AI:

FunAudioLLM introduces innovative models for enhancing natural voice interactions between humans and large language models, enabling applications such as speech-to-speech translation and emotional voice chat.

WildGaussians: 3D Gaussian Splatting in the Wild

https://arxiv.org/pdf/2407.08447v1

Summary by Last Week in AI:

A new approach called WildGaussians is introduced to improve 3D Gaussian Splatting's performance in handling in-the-wild data, achieving state-of-the-art results with real-time rendering speeds.

HEMM: Holistic Evaluation of Multimodal Foundation Models

https://arxiv.org/pdf/2407.03418v1

Summary by The Sequence of AI Knowledge:

Researchers from Carnegie Mellon University published a paper introducing the holitic evaluation of multimodal models(HEMM) framework . HEMM sets the primitives to systematically evaluate multimodal models across different tasks such as basic skills, information flow, and real-world use cases

Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

https://arxiv.org/pdf/2407.07775v1

Summary by Turing Post:

Combines Vision-Language Models and topological graphs for effective multimodal instruction navigation in complex environments

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

https://arxiv.org/pdf/2407.08083

Summary by Turing Post:

Develops a hybrid architecture that integrates Transformer self-attention into the Mamba model, enhancing performance in various vision tasks?

PaliGemma: A versatile 3B VLM for transfer

https://arxiv.org/pdf/2407.07726v1

Summary by Turing Post:

Combines a vision encoder and a language model to effectively transfer knowledge across diverse vision-language tasks

MJ-BENCH: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

https://arxiv.org/pdf/2407.04842v1

Summary by Turing Post:

Introduces a benchmark for evaluating multimodal judges in text-to-image generation, assessing their performance on various criteria including safety and bias?

Autoregressive Speech Synthesis without Vector Quantization

https://arxiv.org/pdf/2407.08551

Summary by Turing Post:

Proposes an autoregressive TTS model that enhances output diversity and robustness

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

https://arxiv.org/pdf/2407.01530

ColPali: Efficient Document Retrieval with Vision Language Models

https://arxiv.org/pdf/2407.01449

Shape of Motion: 4D Reconstruction from a Single Video

https://arxiv.org/pdf/2407.13764

Summary by AlphaSignal:

Reconstructing dynamic scenes from single videos is complex due to the ill-posed nature of the task. Traditional methods are limited as they require templates, function only in nearly static scenes, or cannot track full-sequence 3D motion, which makes them unsuitable for complex, moving scenes. This approach uses SE(3) motion bases to model motion as a combination of base movements. It integrates data-driven priors like depth maps and 2D motion tracks into a unified scene representation, enhancing consistency and accuracy.

SEED-Story: Multimodal Long Story Generation with Large Language Model

https://arxiv.org/pdf/2407.08683v1

Improving GFlowNets for Text-to-Image Diffusion Alignment

https://arxiv.org/pdf/2406.00633

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

https://arxiv.org/pdf/2407.08083v1

Summary by Last Week in AI:

A novel hybrid Mamba-Transformer vision backbone, MambaVision, is proposed and shown to achieve state-of-the-art performance in image classification and outperform comparably-sized backbones in downstream tasks.

SLOWFAST-LLAVA: A STRONG TRAINING-FREE BASELINE FOR VIDEO LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2407.15841v1

Summary by The Sequence of AI Knowledge:

Apple Research published a paper detailing SlowFast-LLaVA(SF-LLaVA), a video language model optimized for capturing the spatial semantics and temporal context in videos. SF-LLaVA uses a two-stream input design to aggregate features from different video frames in ways that facilitate knowledge extraction


Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

7 个月

The emphasis on efficiency in LLM research is fascinating, but I wonder if it risks sacrificing the very qualities that make LLMs powerful creativity and nuanced understanding. The recent surge in AI-generated art highlights how even seemingly "non-reasoning" models can produce surprising and insightful outputs. Given this, how would you reconcile the pursuit of efficiency with the need to preserve these emergent capabilities, especially considering Elon Musk's recent concerns about AI becoming too "narrowly focused"?

要查看或添加评论,请登录

Jean David Ruvini的更多文章

  • LLM Papers Reading Notes - March 2025

    LLM Papers Reading Notes - March 2025

    Sharing short notes (from myself and others) about LLM research papers I came across in February. These notes differ in…

    1 条评论
  • LLM Papers Reading Notes - February 2025

    LLM Papers Reading Notes - February 2025

    Sharing short notes (from myself and others) about LLM research papers I came across in February. These notes differ in…

  • LLM Papers Reading Notes - January 2025

    LLM Papers Reading Notes - January 2025

    Sharing short notes (from myself and others) about LLM research papers I came across in December. These notes differ in…

  • LLM Papers Reading Notes - December 2024

    LLM Papers Reading Notes - December 2024

    Sharing short notes (from myself and others) about LLM research papers I came across in November. These notes differ in…

    2 条评论
  • LLM Papers Reading Notes - November 2024

    LLM Papers Reading Notes - November 2024

    Sharing short notes (from myself and others) about LLM research papers I came across in October. These notes differ in…

  • LLM Paper Reading Notes - October 2024

    LLM Paper Reading Notes - October 2024

    Sharing short notes (from myself and others) about LLM research papers I came across in September. These notes differ…

    8 条评论
  • LLM Paper Reading Notes - September 2024

    LLM Paper Reading Notes - September 2024

    Sharing short notes (from myself and others) about LLM research papers I came across in August. These notes differ in…

    1 条评论
  • LLM Paper Reading Notes - July 2024

    LLM Paper Reading Notes - July 2024

    Sharing short notes (from myself and others) about LLM research papers I came across in June. These notes differ in…

  • LLM Paper Reading Notes - June 2024

    LLM Paper Reading Notes - June 2024

    Sharing short notes (from myself and others) about LLM research papers I came across in May. These notes differ in…

    1 条评论
  • LLM Paper Reading Notes - May 2024

    LLM Paper Reading Notes - May 2024

    Sharing short notes about LLM research papers I came across in April. These notes, intended for my future self, differ…

    5 条评论

社区洞察

其他会员也浏览了