LLM Paper Reading Notes - August 2024
Sharing short notes (from myself and others) about LLM research papers I came across in July. These notes differ in their level of detail and precision. I hope they're still useful in piquing your curiosity and helping you breathe under the waterfall. At the current pace of AI, it takes the power of all of us to keep up.
Check my newsletter for past reading notes!
Reading Notes
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
This study investigates how large language models handle question-answering tasks under two conditions: when they receive comprehensive context information (long-context) versus when they are given only selected chunks of the necessary information (RAG). It shows that long context surpasses RAG significantly for Gemini-1.5-Pro, GPT-4O and GPT-3.5-Turbo. The authors propose a hybrid solution which consists first of generating the context with RAG and asking the LLM if it is sufficient to answer the question. If it is not, long-context is used. The approach surpasses long-context for GPT-4O and GPT-3.5-Turbo, and is more cost effective.
Confabulation: The Surprising Value of Large Language Model Hallucinations
This paper makes the parallel between human confabulation and LLM hallucinations. On one hand, it highlights? research from psychiatry that suggests that everyday memory reconstruction often involves some degree of confabulation. When humans do not have access to sufficient information to formulate coherent semantic meaning, they often confabulate to ‘fill in the blanks’ with self-consistent narratives that are not necessarily factual but bear close semantic verisimilitude to reality. On the other hand, it shows, using a story detection model (fine-tuned ELECTRA-large based) and 3 datasets, that hallucinated content has higher narrativity. It concludes that some degree of confabulation may be necessary for LLM to maintain cognitive coherence. Contrary to what the title suggests, however, it does not propose practical applications or benefits of LLM hallucinations in a way that distinctly separates beneficial confabulation from the potential risks of fabricating facts.
RATT: A Thought Structure for Coherent and Correct LLM Reasoning
It is well known that asking a LLM to explicitly reason step by step improves problem solving performances. Tree of Thoughts (ToT) extended this paradigm by exploring and evaluating multiple reasoning branches of a thought tree structure. This paper proposes to extend ToT further by retrieving documents at each node of the tree and using them to guide the reasoning. Experiments on code generation, creative writing and hallucination detection benchmark, with GPT3.5T, show the efficacy of this approach. However, problem solving capabilities are only evaluated for the “Game of 24” mathematical puzzle and not on commonly used reasoning benchmarks.
Memory3: Language Modeling with Explicit Memory
Summary by LLM Watch:
…Memory3 introduces a third form of memory in addition to the implicit knowledge stored in model parameters and the short-term working memory used during inference (context key-values). This explicit memory is designed to store factual knowledge more efficiently than model parameters. The researchers also developed techniques to make this approach feasible, including a memory sparsification mechanism to reduce storage requirements and a two-stage pretraining scheme to facilitate the formation of the explicit memory during training…The results of Memory3 are showing that a 2.4B parameter model with explicit memory can outperform much larger models and maintain higher decoding speed than retrieval-augmented generation (RAG) approaches.
A good summary. I would just add that the paper introduces 3 types of memories: implicit memories stored in the model parameters, working memory consisting of cached key-values from the current sequence, and explicit memory consisting of external knowledge encoded from the knowledge base, similar to retrievable model parameters or sparsely-activated neural circuits. Explicit memory is recalled through vector search à la RAG.
Characterizing Prompt Compression Methods for Long Context Inference
The paper provides a comprehensive overview and comparison of three prompt compression techniques: token pruning, abstractive compression (summarization), and extractive compression. In extractive compression, the prompt is divided into smaller chunks such as sentences or phrases. These chunks are then scored based on their relevance to the query or question using a fine-tuned DeBERTa model, with the most relevant chunks being selected. This method yields the best results, often even improving task accuracy at low compression rates (5-10%).
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks
This article summarizes 38 prompt engineering techniques for LLM reasoning and lists the types of problems and datasets they have been used with.
Distilling System 2 into System 1
The paper proposes to distill a LLM, called System 2 LLM, performing multi-steps or multi-calls reasoning (such as Chain of Thoughts, Rephrase and Respond, Branch-Solve-Merge) into a LLM that outputs the response directly in a single forward pass without multi-steps reasoning, called System 1. First, question-answer pairs are generated in an unsupervised manner from System 2 (using self consistency to curate errors). Second, System 1 is fine-tuned on these question-answer pairs (without the reasoning steps). Experiments with Llama-2-70B-chat show that distillation of System 2 methods often maintains or improves performance compared to the original System 2 methods while significantly reducing inference costs. The effectiveness of distillation varies depending on the task, with some tasks (like CoT for complex reasoning) proving challenging to distill effectively.
Just read twice: closing the recall gap for recurrent language models
Recurrent models such as RNNs and their modern variant, Mamba, process tokens sequentially and do not suffer from quadratic complexity like Transformers do. This paper notes that these recurrent models are brittle to the order of the input data. For example, suppose we ask question Q (e.g., “When did Galileo move to Florence?”), over documents D (e.g., the detailed Wikipedia for Galileo Galilei). The model needs to remember just one fact from D if the prompt is ordered [Q,D], but needs to remember all facts when it is [D,Q]. Based on this insight, they show that repeating information in a prompt (D,Q,D) improves recurrent model and Transformer++ on all model sizes (up to 2.7B parameters) and tasks. The authors also propose JRT-RNN, a non causal encode-decoder architecture that is more robust to data ordering (still repeating the information). JRT-RNN provides 99% of Transformer quality at 360M params., 30B tokens and 96% at 1.3B params., 50B tokens on average across the tasks, with 19.2× higher throughput for prefill than FA2.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Transformers, though powerful in handling long sequences, suffer from quadratic complexity. In contrast, RNNs and their modern variant, Mamba, process tokens sequentially and memorize them in a hidden state represented as a fixed-sized vector. Despite improvements over RNNs, Mamba struggles to maintain competitiveness with Transformers beyond 8,000 tokens. This paper proposes a richer representation of the hidden state in the form of a set of weights, akin to an "inner" neural network. Updating this hidden state is approached similarly to training a neural network. For a given sentence, the inner model is tasked with recovering the current token from a corrupted version using a reconstruction loss. The corruption involves low-rank projections, whose parameters are learned during the overall model training (outer training). The paper presents two specific instantiations of the inner model: TTT-Linear and TTT-MLP. Evaluations demonstrate that TTT-Linear matches or exceeds the performance of both Mamba and Transformers, especially in long-context scenarios. To enhance hardware efficiency, the authors introduce techniques such as mini-batch TTT and dual form operations. Evaluations show that TTT-Linear not only outperforms Mamba after an 8k context length but also maintains performance comparable to Transformers with greater efficiency.
However, the claim that this approach performs "Test-Time Training" of the inner model is debatable since the inner weights are reset after each sentence, ensuring that sentences are processed independently without accumulating knowledge across them.
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
This paper demonstrates that fine-tuning a LLM for RAG can significantly enhance its question answering capabilities. This fine-tuning process leverages QA datasets to train the LLM to both estimate the relevance of specific document snippets to a question and generate answers from these snippets. At inference time, after retrieving snippets, the RankRAG pipeline involves an additional call to the LLM to estimate the relevance of each snippet (reranking). The 5 most relevant snippets are then added to the LLM context for generating the answer. RankRAG, when applied to Llama 8B and 70B, achieves state-of-the-art (SOTA) performance, albeit at the cost of an extra inference step, which can result in up to a 6x increase in time.
Large Language Models Understand Layouts
This paper demonstrates that GPT-3.5 Turbo, and to a lesser extent other LLMs, can understand text layout. When text is formatted into four quadrants (top-left, top-right, bottom-left, bottom-right) using only spaces and newlines, GPT-3.5 Turbo can accurately answer questions like “What is the name mentioned in the top-left corner?” with an F1 score of 87.77. The experiments suggest that training data containing code is crucial for developing this layout understanding capability in LLMs, which is further improved during instruction-tuning. This capability can be significantly enhanced with specifically crafted training data.
Searching for Best Practices in Retrieval-Augmented Generation
This paper reviews and evaluates many components commonly used in RAG pipelines, such as query classification, chunking, retrieval modules, re-rankers, summarizers and more. All in all a good review with some good pointers.
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation
A very interesting method to evaluate RAG pipelines is proposed in this paper. It involves using a LLM to generate multiple-choice questions (MCQ), each with only one correct answer, from the RAG document corpus. Furthermore, they describe how Item Response Theory (IRT) can be applied to assess the sensitivity of the test across multiple cognitive dimensions (understanding, remembering, creating, etc.) and iteratively improve its quality. This process is conducted without any human supervision, aside from a few regex filters to remove poor-quality questions. Some key findings from their experiments using various retrievers and LLMs include: hybrid retrievers that combine BM25 and dense models offer greater robustness; the performance gain from using an appropriate retriever can surpass that of choosing a larger LLM; and poorly aligned retriever components can lead to worse accuracy than having no retrieval at all.
However, this approach does not assess RAG pipelines in the typical manner they are used, which is to generate natural language responses for users. Instead, the LLM is prompted to answer MCQs. Could a potential solution be to have the LLM generate a response and then use another LLM to answer the MCQs based on that response?
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
In Retrieval-Augmented Generation, documents are retrieved based on a question and used as context within the LLM prompt to generate accurate and informed answers. This paper proposes a novel framework called Speculative RAG to enhance this process by utilizing a smaller, faster, and specifically fine-tuned model known as the drafter. The drafter generates multiple answer drafts along with their rationales by clustering the retrieved documents and sampling from these clusters. These drafts are then evaluated by a larger generalist LLM called the verifier, which estimates the probability of each answer-rationale pair and selects the best one. This approach leads to SOTA performance on QA datasets while reducing latency and cost. The fine-tuning dataset consists of triplets of questions, documents, and answers, augmented with rationales generated by the LLM.
Highlights from the Community
FlashAttention-3
Summary by Last Week in AI:
FlashAttention is an important and widely used method for speeding up the inference of Large Language Models. This discusses FlashAttention-3, an improved method for speeding up attention on Hopper GPUs, the latest and best hardware for LLMs from Nvidia. The new method utilizes three main techniques: exploiting asynchrony of the Tensor Cores and TMA to overlap computation and data movement, interleaving block-wise matmul and softmax operations, and using block quantization and incoherent processing that leverages hardware support for FP8 low-precision. The results show that FlashAttention-3 achieves a speedup on H100 GPUs by 1.5-2.0 times with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s.
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
Summary by The Sequence of AI Knowledge:
Researchers from elite AI universities such as UC Berkeley, Yale, Carnegie Mellon and others published a paper introducing OpenDevin, a framework for developing AI agents that interact with environments similar to human programmers. OpenDevin agents are able to collaborate with human programmers in different tasks such as bug fixing, feature building, testing and many others
AI models collapse when trained on recursively generated data
Summary by The Sequence of AI Knowledge:
Researchers from Oxford, Cambridge, Imperial College of London and other institutions published a paper in Nature outlining a curious phenomenon in LLMs coined as model collapse. The thesis of model collapse states that LLMs will start showing irreversible degenerative behavior when trained in data created by other AI models
Compact Language Models via Pruning and Knowledge Distillation
Summary by The Sequence of AI Knowledge:
NVIDIA Research published a paper proposing a set of effective compression best practices to build compact LLMs. The techniques combine the best strategies for depth, width, attention and MLP pruning with knowledge distillation-based retraining
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Summary by Last Week in AI:
Magpie presents a method for synthesizing high-quality instruction data at scale by extracting it directly from aligned large language models, demonstrating its effectiveness in comparison to other public instruction datasets.
GraphFM: A Scalable Framework for Multi-Graph Pretraining
Summary by Last Week in AI:
GraphFM is a scalable framework for multi-graph pretraining, but the specific details of the article are not available.
Transformer Layers as Painters
Summary by Last Week in AI:
Understanding the impact of removing or reorganizing information throughout the layers of a pretrained transformer can yield better usage of existing models and make architectural improvements to produce new variants, as shown by a series of empirical studies on frozen models.
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Summary by Last Week in AI:
Introducing LMMS-EVAL, a unified multimodal benchmark framework with over 50 tasks and 10 models, addressing the challenges of low cost and zero contamination in evaluating large multi-modal models.
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Summary by Last Week in AI:
Husky is an open-source language agent that outperforms existing models in addressing complex reasoning problems by using a unified action space and expert models.
LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data
KAN or MLP: A Fairer Comparison
Summary by AlphaSignal:
MLP outperformed KAN in machine learning (86.16% vs. 85.96%), computer vision (85.88% vs. 77.88%), NLP (80.45% vs. 79.95%), and audio processing (17.74% vs. 15.49%). KAN excelled only in symbolic formula representation (1.2e-3 RMSE vs. 7.4e-3). Access the code
Accuracy is Not All You Need
Summary by LLM Watch:
The researchers propose two new metrics to evaluate compressed LLMs: KL-Divergence and flips. KL-Divergence measures the difference in probability distributions between the baseline and compressed models, providing a more nuanced understanding of how the models' outputs differ. The flips metric quantifies the proportion of answers that change from correct to incorrect (and vice versa) between the baseline and compressed models, even when overall accuracy remains similar. By incorporating these metrics, the study offers a more comprehensive evaluation framework for compressed LLMs.
Case2Code: Learning Inductive Reasoning with Synthetic Data
Summary by LLM Watch:
Inductive reasoning, the ability to infer underlying rules by observing examples or sequential transformations, is a crucial aspect of complex reasoning. While Large Language Models (LLMs) have shown impressive deductive reasoning skills, their inductive reasoning capabilities have not been extensively evaluated or explicitly trained. Collecting large-scale, diverse human-generated inductive data is challenging, making it difficult to assess and enhance LLMs' inductive reasoning abilities. The researchers propose a novel approach called Case2Code, which leverages the expressiveness and correctness of programs to synthesize inductive reasoning tasks. They collect a diverse set of executable programs and generate input-output transformations for each program. LLMs are then tasked with inferring the underlying code implementations based on the synthetic input-output cases. By evaluating representative LLMs on the Case2Code task, the researchers demonstrate that case-to-code induction is challenging for current models. To address this, they synthesize large-scale Case2Code training samples to explicitly train LLMs in inductive reasoning.
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Summary by LLM Watch:
While a lot of research has been done on scaling laws for LLMs, most of it has been focused on the number of parameters and the amount of training data. The vocabulary size, which determines the granularity of the tokens that are used to represent the input and output sequences, has been largely overlooked. Choosing the right vocabulary size is a trade-off between representing the input and output more efficiently with fewer tokens and the risk of under-fitting rare tokens. The researchers propose three different methods for predicting the optimal vocabulary size for a given compute budget: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. All three methods converge on the same result, showing that the optimal vocabulary size depends on the available compute budget and that larger models should use larger vocabularies. For example, they predict that the Llama2-70B model should have used a vocabulary size of at least 216K instead of the 32K that was actually used.
H2O-Danube3 Technical Report
Summary by Turing Post:
Presents small LLMs optimized for mobile devices, highlighting efficient operation and accessibility
LETS-C: Leveraging Language Embedding for Time Series Classification
Summary by Turing Post:
Utilizes language embeddings for time-series classification, demonstrating high performance with reduced computational costs
Lynx: An Open Source Hallucination Evaluation Model
Summary by Turing Post:
Develops an open-source model for detecting hallucinations in Retrieval-Augmented Generation systems?
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in
Large Language Models Using Only Attention Maps
Summary by Turing Post:
Detects contextual hallucinations in LLMs using attention maps, providing a tool to reduce hallucinations?
Associative Recurrent Memory Transformer
Summary by Turing Post:
Develops a new architecture for processing long sequences efficiently using associative memory
HUMAN-LIKE EPISODIC MEMORY FOR INFINITE CONTEXT LLMS
Summary by Turing Post:
Integrates features of human episodic memory into LLMs to manage infinite context lengths
MUSCLE: A Model Update Strategy for Compatible LLM Evolution
Summary by Turing Post:
Introduces a model update strategy that minimizes negative flips during LLM updates, ensuring consistent task performance
InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct
Summary by Turing Post:
Enhances code LLMs by generating natural language instructions from code, improving model diversity and performance
On Leakage of Code Generation Evaluation Datasets
Summary by Turing Post:
Identifies contamination sources in code generation datasets and introduces a cleaner benchmark for evaluating LLMs?
An accurate detection is not all you need to combat label noise in web-noisy datasets
Summary by Turing Post:
Proposes a hybrid approach to improve classification performance in noisy datasets by combining unsupervised learning with noise detection methods
Self-Recognition in Language Models
Summary by Turing Post:
Investigates whether LLMs can recognize their own outputs, revealing insights into model decision-making processes?
From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty
Summary by Turing Post:
Studies fallback behaviors of LLMs under uncertainty, detailing how advanced models handle errors and uncertainties
Understanding Visual Feature Reliance through the Lens of Complexity
Summary by Turing Post:
Analyzes how deep learning models prioritize features based on complexity, impacting model decisions
SPREADSHEETLLM: Encoding Spreadsheets for Large Language Models
Summary by Turing Post:
Enhances LLMs' ability to handle complex spreadsheet data through advanced serialization and compression techniques
INTERNET OF AGENTS: WEAVING A WEB OF HETEROGENEOUS AGENTS FOR COLLABORATIVE INTELLIGENCE
Summary by Turing Post:
Proposes a collaborative framework integrating diverse autonomous agents to overcome limitations in multi-agent systems, enhancing intelligence and interaction
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
Summary by Turing Post:
Combines quantization and adaptive low-rank projections to reduce memory usage during LLM training
Inference Performance Optimization for Large Language Models on CPUs
Summary by Turing Post:
Optimizes LLM inference on CPUs using techniques like SlimAttention and an INT8 KV cache approach
Toto: Time Series Optimized Transformer for Observability
Summary by Turing Post:
Introduces a foundation model for time-series forecasting optimized for observability metrics
Gradient Boosting Reinforcement Learning
Summary by Turing Post:
Extends gradient boosting techniques to reinforcement learning for improved performance on structured tasks
AgentInstruct: Toward Generative Teaching with Agentic Flows
Summary by Turing Post:
Develops an agentic framework that autonomously generates synthetic data to teach language models new skills, significantly improving model performance
GTA: A Benchmark for General Tool Agents
Summary by Turing Post:
Introduces a benchmark to evaluate language model agents in real-world scenarios, highlighting existing models' limitations in tool-use capabilities
Just read twice: closing the recall gap for recurrent language models
Summary by Last Week in AI:
Improving the recall gap for recurrent language models by addressing the challenge of information selection and proposing JRT-Prompt and JRT-RNN as solutions.
Universal Length Generalization with Turing Programs
On Leakage of Code Generation Evaluation Datasets
Summary by The Sequence of AI Knowledge:
Researchers from Cohere published a paper providing evidence of the levels of contamination of code generation benchmarks in major LLMs. The paper also proposes a Less Basic Python Problems, a new benchmark more resilient to contamination?
A Survey on Efficient Inference for Large Language Models
On scalable oversight with weak LLMs judging strong LLMs
Graph-Structured Speculative Decoding
Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning
MAVEN-FACT: A Large-scale Event Factuality Detection Dataset
xLSTMTime : Long-term Time Series Forecasting With xLSTM
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery
Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay
Beyond Words: On Large Language Models Actionability in Mission-Critical Risk Analysis
Robust ASR Error Correction with Conservative Data Filtering
Learning From Correctness Without Prompting Makes LLM Efficient Reasoner
Evaluating Language Model Context Windows: A “Working Memory” Test and Inference-time Correction
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
A nice summary is available here: https://situational-awareness-dataset.org/
OthelloGPT learned a bag of heuristics
Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models
Understanding Transformers via N-Gram Statistics
Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition
领英推荐
Retrieved In-Context Principles from Previous Mistakes
On Speeding Up Language Model Evaluation
Towards Building Specialized Generalist AI with System 1 and System 2 Fusion
Mixture of A Million Experts
Summary by Last Week in AI:
DeepMind introduces PEER, a novel architecture that scales MoE models to millions of experts, improving the performance-compute tradeoff of large language models by efficiently routing input data and using tiny experts with a single neuron in the hidden layer.
PAS: Data-Efficient Plug-and-Play Prompt Augmentation System
Summary by LLM Watch:
…PAS leverages the power of LLMs to generate high-quality prompt complementary datasets automatically. By training on these datasets, PAS achieves exceptional performance in prompt engineering tasks. The system is highly efficient, requiring only 9000 data points to reach state-of-the-art performance, which is a significant improvement over previous APE models. Additionally, PAS can autonomously generate prompt augmentation data without the need for human intervention, further streamlining the prompt engineering process…
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Summary by LLM Watch:
…The proposed approach [to mitigate LLM hallucinations], called Lookback Lens, leverages the attention mechanism of LLMs to detect contextual hallucinations. The key idea is to examine the ratio of attention weights that the model assigns to the provided context versus its own generated tokens. By training a simple linear classifier on these lookback ratio features, the authors demonstrate that it is possible to effectively detect hallucinations without the need for more complex models that rely on the entire hidden states of the LLM or text-based entailment. Remarkably, the detector is found to be transferable across tasks and even different-sized models, allowing for efficient deployment without retraining…
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Summary by Last Week in AI:
Challenging long-context LLMs and RAG systems with the "Summary of a Haystack" task, the article presents a new evaluation method for AI systems' output quality on long-context tasks, highlighting the need for improved performance.
Revealing Fine-Grained Values and Opinions in Large Language Models
Summary by Last Week in AI:
Uncovering biases and disparities in large language models through analysis of responses to politically charged statements and the impact of demographic features on outcomes.
AI Agents That Matter
Summary by Last Week in AI:
AI agents' benchmarks and evaluation practices have shortcomings, such as a narrow focus on accuracy, leading to needlessly complex and costly agents, and a lack of standardization in evaluation practices, hindering their usefulness in real-world applications.
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Summary by Turing Post:
Argues that defining long-context NLP tasks by input length is insufficient, proposing a taxonomy to better evaluate and develop LLM capabilities in genuinely difficult long-context scenarios.?
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
Summary by Turing Post:
Enhances flow matching in generative models by enforcing self-consistency in the velocity field, improving training efficiency and sample quality.
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
Summary by Turing Post:
Improves LLM performance on complex math tasks by decomposing problems into logical subtasks and incorporating self-correction, demonstrating robust generalization capabilities.?
Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
Summary by Turing Post:
Employs instruction-tuning with enriched prompts containing definitions and guidelines, significantly improving the model's ability to generalize to unseen entity types in NER tasks.
CHAIN-OF-KNOWLEDGE: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
Summary by Turing Post:
Enhances LLMs with knowledge reasoning abilities using knowledge graphs and a trial-and-error mechanism, improving general reasoning capabilities and addressing rule overfitting.
E2 TTS: EMBARRASSINGLY EASY FULLY NON-AUTOREGRESSIVE ZERO-SHOT TTS
Summary by Turing Post:
Introduces a non-autoregressive zero-shot text-to-speech system with a simple architecture, achieving human-level naturalness and state-of-the-art speaker similarity and intelligibility.
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Summary by Turing Post:
Utilizes dynamic sparse attention patterns to speed up the pre-filling stage of long-context LLMs, significantly reducing inference latency while maintaining accuracy.
AGENTLESS : Demystifying LLM-based Software Engineering Agents
Summary by Turing Post:
Simplifies LLM-based software development using a two-step process of localization and repair without autonomous tool usage, achieving high performance and low cost.
RouteLLM: Learning to Route LLMs with Preference Data
Summary by Turing Post:
Optimizes cost and performance by dynamically selecting between strong and weak LLMs, reducing costs while maintaining response quality through data augmentation and human preference data.
LiteSearch: Efficacious Tree Search for LLM
Summary by Turing Post:
Develops a novel tree search algorithm to improve LLMs' performance on mathematical reasoning tasks, reducing computational costs while maintaining competitive performance.
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
Summary by Turing Post:
Proposes Expert-Specialized Fine-Tuning (ESFT) for sparse Mixture-of-Experts (MoE) architectures, tuning only the most relevant experts for a task, improving tuning efficiency and performance.
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
Summary by Turing Post:
Highlights that unlearning fails to prevent reintroduction of removed knowledge through in-context learning, emphasizing the need for robust content filtering mechanisms.
ProgressGym: Alignment with a Millennium of Moral Progress
Summary by Turing Post:
Introduces a framework to align LLMs with human moral progress using historical texts and LLMs, offering benchmarks to track evolving values and address value lock-in risks in AI.
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Summary by Turing Post:
Proposes a method to defend against jailbreak attacks by unlearning harmful knowledge, significantly reducing attack success rates and demonstrating remarkable generalizability.?
A FALSE SENSE OF SAFETY: UNSAFE INFORMATION LEAKAGE IN ‘SAFE’ AI RESPONSES
Summary by Turing Post:
Explores limitations of current AI safety measures, introducing "inferential adversaries" to exploit seemingly safe outputs, emphasizing the need for new defense mechanisms.?
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
Summary by Turing Post:
Develops a defense mechanism using self-evaluation to reduce attack success rates, outperforming existing defenses and remaining robust even under adaptive attacks.
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Summary by LLM Watch:
…The researchers propose a novel approach called Persona Hub, which automatically curates a collection of 1 billion diverse personas from web data. These personas act as distributed carriers of world knowledge, allowing the LLM to tap into various perspectives and generate synthetic data accordingly. By utilizing these personas, the LLM can create diverse and high-quality synthetic data for a wide range of scenarios, such as mathematical and logical reasoning problems, instructions, knowledge-rich texts, game NPCs, and tools (functions). This persona-driven approach ensures that the generated data is versatile, scalable, and flexible…
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Summary by The Sequence of AI Knowledge:
HuggingFace published a paper detailing how they built FineWeb, one of the largest open source datasets for LLM pretraining ever built. FineWeb boosts and impressive 15 trillion tokens from 96 Common Crawl snapshots
Symbolic Learning Enables Self-Evolving Agents
Summary by The Sequence of AI Knowledge:
Researchers from AIWaves published a paper introducing a technique known as agent symbolic learning aimed to self-improve agents. The core idea is to draw a parallel between an agent pipeline and a neural net and use symbolic optimizers to improve the agent network?
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Summary by The Sequence of AI Knowledge:
Salesforce Research published a paper introducing APIGen, a pipeline designed to synthesize function-calling datasets. APIGen was used to train models over 7B parameters based on state-of-the-art benchmarks?
Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts
Summary by The Sequence of AI Knowledge:
Google Research published a paper introducing Meeting Information Seeking Dialogs(MISeD), a dataset focused on meeting transcripts. MISeD tries to optimize for finding factual information in meeting transcripts which could be a notoriously difficult task
OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
Summary by The Sequence of AI Knowledge:
Researchers from Shanghai Jiao Tong University, Generative AI Research Lab published a paper detailing the results of the Olympic Arena superintelligence benchmark. Olympic Arena was designed to evaluate models across many disciplines and modalities
Gemma 2: Improving Open Language Models at a Practical Size
Summary by LLM Watch:
…Gemma 2 introduces several technical modifications to enhance performance and efficiency. The architecture incorporates interleaved local-global attentions, which allow the model to capture both local and global dependencies effectively. Additionally, group-query attention is employed to reduce computational complexity. For the smaller models (2B and 9B), knowledge distillation is used instead of next token prediction during training, enabling them to learn from larger, more powerful models while maintaining a compact size…
D2LLM: Decomposed and Distilled Large Language Models for Semantic Search
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
A Survey of Large Language Models for Graphs
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
Poisoned LangChain: Jailbreak LLMs by LangChain
Beyond Words: Other Modalities
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Summary by Turing Post:
Trains a vision model on over twenty diverse modalities, enabling it to perform a wide range of tasks without performance loss, enhancing multimodal generation and retrieval.
UNDERSTANDING ALIGNMENT IN MULTIMODAL LLMS: A COMPREHENSIVE STUDY
Summary by Turing Post:
Explores alignment of responses in multimodal LLMs with image content, proposing Bias-Driven Hallucination Sampling (BDHS) and highlighting the benefits of combined offline and online methods.
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
Summary by Turing Post:
Integrates LLMs with the Robot Operating System (ROS) to facilitate intuitive robot programming, incorporating feedback to refine tasks, demonstrating robustness and scalability.
STARK : Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
Summary by Turing Post:
Introduces a large-scale multi-modal conversation dataset featuring diverse social personas and images, enabling the creation of advanced conversation models with superior visual imagination abilities.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Summary by Last Week in AI:
A new framework called OMG-LLaVA combines powerful pixel-level vision understanding with reasoning abilities, accepting various visual and text prompts for flexible user interaction and achieving image-level, object-level, and pixel-level reasoning and understanding in a single model.
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Summary by Last Week in AI:
A new benchmark, MMEvalPro, addresses biases in evaluating Large Multimodal Models (LMMs) by introducing a trilogy evaluation pipeline and more rigorous metrics, making evaluations more challenging and trustworthy.
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Summary by Last Week in AI:
OmniJARVIS is a novel Vision-Language-Action model that uses unified tokenization of multimodal interaction data to enable open-world instruction-following agents in Minecraft, demonstrating strong reasoning and efficient decision-making capabilities.
Magic Insert: Style-Aware Drag-and-Drop
Summary by Last Week in AI:
A new method called Magic Insert allows for style-aware drag-and-drop of subjects from one image to another, addressing the challenges of style-aware personalization and realistic object insertion in stylized images.
Vision language models are blind
Summary by The Gradient:
Researchers from Auburn University and the University of Alberta found that state-of-the-art large language models with vision capabilities (VLMs) are surprisingly poor at understanding spatial information involving basic geometric shapes, such as whether two circles overlap. They propose BlindTest, a new benchmark of 7 simple tasks that are unlikely to have prior answers in natural language on the Internet, to test VLM ability to "see" images like humans do.
Data curation via joint example selection further accelerates multimodal learning
Summary by Last Week in AI:
Joint example selection for data curation accelerates multimodal learning, surpassing state-of-the-art models with significantly fewer iterations and less computation.
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Summary by Last Week in AI:
FunAudioLLM introduces innovative models for enhancing natural voice interactions between humans and large language models, enabling applications such as speech-to-speech translation and emotional voice chat.
WildGaussians: 3D Gaussian Splatting in the Wild
Summary by Last Week in AI:
A new approach called WildGaussians is introduced to improve 3D Gaussian Splatting's performance in handling in-the-wild data, achieving state-of-the-art results with real-time rendering speeds.
HEMM: Holistic Evaluation of Multimodal Foundation Models
Summary by The Sequence of AI Knowledge:
Researchers from Carnegie Mellon University published a paper introducing the holitic evaluation of multimodal models(HEMM) framework . HEMM sets the primitives to systematically evaluate multimodal models across different tasks such as basic skills, information flow, and real-world use cases
Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs
Summary by Turing Post:
Combines Vision-Language Models and topological graphs for effective multimodal instruction navigation in complex environments
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Summary by Turing Post:
Develops a hybrid architecture that integrates Transformer self-attention into the Mamba model, enhancing performance in various vision tasks?
PaliGemma: A versatile 3B VLM for transfer
Summary by Turing Post:
Combines a vision encoder and a language model to effectively transfer knowledge across diverse vision-language tasks
MJ-BENCH: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Summary by Turing Post:
Introduces a benchmark for evaluating multimodal judges in text-to-image generation, assessing their performance on various criteria including safety and bias?
Autoregressive Speech Synthesis without Vector Quantization
Summary by Turing Post:
Proposes an autoregressive TTS model that enhances output diversity and robustness
xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart
ColPali: Efficient Document Retrieval with Vision Language Models
Shape of Motion: 4D Reconstruction from a Single Video
Summary by AlphaSignal:
Reconstructing dynamic scenes from single videos is complex due to the ill-posed nature of the task. Traditional methods are limited as they require templates, function only in nearly static scenes, or cannot track full-sequence 3D motion, which makes them unsuitable for complex, moving scenes. This approach uses SE(3) motion bases to model motion as a combination of base movements. It integrates data-driven priors like depth maps and 2D motion tracks into a unified scene representation, enhancing consistency and accuracy.
SEED-Story: Multimodal Long Story Generation with Large Language Model
Improving GFlowNets for Text-to-Image Diffusion Alignment
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Summary by Last Week in AI:
A novel hybrid Mamba-Transformer vision backbone, MambaVision, is proposed and shown to achieve state-of-the-art performance in image classification and outperform comparably-sized backbones in downstream tasks.
SLOWFAST-LLAVA: A STRONG TRAINING-FREE BASELINE FOR VIDEO LARGE LANGUAGE MODELS
Summary by The Sequence of AI Knowledge:
Apple Research published a paper detailing SlowFast-LLaVA(SF-LLaVA), a video language model optimized for capturing the spatial semantics and temporal context in videos. SF-LLaVA uses a two-stream input design to aggregate features from different video frames in ways that facilitate knowledge extraction
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
7 个月The emphasis on efficiency in LLM research is fascinating, but I wonder if it risks sacrificing the very qualities that make LLMs powerful creativity and nuanced understanding. The recent surge in AI-generated art highlights how even seemingly "non-reasoning" models can produce surprising and insightful outputs. Given this, how would you reconcile the pursuit of efficiency with the need to preserve these emergent capabilities, especially considering Elon Musk's recent concerns about AI becoming too "narrowly focused"?