LLM Paper Reading Notes - April 2024

LLM Paper Reading Notes - April 2024

Sharing short notes about LLM research papers I came across in March. These notes, intended for my future self, differ in their level of detail and precision. I hope they're still useful in piquing your curiosity and helping you breathe under the waterfall. At the current pace of AI, it takes the power of all of us to keep up.

Check my newsletter for past reading notes!

Reading Notes

Simple and Scalable Strategies to Continually Pre-train Large Language Models

https://arxiv.org/pdf/2403.08763.pdf

This paper discusses updating a Large Language Model (LLM) when new data becomes available. This is usually done by merging the previous training data with the new ones and re-pre-training from scratch, which is expensive; or by continuing the pre-training with the new data, which often leads to poor performance and catastrophic forgetting. This paper highlights simple strategies to make the latter approach (continuing pre-training) successful. These include re-warming and re-decaying the learning rate, and replaying 5% of the previous data.

Language models scale reliably with over-training and on downstream tasks

https://arxiv.org/pdf/2403.08540.pdf

This paper proposes an in-depth investigation of current LLM training practices (scaling laws, over-training, perplexity optimization). The main takeaways of this paper are a) tuning a language model (LLM) training recipe over a small model is predictive of the performance of the same recipe when training a much larger model; b) training a LLM with more data than theoretically optimal (over-training) reduces the model loss (and therefore perplexity); and c) optimizing perplexity during training is predictive of downstream task performance.

Structured Entity Extraction using Large Language Models

https://arxiv.org/pdf/2402.04437v2.pdf

This paper addresses the challenge of extracting entities and their relationship from free text. The proposed method targets a predefined set of 10 entity types and 10 entity properties selected from Wikipedia properties, and performs named-entity recognition, entity property extraction, relationship extraction, and coreference resolution, in a multi-step process. It consists in fine-tuning a LLM (T5 Base and T5 Large) on Wikidata-based and GPT4-based datasets derived from Wikipedia. Evaluation on these datasets, shows improvements over previous work in both automatic metrics and human evaluation

UltraWiki: Ultra-fine-grained Entity Set Expansion with Negative Seed Entities

https://arxiv.org/pdf/2403.04247.pdf

The paper discusses employing both a retrieval-based framework (RetExpan) and a generation-based framework (GenExpan) for ultra-fine-grained Entity Set Expansion. GenExpan leverages LLM prompting (LLaMA-7B) to generate entity sets that adhere to ultra-fine semantic classifications, both positively and negatively defined. While the work is interesting and undoubtedly valuable, the paper could be more accessible if more concise.

Larimar: Large Language Models with Episodic Memory Control

https://arxiv.org/pdf/2403.11901.pdf

This paper tackles the problem of updating the knowledge of a pre-trained LLM with piecewise facts.. Rather than fine-tuning the LLM on these facts, editing parts of the LLM, the approach consists in training a separate model to encode these facts into a memory largely inspired by GENERATIVE PSEUDO-INVERSE MEMORY (Pham et al, 2021). The LLM decoding is then conditioned on the output of the memory (although this part is not described).

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

https://arxiv.org/pdf/2403.03853.pdf

This paper proposes a very simple technique to prune entire transformer layers of an LLM. It consists in quantifying the impact of the layer on the hidden state: the more the layer changes the hidden state (measured by the average similarity of the rows before and after the layer), the more important it is. Experiments with LLama and Baichuan2 (7B and 13B) show that this technique outperforms on average related works. The averages reported in the main table don’t always match but this doesn’t change the overall results.

Think before you speak: Training Language Models With Pause Tokens

https://arxiv.org/pdf/2310.02226.pdf

This paper shows that introducing artificial “pause” tokens in sentences at pre-training and fine-tuning time does improve the reasoning performance of decoder only LLMs. The pause tokens are passed through the attention layers but are not taken into account to compute the loss of next token prediction (they are not predicted). During pre-training, N pause tokens are inserted at uniformly random locations? in the training document. During fine-tuning, these pause tokens are inserted right before the answer. At inference time, the? extraction of the model's outputs is delayed until the last pause token is observed, allowing the model to leverage additional attention computation steps to make its prediction. Surprisingly, this approach improves the performance of a 1B parameters LLM on 8 benchmarks out of 9. The optimal number of pause tokens seems domain specific (from 10 to 50). Note that inserting dots “.” (instead of special pause tokens) does not provide any gain. The authors do not compare their approach with Chain of Thought and do not clarify if this could improve CoT.

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

https://arxiv.org/pdf/2403.09629.pdf

This paper goes a step further than “Think before you speak” and generates thoughts (short sentences) after each token to force the LLM to quietly think out loud for problems requiring multi-step reasoning. Quietly because these thoughts are not generated as part of the LLM output but used internally to enhance its understanding and predictions. Experimental results with Mistral 7B show improved performance on GSM8K and CommonsenseQA. Interestingly, this approach can also improve Chain of Thought reasoning. Contrary to “Think before you speak”, it does not involve any fine-tuning (only continued pre-training).

Quite intriguing. Is this paving the way to machines that can think?

Re-Reading Improves Reasoning in Large Language Models

https://arxiv.org/pdf/2309.06275v2.pdf

Augmenting a prompt by adding “Read the question again:” and repeating the question does improve the reasoning capabilities of fine-tuned LLMs (gpt-3.5-turbo-0613, text-davinci-003) and non fine-tuned ones (Llama-2 13B and 70B) across several benchmarks for arithmetic, commonsense, and symbolic reasoning tasks. This approach also improves Chain-of-Thought and other reasoning eliciting strategies. The authors hypothesize that re-reading allows unidirection decoder only models to perform some bidirectional attention.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

https://arxiv.org/pdf/2403.12968.pdf

Techniques such as In-Context Learning, Chain-of-Thought reasoning, and Retrieval Augmented Generation enable LLMs to tackle complex tasks, albeit at significant computational and financial costs due to the necessity of processing lengthy prompts. This paper introduces a task-agnostic method for prompt compression, leveraging a binary word classifier. This classifier determines which words from the input should be retained and which should be omitted. It is trained on a dataset derived from MeetingBank, utilizing GPT-4. The dataset is segmented into smaller chunks, upon which GPT-4 performs summarization by deletion only, without adding new words. Each pair (original text,summary) serves as a distinct training instance. Although this proposed method does not outperform context-aware (question-aware) methods, it demonstrates impressive efficacy on both in-domain and out-of-domain datasets, nearly matching the performance of using the original, uncompressed prompts.

RAFT: Adapting Language Model to Domain Specific RAG

https://arxiv.org/pdf/2403.10131.pdf

This paper convincingly shows that fine-tuning a LLM for RAG does improve RAG based question answering. Fine-tuning for RAG means fine-tuning the model on 1) question-answer pairs augmented with a document containing the answer and a few distractor documents, as well as question answer pairs containing solely distractor documents; and 2) expressing the answers as Chain-of-Thought reasoning. Best results are obtained when 40 to 60% of the contexts contain 1 relevant document and 4 distractor documents (and the rest of the contexts only distractor documents).

Improving Sentence Embeddings with an Automatically Generated NLI Dataset

https://arxiv.org/pdf/2402.15132v1.pdf

PromptEOL (Jiang et al., 2023) proposed a technique to generate sentence embeddings using a decoder only model (the dominant architecture in current LLMs). It consists in prompting a LLM with “This sentence:"[text]" means in one word: "”, where [text] is replaced with the sentence to embed; and using the hidden state after “in one word:” as the embeddings. While this approach works,? PromptEOL achieves the best results when fine-tuned on a natural language inference (NLI) dataset. This paper proposes to reduce the reliance of PromptEOL on large manually annotated datasets, by automatically generating them using? LLaMA-2-7B. This approach outperforms other methods using unsupervised datasets, but still lags behind manually curated datasets.

USER-LLM: Efficient LLM Contextualization with User Embeddings

https://arxiv.org/pdf/2402.13598.pdf

This paper addresses the challenge of contextualizing LLMs so that the output they generate is personalized to a specific user. This is accomplished by 1) training an autoregressive transformer to embed a sequence of user activities, and 2) performing cross-attention between the user embeddings and the intermediate text representations within the LLM.

DEMYSTIFYING EMBEDDING SPACES USING LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2310.04475v2.pdf

This paper proposes to train a 2-layer neural network to map domain specific embeddings to the embeddings of a LLM (PaLM 2-XS). This allows to prompt LLMs with sentences that contain embeddings, e.g., “List five positive characteristics of the movie <embeddings>”. Notably, the input embeddings can be diverse, including averages or combinations of different embeddings, enabling the LLM to generate responses about fictitious entities (e.g., combining the embeddings for “Forest Gump” and “Barbie” to express the concept of a cross-over movie). Leaving aside the cross-domain mapping advantage of their approach, I wonder why they don’t touch upon the fact that fictitious entities could also be achieved without training an adapter: use the LLM to generate embeddings, combine them, and feed the resulting embedding back into the LLM.

Resonance RoPE: Improving Context Length Generalization of Large Language Models

https://arxiv.org/pdf/2403.00071.pdf

A lot of the research around Rotational Positional Embeddings (RoPE) has focused on improving LLMs' capability to process long prompts (context length). This paper, instead, focuses on LLMs' ability to generate tokens correctly over very long sequences, particularly focusing on how well they can handle out-of-distribution (OOD) token positions that they were not explicitly trained on (a.k.a. “Train Short, Test Long” scenarios). It introduces POSGEN, a new synthetic benchmark for specifically evaluating OOD generation and proposes Resonance RoPE, demonstrating that it narrows the generalization gap in TSTL scenarios on POSGEN. Resonance RoPE works by modifying the wavelength parameters of RoPE to ensure better alignment with the model's training on shorter sequences.

Language Models Hallucinate, but May Excel at Fact Verification

https://arxiv.org/pdf/2310.14564v2.pdf

This paper studies the factuality of LLMs. It compares FLAN-T5-11B, LLama30B, LLama65B and GPT3.5 on two tasks: completing a sentence about a factual claim from Wikipedia given the first two tokens; and generating a paragraph of five sentences about a given entity from Wikipedia. Larger models tend to perform better and GPT3.5 (175B) performs significantly better than the others. However, when used for checking the factuality of statements,? FLAN-T5-11B, the least factual generator in the study, performs the best as a fact verifier. It also outperforms a supervised model trained on top of FLAN-T5-780M.?

Stealing Part of a Production Language Model

https://arxiv.org/pdf/2403.06634.pdf

This paper describes techniques that can be used to identify the dimension of the hidden layer, and even the complete embedding projection layer, of a black-box transformer language model. They accomplish this using clever mathematical tricks. A fun read! They confirm for the first time that OpenAI’s ada and babbage language models have a hidden dimension of 1024 and 2048, respectively.

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

https://arxiv.org/pdf/2305.05176.pdf

Lists a few simple ideas to reduce the cost of using LLM: reducing the size of the prompt, grouping requests, caching, or leveraging and cascading cheaper LLMs.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

https://arxiv.org/pdf/2403.09611v1.pdf

Typically, my focus is narrowly centered on textual modalities, leaving little room to explore others. However, I had to make an exception for this comprehensive overview from Apple about the design and training of their Multimodal Large Language Model (LLM). Very informative!

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

https://arxiv.org/pdf/2402.15627.pdf

I only eyeballed this paper but it seems a great source of insights for training LLMs at scale. Definitely on my TODO list.

Reverse Training to Nurse the Reversal Curse

https://arxiv.org/pdf/2403.13799.pdf

It is well known that when LLMs are trained on “A has a feature B”, they do not generalize to “B is a feature of A”. To tackle this problem, this paper proposes to train LLMs by passing the data in both directions! The question is then how the data should be reversed: by processing sentences right to left word by word? Should the order of entities be preserved? Can sentences be randomly chunked and reversed only within these chunks to save the cost of entity recognition? Experimental results seem to suggest that reverse training can help greatly, and that entity-preserving and random segment reverse training are more effective. Interestingly, reverse training seems also to benefit standard (forward) tasks.

Cascade Speculative Drafting for Even Faster LLM Inference

https://arxiv.org/pdf/2312.11462v4.pdf

Speculative decoding accelerates the inference process (next-token prediction) in large language models. Rather than having the model predict every subsequent token, a smaller cheaper auxiliary model is used to project several tokens ahead. The larger model's role is then primarily to verify (accept or reject) these predictions. While maintaining the same output distribution, inference is faster because verifying token predictions is cheaper than generating tokens, and can be parallelized. This paper proposes to apply speculative decoding to the auxiliary model by cascading several models of various sizes and to use smaller models to generate the later tokens. Experiments with FLAN-T5-XXL and? LLAMA-2-chat-7B show some speedup over speculative decoding, although I am not sure how they come up with these 44% and 81% speedup. Also, a comparison with recent improvements over the original speculative decoding algorithm would be interesting.

Reformatted Alignment

https://arxiv.org/pdf/2402.12219.pdf

Alignment generally refers to the process of ensuring that the outputs of LLMs are in accordance with human preferences, intentions, and values. Ensuring that LLMs follow instructions, understands questions, and provides factually accurate and contextually appropriate answers is one of its components. This paper shows that the performances of LLMs in areas like mathematical reasoning, factuality, and readability can be improved by… simply reformatting the datasets they are instruction fine-tuned on. A low but tasty hanging fruit.?

Beyond My Bandwidth

ChatMusician: Understanding and Generating Music Intrinsically with LLM

https://arxiv.org/pdf/2402.16153.pdf

LLAMAFACTORY: Unified Efficient Fine-Tuning of 100+ Language Models

https://arxiv.org/pdf/2403.13372.pdf

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

https://arxiv.org/pdf/2403.12881.pdf

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

https://arxiv.org/pdf/2403.10704.pdf

RewardBench: Evaluating Reward Models for Language Modeling

https://arxiv.org/pdf/2403.13787.pdf

Evolutionary Optimization of Model Merging Recipes

https://arxiv.org/pdf/2403.13187.pdf

ALGORITHMIC PROGRESS IN LANGUAGE MODELS

https://arxiv.org/pdf/2403.05812.pdf

CHAIN-OF-TABLE: EVOLVING TABLES IN THE REASONING CHAIN FOR TABLE UNDERSTANDING

https://arxiv.org/pdf/2401.04398.pdf

SparQ Attention: Bandwidth-Efficient LLM Inference

https://arxiv.org/pdf/2312.04985v3.pdf

MASKED STRUCTURAL GROWTH FOR 2X FASTER LANGUAGE MODEL PRE-TRAINING

https://arxiv.org/pdf/2305.02869v2.pdf

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

https://arxiv.org/pdf/2403.04696v1.pdf

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy

https://arxiv.org/pdf/2403.04283v1.pdf

GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability

https://arxiv.org/pdf/2403.04483v1.pdf

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

https://arxiv.org/pdf/2403.03218.pdf

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

https://arxiv.org/pdf/2403.03507v1.pdf

LLM4Decompile: Decompiling Binary Code with Large Language Models https://arxiv.org/pdf/2403.05286v1.pdf

SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents

https://arxiv.org/pdf/2403.08715.pdf

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

https://arxiv.org/pdf/2403.06504.pdf

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

https://arxiv.org/pdf/2403.07816.pdf

AtP?: An efficient and scalable method for localizing LLM behaviour to components

https://arxiv.org/pdf/2403.00745.pdf

Theoretical Foundations of Deep Selective State-Space Models

https://arxiv.org/pdf/2402.19047.pdf

Non-Vacuous Generalization Bounds for Large Language Models

https://arxiv.org/pdf/2312.17173.pdf

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

https://arxiv.org/pdf/2402.19427.pdf

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

https://arxiv.org/pdf/2403.04746.pdf

Learning to Decode Collaboratively with Multiple Language Models

https://arxiv.org/pdf/2403.03870.pdf

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

https://arxiv.org/pdf/2403.03950.pdf

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

https://arxiv.org/pdf/2312.07398.pdf

Harnessing Retrieval-Augmented Generation (RAG) for Uncovering Knowledge Gaps

https://arxiv.org/pdf/2312.07796.pdf

Instruction-tuned Language Models are Better Knowledge Learners

https://arxiv.org/pdf/2402.12847.pdf

Design2Code: How Far Are We From Automating Front-End Engineering?

https://arxiv.org/pdf/2403.03163v1.pdf

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

https://arxiv.org/pdf/2402.13064.pdf

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

https://arxiv.org/pdf/2402.15220.pdf

Divide-or-Conquer? Which Part Should You Distill Your LLM?

https://arxiv.org/pdf/2402.15000.pdf

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

https://arxiv.org/pdf/2402.14808v2.pdf

Recourse for Reclamation: Chatting with Generative Language Models

https://arxiv.org/pdf/2403.14467.pdf

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

https://arxiv.org/pdf/2402.19450v1.pdf

CLLMs: Consistency Large Language Models

https://arxiv.org/pdf/2403.00835.pdf

Aligning Large Language Models to a Domain-specific Graph Database

https://arxiv.org/pdf/2402.16567v2.pdf

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs

https://arxiv.org/pdf/2402.18157v1.pdf

GPTVQ: The Blessing of Dimensionality for LLM Quantization

https://arxiv.org/pdf/2402.15319v1.pdf

EVERYTHING OF THOUGHTS : DEFYING THE LAW OF PENROSE TRIANGLE FOR THOUGHT GENERATION

https://arxiv.org/pdf/2311.04254v3.pdf

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

https://arxiv.org/pdf/2402.17985v1.pdf

TnT-LLM: Text Mining at Scale with Large Language Models

https://arxiv.org/pdf/2403.12173.pdf

Evaluating Frontier Models for Dangerous Capabilities

https://arxiv.org/pdf/2403.13793.pdf

DATA INTERPRETER: AN LLM AGENT FOR DATA SCIENCE

https://arxiv.org/pdf/2402.18679v1.pdf

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

https://arxiv.org/pdf/2403.02884.pdf

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents

https://arxiv.org/pdf/2403.03101v1.pdf

StarCoder 2 and The Stack v2: The Next Generation

https://arxiv.org/pdf/2402.19173v1.pdf

Beyond Words: Other Modalities

When Do We Not Need Larger Vision Models?

https://arxiv.org/pdf/2403.13043.pdf

Multistep Consistency Models

https://arxiv.org/pdf/2403.06807.pdf

Improving fine-grained understanding in image-text pre-training

https://arxiv.org/pdf/2401.09865.pdf

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

https://arxiv.org/pdf/2403.12962v1.pdf

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

https://arxiv.org/pdf/2403.14610v1.pdf

UNI-SMART: UNIVERSAL SCIENCE MULTIMODAL ANALYSIS AND RESEARCH TRANSFORMER

https://arxiv.org/pdf/2403.10301.pdf

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

https://arxiv.org/pdf/2403.12596.pdf

MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

https://arxiv.org/pdf/2403.14624.pdf

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

https://arxiv.org/pdf/2403.10517.pdf

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

https://arxiv.org/pdf/2403.13248.pdf

EFFICIENT VIDEO DIFFUSION MODELS VIA CONTENT-FRAME MOTION-LATENT DECOMPOSITION

https://arxiv.org/pdf/2403.14148.pdf

DeepSeek-VL: Towards Real-World Vision-Language Understanding

https://arxiv.org/abs/2403.05525v2

MoAI: Mixture of All Intelligence for Large Language and Vision Models

https://arxiv.org/pdf/2403.07508.pdf

GiT: Towards Generalist Vision Transformer through Universal Language Interface

https://arxiv.org/pdf/2403.09394.pdf

AtomoVideo: High Fidelity Image-to-Video Generation

https://arxiv.org/pdf/2403.01800v1.pdf

Enhancing Vision-Language Pre-training with Rich Supervisions

https://arxiv.org/pdf/2403.03346v1.pdf

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

https://arxiv.org/pdf/2403.09530.pdf

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

https://arxiv.org/pdf/2403.03100v1.pdf

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

https://arxiv.org/pdf/2403.02677.pdf

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

https://arxiv.org/pdf/2403.03194.pdf

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

https://arxiv.org/pdf/2403.04692v1.pdf

Genie: Generative Interactive Environments

https://arxiv.org/pdf/2402.15391.pdf

Learning and Leveraging World Models in Visual Representation Learning

https://arxiv.org/pdf/2403.00504.pdf

Beyond Language Models: Byte Models are Digital World Simulators

https://arxiv.org/pdf/2402.19155v1.pdf

要查看或添加评论,请登录

社区洞察

其他会员也浏览了