LLM Paper Reading Notes - April 2024
Sharing short notes about LLM research papers I came across in March. These notes, intended for my future self, differ in their level of detail and precision. I hope they're still useful in piquing your curiosity and helping you breathe under the waterfall. At the current pace of AI, it takes the power of all of us to keep up.
Check my newsletter for past reading notes!
Reading Notes
Simple and Scalable Strategies to Continually Pre-train Large Language Models
This paper discusses updating a Large Language Model (LLM) when new data becomes available. This is usually done by merging the previous training data with the new ones and re-pre-training from scratch, which is expensive; or by continuing the pre-training with the new data, which often leads to poor performance and catastrophic forgetting. This paper highlights simple strategies to make the latter approach (continuing pre-training) successful. These include re-warming and re-decaying the learning rate, and replaying 5% of the previous data.
Language models scale reliably with over-training and on downstream tasks
This paper proposes an in-depth investigation of current LLM training practices (scaling laws, over-training, perplexity optimization). The main takeaways of this paper are a) tuning a language model (LLM) training recipe over a small model is predictive of the performance of the same recipe when training a much larger model; b) training a LLM with more data than theoretically optimal (over-training) reduces the model loss (and therefore perplexity); and c) optimizing perplexity during training is predictive of downstream task performance.
Structured Entity Extraction using Large Language Models
This paper addresses the challenge of extracting entities and their relationship from free text. The proposed method targets a predefined set of 10 entity types and 10 entity properties selected from Wikipedia properties, and performs named-entity recognition, entity property extraction, relationship extraction, and coreference resolution, in a multi-step process. It consists in fine-tuning a LLM (T5 Base and T5 Large) on Wikidata-based and GPT4-based datasets derived from Wikipedia. Evaluation on these datasets, shows improvements over previous work in both automatic metrics and human evaluation
UltraWiki: Ultra-fine-grained Entity Set Expansion with Negative Seed Entities
The paper discusses employing both a retrieval-based framework (RetExpan) and a generation-based framework (GenExpan) for ultra-fine-grained Entity Set Expansion. GenExpan leverages LLM prompting (LLaMA-7B) to generate entity sets that adhere to ultra-fine semantic classifications, both positively and negatively defined. While the work is interesting and undoubtedly valuable, the paper could be more accessible if more concise.
Larimar: Large Language Models with Episodic Memory Control
This paper tackles the problem of updating the knowledge of a pre-trained LLM with piecewise facts.. Rather than fine-tuning the LLM on these facts, editing parts of the LLM, the approach consists in training a separate model to encode these facts into a memory largely inspired by GENERATIVE PSEUDO-INVERSE MEMORY (Pham et al, 2021). The LLM decoding is then conditioned on the output of the memory (although this part is not described).
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
This paper proposes a very simple technique to prune entire transformer layers of an LLM. It consists in quantifying the impact of the layer on the hidden state: the more the layer changes the hidden state (measured by the average similarity of the rows before and after the layer), the more important it is. Experiments with LLama and Baichuan2 (7B and 13B) show that this technique outperforms on average related works. The averages reported in the main table don’t always match but this doesn’t change the overall results.
Think before you speak: Training Language Models With Pause Tokens
This paper shows that introducing artificial “pause” tokens in sentences at pre-training and fine-tuning time does improve the reasoning performance of decoder only LLMs. The pause tokens are passed through the attention layers but are not taken into account to compute the loss of next token prediction (they are not predicted). During pre-training, N pause tokens are inserted at uniformly random locations? in the training document. During fine-tuning, these pause tokens are inserted right before the answer. At inference time, the? extraction of the model's outputs is delayed until the last pause token is observed, allowing the model to leverage additional attention computation steps to make its prediction. Surprisingly, this approach improves the performance of a 1B parameters LLM on 8 benchmarks out of 9. The optimal number of pause tokens seems domain specific (from 10 to 50). Note that inserting dots “.” (instead of special pause tokens) does not provide any gain. The authors do not compare their approach with Chain of Thought and do not clarify if this could improve CoT.
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
This paper goes a step further than “Think before you speak” and generates thoughts (short sentences) after each token to force the LLM to quietly think out loud for problems requiring multi-step reasoning. Quietly because these thoughts are not generated as part of the LLM output but used internally to enhance its understanding and predictions. Experimental results with Mistral 7B show improved performance on GSM8K and CommonsenseQA. Interestingly, this approach can also improve Chain of Thought reasoning. Contrary to “Think before you speak”, it does not involve any fine-tuning (only continued pre-training).
Quite intriguing. Is this paving the way to machines that can think?
Re-Reading Improves Reasoning in Large Language Models
Augmenting a prompt by adding “Read the question again:” and repeating the question does improve the reasoning capabilities of fine-tuned LLMs (gpt-3.5-turbo-0613, text-davinci-003) and non fine-tuned ones (Llama-2 13B and 70B) across several benchmarks for arithmetic, commonsense, and symbolic reasoning tasks. This approach also improves Chain-of-Thought and other reasoning eliciting strategies. The authors hypothesize that re-reading allows unidirection decoder only models to perform some bidirectional attention.
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Techniques such as In-Context Learning, Chain-of-Thought reasoning, and Retrieval Augmented Generation enable LLMs to tackle complex tasks, albeit at significant computational and financial costs due to the necessity of processing lengthy prompts. This paper introduces a task-agnostic method for prompt compression, leveraging a binary word classifier. This classifier determines which words from the input should be retained and which should be omitted. It is trained on a dataset derived from MeetingBank, utilizing GPT-4. The dataset is segmented into smaller chunks, upon which GPT-4 performs summarization by deletion only, without adding new words. Each pair (original text,summary) serves as a distinct training instance. Although this proposed method does not outperform context-aware (question-aware) methods, it demonstrates impressive efficacy on both in-domain and out-of-domain datasets, nearly matching the performance of using the original, uncompressed prompts.
RAFT: Adapting Language Model to Domain Specific RAG
This paper convincingly shows that fine-tuning a LLM for RAG does improve RAG based question answering. Fine-tuning for RAG means fine-tuning the model on 1) question-answer pairs augmented with a document containing the answer and a few distractor documents, as well as question answer pairs containing solely distractor documents; and 2) expressing the answers as Chain-of-Thought reasoning. Best results are obtained when 40 to 60% of the contexts contain 1 relevant document and 4 distractor documents (and the rest of the contexts only distractor documents).
Improving Sentence Embeddings with an Automatically Generated NLI Dataset
PromptEOL (Jiang et al., 2023) proposed a technique to generate sentence embeddings using a decoder only model (the dominant architecture in current LLMs). It consists in prompting a LLM with “This sentence:"[text]" means in one word: "”, where [text] is replaced with the sentence to embed; and using the hidden state after “in one word:” as the embeddings. While this approach works,? PromptEOL achieves the best results when fine-tuned on a natural language inference (NLI) dataset. This paper proposes to reduce the reliance of PromptEOL on large manually annotated datasets, by automatically generating them using? LLaMA-2-7B. This approach outperforms other methods using unsupervised datasets, but still lags behind manually curated datasets.
USER-LLM: Efficient LLM Contextualization with User Embeddings
This paper addresses the challenge of contextualizing LLMs so that the output they generate is personalized to a specific user. This is accomplished by 1) training an autoregressive transformer to embed a sequence of user activities, and 2) performing cross-attention between the user embeddings and the intermediate text representations within the LLM.
DEMYSTIFYING EMBEDDING SPACES USING LARGE LANGUAGE MODELS
This paper proposes to train a 2-layer neural network to map domain specific embeddings to the embeddings of a LLM (PaLM 2-XS). This allows to prompt LLMs with sentences that contain embeddings, e.g., “List five positive characteristics of the movie <embeddings>”. Notably, the input embeddings can be diverse, including averages or combinations of different embeddings, enabling the LLM to generate responses about fictitious entities (e.g., combining the embeddings for “Forest Gump” and “Barbie” to express the concept of a cross-over movie). Leaving aside the cross-domain mapping advantage of their approach, I wonder why they don’t touch upon the fact that fictitious entities could also be achieved without training an adapter: use the LLM to generate embeddings, combine them, and feed the resulting embedding back into the LLM.
Resonance RoPE: Improving Context Length Generalization of Large Language Models
A lot of the research around Rotational Positional Embeddings (RoPE) has focused on improving LLMs' capability to process long prompts (context length). This paper, instead, focuses on LLMs' ability to generate tokens correctly over very long sequences, particularly focusing on how well they can handle out-of-distribution (OOD) token positions that they were not explicitly trained on (a.k.a. “Train Short, Test Long” scenarios). It introduces POSGEN, a new synthetic benchmark for specifically evaluating OOD generation and proposes Resonance RoPE, demonstrating that it narrows the generalization gap in TSTL scenarios on POSGEN. Resonance RoPE works by modifying the wavelength parameters of RoPE to ensure better alignment with the model's training on shorter sequences.
Language Models Hallucinate, but May Excel at Fact Verification
This paper studies the factuality of LLMs. It compares FLAN-T5-11B, LLama30B, LLama65B and GPT3.5 on two tasks: completing a sentence about a factual claim from Wikipedia given the first two tokens; and generating a paragraph of five sentences about a given entity from Wikipedia. Larger models tend to perform better and GPT3.5 (175B) performs significantly better than the others. However, when used for checking the factuality of statements,? FLAN-T5-11B, the least factual generator in the study, performs the best as a fact verifier. It also outperforms a supervised model trained on top of FLAN-T5-780M.?
Stealing Part of a Production Language Model
This paper describes techniques that can be used to identify the dimension of the hidden layer, and even the complete embedding projection layer, of a black-box transformer language model. They accomplish this using clever mathematical tricks. A fun read! They confirm for the first time that OpenAI’s ada and babbage language models have a hidden dimension of 1024 and 2048, respectively.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lists a few simple ideas to reduce the cost of using LLM: reducing the size of the prompt, grouping requests, caching, or leveraging and cascading cheaper LLMs.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Typically, my focus is narrowly centered on textual modalities, leaving little room to explore others. However, I had to make an exception for this comprehensive overview from Apple about the design and training of their Multimodal Large Language Model (LLM). Very informative!
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
I only eyeballed this paper but it seems a great source of insights for training LLMs at scale. Definitely on my TODO list.
Reverse Training to Nurse the Reversal Curse
It is well known that when LLMs are trained on “A has a feature B”, they do not generalize to “B is a feature of A”. To tackle this problem, this paper proposes to train LLMs by passing the data in both directions! The question is then how the data should be reversed: by processing sentences right to left word by word? Should the order of entities be preserved? Can sentences be randomly chunked and reversed only within these chunks to save the cost of entity recognition? Experimental results seem to suggest that reverse training can help greatly, and that entity-preserving and random segment reverse training are more effective. Interestingly, reverse training seems also to benefit standard (forward) tasks.
Cascade Speculative Drafting for Even Faster LLM Inference
Speculative decoding accelerates the inference process (next-token prediction) in large language models. Rather than having the model predict every subsequent token, a smaller cheaper auxiliary model is used to project several tokens ahead. The larger model's role is then primarily to verify (accept or reject) these predictions. While maintaining the same output distribution, inference is faster because verifying token predictions is cheaper than generating tokens, and can be parallelized. This paper proposes to apply speculative decoding to the auxiliary model by cascading several models of various sizes and to use smaller models to generate the later tokens. Experiments with FLAN-T5-XXL and? LLAMA-2-chat-7B show some speedup over speculative decoding, although I am not sure how they come up with these 44% and 81% speedup. Also, a comparison with recent improvements over the original speculative decoding algorithm would be interesting.
Reformatted Alignment
Alignment generally refers to the process of ensuring that the outputs of LLMs are in accordance with human preferences, intentions, and values. Ensuring that LLMs follow instructions, understands questions, and provides factually accurate and contextually appropriate answers is one of its components. This paper shows that the performances of LLMs in areas like mathematical reasoning, factuality, and readability can be improved by… simply reformatting the datasets they are instruction fine-tuned on. A low but tasty hanging fruit.?
Beyond My Bandwidth
ChatMusician: Understanding and Generating Music Intrinsically with LLM
LLAMAFACTORY: Unified Efficient Fine-Tuning of 100+ Language Models
Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
RewardBench: Evaluating Reward Models for Language Modeling
Evolutionary Optimization of Model Merging Recipes
ALGORITHMIC PROGRESS IN LANGUAGE MODELS
CHAIN-OF-TABLE: EVOLVING TABLES IN THE REASONING CHAIN FOR TABLE UNDERSTANDING
SparQ Attention: Bandwidth-Efficient LLM Inference
MASKED STRUCTURAL GROWTH FOR 2X FASTER LANGUAGE MODEL PRE-TRAINING
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy
GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
LLM4Decompile: Decompiling Binary Code with Large Language Models https://arxiv.org/pdf/2403.05286v1.pdf
SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
领英推荐
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
AtP?: An efficient and scalable method for localizing LLM behaviour to components
Theoretical Foundations of Deep Selective State-Space Models
Non-Vacuous Generalization Bounds for Large Language Models
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
Learning to Decode Collaboratively with Multiple Language Models
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
LLMEval: A Preliminary Study on How to Evaluate Large Language Models
Harnessing Retrieval-Augmented Generation (RAG) for Uncovering Knowledge Gaps
Instruction-tuned Language Models are Better Knowledge Learners
Design2Code: How Far Are We From Automating Front-End Engineering?
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Divide-or-Conquer? Which Part Should You Distill Your LLM?
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Recourse for Reclamation: Chatting with Generative Language Models
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
CLLMs: Consistency Large Language Models
Aligning Large Language Models to a Domain-specific Graph Database
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
GPTVQ: The Blessing of Dimensionality for LLM Quantization
EVERYTHING OF THOUGHTS : DEFYING THE LAW OF PENROSE TRIANGLE FOR THOUGHT GENERATION
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
TnT-LLM: Text Mining at Scale with Large Language Models
Evaluating Frontier Models for Dangerous Capabilities
DATA INTERPRETER: AN LLM AGENT FOR DATA SCIENCE
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
StarCoder 2 and The Stack v2: The Next Generation
Beyond Words: Other Modalities
When Do We Not Need Larger Vision Models?
Multistep Consistency Models
Improving fine-grained understanding in image-text pre-training
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
UNI-SMART: UNIVERSAL SCIENCE MULTIMODAL ANALYSIS AND RESEARCH TRANSFORMER
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
EFFICIENT VIDEO DIFFUSION MODELS VIA CONTENT-FRAME MOTION-LATENT DECOMPOSITION
DeepSeek-VL: Towards Real-World Vision-Language Understanding
MoAI: Mixture of All Intelligence for Large Language and Vision Models
GiT: Towards Generalist Vision Transformer through Universal Language Interface
AtomoVideo: High Fidelity Image-to-Video Generation
Enhancing Vision-Language Pre-training with Rich Supervisions
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Genie: Generative Interactive Environments
Learning and Leveraging World Models in Visual Representation Learning
Beyond Language Models: Byte Models are Digital World Simulators