Large Concept Models (LCMs): A New Paradigm in AI Language Processing

Abstract

Large Concept Models (LCMs) represent a significant advancement in AI language processing, moving beyond the token-based approach of Large Language Models (LLMs). This paper explores the architecture, advantages, and potential applications of LCMs, highlighting their ability to handle long context inputs, perform hierarchical reasoning, and operate across multiple modalities.


Introduction

In recent years, Large Language Models (LLMs) have revolutionized the field of AI, becoming an essential tool for many tasks. The main component in these models’ architecture is a large Transformer model. However, to process our prompts, LLMs use another crucial component called a tokenizer. The tokenizer converts the prompt into tokens, which are part of the model’s vocabulary.

Introducing Large Concept Models (LCMs)

A recent research paper from Meta aims to bridge this gap. The paper is titled Large Concept Models: Language Modeling in a Sentence Representation Space, and it introduces a new architecture called Large Concept Models (LCMs). Unlike traditional LLMs that process tokens, LCMs work with concepts.


Understanding Concepts vs. Tokens

Concepts represent the semantics of higher-level ideas or actions and are not tied to specific single words. Furthermore, concepts are not restricted to language alone and can be derived from multiple modalities. For instance, the concept behind a particular sentence remains consistent whether it is in English, another language, or conveyed through text or voice.


Advantages of LCMs

  1. Better Long Context Handling: Concept sequences are much shorter than token sequences for the same input, significantly reducing the challenge of managing long sequences.
  2. Hierarchical Reasoning: Processing concepts rather than subword tokens allows for better hierarchical reasoning. For example, a researcher giving a talk would outline higher-level ideas rather than writing out every single word.
  3. Modality and Language Independence: LCMs support over 200 languages and various modalities, making them more versatile than traditional LLMs.


High-Level Architecture of LCMs

Understanding the high-level architecture of LCMs is crucial. The process begins with an input sequence of words divided into sentences, which are assumed to be the basic building blocks representing concepts.

  • Concept Encoder (SONAR): These sentences are first passed through a concept encoder, which encodes them into concept embeddings. SONAR supports 200 languages as text input and output—more than double the number of languages supported by most LLMs today. It also accepts 76 languages as speech input.
  • Large Concept Model (LCM): Next, the sequence of concepts is processed by a Large Concept Model to generate a new sequence of concepts at the output. The LCM operates solely in the embedding space, making it independent of any specific language or modality.
  • Concept Decoder (SONAR): Finally, the generated concepts are decoded back into language using SONAR. The decoder can convert the output of the LCM into more than one language or even more than one modality.


Inner Architecture of LCMs

We’re now ready to delve into a few different architectures of Large Concept Models. Below we will explore Base-LCM, the first attempt of generating a Large Concept Model, and afterwards we’ll review Diffusion-based LCMs which are an improved LCM architecture.


Base-LCM: Large Concept Model Naive Architecture

This method is analogous to training a large language model to predict the next token. However, instead of predicting the next token, the model is trained to predict the next concept within the concepts embedding space. This version is referred to as Base-LCM.

In the figure from the paper, we see the high-level architecture of Base-LCM. At the bottom on the left, we have a sequence of concepts. This sequence, excluding the last concept, is fed into the model to predict the next concept. The output is then compared to the actual next concept, which was not included in the model input. A mean squared error (MSE) loss is used to train the model.


The model comprises a main Transformer decoder component, along with smaller components before and after the Transformer, referred to as PreNet and PostNet. The PreNet component normalizes the concept embeddings received from SONAR and maps them into the Transformer’s dimension. The PostNet component projects the model output back to SONAR’s dimension.


Base-LCM Limitation

Unlike large language models that learn a distribution for next token prediction, this model is trained to output a very specific concept. However, there are likely many other concepts that could make sense in a given context.

This leads us to the next version of LCM architecture. The challenge of having many possible plausible outputs for a given input has already been tackled in the image generation domain. For example, if we ask an image generation model to generate a cute cat, we will likely be satisfied with many different options for generated cute cat images. A widely used architecture for image generation models is diffusion model. Inspired by this, diffusion-based architecture is also explored for large concept models.


Understanding Diffusion Models

Diffusion models take a prompt as input, such as “A cat is sitting on a laptop”. The model learns to gradually remove noise from an image to generate a clear picture. The process starts with a random noise image, and at each step, the model removes some of the noise. The noise removal is conditioned on the input prompt, resulting in an image that matches the prompt. The three dots imply that we skip steps in the above example. Finally, we get a clear image of a cat, which is the final output of the diffusion model for the given prompt. The noise removal process usually takes between tens to thousands of steps, which can result in a latency drawback. During training, to learn how to remove noise, noise is gradually added to a clear image—this is the diffusion process.


Diffusion-Based LCMs: Improved Large Concept Model Architecture

Now that we’ve recalled what diffusion models are, we can explore the two types of diffusion-based large concept models depicted in the figure from the paper.


One-Tower Large Concept Model

On the left, we see a version called the One-Tower LCM. At the bottom, there is an input sequence of concepts, along with a number representing the noisening timestamp. Zero for all concept embeddings indicates that they are clean concepts, and only the last concept is noisy, noted with a t timestamp, which needs to be cleaned to get the clean next concept prediction. The model is built similarly to the Base-LCM but runs multiple times. At each step, it removes some noise from the noisy next concept, iteratively processing its output as the noisy concept for a certain number of steps.


Two-Tower Large Concept Model

On the right, we see another version called the Two-Tower LCM. The main difference from the One-Tower version is that it separates the encoding of the preceding context from the diffusion of the next concept embedding. The clean concept embeddings are first encoded using a decoder-only Transformer. The outputs are then fed to a second model, the denoiser, which also receives the noisy next concept and iteratively denoises it to predict the clean next concept. The denoiser consists of Transformer layers, with a cross-attention block to attend to the encoded previous concepts.


Results


Comparing Different Versions of Large Concept Models (LCMs)

In the table from the paper, we see instruction-tuning evaluation results for the various models. The diffusion-based versions significantly outperform the other versions for the two reported metrics: ROUGE-L, which evaluates the quality of generated summaries by measuring the longest common subsequence between the generated text and the reference text, and the coherence metric, which evaluates how logically consistent and smoothly flowing the generated text is.


The Quant models are additional large concept model versions that we did not cover in this post. At the bottom of the table, we see that smaLlama achieves slightly better results than the diffusion-based large concept model versions.

Higher-Scale Evaluation of Large Concept Models (LCMs)

To verify the method on higher scale, the Two-Tower LCM model was scaled up to 7B parameters. In the table below, we can see how it performs for summarization tasks comparing to the following baselines:


  • Encoder-Decoder Transformer Models: T5
  • Decoder-Only LLMs: Gemma-7B, Llama-3.1-8B, and Mistral-7B-v0.3


The results show that the LCM produces competitive ROUGE-L scores, comparable to the specifically tuned T5-3B model, and surpasses the instruction-finetuned LLMs. Key findings include:


  • Abstractive Summaries: LCMs tend to generate more abstractive summaries rather than extractive ones, indicated by lower OVL-3 scores.
  • Repetition Rate: LCMs produce fewer repetitions compared to LLMs, with repetition rates closer to the ground truth.
  • Fluency: According to the CoLA classifier, LCMs generate less fluent summaries than LLMs, though even human-generated summaries scored lower than LLM outputs.
  • Source Attribution and Semantic Coverage: Similar trends are observed in source attribution (SH-4) and semantic coverage (SH-5), potentially due to biases in model-based metrics favoring LLM-generated content.


Conclusion

Large Concept Models (LCMs), an innovative architecture that processes higher-level concepts instead of individual tokens, closely mimicking human reasoning. LCMs demonstrate competitive performance on summarization tasks, outperforming traditional LLMs in several key areas.

要查看或添加评论,请登录

Bazeed Shaik的更多文章

社区洞察

其他会员也浏览了