Large Concept Models (LCMs): A New Paradigm in AI Language Processing
Bazeed Shaik
Chief AI Officer (CAIO)-Steering Gen AI, CCoE, Multi-Cloud Solutions & DevSecOps a with Passionate Leadership | Digital Pioneer | EMBA | 5xAWS, 5xAzure, 1xGCP | CKAD, CCIE, ITILV3 & PMP | 12K+ LinkedIn Connections
Abstract
Large Concept Models (LCMs) represent a significant advancement in AI language processing, moving beyond the token-based approach of Large Language Models (LLMs). This paper explores the architecture, advantages, and potential applications of LCMs, highlighting their ability to handle long context inputs, perform hierarchical reasoning, and operate across multiple modalities.
Introduction
In recent years, Large Language Models (LLMs) have revolutionized the field of AI, becoming an essential tool for many tasks. The main component in these models’ architecture is a large Transformer model. However, to process our prompts, LLMs use another crucial component called a tokenizer. The tokenizer converts the prompt into tokens, which are part of the model’s vocabulary.
Introducing Large Concept Models (LCMs)
A recent research paper from Meta aims to bridge this gap. The paper is titled Large Concept Models: Language Modeling in a Sentence Representation Space, and it introduces a new architecture called Large Concept Models (LCMs). Unlike traditional LLMs that process tokens, LCMs work with concepts.
Understanding Concepts vs. Tokens
Concepts represent the semantics of higher-level ideas or actions and are not tied to specific single words. Furthermore, concepts are not restricted to language alone and can be derived from multiple modalities. For instance, the concept behind a particular sentence remains consistent whether it is in English, another language, or conveyed through text or voice.
Advantages of LCMs
High-Level Architecture of LCMs
Understanding the high-level architecture of LCMs is crucial. The process begins with an input sequence of words divided into sentences, which are assumed to be the basic building blocks representing concepts.
Inner Architecture of LCMs
We’re now ready to delve into a few different architectures of Large Concept Models. Below we will explore Base-LCM, the first attempt of generating a Large Concept Model, and afterwards we’ll review Diffusion-based LCMs which are an improved LCM architecture.
Base-LCM: Large Concept Model Naive Architecture
This method is analogous to training a large language model to predict the next token. However, instead of predicting the next token, the model is trained to predict the next concept within the concepts embedding space. This version is referred to as Base-LCM.
In the figure from the paper, we see the high-level architecture of Base-LCM. At the bottom on the left, we have a sequence of concepts. This sequence, excluding the last concept, is fed into the model to predict the next concept. The output is then compared to the actual next concept, which was not included in the model input. A mean squared error (MSE) loss is used to train the model.
The model comprises a main Transformer decoder component, along with smaller components before and after the Transformer, referred to as PreNet and PostNet. The PreNet component normalizes the concept embeddings received from SONAR and maps them into the Transformer’s dimension. The PostNet component projects the model output back to SONAR’s dimension.
Base-LCM Limitation
Unlike large language models that learn a distribution for next token prediction, this model is trained to output a very specific concept. However, there are likely many other concepts that could make sense in a given context.
This leads us to the next version of LCM architecture. The challenge of having many possible plausible outputs for a given input has already been tackled in the image generation domain. For example, if we ask an image generation model to generate a cute cat, we will likely be satisfied with many different options for generated cute cat images. A widely used architecture for image generation models is diffusion model. Inspired by this, diffusion-based architecture is also explored for large concept models.
领英推荐
Understanding Diffusion Models
Diffusion models take a prompt as input, such as “A cat is sitting on a laptop”. The model learns to gradually remove noise from an image to generate a clear picture. The process starts with a random noise image, and at each step, the model removes some of the noise. The noise removal is conditioned on the input prompt, resulting in an image that matches the prompt. The three dots imply that we skip steps in the above example. Finally, we get a clear image of a cat, which is the final output of the diffusion model for the given prompt. The noise removal process usually takes between tens to thousands of steps, which can result in a latency drawback. During training, to learn how to remove noise, noise is gradually added to a clear image—this is the diffusion process.
Diffusion-Based LCMs: Improved Large Concept Model Architecture
Now that we’ve recalled what diffusion models are, we can explore the two types of diffusion-based large concept models depicted in the figure from the paper.
One-Tower Large Concept Model
On the left, we see a version called the One-Tower LCM. At the bottom, there is an input sequence of concepts, along with a number representing the noisening timestamp. Zero for all concept embeddings indicates that they are clean concepts, and only the last concept is noisy, noted with a t timestamp, which needs to be cleaned to get the clean next concept prediction. The model is built similarly to the Base-LCM but runs multiple times. At each step, it removes some noise from the noisy next concept, iteratively processing its output as the noisy concept for a certain number of steps.
Two-Tower Large Concept Model
On the right, we see another version called the Two-Tower LCM. The main difference from the One-Tower version is that it separates the encoding of the preceding context from the diffusion of the next concept embedding. The clean concept embeddings are first encoded using a decoder-only Transformer. The outputs are then fed to a second model, the denoiser, which also receives the noisy next concept and iteratively denoises it to predict the clean next concept. The denoiser consists of Transformer layers, with a cross-attention block to attend to the encoded previous concepts.
Results
Comparing Different Versions of Large Concept Models (LCMs)
In the table from the paper, we see instruction-tuning evaluation results for the various models. The diffusion-based versions significantly outperform the other versions for the two reported metrics: ROUGE-L, which evaluates the quality of generated summaries by measuring the longest common subsequence between the generated text and the reference text, and the coherence metric, which evaluates how logically consistent and smoothly flowing the generated text is.
The Quant models are additional large concept model versions that we did not cover in this post. At the bottom of the table, we see that smaLlama achieves slightly better results than the diffusion-based large concept model versions.
Higher-Scale Evaluation of Large Concept Models (LCMs)
To verify the method on higher scale, the Two-Tower LCM model was scaled up to 7B parameters. In the table below, we can see how it performs for summarization tasks comparing to the following baselines:
The results show that the LCM produces competitive ROUGE-L scores, comparable to the specifically tuned T5-3B model, and surpasses the instruction-finetuned LLMs. Key findings include:
Conclusion
Large Concept Models (LCMs), an innovative architecture that processes higher-level concepts instead of individual tokens, closely mimicking human reasoning. LCMs demonstrate competitive performance on summarization tasks, outperforming traditional LLMs in several key areas.