Behind the Generative AI Buzz: Transformers, LLMs, and the Future of Business
Intro
Generative AI has quite a buzz these days; by giving a broad overview of the world of transformers and the LLMs that use them, let's shed some light on how LLMs are constructed and what they can do versus what they cannot. This is a quickly developing, often-discussed, and use-case rich space that is up for grabs by the organizations that can understand how to apply it best to their domains.
Why Now? Machine Learning Context for Generative AI
The transformer is the modern foundation of all the excitement started by generative AI last year. Without this innovation, we would not be able to train models as quickly or with as much accuracy as we can today. Each time you prompt a language model with a question and context, you are leveraging the benefits of training with transformers. The model is retrieving answers from the vast corpus of information it has trained on, saving the parts of that information it deems most important within the billions of floating-point (decimal-valued, e.g. 0.9835245) numbers that make up its parameters.
What does 'most important' mean? Think about the big picture:
1,000,000x smaller - that's an efficient compression algorithm.
The purpose of model training is to optimize information while keeping the signalling capability high - i.e. producing language output that humans would agree with. This is similar to how when you go into the doctor's office, you have a list of symptoms you're feeling. It's their job to figure out which of those symptoms are important and indicative of the true issue, and which ones happen in a variety of ailments (and therefore are not useful).
What is a transformer?
Now coming to the transformer, which underpins that training. The transformer allows efficient processing of chunks of data (think sentences) to train very large models in a reasonable amount of time. It provides a quantitative structure for a neural network to learn about how sentences are constructed, what types of things are important indicators in a sentence, and how the positions of words in a sentence might influence these characteristics. If you're interested in a technical deep dive, learn more by reading [1].
At a high level, transformers allow the model to learn a set of salient attributes. This might be something along the lines of the following illustrative example:
The model has to then rely on context clues within the sentence - position of the word, other nearby words, etc. - to understand which of the two situations it finds itself in. It will then be able to correctly predict what word comes next.
How do you train one?
The training process for language models involves running many, many samples from the training data through the carefully and intricately defined transformer-based architecture, each time figuring out how far away the model's guess is from the correct answer and adjusting future results part-way towards that correct answer. Do this enough times, and it eventually adjusts this non-linear (read: not following a simple pattern) representation of floating-point values to minimize the number of mistakes it might make when asked for information. At the same time it compactly and efficiently represents the data.
In practice complexities, including large models not being able to fit on a single machine or a single machine taking too long to go through the billions of examples we want to provide, push us in the direction of scaling out to use more machines.
This scale is the reason few attempt to train a 500B parameter model; it's expensive and time-consuming. However it is at this scale that broad excitement for this technology started last year. It is at this scale that large companies have the resources to train these models. It is at this scale that these models exhibit emergent behaviors; pass standardized professional exams; and, in the case of a few like Google, accept multimodal (data types including text, image, video, and others) input to provide a seamless interaction similar to how the human mind thinks.
领英推荐
Architectures and Associated Use Cases
Let's dive into three transformer architectures: a decoder-only architecture, the more widely used encoder-decoder architecture, and the most common add-on to encoder-decoders, instruction tuning.
With a decoder-only architecture, the user doesn't provide any real prompting or input - it's like if you ask a model to "just generate me some text". The model generates content stochastically from the dataset that it has been trained on - these probabilities are what make each time different than the previous. There aren't a ton of scenarios where this would be helpful, but situations where random, placeholder, or broad-focus content is needed are where this architecture will prove to be valuable. For example if the user needed 'lorem ipsum' placeholder content except customized in some way, they could use this. While this architecture isn't applied often to use cases, it's important to understand what it does to learn deeply what the technology is capable of.
The encoder-decoder architecture is more of the standard for language models; in its raw form, the model takes in user input and builds off of that input to predict the most likely terms following that sequence. In this form, the model could be used for things like email autocomplete (which many email apps are already leveraging) or as a writing assistant.
Instruction tuning, which builds on top of the encoder-decoder architecture, allows us to start asking the model questions and getting back answers as humans would provide rather than just completing the phrase that was provided as input. It is now able to function much like the LLMs that you may have used today and provide answers to questions or requests. This is the current state of the art, and is a promising fine-tuned iteration of an LLM with which we can start to look for business value through targeted application.
Pitfalls - What are these models not?
LLMs are a lot of things, but at the end of the day they are next word generators. The conversations feel natural and real much like one you would have with a human, but this is because they have essentially read through billions of such conversations that humans have already had with each other on the internet, in books & print, and via other sources that the model trains on. Therefore don't expect novel results when asking a model for a new business plan - it's answering strictly from how people have answered in the past.
Their sweet spot is crunching large amounts of data, whether video, text, audio, or images, and synthesizing that information usefully when asked about it. They are already becoming useful in low-stakes situations like accelerating software engineering. The use cases will only grow as more organizations think about its usage in different areas and as the tech gets more and more accurate. For example Google builds these models on a foundation of Responsible AI, enabling safe usage of the tech and building trust with organizations.
Conclusion
Through this blog post we peeked into the world of large language models, which are a subset of generative AI. There is not enough room in even 100 blog posts to cover the details of this space, but hopefully this gives you some ideas for further exploration. An LLM is not a single concept, but a collection of steps including selecting a dataset, configuring the training process correctly, and fine-tuning it towards what your ultimate goals are. Most of us will never train models of this size, but knowing how they work will give us the tools we need to best leverage these models and impact business results.
AI is here to stay, are you positioned to maximize its benefits?
References:
A very insightful post, Ashish. It's really interesting how the unsupervised learning for LLMs work the way they do, before there's supervised training and then RLHF. Another point to think about is that decoder-only models can learn representations directly through their pretraining process - i.e. casual pretraining and do not need to see the entire sequence of text when making the next word prediction as it directly generates tokens in an autoregressive manner.