How do LLMs Work and Their Journey to Multimodality (Intuitively)
Karol Roszak
Generative AI | Data Science | Prompt Engineering | And a bunch of other hard expressions..
Hello and welcome to the second day of the "AI Advent" series. Today we are going to cover what are Large Language Models, how they work, and also what is Multimodal AI and why it is a good thing to see it coming.
Table of content:
?
1 Introduction
Throughout this year, the world of technology has witnessed a revolution led by Large Language Models (LLMs). It all started with GPT-3, known by many as ChatGPT, an AI chatbot developed by OpenAI. Since its debut in November 2022, ChatGPT has not just been a novelty but a game-changer across various sectors.
From the rather unfortunate use case of automating customer service (yes, I still hate being forced to talk to a chatbot) through assisting in creative writing and SEO to even programming or recipe generation, its influence has been wide-reaching. Although, be cautious with the last one, alright?
But have you ever wondered how LLMs work? Or have you noticed recent efforts to combine LLMs with multimodality? If you didn't, don't worry, and let's find out together. We aren't going to dive into technical dialect, so no matter your background, you should understand it all pretty easily.
?
2. What are Large Language Models (LLMs)?
At their core, Large Language Models (LLMs) are advanced AI models/programs designed to understand and generate human-like text. They are the masterminds behind tools like ChatGPT, Google's Bard, and most of the other chatbots nowadays.
These models, including the renowned GPT-3, work by analyzing vast amounts of text data and basically learning associations between different words (in AI jargon, called tokens). Although it's a bit simplified, you can imagine them as programs trying to predict the next word in a sequence. It's like having a supercharged auto-complete feature that can not only write whole paragraphs but also reason on tasks it has never encountered before based on similar ones (but not necessarily the same) it has already seen.
LLMs are trained on colossal datasets comprising articles, Wikipedia entries, books, and other text-rich resources. Such training enables them to grasp language nuances and produce responses that are impressively accurate and indeed human-like, and each of them probably already knows more than any human alone.
Nevertheless, this approach is not without flaws. The use of such extensive datasets has led to legal debates, especially around copyright issues. Basically, if your book, article, or whatever you have written is present in the training data, should it be considered a copyright infringement? We don't know it yet.
If you are interested in the topic of copyright in AI, have a look here or here. There, you can find more on authors like George R.R. Martin or Jodi Picoult suing OpenAI.
?
3. Selected LLM limitations, challenges, and risks
Building on the previous explanation, you can view LLMs as sophisticated next-word prediction engines. They use parameters as numbers; think of these as decision-making factors; usually, the more there are, the more powerful the model might get, assuming it has seen enough text in training. Intuitively, the bigger the model, the better, and the more data, the better as well.
However, the principle of "junk in, junk out" applies here. The quality of an LLM's output heavily depends on the quality of the data it's trained on and the instructions you give it. So having the most powerful model architecture at hand won't give you anything if it hasn't been trained on relevant data, and your prompt engineering basically.. sucks. Prompt Engineering essentially involves crafting specific inputs to guide the LLM towards more accurate and relevant outputs; we are going to explore it more in another article. Now let's mention some LLM limitations.
领英推荐
One challenge with LLMs is the occurrence of "hallucinations" in their responses. These are instances where the LLM generates bizarre or off-base content. Users must be cautious of these anomalies, as they can lead to misinformation or unintended results. An example of LLM hallucinating might be when it just makes up some information that is simply not true. It might be somewhat funny when you know the information AI gives you is just rubbish, but what about times when you cannot tell the difference?
Another significant issue is the potential for biases in LLM outputs. Since these models learn from existing data, they can adopt and perpetuate any biases present in their training material. This can manifest in gender, racial, or cultural biases, among others, posing challenges to fairness and neutrality in AI-generated content.
?
4. From LLMs to Multimodal AI
The AI journey that began with text-based models like GPT-3 has now evolved towards a more comprehensive approach called Multimodal AI. This shift marks a significant leap from solely processing text to understanding and interpreting multiple forms of data simultaneously.
Multimodal AI is an advanced form of AI that can comprehend and interact with various types of data, such as text, images, videos, or even sound. This means these AI models don't just read words; they can analyze pictures, understand videos, and interpret audio cues all at once. The latest iteration from OpenAI, GPT-4, is a forerunner in this field, showcasing the ability to understand both text and images. (And yes, you can talk to it too, but they just transform your speech into text and then send it to the model, so I call it cheating!)
Imagine AI models that can process text, images, videos, and sounds in tandem and provide responses in a similar manner. This capability would significantly enhance how AI systems understand and interact with the world, making them more intuitive and effective in various applications, from enhanced virtual assistants to advanced content creation and beyond. I would even dare to say their potential could be described as truly limitless.
?
Conclusions
We covered what LLMs are, mentioned some challenges they pose, and described recent advancements in packing LLMs with Multimodal AI capabilities that seem so complex yet have been achieved within a year. Perhaps by 2024 or 2025, the need for Netflix could become a thing of the past, with people harnessing AI to craft their own superior cinematic experiences. ;)
Oh, I almost forgot about the mighty quote I always give at the end of my articles, here is one from Bob Ross:
Talent is a pursued interest. Anything that you're willing to practice, you can do.
So don't just sit here, go and practice what you like.
And if you enjoy the AI Advent series consider liking, sharing and supporting my work, it gives me joy to see people actually enjoying my articles. So far here is the link to the previous article in the series, to see more visit my LinkedIn page:
And here will be to the next (when it's ready)