What I learned from Bloomberg's experience of building their own LLM
Hey fellow AI explorers,
Most individuals experimenting with AI large language models (LLMs) utilize the available pre-trained models such as ChatGPT, LLaMa, Bloom, among a constantly growing list of others. It appears that only a handful of companies are constructing their own LLMs from scratch, also known as pretraining. What are the pros and cons of undertaking such an endeavor?
I recently gained valuable insights by listening to the TWIML AI podcast [1] (see the 1st comment below) about how Bloomberg developed BloombeGPT, an LLM for internal company experimentation. The podcast demystified the process of creating a custom LLM. The host, Sam Charrington, talks with David Rosenberg, the leader of Bloomberg's machine learning strategy team. I highly recommend giving it a listen if the subject intrigues you.
One aspect of LLMs that the podcast sheds light on are the nuances between various deployment methods of LLM technology, namely:
The term "foundation model" when used to describe an LLM implies the expectation that this LLM is to serve as the basis of future, fine-tuned LLMs.
The podcast also covers other interesting aspects of LLM pretraining that I will discuss in a follow-up article.
Different Approaches to using/building LLMs
The three approaches covered below are listed in order of difficultly in terms of human effort, time and expense from most to least difficult.
Constructing your own LLM from scratch i.e. pretraining
This approach requires a vast collection of text training data (while some LLMs utilize image data, this article focuses solely on text). The process might span several months and could cost anywhere from $1 million to $100 million.
Ultimately, the model will comprise tens of millions to a trillion parameters, representing weights on individual nodes within the intricate, multi-layered architecture of the LLM's deep neural network (DNN).
Building your own LLM, while expensive and time-consuming, will give you the most control over the way the raw input data is processed and offers the most protection of your proprietary data.
Fine-tuning an existing foundation model LLM
With an FM and a unique dataset, like Bloomberg's extensive financial text data, you can refine the FM to create an updated LLM. This is particularly effective if the fine-tuning data differs significantly from the FM's training data.
领英推荐
Fine-tuning can be more cost-effective (ranging from $1,000s to $10,000s) and quicker (from around a few hours to several days) compared to building from scratch.
To fine-tune an FM, access to the FM's DNN parameters is essential. It's worth noting that, as of this writing, OpenAI hasn't provided access to GPT-4 or GPT-3 parameters, making them unsuitable for fine-tuning. However, other models, like Llama 2, do offer their parameters for fine-tuning.
Employing in-context learning (ICL)
ICL often involves crafting effective prompts to elicit desired outputs from an LLM chat interface, such as ChatGPT. Using ChatGPT with sample inputs and their corresponding outputs as part of the prompt is a form of ICL. For example, in the prompt, a few pairs of a word and its dictionary definition could be provided in a particular format; Finally in the last part of the prompt just a word would be provided and the LLM would be expected to come up with the definition for that word with the output in the format that was specified earlier in the prompt content. See [2]. Variants of ICL include few-shot, one-shot, and zero-shot learning [3]. According to [2], ICL is a form of temporary fine-tuning although it's a continuing research question as to exactly how it works.
ICL is the most straightforward way to leverage someone else's LLM, like using OpenAI's ChatGPT or GPT-4 APIs without the need for fine-tuning.
Using a LLM through APIs that are hosted over the Internet, might risk exposing proprietary data to the LLM's host. This is one of the reasons that Bloomberg decided to explore pretraining their own LLM because it had the least risk of exposing their valuable data to a company like OpenAI.
ICL is the most affordable option compared to pre-training and fine-tuning, costing only the token processing and computation fees associated with each prompt submitted.
How much effort and money did Bloomberg put into their LLM?
In the podcast, David Rosenberg of Bloomberg, reported that the total cost was just over $1 million. This makes their model comparable to OpenAI's GPT-3 which reportedly cost about $4.6 million. Compare that to OpenAI's GPT-4 which cost over $100 million.
Rosenberg reported that once they had settled on their processing methodology, the final compute run took 53 days. See the podcast [1] for information on the exact training architecture. They had other processing methodologies that they tried out first so they spent more than just these 53 days in compute time. The whole project lasted for about one year.
The team that worked on this project consisted of nine, full-time employees. I believe that four them did coding, building the machine learning system and running experiments and training. The other five reviewed the literature to find the latest methods for pretraining LLMs and drove the process of trying to optimize the final LLM.
Conclusion
This article mainly describe the differences between LLM pretraining, fine-tuning and ICL as background and in the context of Bloomberg's decision to build their own pretraining LLM.
In my next article, I'll share my perspective on why Bloomberg opted to go through the whole process of LLM pretraining. I'll also discuss the customization benefits they gained, such as modifying the tokenizer, which wouldn't have been possible otherwise.
I welcome your comments and will try to answer them.
founder at zettacap
1 年Ari, thanks for the insights. My feeling is that the true cost of a project like this would be considerably higher than what Rosenberg states ~ $1million. Taking their own data (9 full-time employees working for about a year) would already push the limits of / exceed their estimate. Then, you would need to add additional expenses like compute time (53 days on the final run, but how many dead-ends did they hit?). Also, you have the issue of this project having direct access to Bloomberg articles -- likely some of the most curated / clean finance-related articles. If this were a standalone project, much of the project's time would be spent on data prep, or they would have spent a lot of money accessing finance specific articles. My guess is that the true cost of a project like this would exceed the $1 million estimate if looked at as an internal project and would be closer to the GPT-3 estimate if you looked at it like a standalone project.
This is the best article Ive seen that summarises the 3 different approaches and pros & cons attached with each - great read Ari
Public policy analyst/strategist.
1 年Ari thanks for your insights and comments. The more discussion we have, the better we’ll all understand AI. I’m sure the advantages outweigh the risks. Thanks. Ja
Fantastic insights
Data | AI | Data | Healthcare | NoSQL
1 年Very interesting. I wonder how one determines the limits of ICL, to help decide when to try fine tuning a model?