Case study — LLM: Overhyped and Underwhelming?
Introduction
November 30, 2022 marks the launch of ChatGPT and the beginning of a new era when AI finally starts its quiet takeover. Instead of quickly wiping out the human race with mass destruction weapons, it has a more insidious approach: make kids lazy by doing their homework for them and slowly grow the unemployment rate by making various job positions redundant. Some are excited, others are scared. While both of those reactions are justified, we would like to address an array of issues regarding large language models like ChatGPT and their usage that still need solving before permanently joining either of those sides.
Before delving into specifics, let’s first get on the same page about what we consider a large language model to be. Large language models (henceforth abbreviated to LLM) are artificial neural networks trained on a huge body of texts, for example Wikipedia articles, crawled content from webpages and books. The aim of training such a model is to capture the general knowledge of language in that corpora, like syntax, semantics etc. After initial training, the network can be fine-tuned on specific downstream tasks. For instance, you can know how to speak French — which sentence is grammatically correct and in what order the words should be in (pre-training), but have no idea how to categorize butterflies into Lycaenidae and Riodinidae families without first learning about the necessary criteria (fine-tuning).
While there are vast number of LLMs with different names (ChatGPT, BLOOM, LLAMA etc), most have underlying architectures being one of two types: BERT (Bidirectional Encoder Representations from Transformers)?and GPT (Generative Pre-trained Transformer). The main difference between those two is that BERT tries to predict a token considering both preceding and following context, while GPT takes into account only the preceding context. E.g. for sentence?“The reports of my?[MASK]?have been greatly exaggerated.”,?BERT tries to find a suitable candidate to replace the [MASK] token (Figure 1), but GPT aims to complete a given prompt by generating the most probable continuation sequence (Figure 2).
Although large language models aren’t even strictly restricted to those two types, most conversations around LLMs lately tend to revolve only around GPT-like models due to the success of OpenAI’s ChatGPT and its many relatives. While LLMs have gained wide-spread attention only after the release of ChatGPT, there haven’t been any significant changes to the architecture of those models since the introduction of attention-based models called Transformers in 2017. So what happened? Simple answer is: a LOT of data and fine-tuning.
While those models have proven to be rather powerful tools for solving a lot of different issues, there are still a number of things to be cautious about. The issues range from the impact on creative industries to the environmental price of training such huge models, but we are mainly focusing on the usage of LLMs as autonomous units in industry pipelines.
So what are the potential pitfalls that one should be aware of?
1. Is there an actual?need?for generative LLM?
As mentioned in the previous chapter, generative LLMs are only one type of LLMs, but for simplicity, we hereafter use the term only in referral to generative models.
Generative LLMs are well-suited for solving a wide range of complicated problems, but on many occasions there exists a simpler and a more fitting solution. Airplanes are a lot faster than bikes, but would you use a private jet for getting groceries from a nearby market? Similarly with flying an airplane there is a time and place for using an LLM. For example, LLMs make great chatbots and plot summarizers, although there are problems to consider with those applications as well, which will be addressed in later chapters. However, we have noticed that the enthusiasm stemming from hands-on experience with ChatGPT tends to create an opinion that LLMs are the best solution for ALL tasks concerning natural language processing. For instance, we’ve heard a proposal to use LLM for a simple binary classification task that could be solved with an “if-else” clause. Furthermore, even for more complicated problems, old-fashioned methods or lowly LLM cousins like BERT tend to have an upper hand when it comes to analyzing actual costs. For example, generative LLMs can be used for detecting toxic content, for huge models like ChatGPT even with zero examples (although content policy sometimes blocks some queries that contain certain trigger words). However, using it in actual comment moderation pipelines can turn out a lot more expensive than using some smaller and simpler classification model.
2. The balance between cost and good results
So the problem cannot be solved with a good old regular expression or a classical machine learning method and there really is a justified need for using an LLM — what are the options?
领英推荐
a) Use it via an API — ChatGPT provides pricing plans starting from 0.03$ per 1000 input tokens for its best model and even cheaper options for others. While the price itself might at first glance seem more than reasonable, then… is it really? Again, this depends on the actual complexity of the issue and the existence of viable alternative solutions, but if the daily data amounts are sufficiently high, it might be a lot more cost-efficient to train and/or use your own smaller and problem-oriented models. It might be a bit inconvenient to label the data and train a model, when you could just skip those steps with plugging into an API, but those small nuisances in the beginning will later pay off. For example, we have calculated that replacing our locally hosted models with GPT-4 API would result in monthly costs twice as high as we get paid by the client using our service.
Moreover, there is another drawback: using a third party application isn’t usually an option for handling sensitive data. If data sensitivity is not an issue, there still exists a certain level of danger by making your application dependable on a third-party service. What if you build your whole product around it and the service is suddenly shut down? Additionally, the price can be raised after a sufficient amount of costumers have become dependant of the application. Final API drawback is that it’s actually hard to verify that the model “behing the curtains” will stay the same that you signed up for — some Reddit users monitoring the responses noticed a sudden drop of quality starting about a week ago and speculated that it might stem from downgrading the models to keep up with the costs of hosting them (https://www.reddit.com/r/ChatGPT/comments/14xzohj/the_worlds_mostpowerful_ai_model_suddenly_got/).
b) Set up an LLM on your own infrastructure — while the best models of OpenAI aren’t open source there are plenty of alternatives that claim to be on par with ChatGPT. However, hosting such huge models isn’t cheap — for example, running 65-billion parameter LLaMA model without hiccups requires multiple Nvidia’s A100 GPUs (or its equivalents) costing around 10,000$ each. Although there are smaller models with less costly hardware requirements, they also have significantly lower capabilities compared to huge models like ChatGPT.
3. Legal issues
There are several court cases at the moment against OpenAI Inc that could have a significant impact on how LLMs are trained and can be used. For example, there is a class action copyright lawsuit claiming that ChatGPT is trained on books without permission from the authors.?“The complaint filed in San Francisco federal court on Wednesday said ChatGPT’s machine learning training dataset comes from books and other texts that are “copied by OpenAI without consent, without credit, and without compensation.”?The complaint cited a 2020 paper from OpenAI introducing ChatGPT-3, which said 15% of the training dataset comes from?“two internet-based books corpora.”?The authors alleged that one of those book datasets, which contains over 290,000 titles, comes from “shadow libraries” like Library Genesis and Sci-Hub, which use torrent systems to illegally publish thousands of copyrighted works. ”
In addition to using copyrighted materials in the training data, there exists another legal breach that is difficult to fight: fine-tuning licensed models and releasing them under a different name and license.
4. Regulations
Lack of precedent and regulations in the field of AI makes it risky to invest into such pipelines when it could be regulated in months and years to come. For example, the EU is currently in the process of forming the first comprehensive legislation for AI:
5. Hallucinations and difficulties in validation
The knowledge of an LLM is strictly restricted to the data it was trained on. For example, the current version of ChatGPT doesn’t contain any training data beyond the year 2021 and thus lacks understanding of anything that has happened since. Furthermore, even if an LLM should be able to answer the question correctly, it can still hallucinate while trying to answer, which makes integrating them into client interfaces somewhat dangerous. In some cases, the model honestly states that it doesn’t know the answer or is aware that their knowledge might not be up to date. For example, if ChatGPT is asked to name the winner of the last World Cup, it answers that the winner is France, but also states that its knowledge cut-off is September 2021 and that there may have been subsequent World Cups since then. However, it can also quite confidently express complete nonsense or even worse: nonsense mixed with some truth. Figure 3 provides an example of the latter: while the names of some characters are indeed correct (Arno Tali, Joosep Toots), the descriptions of them are completely incorrect.
There are methods trying to combat this problem, like using internal databases for querying context, but setting up such pipelines requires some additional engineering and in the end still won’t guarantee 100% accuracy. One might argue that it’s extremely difficult to achieve complete accuracy with any machine learning method, so why do some incorrect answers suddenly bear more weight? Firstly, it’s harder to “fix” generative LLMs — with simpler methods, one can usually get some idea what triggers the wrong answer and find some pattern in the inputs that the model struggles with the most. After that, one can put together an enriched dataset and retrain the model. In theory, the same can be done with generative LLMs as well, but it requires a lot more knowledge of the inner-workings of the model. Secondly, it might be hard to validate, if the answer is correct or not — when the model has answered a lot of questions correctly before, then we can be more inclined to believe it, especially when we lack the knowledge and/or resources to critically evaluate the response ourselves. Furthermore, the models tend to use a very confident style while phrasing the answers, which in turn makes it even harder to doubt their knowledge.
Conclusion
Generative LLMs can be helpful tools, but before making them integral part of your services, it’s advisable to consider whether they really are the best and only solution. Forthcoming regulations might restrict the way they can be used and ongoing legal issues might pull the brakes for further developments of the most commercially successful models. Furthermore, simpler methods might actually be a better and more affordable fit, especially when it comes to handling sensitive data. This doesn’t mean that LLMs should be ignored completely: people work every day to make open source solutions better and more affordable by reducing the size of the models without significant sacrifice to quality. So keep an open, but slightly critical mind!