The LLM-ephants In The Room
Chris Gathercole
Director | ML/AI Strategy Development | Rapidly Prototyping Capabilities | New Product R&D
In any discussion of Large Language Models (LLMs), it felt necessary to list a variety of recurring caveats, all of which are well known, if somewhat scattered. Some are existentially significant, such as energy consumption. These are not receiving the same level of coverage as the breathless excitement about the latest developments. They are a very current set of metaphorical “elephants in the room” (wikipedia).
I have been sketching out some thoughts on the potent and frankly amazing text-handling capabilities of LLMs, eg. OpenAI/(Chat)GPT, Anthropic/Claude, Meta/Llama, and others. We are in a golden age of rapid, effective, and cheap/free prototyping involving rich text content. Previously esoteric and high-end ML capabilities are now available to every company, no matter how small or under-resourced.
These seriously impressive text capabilities are at their most usable ever. If you can open a tab in a web browser, and you are comfortable with a bit of cut’n’pasting you can start prototyping, exploring and seeking relevance in and for your existing content. To go a step further than that, to automate your prototype, just requires some basic programming.
It seems pretty clear that companies not paying attention to this, those not looking for opportunities, will be seriously disadvantaged and outcompeted by those that do. LLMs are as-yet under-utilised tools for parsing and processing texts, extracting metadata, helping with first drafts (for completion by a human), etc. Generation of original content is not the best use of them.
This doc is to collate some of the caveats around the use of LLMs. I will put my spin on things, using a moderately clickbait-y “bad news / good news” approach, skipping nuance with abandon.
This doc is also an invitation for folk who just might properly understand the underlying issues of LLMs to discuss these issues, possibly even disagreeing with me, and to highlight ones I have assuredly missed. The world of LLMs is moving so fast and so profoundly that the one thing you can predict with any confidence is the immediate irrelevance of this doc, though the caveats have been around since LLMs began. Airing these issues has helped me get my thoughts in order.
The main caveat to all the caveats listed below is that every day there are new developments ameliorating many of the weaknesses and problems with LLMs, as well as adding new capabilities and use cases. There is a race to the top, comprising all the main IT providers and thousands of interest groups and highly-motivated individuals. The AI fire has been lit. But there is no need to wait. Commercially-useful, easily usable, and above all cheap capabilities, in the form of LLMs, are available right now.
A basic example of working with an LLM
The following scenario is a variation of a well-known challenge called ‘Named Entity Recognition’, identifying people and organisations explicitly named in the text. Long before LLMs appeared, there were various services and systems which could do this quite well. However, access to these (pre-LLM) services for the uninformed user was not straightforward, nor easy to use or achieve high quality.
With LLMs, pretty much as a side-effect, suddenly it is easy. Furthermore, also as a side-effect, you can do so much more than simply identify named entities. In this example we group them by type, with the types automatically selected by the LLM, and obtain a clear description of their relevance to the article. But we could also have taken the prompt further and created relationship maps of who did/said what to who, or identified unnamed but implied actors, or elicited events, or highlighted specific topics, e.g. “anything to do with indirect payments”, etc. If you can express it in clear English, it is quite possibly possible. And quick and easy and free to experiment with.
Let’s assume you have access to an LLM in a browser, say OpenAI’s free ChatGPT.
You have some content — perhaps one of your own articles, or an article scraped from the web, or a client’s brief, or a content feed you are paying for but not taking full advantage of, or a whistleblower has given you a data dump to explore. Let’s use the opening section of this wikipedia page about penicillin as our content to be explored.
You want to get an overview of the content, so let’s identify any people or companies mentioned in it, and let the LLM decide how best to group them.
You construct the following prompt and paste it into the ChatGPT input box:
You are a tool for analysing text documents, summarising and extracting useful metadata.
Read this article: <article>
Ancient societies used moulds to treat infections, and in the
following centuries many people observed the inhibition of bacterial
growth by moulds. While working at St Mary’s Hospital in London in
1928, Scottish physician Alexander Fleming was the first to
experimentally determine that a Penicillium mould secretes an
antibacterial substance, which he named “penicillin”. The mould was
found to be a variant of Penicillium notatum (now called Penicillium
rubens), a contaminant of a bacterial culture in his laboratory. The
work on penicillin at St Mary’s ended in 1929.
In 1939, a team of scientists at the Sir William Dunn School of
Pathology at the University of Oxford, led by Howard Florey that
included Edward Abraham, Ernst Chain, Norman Heatley and Margaret
Jennings, began researching penicillin. They developed a method for
cultivating the mould, and extracting, purifying and storing
penicillin from it. They created an assay for measuring its purity.
They carried out experiments with animals to determine penicillin’s
safety and effectiveness before conducting clinical trials and field
tests. They derived its chemical formula and determined how it
works. The private sector and the United States Department of
Agriculture located and produced new strains and developed mass
production techniques. Penicillin became an important part of the
Allied war effort in the Second World War, saving the lives of
thousands of soldiers. Fleming, Florey and Chain shared the 1945
Nobel Prize in Physiology or Medicine for its discovery and
development.
</article>
List any entities mentioned in the article,
with their relevance to the article.
Group them by type.
And ChatGPT responds seconds later with
Entities mentioned in the article, grouped by type:
People:
Alexander Fleming — Scottish physician who experimented and determined that Penicillium mould secretes an antibacterial substance, penicillin.
Howard Florey — Led a team of scientists at the University of Oxford researching penicillin.
Edward Abraham — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.
Ernst Chain — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.
Norman Heatley — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.
Margaret Jennings — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.
Organizations:
St Mary’s Hospital, London — Where Alexander Fleming worked and conducted experiments on penicillin in 1928.
Sir William Dunn School of Pathology, University of Oxford — The institution where a team of scientists led by Howard Florey researched penicillin.
United States Department of Agriculture — Involved in locating and producing new strains of penicillin and developing mass production techniques.
Substances:
Penicillium mould — Source of the antibacterial substance later named “penicillin.”
Terms/Concepts:
Penicillin — Antibacterial substance secreted by Penicillium mould, discovered and developed by Alexander Fleming, Howard Florey, and Ernst Chain.
Allied war effort — Penicillin became an important part of the Allied war effort in the Second World War, saving the lives of thousands of soldiers.
Nobel Prize in Physiology or Medicine — Award shared by Fleming, Florey, and Chain in 1945 for the discovery and development of penicillin.
Not bad for a first go. You can now start tweaking the prompt, perhaps to resolve the clumsy distinctions of ‘Substances’ and ‘Terms/Concepts’ by specifying your own classification scheme in the prompt, e.g. “group by people, organisations, and all other concepts”.
Some LLM-ephants
Confabulations, Hallucinations, or Just Making Stuff Up
Bad News
This weakness is integral to how LLMs are constructed and trained.
The ‘G’ in GPT stands for Generative. Based on trillions of word combinations seen in billions of documents during training (the precise quantities may vary), an LLM looks at the prompt+response word sequence so far and generates, based on statistics, the word most likely to be next in the sequence (skating over a teensy bit of detail here). There is no understanding of fact or truth.
Multiple outraged author folk on the X/Twitter and Reddit have highlighted when ChatGPT has constructed realistic-looking but fake references and citations to content they have never written. Much hilarity ensues.
Interestingly, whilst LLMs can and will make up stuff in this way, pure fiction, they can also leak verbatim chunks of their training data when they are really really not meant to. For some situations, this can be worse. For example.
Reinforcement Learning from Human Feedback (RLHF) is a phase of training that uses humans to reward output that ‘looks’ right. In a very real and insidious sense, LLMs are trained to fool humans.
Good News
While you can not fully prevent confabulations, you can fairly easily express the prompt in such a way that you give the LLM very little wiggle room to confabulate. By front-loading the prompt with the most relevant context, and using a variety of idioms, you can increase the likelihood that the LLM’s focus stays on the prompt, ensuring the LLM’s responses are (mostly) relevant and useful.
See the penicillin example above. The prompt has constrained the LLM tightly to the specific task, and to the specific text to be analysed for entities.
You can have additional, automated fact-checking steps, to increase your confidence in the output. Revisit and rephrase the prompts if the LLM starts straying.
Moreover, you can turn the LLM on itself, and get it to mark its own homework, as it were. Present the previous prompt and the response as part of a new prompt and require it to identify any confabulation issues.
Obviously, the LLM might then confabulate evidence of confabulations, but the likelihood of such errors becomes very very low.
(Extremely) High Energy Consumption
This is an interesting, and properly worrying topic. Alas there is no convenient unit for comparing energy consumptions. The measurements are highly nuanced, e.g. depending on whether you factor in the renewable-ness/locality of the energy source. Obviously different queries will consume differing energy amounts.
Bad News
Serious amounts of energy are in fact being consumed, and the consumption is increasing rapidly.
Here are some unhelpful, incompatible numbers.
Google searches
Video streaming
and of course, Bitcoin!
and now we have LLMs
Energy consumption is clearly a hot topic. The incompatible units for the various consumers is frustrating. Perhaps this could be the topic of another essay.
Bitcoin is a beast, far outstripping mere search and AI. In no uncertain terms, even a single transaction consumes enormous amounts of energy. Multiply that by the overall amount of Bitcoin traffic, and Bitcoin consumes as much energy as a small/medium country.
Per search query, things might not seem so bad. But again, multiply that out by the accelerating use (e.g. adding LLMs to every Google and Bing query) and it is significant.
The energy costs of training LLMs eclipses Bitcoin transactions, but happens less often. Still, …
This is rarely mentioned in the press.
Good News
Issues around LLM energy consumption are (at least somewhat) under scrutiny.
The energy costs of training and using LLMs are increasingly included in assessments of LLMs.
Researchers are actively striving for ways of reducing the energy footprints of LLMs.
Improvements in technology are leading to more powerful, and more energy efficient processors.
You can refactor the way you use LLMs, reducing overall call volumes, incidentally reducing response times.
Similar to the entry about cost of productionizing, you can refactor your use of LLMs into specific (smaller, cheaper) MLMs and SLMs. There is a fairly direct correlation between lower costs and lower energy consumption.
Expensive to Productionize?
Having prototyped very cheaply, if not for free, you will be faced with the realities of automating your prototypes.
Bad News
The charging models for commercial, externally-hosted LLMs are usually based on query size. A typical charge might be $10 per million tokens, where a token is roughly ? of a word in the prompt + response. For a large query, where you might be specifying a large, complicated context in the prompt and requiring a chunky response, you could be reaching 10,000 tokens, which brings the per-query cost to approx 10 cents.
For prototyping, this is basically free. But if you want to scale up to thousands of live user queries (don’t), or many thousands or millions of pieces of content, the costs escalate significantly.
Another inevitable gotcha is your prompts will get bigger as your LLM ambitions increase, so the cost-per-query will increase.
Good News
You can easily switch between LLM providers, taking your hand-constructed prompts with you, seeking the cheapest, most suitable LLM. It is likely your prompts will work nearly as well on a different LLM, and can easily be edited to do so.
Most LLMs have a free-to-use mode, with a basic, rate-limited SLA. For example, at one stage Anthropic/Claude’s licensing on their free product allowed one concurrent request across an enterprise. While perfect for prototyping, this may even be sufficient for constrained production loads.
If you do need to scale up, the charging model makes it easy to accurately calculate expected costs, and you can make the decision as to whether it is worthwhile.
If you still want to scale up, and the external-hosted costs are too onerous, you can contemplate self-hosting. There are many downloadable LLMs available from the likes of HuggingFace. This becomes an engineering exercise, albeit a non-trivial one. Again, you can forecast costs accurately, which will now be based on availability rather than call sizes.
If you still want to scale up, and the costs of hosting your own full-size LLM are too high, you can revisit your prototypes and investigate how to refactor them to use smaller, cheaper, MLMs or even SLMs. Most of the individual capabilities that can be handled by the Jack-of-all-trades LLMs can in fact be done better (faster, cheaper) by smaller, more focussed models. This refactoring is very doable, albeit non-trivial. You will need to invest much more time and effort in the ML-engineering aspect of the prototyping stage.
Copyrighting the Input and the Output
This is a highly nuanced, contentious topic, hedged with ifs and buts, where expertise is necessary. (Bloomberg Law .com)
Bad News
It seems clear there were copyright violations galore during the training of the big LLMs, where significant amounts of copyrighted text were slurped up and used without permission.
There are debates (and doubts) about whether the raw output of an LLM can be copyrighted by the user, or even the provider.
Good News
My preference is to consider the LLM as a tool for parsing and processing texts, extracting metadata, helping with first drafts (for completion by a human), etc, and not for generation of original content, and especially not for the generation of content ‘in the style of’ a copyrighted author. Any ‘style’ instructions should be defined in the prompt, to enact your (personal or org’s) style choices.
If the LLM is used for its ability to parse natural language, and to (kinda sorta) follow natural language instructions for processing the texts, you are using its general (and potent) capabilities gleaned from the totality of its training, and not reliant on training content stolen from any specific authors. This ‘nice’ use of LLMs should survive to later generations of LLM which are only trained on ‘clean’ data, should that come to pass.
‘They’ Are Stealing Our Content
The ‘they’ in this case being the developers of the original LLMs.
Bad News
In short, yes, they probably did, up til 2021 or so. Anything available online was considered fair game. An exciting assortment of legal proceedings are underway to tackle this.
There are reports nowadays of LLM providers (and seekers of LLM training data) starting to honour the access controls on site content such as the robots.txt file, or development of meta tags, or the use of copyright license (as discussed in searchengineland.com). This relies on the seekers of training content behaving well, following the rules. Being realistic though, the only way to be sure your content is not appropriated is to not expose it online.
If you use an externally-hosted LLM, your prompt (and obviously, the response) will be known to the provider.
Good News
In external providers’ Ts&Cs there should be some statement of how long and for what purpose the query transactions are retained. Your Legal person/team should be able to assess the safety/privacy of sharing your current content with the external provider via queries.
There is an obvious, straightforward way to be sure your content is not appropriated for training, and extremely sensitive, private content is never seen by external providers: host your own LLM.
You can still prototype using the free external LLMs by preparing some safe, exemplar content, or anonymising the content, or constructing bespoke content, specifically for prototyping stage.
Look, An Eagle!, or Losing Focus
Most of the big LLMs are competing on the size of their context window, ie., how much text the LLM can ingest via the prompt and then generate a full response, while staying relevant to the prompt.
领英推荐
Bad News
As you increase the size and complexity of your prompts, various suboptimal behaviours crop up. A common sign is the response will just curtail the list it is generating.
Or the format of the output changes.
Or the level of detail suddenly drops.
Or the response simply stops in the middle of the …
Good News
You easily can identify these limits during prototyping, establishing ‘safe’ sizes of content and complexity of prompt. You might find you can safely supply 5k of text and specify 3 separate tasks in the same prompt, and expect 5k of text in the response, for example.
You can do some automated processing on the LLM output to check it has (probably) honoured the full prompt.
There are various idioms you can include in the prompt that increase the likelihood of the LLM staying on track during a sequence of tasks in the same prompt, e.g. and yes, really
“Take a deep breath and work on this step by step” (arstechnica)
New idioms are being discovered all the time, as our understanding improves of how LLMs tick.
You can home in on phrasing that your LLM is more ‘comfortable’ with. You can front load the prompt with your required structure (e.g. a list of terms you are specifically looking for) and leave less room for the LLM to wiggle out of it.
This is part psychology, part art form, part telepathy. LLM Whispering. Fun and frustrating. There are no hard and fast rules.
Jailbreaks, Leading Questions, or Making The LLM Say Bad Things
This is where ‘bad actors’ will strive to defeat any protections built into the LLM, to access the underlying data, or to get LLM to say bad or inappropriate things.
Bad News
People, being people, especially online, of course this is inevitable. Microsoft knows this well (theverge: tay-microsoft-chatbot-racisteverge).
Putting in effective safeguards is an enormous task. The big providers are investing much effort and money into this. Leave them to it.
Good News
By far the simplest approach is to simply not allow users direct access to your LLM, and is certainly my preferred approach. If your LLM prompts are defined offline for specific offline tasks, e.g. extracting named entities from some text, the bad actors can’t hurt you.
As the user of the LLM, if you are not a bad actor, you won’t be doing bad actor-y things.
This is something of a non-problem for offline use of LLMs, but is a very big problem if you are exposing your LLM to live users.
Losing Your Voice, or Replacing An Author With An LLM
As mentioned above, my preference is simply to not use the LLM to generate original content. However, if you are using the LLM to summarise your own content, possibly summarising an article, or a set of articles by the same author, or a whole topic, or search results, etc, you do stray into the question of ‘voice’.
Bad News
This can matter where the voice of each author is distinct and part of their brand.
Good News
You can tune the prompt to summarise in specific ways, in effect giving the summaries their own voice, distinct from the underlying content being summarised. Make it clear this summary is not by the author.
You can structure the prompts to extract specific taxonomies and other metadata, or listing the main points, rather than generate prose.
You can use the summaries as hidden text to improve searching, giving users more ways of finding relevant content.
You can tune summaries to target certain types of users or use cases, again making it possible to match your content with nuanced user needs, without necessarily exposing the LLM summaries.
So so slow, or 20+ seconds per query
LLM query response times are roughly proportional to the sizes of the prompt and the response.
Bad News
When you use the ChatGPT (and similar) interface, the response starts within a second or so, and appears. one. word. at. a. time. like. a. teletype. This is not an affectation, or a design choice. It is in fact the speed at which the LLM is generating the response.
When you use the LLM to process text, you will be waiting for the entire response before you do anything with it.
For significant queries involving 10,000 tokens or more, the response might take upwards of 20 seconds to complete a query.
The query itself is cpu-intensive, so it is not easy to run many queries in parallel.
This is another fundamental reason not to tie LLM calls to live user queries.
Good News
Many groups are striving to improve the efficiency of LLMs, and Moore’s Law (wikipedia) still seems to be holding up.
LLMs will inevitably get faster.
Meanwhile, if you use them in batch mode (ie, decoupled from live user queries) you can keep on trucking with the maximum throughput 24x7, whether hosted externally or locally.
You can also investigate refactoring your use of LLMs into smaller faster MLMs and SLMs.
Oracle of Delphi, or Treating Your LLM Like It Actually Knows Everything
This is the ‘glamorous’ use of LLMs. Striving for the powers of the ancient Oracle of Delphi, aka Pythia (wikipedia), open up a portal for any and all users to approach and ask their questions. All your company’s information, knowledge, and wisdom accessible, just a question away.
It is certainly attention grabbing. ChatGPT for example can answer ‘any’ question, and maintain a conversation, given various guard rails to keep it wholesome. As a toy, as the latest and impressive descendant of Eliza (wikipedia), it is entertaining and eerie.
It is possible to embellish an LLM with your own content, so it can answer questions about that content.
Bad News
However, all is not as it seems.
The use of LLMs in this way exposes you to almost all of the problems and weaknesses and risks with LLMs mentioned in this doc, notably Confabulations and Jailbreaks.
LLMs are simply not reliable as ‘fact’ machines.
Are you happy for an error-prone, easily manipulated fantasist to represent your company to your users?
Good News
There is actually no need to fret about this in the short term, since it is not actually where the main benefits of LLMs lie. They are far more useful for parsing, summarising, and analysing texts.
There are many approaches where you can use LLMs to pre-process your content, user personas, FAQs, etc, and create a rather amazingly powerful, useful, and safe, live search capability.
And all the other issues I have missed out
Bad News
There are many.
Good News
LLMs are amazingly useful and usable right now, despite all these caveats.
Use it, or lose out to it.
— — —
FAQ and Terms
LLM
Token
MLM, SLM
Externally-hosted
Query = Prompt + Response
HuggingFace
Taxonomy
Reinforcement Learning from Human Feedback (RLHF)
SLA
Metadata
Context Window
BI Consultant at AIG
1 年This is a great piece Chris. One of the few articles around that breaks down these aspects into a digestible form and presents them in context. Looking fwd to the next one.