The LLM-ephants In The Room
Some elephants, probably in a room (via Dall-E)

The LLM-ephants In The Room

In any discussion of Large Language Models (LLMs), it felt necessary to list a variety of recurring caveats, all of which are well known, if somewhat scattered. Some are existentially significant, such as energy consumption. These are not receiving the same level of coverage as the breathless excitement about the latest developments. They are a very current set of metaphorical “elephants in the room” (wikipedia).

I have been sketching out some thoughts on the potent and frankly amazing text-handling capabilities of LLMs, eg. OpenAI/(Chat)GPT, Anthropic/Claude, Meta/Llama, and others. We are in a golden age of rapid, effective, and cheap/free prototyping involving rich text content. Previously esoteric and high-end ML capabilities are now available to every company, no matter how small or under-resourced.

These seriously impressive text capabilities are at their most usable ever. If you can open a tab in a web browser, and you are comfortable with a bit of cut’n’pasting you can start prototyping, exploring and seeking relevance in and for your existing content. To go a step further than that, to automate your prototype, just requires some basic programming.

It seems pretty clear that companies not paying attention to this, those not looking for opportunities, will be seriously disadvantaged and outcompeted by those that do. LLMs are as-yet under-utilised tools for parsing and processing texts, extracting metadata, helping with first drafts (for completion by a human), etc. Generation of original content is not the best use of them.

This doc is to collate some of the caveats around the use of LLMs. I will put my spin on things, using a moderately clickbait-y “bad news / good news” approach, skipping nuance with abandon.

This doc is also an invitation for folk who just might properly understand the underlying issues of LLMs to discuss these issues, possibly even disagreeing with me, and to highlight ones I have assuredly missed. The world of LLMs is moving so fast and so profoundly that the one thing you can predict with any confidence is the immediate irrelevance of this doc, though the caveats have been around since LLMs began. Airing these issues has helped me get my thoughts in order.

The main caveat to all the caveats listed below is that every day there are new developments ameliorating many of the weaknesses and problems with LLMs, as well as adding new capabilities and use cases. There is a race to the top, comprising all the main IT providers and thousands of interest groups and highly-motivated individuals. The AI fire has been lit. But there is no need to wait. Commercially-useful, easily usable, and above all cheap capabilities, in the form of LLMs, are available right now.

A basic example of working with an LLM

The following scenario is a variation of a well-known challenge called ‘Named Entity Recognition’, identifying people and organisations explicitly named in the text. Long before LLMs appeared, there were various services and systems which could do this quite well. However, access to these (pre-LLM) services for the uninformed user was not straightforward, nor easy to use or achieve high quality.

With LLMs, pretty much as a side-effect, suddenly it is easy. Furthermore, also as a side-effect, you can do so much more than simply identify named entities. In this example we group them by type, with the types automatically selected by the LLM, and obtain a clear description of their relevance to the article. But we could also have taken the prompt further and created relationship maps of who did/said what to who, or identified unnamed but implied actors, or elicited events, or highlighted specific topics, e.g. “anything to do with indirect payments”, etc. If you can express it in clear English, it is quite possibly possible. And quick and easy and free to experiment with.

Let’s assume you have access to an LLM in a browser, say OpenAI’s free ChatGPT.

You have some content — perhaps one of your own articles, or an article scraped from the web, or a client’s brief, or a content feed you are paying for but not taking full advantage of, or a whistleblower has given you a data dump to explore. Let’s use the opening section of this wikipedia page about penicillin as our content to be explored.

You want to get an overview of the content, so let’s identify any people or companies mentioned in it, and let the LLM decide how best to group them.

You construct the following prompt and paste it into the ChatGPT input box:

You are a tool for analysing text documents, summarising and extracting useful metadata.

Read this article: <article>
Ancient societies used moulds to treat infections, and in the 
following centuries many people observed the inhibition of bacterial 
growth by moulds. While working at St Mary’s Hospital in London in 
1928, Scottish physician Alexander Fleming was the first to 
experimentally determine that a Penicillium mould secretes an 
antibacterial substance, which he named “penicillin”. The mould was 
found to be a variant of Penicillium notatum (now called Penicillium 
rubens), a contaminant of a bacterial culture in his laboratory. The 
work on penicillin at St Mary’s ended in 1929.

In 1939, a team of scientists at the Sir William Dunn School of 
Pathology at the University of Oxford, led by Howard Florey that 
included Edward Abraham, Ernst Chain, Norman Heatley and Margaret 
Jennings, began researching penicillin. They developed a method for 
cultivating the mould, and extracting, purifying and storing 
penicillin from it. They created an assay for measuring its purity. 
They carried out experiments with animals to determine penicillin’s 
safety and effectiveness before conducting clinical trials and field 
tests. They derived its chemical formula and determined how it 
works. The private sector and the United States Department of 
Agriculture located and produced new strains and developed mass 
production techniques. Penicillin became an important part of the 
Allied war effort in the Second World War, saving the lives of 
thousands of soldiers. Fleming, Florey and Chain shared the 1945 
Nobel Prize in Physiology or Medicine for its discovery and 
development.

</article>

List any entities mentioned in the article, 
with their relevance to the article. 
Group them by type.        

And ChatGPT responds seconds later with

Entities mentioned in the article, grouped by type:

People:

Alexander Fleming — Scottish physician who experimented and determined that Penicillium mould secretes an antibacterial substance, penicillin.

Howard Florey — Led a team of scientists at the University of Oxford researching penicillin.

Edward Abraham — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.

Ernst Chain — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.

Norman Heatley — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.

Margaret Jennings — Member of the team at the Sir William Dunn School of Pathology involved in penicillin research.

Organizations:

St Mary’s Hospital, London — Where Alexander Fleming worked and conducted experiments on penicillin in 1928.

Sir William Dunn School of Pathology, University of Oxford — The institution where a team of scientists led by Howard Florey researched penicillin.

United States Department of Agriculture — Involved in locating and producing new strains of penicillin and developing mass production techniques.

Substances:

Penicillium mould — Source of the antibacterial substance later named “penicillin.”

Terms/Concepts:

Penicillin — Antibacterial substance secreted by Penicillium mould, discovered and developed by Alexander Fleming, Howard Florey, and Ernst Chain.

Allied war effort — Penicillin became an important part of the Allied war effort in the Second World War, saving the lives of thousands of soldiers.

Nobel Prize in Physiology or Medicine — Award shared by Fleming, Florey, and Chain in 1945 for the discovery and development of penicillin.        

Not bad for a first go. You can now start tweaking the prompt, perhaps to resolve the clumsy distinctions of ‘Substances’ and ‘Terms/Concepts’ by specifying your own classification scheme in the prompt, e.g. “group by people, organisations, and all other concepts”.

Some LLM-ephants

Confabulations, Hallucinations, or Just Making Stuff Up

Bad News

This weakness is integral to how LLMs are constructed and trained.

The ‘G’ in GPT stands for Generative. Based on trillions of word combinations seen in billions of documents during training (the precise quantities may vary), an LLM looks at the prompt+response word sequence so far and generates, based on statistics, the word most likely to be next in the sequence (skating over a teensy bit of detail here). There is no understanding of fact or truth.

Multiple outraged author folk on the X/Twitter and Reddit have highlighted when ChatGPT has constructed realistic-looking but fake references and citations to content they have never written. Much hilarity ensues.

Interestingly, whilst LLMs can and will make up stuff in this way, pure fiction, they can also leak verbatim chunks of their training data when they are really really not meant to. For some situations, this can be worse. For example.

Reinforcement Learning from Human Feedback (RLHF) is a phase of training that uses humans to reward output that ‘looks’ right. In a very real and insidious sense, LLMs are trained to fool humans.

Good News

While you can not fully prevent confabulations, you can fairly easily express the prompt in such a way that you give the LLM very little wiggle room to confabulate. By front-loading the prompt with the most relevant context, and using a variety of idioms, you can increase the likelihood that the LLM’s focus stays on the prompt, ensuring the LLM’s responses are (mostly) relevant and useful.

See the penicillin example above. The prompt has constrained the LLM tightly to the specific task, and to the specific text to be analysed for entities.

You can have additional, automated fact-checking steps, to increase your confidence in the output. Revisit and rephrase the prompts if the LLM starts straying.

Moreover, you can turn the LLM on itself, and get it to mark its own homework, as it were. Present the previous prompt and the response as part of a new prompt and require it to identify any confabulation issues.

Obviously, the LLM might then confabulate evidence of confabulations, but the likelihood of such errors becomes very very low.

(Extremely) High Energy Consumption

This is an interesting, and properly worrying topic. Alas there is no convenient unit for comparing energy consumptions. The measurements are highly nuanced, e.g. depending on whether you factor in the renewable-ness/locality of the energy source. Obviously different queries will consume differing energy amounts.

Bad News

Serious amounts of energy are in fact being consumed, and the consumption is increasing rapidly.

Here are some unhelpful, incompatible numbers.

Google searches

  • “the energy required to power a Google search could power a low energy (10 watt) light bulb for 108 seconds.” (fullfact.org)

Video streaming

  • “the average carbon footprint of a half-hour Netflix show is equivalent to driving around 100 metres in a conventional car.” (www.iea.org)

and of course, Bitcoin!

  • “[One Bitcoin transaction is] Equivalent to the carbon footprint of 810,785 VISA transactions or 60,970 hours of watching Youtube.” (digiconomist.net)

and now we have LLMs

  • “A single LLM interaction may consume as much power as leaving a low-brightness LED lightbulb on for one hour.” (spectrum.ieee.org)
  • “we estimated GPT-4’s training electricity consumption to be between 51,772,500 and 62,318,750 KWh.” (towardsdatascience.com)

Energy consumption is clearly a hot topic. The incompatible units for the various consumers is frustrating. Perhaps this could be the topic of another essay.

Bitcoin is a beast, far outstripping mere search and AI. In no uncertain terms, even a single transaction consumes enormous amounts of energy. Multiply that by the overall amount of Bitcoin traffic, and Bitcoin consumes as much energy as a small/medium country.

Per search query, things might not seem so bad. But again, multiply that out by the accelerating use (e.g. adding LLMs to every Google and Bing query) and it is significant.

The energy costs of training LLMs eclipses Bitcoin transactions, but happens less often. Still, …

This is rarely mentioned in the press.

Good News

Issues around LLM energy consumption are (at least somewhat) under scrutiny.

The energy costs of training and using LLMs are increasingly included in assessments of LLMs.

Researchers are actively striving for ways of reducing the energy footprints of LLMs.

Improvements in technology are leading to more powerful, and more energy efficient processors.

You can refactor the way you use LLMs, reducing overall call volumes, incidentally reducing response times.

Similar to the entry about cost of productionizing, you can refactor your use of LLMs into specific (smaller, cheaper) MLMs and SLMs. There is a fairly direct correlation between lower costs and lower energy consumption.

Expensive to Productionize?

Having prototyped very cheaply, if not for free, you will be faced with the realities of automating your prototypes.

  • You will need API-access to the LLM (they all have this)
  • How many LLM calls will you need?
  • Are you making LLM calls per user query? My suggestion is to not consider doing this.
  • Do you have a lot of content to process?

Bad News

The charging models for commercial, externally-hosted LLMs are usually based on query size. A typical charge might be $10 per million tokens, where a token is roughly ? of a word in the prompt + response. For a large query, where you might be specifying a large, complicated context in the prompt and requiring a chunky response, you could be reaching 10,000 tokens, which brings the per-query cost to approx 10 cents.

For prototyping, this is basically free. But if you want to scale up to thousands of live user queries (don’t), or many thousands or millions of pieces of content, the costs escalate significantly.

Another inevitable gotcha is your prompts will get bigger as your LLM ambitions increase, so the cost-per-query will increase.

Good News

You can easily switch between LLM providers, taking your hand-constructed prompts with you, seeking the cheapest, most suitable LLM. It is likely your prompts will work nearly as well on a different LLM, and can easily be edited to do so.

Most LLMs have a free-to-use mode, with a basic, rate-limited SLA. For example, at one stage Anthropic/Claude’s licensing on their free product allowed one concurrent request across an enterprise. While perfect for prototyping, this may even be sufficient for constrained production loads.

If you do need to scale up, the charging model makes it easy to accurately calculate expected costs, and you can make the decision as to whether it is worthwhile.

If you still want to scale up, and the external-hosted costs are too onerous, you can contemplate self-hosting. There are many downloadable LLMs available from the likes of HuggingFace. This becomes an engineering exercise, albeit a non-trivial one. Again, you can forecast costs accurately, which will now be based on availability rather than call sizes.

If you still want to scale up, and the costs of hosting your own full-size LLM are too high, you can revisit your prototypes and investigate how to refactor them to use smaller, cheaper, MLMs or even SLMs. Most of the individual capabilities that can be handled by the Jack-of-all-trades LLMs can in fact be done better (faster, cheaper) by smaller, more focussed models. This refactoring is very doable, albeit non-trivial. You will need to invest much more time and effort in the ML-engineering aspect of the prototyping stage.

Copyrighting the Input and the Output

This is a highly nuanced, contentious topic, hedged with ifs and buts, where expertise is necessary. (Bloomberg Law .com)

Bad News

It seems clear there were copyright violations galore during the training of the big LLMs, where significant amounts of copyrighted text were slurped up and used without permission.

There are debates (and doubts) about whether the raw output of an LLM can be copyrighted by the user, or even the provider.

Good News

My preference is to consider the LLM as a tool for parsing and processing texts, extracting metadata, helping with first drafts (for completion by a human), etc, and not for generation of original content, and especially not for the generation of content ‘in the style of’ a copyrighted author. Any ‘style’ instructions should be defined in the prompt, to enact your (personal or org’s) style choices.

If the LLM is used for its ability to parse natural language, and to (kinda sorta) follow natural language instructions for processing the texts, you are using its general (and potent) capabilities gleaned from the totality of its training, and not reliant on training content stolen from any specific authors. This ‘nice’ use of LLMs should survive to later generations of LLM which are only trained on ‘clean’ data, should that come to pass.

‘They’ Are Stealing Our Content

The ‘they’ in this case being the developers of the original LLMs.

Bad News

In short, yes, they probably did, up til 2021 or so. Anything available online was considered fair game. An exciting assortment of legal proceedings are underway to tackle this.

There are reports nowadays of LLM providers (and seekers of LLM training data) starting to honour the access controls on site content such as the robots.txt file, or development of meta tags, or the use of copyright license (as discussed in searchengineland.com). This relies on the seekers of training content behaving well, following the rules. Being realistic though, the only way to be sure your content is not appropriated is to not expose it online.

If you use an externally-hosted LLM, your prompt (and obviously, the response) will be known to the provider.

Good News

In external providers’ Ts&Cs there should be some statement of how long and for what purpose the query transactions are retained. Your Legal person/team should be able to assess the safety/privacy of sharing your current content with the external provider via queries.

There is an obvious, straightforward way to be sure your content is not appropriated for training, and extremely sensitive, private content is never seen by external providers: host your own LLM.

You can still prototype using the free external LLMs by preparing some safe, exemplar content, or anonymising the content, or constructing bespoke content, specifically for prototyping stage.

Look, An Eagle!, or Losing Focus

Most of the big LLMs are competing on the size of their context window, ie., how much text the LLM can ingest via the prompt and then generate a full response, while staying relevant to the prompt.

Bad News

As you increase the size and complexity of your prompts, various suboptimal behaviours crop up. A common sign is the response will just curtail the list it is generating.

Or the format of the output changes.

Or the level of detail suddenly drops.

Or the response simply stops in the middle of the …

Good News

You easily can identify these limits during prototyping, establishing ‘safe’ sizes of content and complexity of prompt. You might find you can safely supply 5k of text and specify 3 separate tasks in the same prompt, and expect 5k of text in the response, for example.

You can do some automated processing on the LLM output to check it has (probably) honoured the full prompt.

There are various idioms you can include in the prompt that increase the likelihood of the LLM staying on track during a sequence of tasks in the same prompt, e.g. and yes, really

“Take a deep breath and work on this step by step” (arstechnica)

New idioms are being discovered all the time, as our understanding improves of how LLMs tick.

You can home in on phrasing that your LLM is more ‘comfortable’ with. You can front load the prompt with your required structure (e.g. a list of terms you are specifically looking for) and leave less room for the LLM to wiggle out of it.

This is part psychology, part art form, part telepathy. LLM Whispering. Fun and frustrating. There are no hard and fast rules.

Jailbreaks, Leading Questions, or Making The LLM Say Bad Things

This is where ‘bad actors’ will strive to defeat any protections built into the LLM, to access the underlying data, or to get LLM to say bad or inappropriate things.

Bad News

People, being people, especially online, of course this is inevitable. Microsoft knows this well (theverge: tay-microsoft-chatbot-racisteverge).

Putting in effective safeguards is an enormous task. The big providers are investing much effort and money into this. Leave them to it.

Good News

By far the simplest approach is to simply not allow users direct access to your LLM, and is certainly my preferred approach. If your LLM prompts are defined offline for specific offline tasks, e.g. extracting named entities from some text, the bad actors can’t hurt you.

As the user of the LLM, if you are not a bad actor, you won’t be doing bad actor-y things.

This is something of a non-problem for offline use of LLMs, but is a very big problem if you are exposing your LLM to live users.

Losing Your Voice, or Replacing An Author With An LLM

As mentioned above, my preference is simply to not use the LLM to generate original content. However, if you are using the LLM to summarise your own content, possibly summarising an article, or a set of articles by the same author, or a whole topic, or search results, etc, you do stray into the question of ‘voice’.

Bad News

This can matter where the voice of each author is distinct and part of their brand.

Good News

You can tune the prompt to summarise in specific ways, in effect giving the summaries their own voice, distinct from the underlying content being summarised. Make it clear this summary is not by the author.

You can structure the prompts to extract specific taxonomies and other metadata, or listing the main points, rather than generate prose.

You can use the summaries as hidden text to improve searching, giving users more ways of finding relevant content.

You can tune summaries to target certain types of users or use cases, again making it possible to match your content with nuanced user needs, without necessarily exposing the LLM summaries.

So so slow, or 20+ seconds per query

LLM query response times are roughly proportional to the sizes of the prompt and the response.

Bad News

When you use the ChatGPT (and similar) interface, the response starts within a second or so, and appears. one. word. at. a. time. like. a. teletype. This is not an affectation, or a design choice. It is in fact the speed at which the LLM is generating the response.

When you use the LLM to process text, you will be waiting for the entire response before you do anything with it.

For significant queries involving 10,000 tokens or more, the response might take upwards of 20 seconds to complete a query.

The query itself is cpu-intensive, so it is not easy to run many queries in parallel.

This is another fundamental reason not to tie LLM calls to live user queries.

Good News

Many groups are striving to improve the efficiency of LLMs, and Moore’s Law (wikipedia) still seems to be holding up.

LLMs will inevitably get faster.

Meanwhile, if you use them in batch mode (ie, decoupled from live user queries) you can keep on trucking with the maximum throughput 24x7, whether hosted externally or locally.

You can also investigate refactoring your use of LLMs into smaller faster MLMs and SLMs.

Oracle of Delphi, or Treating Your LLM Like It Actually Knows Everything

This is the ‘glamorous’ use of LLMs. Striving for the powers of the ancient Oracle of Delphi, aka Pythia (wikipedia), open up a portal for any and all users to approach and ask their questions. All your company’s information, knowledge, and wisdom accessible, just a question away.

It is certainly attention grabbing. ChatGPT for example can answer ‘any’ question, and maintain a conversation, given various guard rails to keep it wholesome. As a toy, as the latest and impressive descendant of Eliza (wikipedia), it is entertaining and eerie.

It is possible to embellish an LLM with your own content, so it can answer questions about that content.

Bad News

However, all is not as it seems.

The use of LLMs in this way exposes you to almost all of the problems and weaknesses and risks with LLMs mentioned in this doc, notably Confabulations and Jailbreaks.

LLMs are simply not reliable as ‘fact’ machines.

Are you happy for an error-prone, easily manipulated fantasist to represent your company to your users?

Good News

There is actually no need to fret about this in the short term, since it is not actually where the main benefits of LLMs lie. They are far more useful for parsing, summarising, and analysing texts.

There are many approaches where you can use LLMs to pre-process your content, user personas, FAQs, etc, and create a rather amazingly powerful, useful, and safe, live search capability.

And all the other issues I have missed out

Bad News

There are many.

  • Artificial General Intelligence (AGI), or Are We All Doomed?
  • Widespread Loss of Roles, or Turmoil in the Job Market
  • Is This The End Of The Intern?
  • Our Tech Bro Future
  • Just Because You Can…
  • LLMs For Evil

Good News

LLMs are amazingly useful and usable right now, despite all these caveats.

Use it, or lose out to it.

— — —

FAQ and Terms

LLM

  • Large Language Model — a mind-bogglingly huge network.

Token

  • The unit of size of LLM queries which is the sum of prompt and response. 1 token is roughly ? of a word.

MLM, SLM

  • Medium and Small Language Models. Significantly smaller (and faster and cheaper) than general-purpose LLMs, and sometimes better for specific tasks

Externally-hosted

  • Where the provider offers access to an LLM that they host, running on their hardware systems, with full visibility of and control over what you do with the LLMs.

Query = Prompt + Response

  • An interaction with an LLM. See the example above.

HuggingFace

  • The go-to site for the LLM (and MLM, SLM, et al) community, where models are assessed and compared, and available to be hosted externally, or downloaded locally. huggingface.co

Taxonomy

  • A slice of structure applied to the world. If you are interested in tree frogs, you could define a taxonomy of the different kinds of tree frogs, and specify in the prompt that the LLM classify any mentions of tree frogs in the content according to your taxonomy.

Reinforcement Learning from Human Feedback (RLHF)

  • A phase of training for LLMs that uses humans to reward output that ‘looks’ right.

SLA

  • Service Level Agreement — what the provider commits to provide, and requires of the user.

Metadata

  • All the juicy bits in the text, including named people and organisations and places, relationships between them, dates, implications, etc.

Context Window

  • How much text the LLM can ingest via the prompt and then generate a full response, while staying relevant to the prompt.

An elephant, definitely in a room (via Dall-E)


Sudesh Chhita

BI Consultant at AIG

1 年

This is a great piece Chris. One of the few articles around that breaks down these aspects into a digestible form and presents them in context. Looking fwd to the next one.

要查看或添加评论,请登录

Chris Gathercole的更多文章

  • NotebookLM - Early Thoughts

    NotebookLM - Early Thoughts

    First impressions of Google's NotebookLM+AutoPodcast ( notebooklm.google.

    13 条评论
  • Capability-First: A Different Take On Product Strategy And Prioritisation

    Capability-First: A Different Take On Product Strategy And Prioritisation

    tl;dr First identify a strategically relevant portfolio of versatile product capabilities, and only then prioritise the…

    9 条评论
  • So, you have a huge internal appetite for LLMs?

    So, you have a huge internal appetite for LLMs?

    In an article published on 26 July 2024, the Financial Times described how JP Morgan now has “about 50,000 staff [..

    1 条评论
  • So, you want to organise a Hackathon?

    So, you want to organise a Hackathon?

    TL;DR?—?Start at the end, with the intended outcomes, and work backwards. But also start with, “Are you sure what you…

    1 条评论
  • So, (you think) you have a strategy?

    So, (you think) you have a strategy?

    There are many things called “a strategy”, but how can you assess them? What if you want to construct your own…

    2 条评论
  • So, you have an FAQ?

    So, you have an FAQ?

    Whether yes or no, this is your way in to using AI. Do you have an FAQ? You know, that always out of date collection of…

    4 条评论
  • Mind the (AI) Gap

    Mind the (AI) Gap

    You can’t plan and prioritise properly if you don’t understand the range of AI possibilities, or how they can relate to…

    2 条评论

社区洞察

其他会员也浏览了