AI, AI, Oh
DangrafArt

AI, AI, Oh

“Somewhere nearby is Colossal Cave, where others have found fortunes in treasure and gold, though it is rumoured that some who enter are never seen again”

Crowther and Woods, Colossal Cave Adventure (1976-1977), Digital Equipment Corporation PDP-10


‘Artificial Intelligence’ is an odd sort of phrase. It was coined in the 1950s, and it’s since been applied to a diverse range of information technologies that do a diverse range of things, none of which involve actual intelligence.

At any one time though there tends to be a fashionable research paradigm. In the 1980s, for example, it was all about ‘expert systems’. Neural networks were scarcely heard of, although they were decades old as a concept. By the noughties we’d scarcely hear of anything else.

In 2017 a bunch of researchers based mainly at Google quietly published a paper called 'Attention is All You Need’. It described a mechanism they called a ‘Transformer’. Since then practically all AI development has swung towards this mechanism. OpenAI foisted it last year onto an unsuspecting world (through ChatGPT – GPT stands for ‘Generative Pre-trained Transformer’) and triggered a cascade of developments based on the same underlying technology.

We’ve not heard the last of them by any stretch of the imagination, but in a sense the more well known, fairly general ‘frontier’ models, such as Microsoft/Open AI’s GPT-4 and Google’s Bard are a bit of an evolutionary cul de sac. The sheer computing grunt needed to train one of these monsters could power a medium size country for months. Relatively few organisations can deploy this level of ‘compute’, and so many of the current research trajectories are about reducing the colossal resource demand rather than necessarily building much more powerful platforms. Some of this is done by making clever little tweaks – reducing the number of decimal places used in calculations, for example, turned out to have a big impact without noticeably degrading performance. But much of it is about developing smaller models that specialise in different areas of knowledge, where they can outperform the general transformer networks.

And in this febrile maelstrom of innovation, we here at The Strategy Exchange have not been idle. Oh no, by crikey. Not by a long chalk.


“BISTRANIO:

Great Nurself hewiful,

Rich is nest sword; I say, then we dream out

That provided in their action”

Theatrical Works Augmentation Tool - Generative Pre-trained Transformer, The Strategy Exchange (2023)


Shakespearean plays represent perhaps the pinnacle of the English literary tradition. It may shock you to hear though that there haven’t been any new ones for over four hundred years. This is clearly a problem – one which our Theatrical Works Augmentation Tool sets out to tackle head on.

In fairness, this isn’t a completely original idea[1]. However, we’ve also trained our TwatGPT? platform to write poetry in Old Yorkshire dialect. And we suspect that’s not something that Silicon Valley has cottoned onto yet.

For the avoidance of doubt, this isn’t altogether serious, but it is real. In this article we use TwatGPT to give a bit of an insight into how a transformer actually works, for a curious but non-technical professional. The idea is to help the reader to navigate some of the dramatic and polarised reportage that surrounds this technology, collectively known as Large Language Models (‘LLM’ ), by providing some intuition as to what the mechanism does. We then touch on some of the implications of all this from, hopefully, a reasonably informed perspective. If being down among the weeds gets a bit much, feel free to jump to the conclusions.


“Sleep.

Aw'm but keep it i' sorrowful knee,

Withaat noa mortal can;

An if pray when aw connot do be stopt,

It prevent its a chame.

Awm sewer aw think aw wor fooil, when aw sed,

An leeave thi form this pain;

Aw cannot ne'er may be wrang at exalt,

An sich pleasure untroubles an bunce.”

TwatGPT-Tyke, TSE


LLMs need material to learn from. ChatGPT learned from a good chunk of the entire internet. TwatGPT learned from the plays of William Shakespeare and the poetry of John Hartley.

The first job is to divide that material into ‘tokens’ that we can work with. TwatGPT does this the easy way – we simply use characters. Upper and lower case letters, spaces and newlines, and some punctuation symbols, make for a few dozen tokens. The tokenizers developed by Google and OpenAI use ‘sub-word’ encodings that typically break words up into chunks of characters, which might result in tens of thousands of possible tokens.

Once we’ve got our tokenized training data, the GPT can learn from it. This is done by extensively repeated application of a kind of feedback loop, discussed below. For each of its two applications, TwatGPT ran through this training cycle around 5,000 times, taking a little under an hour and a half, on a megabyte or so of text (roughly a million characters), using the graphics processor on a desktop PC. By way of comparison, the Microsoft/Meta Llama 2 model, which trained on around 10 terabytes of text (10 trillion characters), took 12 days to train, using 6,000 graphics processors.

If you’re wondering what graphics has to do with all this, the answer’s nothing. It just happens that the graphics chips that power video games - ‘graphics processing units’, or ‘GPUs’ - are extremely efficient at doing a lot of simple but similar calculations in parallel, instead of one after the other. By a stroke of good fortune LLMs do virtually nothing else but massive numbers of simple but similar calculations. GPU farms can run these simultaneously and save colossal amounts of training time. Large Language Models wouldn’t be viable at all if every calculation had to be done sequentially, even at the scale of TwatGPT. (Most graphics chips are made by Nvidia, which has consequently transformed itself from a niche supplier in the video gaming industry into a world leading AI powerhouse, whose share price has risen fifteen fold over the last five years. But I digress.)

What a LLM does is to take a block of text and, based on little more than statistical numerology, predict the next token. It can then do the same thing with the newly generated text including the new token. In principle it can babble on indefinitely in this way, without repeating itself.


“Provost:

Marry, brother Petar; you, masters, how say thou accord'st

Too rate, I'll fear the truth.

For, liberty must my woman, give al: our best

To keep what I with ut up the swift on my friends,

A blot fick'd my head, at longing and dear.

Provot-friages, ere it ropkiss your confer:

But let me owe on me, get them, but, not angry live,

For whom we had great late not it fifteen,

And soldier'd but thine. We will take

Keep you, let along in minument this humount no?

A duke have the way, beheld man! When my fortune,

From he galls rule aw, you wrong'd home the straight.”

TwatGPT, TSE


The first step in each training cycle is to select a bunch of examples of text sequences drawn from the training data. Each batch comprises a randomly selected chunk of continuous characters. We can select a number of batches to train on at the same time because of the massive parallelism mentioned above (TwatGPT picks a few dozen), but since each batch is essentially an independent training run that just happens to execute at the same time as the others we’ll ignore this point from here on in and just talk about individual batches.

The selected sequence of text is sliced into tokens, which are encoded as numbers (‘space’ = 1, ‘!’ = 2, and so on). This batch of numbers is the input to the model. TwatGPT uses batches 256 characters long.

In order to train any kind of neural network we need to give it a ‘target’ - a good answer - for it to aim for. In the case of our LLM, we want to give it some text and have it predict the next token. That means that for any given run of characters in the training data, a good answer would be the actual next character in the training text. (We'll come back to this point.)

There’s a bit more to this than meets the eye, so let’s look at a simple example using an artificially short batch, of eight characters. We randomly pick a starting character somewhere in the training data, and read it together with the next seven characters. Let’s say this sequence is “Alas poo”. Our target, for training purposes, is the next character along, which is “r”. So if we have a letter “o”, “r” would be a fair prediction for the next letter. But if we have “oo”, “r” would also be a good, arguably better prediction. Ditto for “poo” and so on right up to “Alas poo”, at which point “r” would be an excellent prediction. So a batch of eight characters isn't just one training example, it's rich with information.

And in order to do a good job, the LLM therefore has to learn not just from the existence of tokens, but the contexts - the relationships with preceding tokens - in which they occur.


Tha'll run an prizzled fowk on a ceam

San in a distress aw wish;

Jenny some me aw'm flaards have roll'd

Or kitten his laad,

For his callion hands, to-day, then his maath,

Mak smile o' dece;

But monny an his mi heart wi' west,

This heart still for ivver th' door.”

TwatGPT-Tyke, TSE


Machine learning systems, including transformers, are made up of a mass of numbers, which in the first instance are random, and which are notionally arranged into layers. These numbers are collectively known as ‘parameters’. TwatGPT has around 11 million parameters. By comparison GPT-4 has 1.76 trillion, up from 175 billion in GPT-3 and 1.5 billion in GPT-2.

An earlier article, Tron for Suits, seeks to explain how the basic neural network mechanism works. A transformer isn’t by any stretch a basic neural network, but it is a neural network and the mechanism is fundamentally the same. Firstly we move forward through the network – a ‘forward pass’ – and then we sweep backwards through it – ‘back propagation’ - nudging the parameters with a view to improving the result next time. Then we repeat the whole process. A lot.

The forward pass is about collections of numbers mating with each other. They multiply and they add. There’s a bit more to it than that, but really not much.


If you’re interested in the detail, here’s how it goes. Stay focused.

The complete set of possible tokens is effectively the ‘vocabulary’ of the LLM. Each token has a list of numbers associated with it, called ‘embeddings’, whose job it is to encode what the network learns about that token. TwatGPT uses a list of nearly 400 numbers for each of 65 possible tokens. Initially they’re just random, but that’ll change as the LLM learns: as with all the parameters in the LLM, they're tuneable.

There’s also another set of these embeddings which this time encodes information about the position in the training batch, or context. If there are 256 tokens in the context, each position from 0 to 255 has its own list of numbers. Again these are initially random.

For each token in the batch a new list of embeddings is generated, simply by adding each number in it’s own list to each number in the list associated with the position of that token in its context. So if “a” is the third token in the training batch, the embeddings for “a” are each added, literally, to the embeddings for position three to give the new overall embeddings for this token at this position.

At which point we come to the heart of the forward pass - the ‘attention’ mechanism. For each token its embeddings are used to generate (initially randomly) a list of numbers called a key, another list called a query and a final list called a value. The way that each list is generated in each case is to pass the embeddings through a neural network layer. How this works exactly is described in more detail in Tron for Suits, but it’s basically just ‘matrix multiplication’ from your school days. In each case the embeddings mate with a number of other parameters (by multiplication and addition) to generate the new lists.

The queries for every token then go off and mate with the keys for every other token to generate another new list for each token representing its ‘affinities’ with all the others. Finally the affinities for each token mate with the values of each token and the results are used to update (by addition) the original embeddings. It’s one almighty, arithmetic cross-jostle.

The last step in the forward pass is to push the final embeddings through another layer to create yet another list of numbers, but this time one number for each possible token in the vocabulary. These effectively represent the model’s estimate of the probability of that token appearing next in the text.

That was the detail. You can come up for air now.


“You are in a maze of twisty little passages - all alike”

Colossal Cave Adventure


So the output of a forward pass is a list of numbers for each character in the input block, each list comprising one number for each token in the vocabulary of the training data, representing a kind of score that translates into the likelihood of that token coming next. Of course, in the training data itself we know what character comes next, so we can evaluate the output based on how likely it would be to predict that actual next character. We want the LLM to suggest a high probability for that token. The result is a single number that puts a value on how good the output scores were. This number is the ‘training loss’, representing numerically the 'distance' between the model’s predictions and the actual data.

Although the forward pass mechanism looks (and is) complex, all we’ve done in the first instance is to multiply and add a huge collection of random numbers. So we’d expect the training loss to be big since it represents a totally random guess.

What we want to do is to reduce the loss as we train the network. The key point here is that the training loss is a traceable result of a huge number of simple calculations: it’s completely, mathematically determined by all those multiplications and additions in the forward pass. So, in principle, we can see what effect each of these operations ultimately had on the loss. That’s where back propagation comes in.


“EMILIA:

In you are; bore comes well and nothing.”

TwatGPT


The real magic of the machine learning cycle comes about in the backward pass. Unfortunately there’s a bit more to this than the simple arithmetic that characterises the forward pass. It involves calculus. Differential calculus. Bear with me.

When you boil it down, all calculus does is tell you what happens to something if you make a small change in something else. In the case of our LLM, we want to know what happens to the loss function if we change the parameters. In particular, we want to minimise it.

Let’s suppose that there’s notionally only one parameter. If we increase that parameter slightly and the loss gets a bit bigger, we say that the ‘gradient’ of the loss function with respect to that parameter is positive (we’re walking uphill). If the loss is getting bigger in this way, then increasing the parameter is the last thing we should be doing - it’s just making the loss worse. In this case we need to reduce the parameter a bit and try again (we need to turn around and set off downhill).

On the other hand, if the loss gets smaller as we increase the parameter, we say that the gradient is negative. Since we want to minimise the loss, this is a good thing and we should go ahead and increase the parameter a bit. We can keep doing this until the loss starts to increase again, in which case we’ve probably just stepped over the minimum loss (like reaching the floor of a valley, which is where we want to be). The process is consequently known as ‘gradient descent’.

Calculus gives us a mathematical statement of what the gradient is and how the loss varies with changes to our single parameter. Of course, if we only have one parameter, we don’t really need this mathematics at all: we can just keep tweaking the parameter up and down until we find the minimum loss.

The complication we have is twofold. Firstly, we have (a lot) more than one parameter, and secondly, not all of the parameters affect the loss directly. Many more of them affect other groups of parameters as we pass forward, only indirectly affecting the loss. Fortunately calculus allows us to do two things: isolate the effect of changes in individual parameters despite there being gazillions of them, and ‘chain’ effects together. The latter point means that by working out the effect of changing a ‘later’ parameter on the loss, and the effect of changing an earlier parameter on the later parameter, we can also work out the indirect effect of the earlier parameter on the loss. And because the successive computations that led to the loss are all simple mathematics, these gradients can be calculated easily.

The combined effect of all this is that we can tweak all the parameters during back propagation in such a way as to be likely to lower the loss (if it’s not already minimised). And whilst calculus might be a bit more complicated than multiplying and adding, it’s still just mechanical number wrangling.


“You are in an awkward sloping east/west canyon”

Colossal Cave Adventure


Complex as all this no doubt sounds, there are some bits of mathematical massaging I’ve not described. From time to time for example, the lists of numbers are ‘normalised’ to make their distribution better to work with. There is also a ‘non-linear’ (step) function applied at the end that makes the distinction between high and low values clearer. The general thrust of the explanation above isn’t affected by any of this.

There’s also a nuance in the process of minimising the loss function through gradient descent, in that what we don’t want is just to replicate the original text. In the vernacular this is called ‘over fitting’, and there are ways of avoiding it.

Finally, the transformer described here is a simplified functional core of a LLM. In practice there are augmentations, such as a (human) process of ‘fine tuning’ and use of external tools by the model (such as calculators, browsers, search bars, and so on).


“You are in a maze of twisty little passages - all alike”

Colossal Cave Adventure


Maybe there is a sense in which we, as humans, also ‘babble’ in a similar fashion to this when we talk - continuously tacking on the next ‘token’ - I don’t know. But what should be clear from all this, if nothing else, is that there is no intelligence in the LLM, no reasoning, no understanding. Large Language Models have been disparagingly referred to as ‘stochastic parrots’, which is needlessly offensive to parrots, but otherwise is not an unfair criticism.

However, as Edsger Dijkstra once said, “the question of whether a computer can think is no more interesting than the question of whether a submarine can swim.

What is remarkable about this generation of AI is how extraordinarily well it often performs despite being ‘just arithmetic’ - statistics played with tokens. And, so far at least, the performance of these models has scaled nicely alongside the fast-growing number of parameters, with as yet no sign of topping out.

It’s no accident though that ‘hallucinate’ is the Cambridge Dictionary word of the year.

An LLM neither knows nor cares whether it’s spouting bullshit. (At one level this is a bit puzzling to me - the fact that the LLM works by projecting a probability distribution for the next token suggests that it ought to be possible for the model to indicate when it’s taking a flying guess, but apparently this is a tough problem to crack.) The models are also extremely vulnerable to being buggered about with by mischief makers - such wheezes as ‘jailbreaks’, ‘prompt injections’, ‘data poisoning’ and ‘backdoor attacks’ - which are beyond the scope of this piece.

Some of the most closely guarded development efforts at the most advanced labs, especially Google DeepMind and Microsoft/OpenAI, seem to be concerned with grafting reasoning and intelligence - ‘type 2 thinking’ - into LLMs. OpenAI's Q* effort in this regard seems to have ruffled a few feathers, and may or may not have had a hand in the recent (at the time of writing) shenanigans there.

In any case, another earlier article, Artificial Intelligence: the Strategic Context, although not specifically covering LLMs, is still largely relevant in terms of the broader issues raised, if you're interested in the strategic perspective.


“There is a loud explosion, and a twenty foot hole appears in the far wall, burying the dwarves in the rubble”

Colossal Cave Adventure


It’s deeply confusing, and I suppose a bit worrying, that the technologists embedded in this world can’t agree on whether LLMs are the most useful and liberating tool since the printing press, or the source of all our doom. Even at OpenAI itself, which was founded with AI safety as a driving ethos, the top two gurus, Ilya Sutskever and Sam Altman, are reported to be poles apart on this.

But, to clarify one point, the science fiction issue that preoccupies some commentators - computers becoming sentient, exponentially self-improving and ‘weakly godlike’ - is not the real concern. At least not yet. The more immediate existential risk is that a tool with no capacity for common sense, and with access to third party APIs, inadvertently wreaks havoc, whether under the control of a bad actor or simply through a cock up (as in Nick Bostrom’s ‘Paperclip Apocalypse’ - see Sources below). Even the printing press caused decades of homicidal religious warfare, after all.


But, to sum up, if you’re experimenting with this technology in your own business, perhaps the best advice for you is Russian: ‘доверяй, но проверяй’ - trust, but verify. Maybe without the trust bit.


Sources

Vaswani et al, Attention is All You Need, 31st Conference on Neural Information Processing Systems (NIPS 2017)

Bostrom, Nick, Superintelligence: Paths, Dangers, Strategies, Oxford University Press (2014)

Bengio et al, A Neural Probabilistic Language Model, Journal of Machine Learning Research (2003)

Karpathy, Andrej, Neural Networks: Zero to Hero, YouTube (2022), and Intro to Large Language Models, YouTube (2023)

6.036, Introduction to Machine Learning, MIT Open Learning (2020)

6.S191, Introduction to Deep Learning, MIT (2023)

6.034, Artificial Intelligence, MIT Open Course Ware (2005)

PyTorch.org

Sweigart, Al, Automate the Boring Stuff With Python, No Starch Press (2019)



[1] As far as I’m aware the idea of using Shakespeare to train a model is due to Andrej Karpathy - see Sources

Ben Meade

Specialist in Data Strategy, Governance & Management. Helping you unlock & harness the power of your information. Creator of Information Fitness?.

1 年

Great read Chris - making the complex accessible while not dumbing it down too much & really like the idea of your home-grown Tyke AI and the prose it spouts as a demonstration...! We should catch-up at some point since I don't think we've actually met through CA just yet...

回复
Barry Nightingale

Chairman, NED, CFO, Portfolio Board Adviser.

1 年

#twatgpt will catch on ! Ai Ai Oh !! #itsthefuture Chris

Adrian Nixon

Graphene and 2D Materials Scientist. Editor in Chief of the Nixene Journal. International Space Elevator Consortium Board Member. Strategic Advisory Board member of StellarModal the space transportation association.

1 年

Brilliant, Chris. You have a wry wit that makes for laugh-out-loud reading, while making the deepest of deep tech accessible and understandable. Looking forward to learning more from you in future. Adrian

Paul Kelly

? Rethinking Operations and Supply Chains ? Align Process, People & Purpose ? Get Ambitious Change Programmes Moving ? Chartered Management Consultant

1 年

I think the guide we’re all looking for is what AI isn’t, what it can’t do and what it shouldn’t do!

要查看或添加评论,请登录

Chris Bentley的更多文章

  • The Quantum State

    The Quantum State

    “Of the three main areas of quantum technologies, quantum computing continues to attract the most investment, with $3…

    6 条评论
  • Impossible Things Before Breakfast

    Impossible Things Before Breakfast

    Reflections on Commercialising Quantum Hanging around at the Economist’s densely scheduled Commercialising Quantum…

    3 条评论
  • The Nakamoto Legacy

    The Nakamoto Legacy

    Having noticed recently that every other email in my spam folder at the moment is concerned with some crypto-currency…

    12 条评论
  • The Solace of Quantum

    The Solace of Quantum

    It’s impossible to explain how quantum computing works in a short article, or even a long article, and this one isn’t…

    6 条评论
  • Punctuation Marks

    Punctuation Marks

    “Evolution is cleverer than you are” Orgel’s Second Rule In 1972, around the time I was giving up piano lessons, the…

    8 条评论
  • Ludwig

    Ludwig

    Why businessmen should read Wittgenstein I have no interest in football. Watching overpaid youngsters with expensive…

    6 条评论
  • Tron for Suits

    Tron for Suits

    "I kept dreaming of a world I thought I’d never see. And then, one day, I got in…" Kevin Flynn (Jeff Bridges), Tron:…

    9 条评论
  • Artificial Intelligence: the Strategic Context

    Artificial Intelligence: the Strategic Context

    In this article we’ll look at the strategic context of AI - what it actually is (and isn’t) in practical terms, why it…

    4 条评论
  • Wicked

    Wicked

    “For every complex problem there is an answer that is clear, simple, and wrong.” (H.

    8 条评论
  • Underdog

    Underdog

    “The strong do what they can while the weak must suffer what they must” - Thucydides, History of the Peloponnesian War…

    13 条评论

社区洞察

其他会员也浏览了