How "conversational AIs"? like ChatGPT work?

How "conversational AIs" like ChatGPT work?

I see a lot of confusion and optimism with the recent advances in conversational AI. It is indeed impressive and it does have many uses - at the very least, light copywriting, clipart and even maybe storytelling can be helped by at least starting with what systems like ChatGPT output. But it's not "intelligent" yet.

Click here for more examples.

In this article, I'll try to describe how "conversational AI" systems work, to a non-programmer audience (but with a few bits of code here and there, for those interested to try it out). In it, we will reason through and build an extremely simple text generator, we will demonstrate what a "model" is, what "training the model" is, and how the model influences and limits what can be done with this kind of a system. We will do all that on a toy model called "Markov chain." Those who are interested in details might start by reading the Wikipedia article on the Markov chain, but that's not necessary to understand what's going on in this article, as I'll give a simplified explanation as we go along.

Simple stuff is easy

There's a lot of confusion about what "modelling" and "training a model" are. For the purpose of this article, a model can be thought of as a set of rules which serve to reach a certain goal, usually with the help of some data which provide those rules with a useful direction, which needs to be "trained".

Let's start with our simplified Markov chain model.

We start by noticing that languages have a grammar - a set of rules which impact the order of words in a text. As children, we learn by listening, inferring, and a lot of trial and error, and this process seems to have a huge amount of help from our biological hardware (wetware). Computers on the other hand need to be told what to do. Let's suppose that words that come next are only influenced by the words which came before. With even more simplification, let's say that the next word in the sentence only depends on the word before it.

Consider this quote:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness.

We consider each pair of words:

  • it was
  • was the
  • the best
  • best of
  • of times
  • it was
  • was the
  • the worst
  • worst of
  • of times
  • ...

We immediately notice that, for this particular piece of text, "it" is always followed by "was", but "the" is sometimes followed by "best" and sometimes by "worst." Congratulations! We have just formed our model and trained it.

Because the data telling us "which word follows which word" is the result of training the model, we can call it "model data." In this particular case, our model data looks like this:

word: "it", followed_by: ["was", "was", "was", "was"]

word: "was", followed_by: ["the", "the", "the", "the"]

word: "the", followed_by: ["best", "worst", "age", "age"]

word: "best", followed_by: ["of"]

word: "of", followed_by: ["times", "times", "wisdom", "foolishness"]

word: "worst", followed_by: ["of"]

word: "age", followed_by: ["of", "of"]

word: "times", followed_by: ["it"]

...

word: "foolishness", followed_by: []

...

This is it, no black magic there. Now that we have formed a model and have trained it with data, we can ask the model to generate text! The algorithm for this comes very naturally:

  1. for start, we pick a random word in our data, let's say "the". This is our "current word".
  2. we look up the entry for the current word and pick a single word from the list of words that follow it - this will be our next "current word"
  3. We repeat step #2 until we get enough words or we hit a word which has no successors (in our case, "foolishness").

A sequence of words generated by this model and this model data could be:

"the age of times it was the best"

That's surprisingly ok for a model which can be implemented in about 25 lines of code. If you are being generous, you might even consider the line was just written by a person who's bad at English.

This illustrates that even the simplest models can have "promising" results.

Diminishing returns

The above algorithm is called "first order Markov chain" because it only considers a single word and the list of words that have followed in in the text.

It's easy to conclude that if we got a semi-sensible output when we only track one word before finding its successor, that we will get much better results when we track TWO previous words! Genius!

Now our model data looks like this:

words: "it was", followed_by: ["the", "the", "the", "the"]

words: "was the", followed_by: ["best", "worst", "age", "age"]

words: "the best", followed_by: ["of"]

...

We also see that the list of followed_by words basically remains the same as what we previously saw for the last word in the preceding words section, and this is because we have too little training data to work with.

Here's an important bit: We only know the data model is too simple because we know the English language. For example, we are pretty certain that the word "the" can be surrounded by many more different words than what the model contains. The computer doesn't know that. Even we wouldn't know that if the only thing we could base our conclusions on was the single outputted line. It's only because we fully understand the model data can we tell how limited it is (or by asking the model to create a large number of outputs that we could inspect).

This illustrates the need for quality data in training the model.

For the next bit, we will ingest the entire Tale of Two Cities from Project Gutenberg (the plain text version), and we'll do it on a more powerful variant of the Markov chain model: one which allows us to pick how many words form our "context" before the next word is picked. So, this variant allows us to create an N'th order Markov chain, because it looks up N consecutive words and constructs a list of words that have followed them.

The Tale of Two Cities has about 140.000 words, of which about 11750 are unique. This means that the first order Markov chain will be a table with 11750 entries: for each unique word from the text, we will have constructed a list of words that have followed it in the text. A second order Markov chain, which tracks TWO words (as in the example from the start of this section), needs to form a table of 73000 unique two-word pairs, and list the words that have followed them.

How much does the generated text improve going from the first order Markov chain to the second order? Let's see.

Here are some 15-word lists outputted by the first order chain:

  • heavily and flowers in Paris picking their horses That was in a long But Mr Lorry
  • town for her and charcoal from the carriage sufficiently admire your knowing what that rose in
  • cold water wherein as he folded rag attached to such quiet time brought you together until

And here are some 15-word lists outputted by the second order chain:

  • before Of the men and women in Saint Antoine with a far more as though the opportunity
  • in the Street of the Tribunal’s patriotic remembrance until three days afterwards more probably a week or
  • so completely as to admission or denial of the Dover mail struggled on with his spiky head

Ok, far less gibberish, but still gibberish. So, how about we try a 5th order chain?

  • frosty air at an hour’s distance from my place of residence in the Street of the School of Medicine when
  • word for it Well If you could endure to have such a worthless fellow and a fellow of such indifferent
  • of this document was done A sound of craving and eagerness that had nothing articulate in it but blood The

Much better, right? And with a 10th order chain?

  • corkscrew Whatever tools they worked with they worked hard until the awful striking of the church clock so terrified Young Jerry that he made off
  • streets she asked him The usual noises Mr Cruncher replied and looked surprised by the question and by her aspect I don’t hear you said
  • a wild rattle and clatter and an inhuman abandonment of consideration not easy to be understood in these days the carriage dashed through streets and

Without going into heavy analysis, I'd say that the 1st order Markov chain output and the 2nd order Markov chain make a similar amount of sense, but so do the outputs of the 5th and 10th order chains.

This illustrates that it gets progressively harder and harder to get even small gains in quality of output.

This means that 5 is about the sweet spot for the length of context for generating text with this type of model. If you're interested how do the steps of text generation look like for the 5th order chain, here they are:

('frosty', 'air', 'at', 'an', 'hour’s') distance

('air', 'at', 'an', 'hour’s', 'distance') from

('at', 'an', 'hour’s', 'distance', 'from') my

('an', 'hour’s', 'distance', 'from', 'my') place

('hour’s', 'distance', 'from', 'my', 'place') of

('distance', 'from', 'my', 'place', 'of') residence

('from', 'my', 'place', 'of', 'residence') in

('my', 'place', 'of', 'residence', 'in') the

('place', 'of', 'residence', 'in', 'the') Street

('of', 'residence', 'in', 'the', 'Street') of

('residence', 'in', 'the', 'Street', 'of') the

('in', 'the', 'Street', 'of', 'the') School

('the', 'Street', 'of', 'the', 'School') of

('Street', 'of', 'the', 'School', 'of') Medicine

('of', 'the', 'School', 'of', 'Medicine') when

Simulating machine learning

I've just had a discussion with my fiancee about if there can be "learning" if the thing which is supposedly learning is not intelligent. As a scientist, I'm pushing this decision into the realm of semantics. If we define "learning" as "something done by an intelligent entity", then what we are talking about is clearly not learning.

Let's say we train our Markov chain model on the entire content of Wikipedia. Presumably, this content will be full of facts.

Since this model is not as sophisticated as ChatGPT's, we can't really input a sentence into it and expect a result, but we CAN posit a beginning of a sentence and ask it to continue.

We could choose "Marvin Lee Minsky was" as the starting words of the Markov chain and ask the model to continue. What we hope to get back is "Marvin Lee Minsky was born in New York City, to an eye surgeon father, Henry, and to a mother, Fannie...", but we are not guaranteed to get this output.

To drive the point home, let's consider algebra. Consider what would happen if we trained our model with a 1st year elementary school algebra textbook, and gave it the starting words "1+1=" (we consider each symbol a word here). Will it be able to continue on its own and output "2"? Probably. Unless there's a typo in the textbook or even a example of a false statement which contains the words "1+1=3." The computer doesn't know. It will form a model which contains both:

("1", "+", "1", "=") ["2", "3"]

When asked to generate text starting with "1+1=", it will print out both "2" and "3" with a 50% chance.

How do we fix that, assuming we don't want to manually handle every single possible algebraic expression?

We fix it by force-feeding the model with thousands of examples of the same correct outcome, which enables the model to pick the correct word sequence more often. Our "more correct" model might contain this record:

("1", "+", "1", "=") ["2", "3", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2"]

(actual representation of this in compuer memory is much more compact)

So there's about 80 times more chance the model will pick "2" over "3" when asked to continue "1+1=". But sometimes, still, it will pick "3".

We have "taught" the model to be better, so in a way, it has "learned" and is now "improved", but it's not 100% correct.

This is, in essence, why ChatGPT has problems with trivial algebra questions:

No alt text provided for this image

This is also why ChatGPT, just like the Markov chain example, will ALWAYS be correct in jargon, but its answers might not make sense. It's kind of like some people learn the jargon of a profession and try pretending they understand what's going on.

Again, the Markov chain example from this article is not how ChatGPT works. ChatGPT is much more complex. But deep down in its zeroes and ones, it does something kind of like that.

This illustrates why language models are not "intelligent."

Knowing all that, do you want ChatGPT diagnosing your leg pain?

Whether ChatGTP or a similar models will ever be "intelligent enough" to pass for a Chinese room, remains to be seen.

One way to improve models like ChatGPT is to stick something like Wolfram Alpha on top of it, which will handle and correct the parts of statements related to logic, math and basic facts. We'll see how that goes soon enough.

Here's the "arbitrary order" Markov chain code - but this time you cannot run it directly online because it fetches the Tale of Two Cities text from the network.

要查看或添加评论,请登录

Ivan Voras的更多文章

  • The selfish AI model

    The selfish AI model

    The Selfish Gene is an idea that can be applied to many things. Half tongue-in-cheek, half serious, I think "the…

  • Infinite growth in infinite complexity

    Infinite growth in infinite complexity

    As a young engineer, I had a comfortable bubble of friends and colleagues in which we collectively scoffed at the…

  • Not-so-hidden risks in AI code generation

    Not-so-hidden risks in AI code generation

    With AI's (such as they currently are - notably LLMs), it looks like we have approached one of the holy grails of…

  • What is money?

    What is money?

    Blast from the past - I was reminded of an article I wrote 6+ years ago - about how to conceptualize money. Here's the…

    2 条评论
  • CTO Notes: Granularity of microservices in the context of project management

    CTO Notes: Granularity of microservices in the context of project management

    Microservices are a popular topic in software architecture, especially for SaaS, chiefly because of their potential for…

  • When intelligence? AGI soon?

    When intelligence? AGI soon?

    Not a day goes buy in the last couple of weeks where I don't get a question like "is ChatGPT intelligent?" The answer…

    4 条评论
  • Sam Altman - Lex Fridman interview on ChatGPT notes - an abundance of optimism

    Sam Altman - Lex Fridman interview on ChatGPT notes - an abundance of optimism

    Just watched the interview with Sam Altman by Lex Fridman about GPT and ChatGPT, it's available here: Some short notes…

    2 条评论
  • Singularity is robots building robots?

    Singularity is robots building robots?

    The technological singularity is the (future) historical point in time where the technological progress gets literally…

    1 条评论
  • Doing Business in a Metaverse

    Doing Business in a Metaverse

    I saw a question in a forum recently, which boils down to: What is the incentive for someone to do business in the…

  • Problems we don't want to solve - decentralisation and the free lunch

    Problems we don't want to solve - decentralisation and the free lunch

    As is appropriate for this time of year, yesterday my sister started enthusiastically describing a new app by the…

社区洞察

其他会员也浏览了