State of thought on GenAI

State of thought on GenAI

Generative AI is still not General AI

TLDR: I go into a lot of details about the current state of thinking about GenAI and why much of it is nonsense.? With the release of GPT 4o and other advancements, the hype train is again accelerating. I argue that the idea that language models could achieve intelligence or any level of cognition is a massive self-deception.? There is no plausible theory by which a word guessing language model would acquire reasoning, intelligence, or any other cognitive process.? Claims that scaling alone will produce cognition are the result of a logical fallacy (affirming the consequent) and are not supported by any evidence. These claims are akin to biological theories of spontaneous generation, and they demonstrate a lack of understanding of what intelligence is.? If the statistical properties of language patterns were the only level of intelligence, every statement would be true and accurate. Intelligence requires multiple levels of representation of the world, the language, and of abstract concepts.?

The hype train has left the station

With the release of GPT 4o and other advancements, the hype train is again accelerating. ?Last month, the venerable science publication, Nature, proclaimed “‘In awe’: scientists impressed by latest ChatGPT model o1.? The chatbot excels at science, beating PhD scholars on a hard science test.”

Under the right circumstances, these models do remarkable things, but do they really exhibit PhD-level intelligence?? Passing tests, for example, qualifying exams, is often a requirement for receiving a PhD diploma, but it is not the same thing as having PhD level intelligence.? Qualifying exams measure PhD-level knowledge, but they do not measure PhD-level intelligence. ?Knowledge is necessary, but to receive a PhD diploma, at least in the sciences (and probably everywhere), one needs to make an original contribution to the science.? That is, one needs to say something that is not already known.? In this GenAI fails.

GenAI models language

GenAI models use a transformer architecture.? Training consists essentially of fill-in-the blank.? A text is presented with a word missing (usually the final word) and the model parameters are adjusted to better predict the missing word.? When used with images, part of an image is removed, and the model weights are adjusted during training to fill in the missing pixels.? Human feedback may be used to further adjust the weights (fine tuning), to prefer some fill-in choices over others.

GenAI models are large language models.? They model language patterns.? The acronym GPT stands for “Generative Pre-trained Transformer.”? In short, they are (usually) good at reproducing the language patterns on which they have been pre-trained.? They reproduce their training patterns with some variations because the number of possible patterns far exceeds the capacity of any model to represent each one exactly.? Because words with similar meaning appear in similar contexts (see below), the model smooshes together multiple text patterns from multiple training examples into an overlapping probabilistic pattern of parameter weights.? Given a context, there are multiple words that could be used to continue the pattern, leading to variations in the output strings.

The idea that language models could achieve intelligence is a mass hallucination.? For some it is a self-deception, for others it may be an intentional deception.? Mostly, I think that it is due to a lack of critical thinking and ignorance—intellectual laziness.

I know of no even semi-plausible theory by which a word guessing language model would acquire reasoning, intelligence, or any other cognitive process. We do know how these models are architected and how they are trained.? When the models produce outputs that are consistent with intelligence, reasoning, or cognition, the default hypothesis must be that they are paraphrasing the text on which they have been trained.? Anything else is an extraordinary claim and requires extraordinary evidence.? To be sure, language models can pass law school admission and similar tests, but the simplest explanation is that the questions on those tests or ones like them have been published and are among the training text that the models used to fill in the blanks. ?

Before discussing the details, we should consider why the question of whether GenAI models are intelligent matters.? If all you need is a system to achieve a specific task and one or more of these models is sufficient to achieve it, then it probably does not matter how exactly it does so. On the other hand, if you are concerned with the future value of these models, for example, will they take over the jobs of lawyers, doctors, or schoolteachers, then you may be concerned with just what the capabilities of these models are.? Will these models lead to artificial general intelligence?? Will they present an existential risk to humanity? Will they radically change the labor environment?? Do they need to be regulated, and if so, how?? If these are the questions with which you are concerned or if you are concerned with science of artificial intelligence, then how they achieve what they do and even what they do will require a more careful examination. Even if your concern is whether to invest in companies dealing with these models, then you must know what they are capable of doing and how they do it.

Unrealistic expectations

In 2016, Geoffrey Hinton said that AI would take over radiology within 5 years and there was no point in training additional radiologists.? That did not happen, and instead we are facing a severe shortage of radiologists. In 2023, he quit Google so he could “speak freely about the risks of AI.” But what are those risks if the AI in question is just a stochastic word guesser?

More recently, OpenAI disbanded its superalignment team, which was charged with pondering the question of how to keep AI under control and aligned with human interests.? In his 2023 Year-end report, US Supreme Court Chief Justice Roberts wondered whether AI would make judges obsolete. Are any of these moves and questions justified by the facts?? I think not.

Former Google executive Mo Gawdat has claimed that GPT-4 matches the IQ of Einstein, and we could be just a few months away (GPT-5) from a machine with 10 times the IQ of Einstein.

The California Legislature passed, but the Governor vetoed, an AI Safety bill, SB-1047. In my opinion it was a bad law and should have been vetoed, but others see it as California Governor Newsom caving to the tech industry. CNN said this in part: "Generative AI - which can create text, photos and videos in response to open-ended prompts - has spurred excitement as well as fears it could make some jobs obsolete, upend elections and potentially overpower humans and have catastrophic effects."

These concerns are only sensible if the available AI models are generally intelligent, not mere word guessers.? Although the models sometimes produce outputs that are consistent with such autonomous intelligence, the evidence is far from compelling.? The outputs are consistent with reasoning and intelligence, but they are also consistent with a machine that merely paraphrases language patterns that it has been trained on.? For example, if I have a hypothesis that Jon stole the cookies from the cookie jar, the fact that the cookie jar is empty is consistent with Jon’s guilt, but it is also consistent with other causes.? There may not have been any cookies to take, or someone else may have taken them, etc.? A GenAI model may produce output that is consistent with the hypothesis that it is intelligent and is reasoning, but that evidence, so far, is also consistent with the alternative hypothesis that it is a language mimic.

Why are GenAI expectations wrong?

There are many causes for the credulous expectation that GenAI is either at or on the verge of achieving general intelligence.? Among them the reliance on faulty logic.? As I mentioned earlier, the so-called evidence for these models engaging in cognitive processes is the observation that they behave as if they were engaging in these cognitive processes without considering whether other explanations (e.g., word guessing) might explain this behavior.? An actor might say what a mathematical genius would say, but that does not make the actor a mathematical genius. In logic this is called affirming the consequent.? If the models think, then they should say this.? They say this, therefore, they think.? Absurdly, we might say, If Abraham Lincoln was killed by robots from outer space, then he would be dead.? Lincoln is dead; therefore, he was killed by robots from outer space.? In psychology, this is called confirmation bias.? People look for evidence that confirms their belief, rather than evidence that tests their belief.

A related problem is the fundamental mistaking of fluency for competence. ?A model that learns language patterns can be very good at mimicking those patterns.? Even in humans, fluency is not a necessary indicator of intelligence.? It is far easier to sound intelligent than it is to be intelligent.? One can learn a few phrases of a foreign language and produce them fluently without understanding a single other word of the language. ?In humans, fluency is often related to competence, but not always.? In machine intelligence, the two can be completely decoupled.

Learning and reproducing specific language patterns also underlies another problem with claims that GenAI models are intelligent.? Many of the claims rely on benchmark tests.? The claims take the form, this model passes an important test, such as a law-school admission test, an IQ test, or any other specific task.? The problem is that the models have been trained on virtually everything on the World Wide Web and these tests and benchmarks typically have been published and, thus, they are likely to be included in the models’ training sets.? The ability to pass the test then indicates nothing more than the system’s ability to memorize and paraphrase the text that it has been fed.? It implies nothing about its level of intelligence.? The empirical problem of testing machine intelligence is distinguishing between capabilities that are provided by learning language patterns and those that would require actual intelligence.

Ignorance of intelligence makes modeling difficult

Compounding the problem of poor testing, is the failure to understand just what intelligence is.? Intelligence is not the same thing as passing a test, even an intelligence test.? Many of the people claiming that these models are as intelligent as humans have no substantial knowledge of just what that human intelligence is.? These people may be great software developers or computer scientists, but unless they have knowledge of what they are trying to implement, they will fail.?

Intelligence has been studied for over 100 years, and when you include studies of human expertise (and even animal) problem solving, it is a large and complicated field.? It consists of more than just solving well-structured problems.? Especially if one is concerned with superintelligence, it requires solving problems that no one currently knows how to solve.? In order to assess the accuracy of problem solving, current models are restricted to solving problems for which there is a known solution.

Because Mo Gawdat claimed that GPT-4 was as intelligent as Einstein, it makes sense to examine some of Einstein’s intelligence.? In 1905, Einstein published a ground-breaking paper on the photoelectric effect.? In order to explain the distribution of light frequencies emitted by a hot radiator, Max Planck determined that the atoms of the radiator oscillated at specific quantized frequencies. He wrote an equation to describe this atomic oscillation.? Einstein knew of Planck’s work.

At the same time, it was known that light falling on certain materials induced an electric current. But the problem was that low-frequency light would not induce the current, even at high intensities, whereas high frequency light would induce a current even at low intensity.? Einstein solved this problem by proposing that light not only consisted of waves, but also of particles.? By proposing that light consisted of particles, Einstein could distinguish between the energy in the amount of light and the energy in each individual particle (photon). Einstein then could predict that the energy of individual particles, not the total number of particles caused the photoelectric effect.?

His genius was not in paraphrasing known physics writing, but in creating new concepts that were inconsistent with the then current writing, but would be consistent with observations.? GenAI models could, if trained on the physics papers of the time, describe the mystery of the photoelectric effect and, perhaps, describe some erroneous theory of it, but they do not have the capability of contradicting that writing and coming up with a contrary new theory.? One photon affects one electron if the photon has enough energy, but that can only be an explanation if we first have the idea of photons, and that did not exist prior to Einstein.

Explaining the photelectric effect required the existence of theoretical entities, photons, that no one had proposed before.? No one has ever seen a particle of light, of course, or an electron or how they interact.? These are all theoretical entities. But together, they explained observations that could not be explained otherwise. In some ways, this is the opposite of the situation with GenAI models and intelligence.? All of the GenAI observations that purport to show intelligence can be explained by learned language patterns.

On the other hand, Einstein’s conceptualization of light as a particle illustrates another property of intelligence that GenAI models are incapable of demonstrating.? He reconceptualized the problem in a different frame of reference from the general understanding of the day.?

In another example of reconceptualization, before Copernicus, the solar system was generally described with the Earth at its center. Ptolemy, for example, asserted that the sun, the planets, and the stars rotated on spheres that rotated around the earth.? Astronomical observations that were inconsistent with the simple view were explained by positing additional spheres with centers displaced somewhat from the Earth. Copernicus reconceptualized this system to assume that the Sun was the center of the solar system, a dramatically different representation, but one that was more consistent with the astronomical observations.?

Both Ptolemy and Copernicus had insights that conceptualized the problem of explaining astronomical observations in ways that were related to, but different from the available texts of their respective times.? Current artificial intelligence models are good at solving a certain kind of problem when the representations are provided by human designers, but so far, they are incapable of finding their own novel ways of representing problems. The tokens, embeddings, and neural network structures that GenAI and other models rely on are provided by their human designers.? The models cannot be truly or autonomously intelligent until they can create their own representations.? To date, we do not know how to build systems with those capabilities, but once they are provided, some form of gradient descent achieves solutions through successive approximation of a solution.? Intelligence appears to require more than solving these human-designed problems.

Spontaneous generation and emergentism

In any case, there is no plausible theory for how a system designed to fill in the blanks in text could come to be capable of intelligence.? The closest hypothesis that would purport to explain how word guessers could become cognitive agents is the concept of “emergence.”? The basic idea is that as a model becomes complex enough, cognitive processes, including intelligence spontaneously emerge.? A similar theory in biology holds that life emerges spontaneously from inanimate substances.?

Here is a quotation from Mary Shelley from the 1831 edition of Frankenstein:

“Many and long were the conversations between Lord Byron and Shelley, to which I was a devout but nearly silent listener. During one of these, … were discussed, and among others the nature of the principle of life …. They talked of the experiments of Dr. [Erasmus] Darwin, … who preserved a piece of vermicelli in a glass case, till by some extraordinary means it began to move with voluntary motion.” (actually, Darwin wrote about “vorticellae,” not vermicelli, tiny creatures that can remain inactive in a dried state for long periods of time and become active when in water). ?

The spontaneous emergence of cognition and intelligence from complexity is no more likely than is the spontaneous emergence of voluntary motion from a glass case of vermicelli.

Many difficult problems become easy to solve once the structure of the problem has been identified, including how to represent it.? Large language models represent language as tokens.? It is not known how this decision to use tokens (rather than pixels and glyphs) simplified the models’ ability to exploit language patterns.? But the decision to use tokens was made by human, not machine, intelligence.? Another decision was to use masked training, where part of the input is removed and the model is trained to replace it.? Again, this decision was made by human designers.?

TRICS simplify the problems

Lachter and Bever argued that these crucial representational decisions can profoundly affect what a model learns.? In 1986, Rumelhart and McClelland proposed that a simple associative neural network was sufficient to learn English past-tense transformations. According to Lachter and Bever, Rumelhart and McClelland chose a phonemic representation of the words that incorporated most of the difficult parts of the problem.? Humans who learn to produce the past tense do not have that information, they have to figure out on their own how it works. ?Rumelhart and McClelland, thus, simplified the problem into one that could be learned by a basic associative mechanism. ?The model succeeded not because it learned the linguistic relations, but because Rumelhart and McClelland’s solved provided it with those relations.?

Lachter and Bever called representations like those used by Rumelhart and McClelland TRICS, “The Representation It Crucially Supposes.”? Being unaware of these TRICS, Rumelhart and McClelland over-estimated the capabilities of their model. ?The apparent intelligence of the model was really the intelligence of Rumelhart and McClelland.? We do not currently know what TRICS might underly the performance of large language models causing designers to over-estimate the intelligence of their models.

In a recent commentary from the Journal of the American Medical Association Network, Sumant R. Ranji noted that a study (Goh et al, 2024) testing a GenAI model’s medical diagnostic accuracy exceeded that of practicing general physicians on a test, but other research found that the model performed poorly under more realistic conditions.

How do LLMs perform a diagnosis under conditions closer to actual clinical practice? A recent study2?evaluated the performance of LLMs on diagnosing and developing management plans for 4 common abdominal conditions, using a dataset consisting of anonymized real patient data. Information was presented to the LLM in a stepwise manner, and after each step, the LLM was asked to summarize the information and provide a diagnosis or request additional testing. Once the LLM provided a diagnosis, it was required to recommend a treatment plan. When confronted with this realistic clinical decision-making scenario, LLMs performed poorly: significantly worse than physicians for all but the simplest diagnoses. The LLMs also failed to consistently request appropriate diagnostic testing and frequently made incorrect treatment recommendations even after arriving at the correct diagnosis.

Goh et al structured their LLM test in a way that was convenient, but that inadvertently made the task easier for the model to perform.? They, that is, Goh et al, solved a key part of the physician’s diagnostic problem, thereby over-estimating the capabilities of their model, just as Rumelhart and McClelland had done.

The key idea here is that the TRICS help the model to solve problems, but the intelligence comes from people, not from the model. The models do not operate autonomously, but, instead, are dependent on human input. Part of their capabilities are human capabilities that they crucially depend on.? Unless we recognize this human contribution, we over-estimate the intelligence of the models.

Rumelhart and McClelland were psychologists, who were not as familiar as Lachter and Bever with linguistics.? It was, therefore, easy for Rumelhart and McClelland to miss the critical contribution of their chosen representations.? Nevertheless, it is still an object lesson for keeping the implications of representational decisions in mind when evaluating the capabilities of a machine learning system.? It is easy to miss the implications of the chosen representations and other techniques used to simplify the problem and thereby over-estimate the autonomy and intelligence of the models.

To summarize, today’s models are statistical models of past language.? They are highly dependent on human input, both as the source of these language patterns and as the source of reinforcement learning training them on the transformational patterns that they should engage with these patterns.? Being facile with language makes them useful for a number of applications, for example, summarization, but that is not enough to be able to solve problems that are not reflective of these past patterns.

The relationship between language and thought or intelligence is complex.? Language is not identical with thought.? We can think of things for which we do not have words and we may know words, but not be able to use them productively.? The same ideas expressed in different terms can be perceived as different.? For example, Two groups were asked to choose a treatment plan for a disease that was 100% fatal if untreated.? One group was asked about an experimental treatment that would result in 50% of the treated patients dying.? A second group was asked about an experimental treatment that would result in 50% of the treated patients surviving.? The people told that 50% would survive were likely to choose the experimental treatment than the group told that 50% of the people would die, yet both of these alternatives were exactly the same.? The words used to present the alternatives mattered to the choice that was made.

Language and intelligence

Words do help to structure how we think about things. A strong version of this idea is the Sapir- Whorf hypothesis: The concepts we have of the world are determined by the categories codified in our native language. ?In its extreme, it says that we can only think of things for which we have words. ?In this form, it is very clearly false. ?Lawrence Barsalou, for example, found that people are quite capable of making up ad hoc categories on the spot (for example, list things to take with you if your house is on fire). ?We do not have a word for such a concept, but Barsalou found that these ad hoc categories have the same kinds of properties that more traditional categories, for which we have names, have. For example, people can select prototypical members of that category.

The transformer model, on which GenAI is based, depends strongly on the idea of distributional semantics, that is that words get their meaning from the other words with which they co-occur.? The fill-in-the-blank training process that transformers use represents each word relative to the other words that accompany it in the training text.? Each word is represented by a vector, called an embedding.? Because similar words appear in similar contexts, the vectors representing those words will also be similar. Roughly speaking, each vector represents the (at least part of the) meaning of specific words.?

All that the models know about the meaning of words is their embedding vectors—that is, how they are related to other words.? This is enough to be useful for paraphrases and for searching, but it is incomplete.? For example, both synonyms (words with the same meaning) and antonyms (words with opposite meanings) are represented by similar vectors.? Using Word2Vec embeddings, in fact, we find that the vector representing the word “hot” is less similar to the vector for the word “warm” (0.43) than it is to the vector that represents the word “cool” (0.52).

The details of how similar the representation of one word is to another is not the main point, it is that the meaning is not the word and not the vector that represents it.? Intelligence would have to operate on the meanings, the words are not enough. The vectors reflect some of the similarity among words, but as the synonym antonym patterns shows, there is more to meaning than just similarity.? Intelligence requires more than just unanchored similarity.

For example, Chi, Feltovich and Glaser found that experts and novices categorized problems differently.? The experts were guided by physics principles, whereas the novices were guided by the words in the problem description. When there was a conflict between the surface descriptions and the underlying physics principles, the novices were distracted by the surface features.? They categorized problems in terms like “rotation,” “mass,” or “spring,” particularly when those words appeared in the problem description. ?The experts classified them in terms like “conservation of energy” or “conservation of linear and angular momentum.”

In order to achieve expert-level intelligence, therefore, AI models, would have to achieve a deeper level of representation than just words and their vector embeddings. ?As Chi and her colleagues’ work shows, similarity of problem descriptions does not always correspond to similarity of problem solutions.

Language is important to intelligence, and intelligence is important to language, but they are not the same thing.? If language were enough, then every statement would be true. The very idea of truth requires that there be at least one other level of representation, reflecting the facts of the world, so that a correspondence can be evaluated. ?

Einstein solved the problem of the photoelectric effect by attributing different facts to familiar, but unexplained observations.? He proposed a set of facts (a theory), that light consisted of particles with certain properties.? These facts were separate from the observations and separate from the words used to describe those observations. ?The proposed facts could be true, or they could be not true. Intelligence, in short, requires multiple levels of representation and representations of relations among them. The statistical co-occurrence patterns can reflect only what people are saying about a topic, not necessarily what is true about the topic. Intelligence requires more than mere statistics.

Intelligence is not limited to scientists, of course. My dog stops and stares up the hill and I construct a theory of what he might be looking at.? The proposed fact, the theory, that explains his behavior is often a coyote. ?Sometimes, I am correct, and I also eventually see the coyote, but sometimes my theory is wrong—it’s a possum, or sometimes I just don’t know.? Again, the point is that there are multiple levels of representation operating here to intelligently solve the problem of why did the dog stop and stare.

Conclusion

Large language models model language and it is wishful thinking to expect that they model intelligence.? There is no reason to think that the statistical properties of language are sufficient provide the intelligence. Generative AI is general only to the extent that the problems against which the models are deployed are consistent with the statistical language patterns that have been observed during training.? Therefore, they are retrospective, and they have difficulty with novel problems and with problems that are deviations from what they have seen.?

To get a fair assessment of what the models can and cannot do, one needs to distinguish between the statistical properties derived from training text and reasoning or other cognitive processes characteristic of intelligence.? Occam’s razor favors the interpretation that the model performance is due to memorized and paraphrase language patterns, not cognitive processes.

Most of the existing benchmark problems have already been presented in the model training sets and so do not provide a reasonable assessment of their abilities.? In coding problems, for example, ChatGPT performed 48% better on problems that were published in a well-known coding benchmark (LeetCode) before the model was trained than it did on problems that were added later.? The model performance on well-known problems is not predictive of how well it will perform on unknown problems, but intelligence would require that it not only perform well on these problems, but also be able to seek them out, conceptualize them, and solve them.?

Progress on artificial intelligence will require methods to distinguish between memorized and thought-out answers and a theory of artificial intelligence that does not depend on magic or miracles or spontaneous generation.? Although the computer does not have to solve all problems in the same way that people do to be intelligent, it has to solve the same kinds of problems.? Mimicking what people have done in the past is good enough for many practical tasks, but if we expect computers to ever be more autonomous than that or to solve challenging problems for which the solution is unknown, they will need capabilities far different from those that can be reasonably obtained today.

?

?

Dinakar Raj

Enterprise Architecture I Risk AI Cloud Sec Data Advisory

5 天前

Converse error, Reversal curse, etc., questioning the fundamentals of logic in LLMs. General Purpose Teachers (GPTs) can be trained to regurgitate on any subject with fine use of language and plausible answers, but would you really rely on them to teach core subjects?

回复
Michael S Carroll, PhD, MEd

Imagining smarter healthcare; building bridges; burning siloes

2 周

Great discussion! I would say that I have somewhat different memory of the R & M past tense model. At the time, the claim that the symbolicists (coming out of the generative grammar tradition) were making was that rules (especially production rules as in generative grammar) were necessary for language production and most adequately explain language acquisition. R & M were trying to show that a statistical model could learn past tense morphology without rules. I think it was reasonable for them to use phonemic coding. It seemed reasonable that a child would be learning in phonemic space, not lexigraphic. Also it was easy to conjecture that phonemic acquisition might precede morphology so it was not crazy to think that the language learner might be building on the scaffold of phonemic knowledge. Finally, let’s remember that the symbolicists were not that great at reconciling actual language acquisition data. Their theories were based mostly on the (mostly conjectural at the time) poverty-of-stimulus concept and the findings in comparative linguistics that certain imaginable configurations of syntax rules did not exist in human languages.?..[Cont]...

回复
Michael S Carroll, PhD, MEd

Imagining smarter healthcare; building bridges; burning siloes

2 周

...[Cont]...This suggested to Chomsky (and his folks) that language acquisition must be assisted by an innate universal grammar in which a small number of learning exposures could ‘set the parameters’ of the learner’s grammar model. But, coming back to R & M, most the evidence for generative grammar was in syntax (not morphology), so to me it seems more like L & B are the ones engaging in confirmation bias by assuming that transformational generative grammar applied to morphology and then criticizing R & M’s experimental set-up based on that assumption. It is true that the ‘poverty-of-evidence’ (my little joke) problem was something that Pinker and Prince were trying to address in their response to R & M, where they argued with at least some research references that learner’s errors were more consistent with rules + exceptions than with associative learning.

回复
Rick Marshall BSc BE

Unibase database, language, AI and semantic data models. Data modeller, cyber security, custom applications

2 周

My thoughts on this (gradually forming an idea, original I hope). By feeding LLMs with data that is not curated we are building something with very high information/Shannon entropy. One of my propositions is that by feeding more uncurated data into these things we increase their entropy and therefore they become less and less knowledgable, let alone intelligent over time. When allowed to consume their own drivel hallucinations, worthless statements must increase. Can we call this AI entropy? More disturbing is the fact that people only need a fraction of the information fed into LLMs to be aid their intelligence. LLMs need two things at least to progress. One is a sense of existential threat - all living things have that from day 1; and curiosity - I have yet to see evidence that LLMs are curious and yet people and most animals are incurably curious.

Eric Lane

Customer Success Strategist | Enhancing Client Experiences through Strategic Solutions

2 周

This is a compelling critique of LLMs, highlighting the limitations of relying on statistical language patterns rather than true intelligence or reasoning.

要查看或添加评论,请登录