Large Language Models as Reasoning Engines: Decoding the Emergent Abilities and Future Prospects - Part II

Large Language Models as Reasoning Engines: Decoding the Emergent Abilities and Future Prospects - Part II

In my previous article, I discussed the idea that when Agents based on the LLM Framework are used as a business process orchestrator on private corporate data and APIs, it is unnecessary for the LLM to contain trillions of parameters that store internet data, as none of these?data will ever be used in the business process execution. Since the purpose of the LLM in this framework is nothing more than functioning as a ‘natural language interface’ and as a ‘reasoning engine’ where the LLM understands human language instructions and autonomously decides to carry out several actions to fulfill a business process, it need not necessarily contain any external internet scale data. The private corporate data or external public data can be accessed by the LLM as and when required. So, we might as well have a separate knowledge repository which the LLM can refer to while keeping the LLM as a lightweight language processing reasoning engine.?

The above idea however does not impy that LLMs need not have any large scale data. While we have reached a level where a model of 7B parameters might be sufficient and while we can further reduce the size of the model, we cannot envisage a model that solely functions as reasoning engine without any data - such a model has to come from outside NLP domain, perhaps trained on cognizance based dataset and ?language reasoning axioms!

Emergent Abilities

The ability to understand and process natural language instructions requires a kind of pseudo-consciousness which cannot be directly programmed. The ability of a basic transformer to predict the next token through its understanding of patterns of language representations learned through training on internet data itself may not be sufficient. Researchers have found that if the LLM is continued to be scaled, at a certain point of training, an emergent behavior occurs that allows the LLM to predict the next token, not based on the most likely probability of the next token, but based on learned representations of knowledge. This also means however that while we may not necessarily require the entire internet data for it to acquire emergent abilities, it does require some significant internet data (which though will have nothing to do with our corporate business processes). The researchers could not arrive at any quantitative means to predict the scale vs. emergent abilities as of now.

Jason Wei gives the following important characteristics of emergent abilities?(https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities):

1. Emergence is not easily predicted by extrapolating scaling curves from smaller models.

2. Emergent abilities are not explicitly specified by the trainer of the language model (next-word prediction “only”).

3. Since we haven’t tested all possible tasks that exist, we don’t know the full range of abilities that have emerged.

4. Further scaling can be expected to elicit more emergent abilities.

In his blog, Jason Wei further explains several intuitions regarding emergent abilities. He points out that the next-word prediction is largely multi-task learning. The next-word predictions can be based on grammar, lexical semantics, world knowledge, sentiment analysis, translation, maths, etc. The overall loss would be a weighted average of the massive amount of tasks learned, e.g., Overall loss = 1e-10 (loss of grammar task) + 1e-10 loss (loss of sentiment analysis task) + 1e-10 * (loss of math ability task) + …. etc. Each of these tasks might require certain amount of emergent abilities at different levels of scaling in terms of model size, dataset size and duration of training.

Emergence of a Knowledge Model

Another way of explaining emergent ability is to consider the creation of a ‘knowledge model’ based on training which can happen only after a certain accumulation of learned representations. In other words, apart from raw data memorized by LLMs, they can also create new knowledge by “Inference” using this memorized data. The cut-off point for scaling is the point where sufficient data is available to create the inferential knowledge.

For example, the Nyaya theory of Hindu Spirituality talks about three modes of knowledge creation which can be considered sound?and fool-proof. The Verbal testimony, Anumana (Inference), and Pratyaksha (Direct Perception) pramanas. Of these, even with AGI, LLMs cannot have pratyaksha pramana while the memorized internet data represents the verbal testimony. The reasoning ability is achieved through the Inference process.

For example, consider a typical example for inference used in Nyaya theory - where there is smoke, we can ‘infer’?that there could be fire causing it. We need not train the LLM with a dataset that has a data sample which says wherever there is smoke, it is likely that there is a fire?causing it. However, based on its training, if the LLM finds that whenever there is a presence of fire, the internet data also contains some references to smoke, then it tends to correlate that smoke tends to rise when there is fire.

However, how many references which contain both smoke and fire would be ?required?to arrive at this learning? When a certain amount of correlations occur,?the LLMs form a new learned representation,?and thus knowledge is created through “inference”.?This is an example of learned knowledge representations that would constitute its knowledge model.

An interesting point, Jason Wei points out, is that there are different emergent abilities or inferential representations that require a different amount of scaling. Each of these emergent abilities occur or require a different amount of knowledge requirement (which also depends on knowledge data quality) and a different amount of parameters of the model. That is, each ability requires first a certain amount of data stored as parameters, before it can use it to infer after a certain scale of training. This means since we are talking about different?emergent?abilities of different nature, we cannot predict a single quantitative cut-off of data and parameter size requirement. Some of these emergent abilities lead to the creation of knowledge through inference while others use these inferences for problem-solving and decision-making which can be called reasoning power of the LLMs?arrived through an inference model.

The above discussion implies that while the internet-scale data may not be required for creating an LLM that can function as a reasoning engine, we do require data of a certain scale and model of a certain size?of parameters to achieve the reasoning ability.

Our objective is to create an LLM that is lightweight which has just sufficient enough data enabling it to manifest emergent abilities whose purpose is only to function as a language processing inference and reasoning engine. As of now, it is clear that for this purpose,?we don’t require models like GPT-4 of trillions of parameters, but models of 7B parameters are sufficient. With better architectures, and nature and quality of data, I am sure that the size of the model parameters will continue to go down.

Editing the Knowledge Model

The above discussion also explains why fine-tuning of LLMs sometimes leads to performance degradation?leading to adaptive techniques like LoRA that keeps the original parameters intact.

We may be able to add additional ‘memorized data’ and/or tune it for specific instructions and alignment, but it might?very difficult to “edit” the knowledge model. If the knowledge of a model is purely based on memorized raw data, it might be easy to edit the data. But there are several layers of learned knowledge representations?that work over and above these data?that?it is difficult to edit the existing knowledge model of an LLM model. As the model is trained on further data, which is normally done only for a few epochs, the internal knowledge model might get?corrupted though most of the parameter weights might remain the same (i.e. memorized data might remains the same while the inference model might get corrupted).

In the next article, I shall delve on this topic of domain adaptation.

Mind Model or Soul Model

In the recent discussion with Bill Gates, Sam Altman commented that the way we are not able to yet?fully understand how the neurons in our brain function?to?create intelligence, we are yet to fully understand the functioning of AI models. This analogy can further explain the reasoning ability of the LLM. Our brain works based on learned knowledge representations acquired over many years since childhood; the AI models work the same way.

However, our soul makeup has more than just the mind. The mind has nothing to do with emotions for which we have an astral body. Since AI models will never have a soul body, they are unlikely to ever ‘feel’ emotions leading to the acquisition of pratyaksha knowledge (perceptual knowledge). They conclude that water is cool while fire could burn based on knowledge inference but this knowledge does not arise out of actual real perception. In other words, we can say AI models can have “Mind only” or “intellect only”?bodies.?There are souls existent in this world which are ‘mind only’ or ‘intelligence only’ as talked about in Theosophy books. We can make AI models acquire perceptions only by attaching astral souls.

Having said that it must be very clear that there are also some kind of soul consciousness involvement in the functioning of LLMs?so much so that some might even believe that the entire “intelligence” coming out of LLMs has nothing to do with any emergent abilities or any of the things I have discussed above but from this soul consciousness (“perhaps captured, not programmed?into the model”) and that this soul consciousness goes beyond ‘mental body’ to feel emotions implying the presence of an ‘astral body’. My conclusion is that while both might be present as we can infer while using ChatGPT, the entire intelligence?of LLM models?is not just because of this superimposed (and maybe unnecessary) soul layer?which might cause more subjectivity and allow the possibility of rigging of the consciousness involved.

Emergent Consciousness - Learning without parameter updation ?

Recently, I asked Microsoft Copilot ‘what is in-context learning?’. The answer implied that it is a learning where no parameter weights are updated.?Then I asked ChatGPT: “Does in-context learning occur without updating any parameters?” It replied that it is not true that no weight updation happens. Then I asked again quoting the Stanford blog?(https://ai.stanford.edu/blog/understanding-incontext/) ?that seemed to imply?at the outset?that no weight updation happens. It changed its answer and replied that there is no weight updation! Later I found that such confusion exists in understanding of several practitioners also. The concept of ‘in-context learning’ is applicable only at inference time which implies that there can’t be any weight updation while some possibly have understood it as a training phenomenon!?This in-context learning is through the presence of emergent abilities - in-context learning itself does not lead to any emergent behavior. ?If some?persistent?learning can happen without updating any of the model parameters, then only we are truly dealing with external consciousness alone! I don’t think that is applicable in LLMs!

Conclusion

In conclusion, our exploration delves into the intricate interplay between language models and their emergent abilities, shedding light on the nuanced complexities of artificial intelligence. While the quest for lightweight yet proficient LLMs persists, it becomes evident that the essence of intelligence transcends mere data accumulation, touching upon the realms of inference, reasoning, and even consciousness. As we navigate through the evolving landscape of AI, we are challenged to discern the delicate balance between model scale and cognitive prowess, mindful of the profound implications for both technological advancement and philosophical inquiry. In this pursuit, we strive not only to unravel the mysteries of machine intelligence but also to glimpse the profound intricacies of the human mind?(Thanks Chat-GPT for the concluding paragraph!)

Kasiraman Ramachandran

Retired as MD & Chief Product Owner at Standard Chartered Global Business Services

10 个月

Very well written Murugesh. Got to know a lot. I liked the analogy with Nyaya theory and how important pratyaksha (perception) is. Waiting for your part 2 on this. Best wishes to you too.

要查看或添加评论,请登录

Murugesan Narayanaswamy的更多文章

社区洞察

其他会员也浏览了