Large Language Models as Reasoning Engines: Decoding the Emergent Abilities and Future Prospects - Part III

Large Language Models as Reasoning Engines: Decoding the Emergent Abilities and Future Prospects - Part III

Before we get into the article, I would like clarify a few things. In my previous articles, my objective was not to dismiss the existence of any consciousness component in Generative AI based LLMs, but rather to acknowledge its presence and an attempt to explain them through the concept of emergent abilities. My contention was that several emergent abilities might combine together to create the inference and reasoning capabilities of the Generative AI based LLM models as against them being due to an externally captured consciousness. Whether a presence of external consciousness component (whether intrinsic or external), not arising out of emergent abilities, is what makes these LLM models artificially intelligent, and whether these emergent abilities are by themselves are not constituents of consciousness component but serve this external consciousness etc., are different questions which I shall discuss in another article.

In the previous articles, I have discussed about using LLMs only as a central hub of a business process orchestration framework where it functions only as a Natural Language Processing Interface and Reasoning Engine utilized by several agents (e.g. Langchain or AutoGen) that perform various tasks using?several?program components and external APIs, for the completion of a business process. The data?component involving private intranet corporate data is to be kept in a separate knowledge repository.?In this context, this article discusses the concept of domain adaptation and its implication for the architectural design pattern.

The concept of using a Knowledge Repository external to the LLM itself is similar to the current widely adopted design pattern where Retrieval Augmented Generation (RAG) is used to access external data,?using data and query embedding, vector search?and vector databases. However, this design pattern ignores the fact that we are unnecessarily using a heavyweight model like GPT-4 that can only be hosted outside, leading to huge costs.

If the purpose of the Generative AI based LLM in the orchestration framework is to function as natural language interface and reasoning engine, why should a corporation pay for a trillion-parameter model when it knows that none of that internet data, it comprises?of, will be of any use to it, given also that all those parameters are not required either for the reasoning capabilities as proved by smaller models like the latest Mistral 7B?

Moreover, when an?agent framework like LangChain or AutoGen is used, there will be a lot of?token based communication to and fro comprising input and output tokens (e.g., ReACT template tokens), all of which could further drive the costs?dramatically since the per token costs of larger GPT-4 kind of models are quite high. Hence, it makes sense to go for the corporation's own lightweight LLM to act just as a reasoning engine,?without any unnecessary internet scale data, which can be deployed in-house or in a cloud where only cloud instance hours, as against tokens, are costed.

Need for Domain Adaptation

Though this is an ideal situation?- possibility of a light weight Generative AI based LLM that functions only as reasoning engine without any information content (all its data are meant to create emergent abilities alone) -?using these ?LLMs?only as a natural language reasoning engines?and relying on RAG to deliver external data may not deliver all the potential of Generative AI.

Because, by taking the entire data outside the LLM, we may be losing one of the major benefits of Generative AI?- the learned knowledge representations pertaining to a?specific domain that can unleash?domain-specific intelligence. This is also an inherent weakness of RAG-based LLM models that is not widely highlighted in literature.

For example, if the private corporate data involves ‘domain or company specific vocabulary’ that are?not part of the embedding model of the LLM, we might be losing benefits of emergent abilities pertaining to domain knowledge. In other words, using RAG does not lead to the creation of new inferential knowledge, but just information retrieval and delivery in required format.

Following are some specific issues which are better handled by domain specific LLMs (based on an excellent article on the subject by Aisera - https://aisera.com/blog/domain-specific-llm/):

1. Lexical Specificity - A Generic LLM may not accurately recognize or utilize certain terms without extensive exposure to the specific vocabulary and its related context.?For instance, medical shorthand such as ‘b.i.d’, ‘q.d.’, ‘q.i.d.’, or legal phrases like “habeas corpus”.

2. Contextual Nuances - Generic LLMs might find it challenging to grasp?contextual subtleties, as their training on generalized?data does not offer sufficient instances of language usage in these unique?and special?contexts.

For instance, the term ‘positive’ in medical parlance has negative connotation as against general usage. Similarly, the term ‘consideration’ has entirely different meaning in legal domain.

A term like ‘short’ in general language might refer to a measurement of height or length, but in the finance domain, ‘shorting’ refers to a specific investment strategy that involves selling securities that the seller does not own, in the hope of repurchasing them later at a lower price.?

3. Depth of Specialized Knowledge - Generic LLMs may be able to superficially imitate the discourse in a specific field, but they lack a profound and genuine understanding of the concepts due to their limited exposure.

For example, the concept of ‘osmosis’ is not merely about the movement of molecules from a region of higher concentration to one of lower concentration, as it might be loosely interpreted outside the field. In biological terms, it represents the intricate process that facilitates the transport of substances across a cell membrane without the expenditure of energy.

The concept of ‘Black-Scholes Model’ is not just about an equation, which is how it might be loosely interpreted outside the field. In finance, it represents a complex mathematical model used for pricing options and derivatives in the market?and rarely used directly, in practice, to calculate the price of derivatives when formulating option strategies. ?Similarly, finance texts widely quote CAPM (Capital Asset Pricing Model) as the model for calculating discount rates but it is rarely ever used in practice for valuation?

4. Data Scarcity - When data is rare or highly specialized, it presents challenges in training Generic LLMs to a level of proficiency that is acceptable for professional use in such domains.

For example, in finance, the availability of training data can significantly decrease when it comes to niche investment strategies or advanced financial modeling. This is the case with ‘Quantitative Momentum Investing’, a specialized investment strategy that relies on complex mathematical models and computations. Due to its specialized nature, data related to this strategy is not commonly available, making it challenging for a Generic LLM to gain proficiency in this domain.

5. Specialized Inference - Generic LLMs are not trained with an emphasis on these specialized inference patterns, and therefore, they might struggle to apply such reasoning when generating text.

For instance, in the finance domain, analysts and investors often analyze market trends and interpret financial reports, requiring them to make inferences based on a complex array of financial data and specific economic indicators. In a stock market analysis, financial professionals must draw inferences from historical data and market trends to determine the potential performance of a particular stock.

In the realm of law, legal practitioners and adjudicators frequently scrutinize case precedents, case laws, and interpret statutes, necessitating them to deduce conclusions from a sophisticated array of prior legal decisions and precise statutory language. In the context of a personal injury litigation, legal experts are required to infer from antecedent case laws to ascertain their applicability to the present case.

In short, would you trust a Medical LLM based on Reddit and Quora discussions or the one trained on specialized medical texts?

Apart from the above which pertain to domain specific knowledge, the general purpose LLMs may also lack the emergent abilities that can provide further insights on the domain that we have been discussing in these articles. These emergent abilities are dependent on understanding of these domain specific nuances. Hence, the very core ‘AI’ part of the LLM models will be missed if we adopt an LLM that fully accesses data from external repository ?and does only information delivery using natural language instructions. The ?in-context learning might also require domain specific capabilities.

If the domain knowledge is part of the LLM, then the LLM?can go beyond just information delivery but can also act as a domain specific reasoning engine by developing emergent abilities through learned knowledge representations pertaining to the domain-based knowledge. Hence, it is?more preferable that these LLM models be trained or fine-tuned?on domain knowledge. For example, in the case of finance domain, advantages include domain specific language understanding of jargons and terminologies, contextual understanding, predictive analysis based on trends, financial reporting and analysis, risk assessments, regulatory compliance etc.

Role of RAG in domain adopted Models

Another weakness of the RAG-based LLM models is the possible repetitive usage of the same contextual information. Consider a customer service chat-bot that refers to several manuals for providing guidance on products. Every time a customer asks a query, the same contexts need to be provided. We very well know that these are the kind of queries that come from the customer, and we also know that LLM has to refer to these manuals for answering these queries. So, why keep this domain knowledge outside the LLM??Even after considering cost saving measures like having a local cache of LLM responses, standard caching mechanisms etc., it may still be more beneficial to keep the highly and repetitively used data as part of the LLM itself.

While this might imply?that the importance of RAG-based models is likely to decrease in the future, there are also some inherent major advantages to the RAG model. For example, the private corporate data cannot be a constant repository. The data might get modified, new data will be added, and there could be even streaming of new data that changes every day. Updating the domain adopted model with new data by retraining them frequently is not an efficient?option.

This implies that an RAG component will still need to be part of the framework to take care of changing data?even when domain adopted LLM is deployed

Domain and Company-specific Adaptation

The corporate private data can have several components - a core component pertaining to the domain at large (finance, automobile, medical, etc.), a company-specific component?which is proprietary, and a changing ?component mentioned in the previous paragraph?that necessitates RAG component.

The domain knowledge can be made part of the LLM by several domain adaptation methods. For example, a finance company can start with BloombergGPT or KAI-GPT which will have all domain vocabulary and learned knowledge representations pertaining to their?domain. The second part of the knowledge repository which is specific to the company can then be adopted to the model. The third part will use an RAG component to?cater to the varying component - data that?are likely to be modified, or?streamed daily based on changes in the external environment.

In general, the following are?all possible LLM design pattern?for the agents framework:

1. Conventional RAG Model that uses an externally hosted general-purpose LLM model with full reasoning capabilities like GPT-4. In this model, GPT-4 is supposed to function as good as domain specific LLM

2. A domain-specific model like KAI-GPT, Google’s Med-PaLM, BloombergGPT, etc., ?further fine-tuned with internal corporate data with an RAG component for floating knowledge. Each industry can have a domain-specific GPT model on which further domain adaptation can be done for private corporate domain data.

3. A lightweight LLM with emergent capabilities that is?fine-tuned on the domain + corporate data with several LoRA adapters for different business divisions, with an associated RAG component for real-time data.?Alternatively, a company can create its own model by fine-tuning a lightweight model with possible emergent capabilities for domain knowledge.

All of the above models would also be?further?instruction-tuned for different tasks and improvised with RLHF training?for output formats, legal, ethical, and privacy concerns.

Issues pertaining to Domain Adaptation

Some of the research, however, assert that the general-purpose models like GPT-4 perform better than domain-specific models. The case of BloombergGPT is quoted often. However, the comparison bench?marking is done on tasks like sentiment analysis, NER, QA, etc. Are these tasks domain-specific at all to Finance??

Is sentiment analysis on capital markets as simple as sentiment analysis of tweets? It is not surprising that general-purpose models like GPT-4 perform better than domain-specific models in such?naive general purpose?tasks. The comparisons should be done on tasks that are specific to the domain. A domain-specific finance task would be the ability to read annual reports or SEC filings and deduce insights.?These studies did not necessarily envisage domain specific or domain tuned emergent abilities when comparing with general purpose LLM models leading to domain specific models being nothing more than those with domain specific memorized data.

I saw one comparison where the task was to find out the latest profit growth given a company name,?where the general-purpose model outperformed the finance domain-specific model?(Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks, https://arxiv.org/abs/2305.05862). But should the task of calculating the profit growth be left to an LLM at all? I have seen many reasoning-related tasks that focus somehow on quantitative tasks. Why should LLMs learn such maths?at all? For each of these finance-related calculations, the model should be able to invoke external programs by providing the functional parameters and data, ?and use the output for delivering the response. Hence, these comparisons may not have been?perfect to reach the conclusion that general-purpose models will always be better than domain-specific smaller models.

If we do use domain-specific model, it is preferable to have emergent abilities pertaining to that domain,?of which it is doubtful as to what BloombergGPT is capable of. The problem could be with respect to the nature of the dataset used for training. In this regard, Google’s Med-PaLM is a?positive?example - the model has supposedly learnt?domain-specific insights and knowledge representations?(Large language models encode clinical knowledge, https://www.nature.com/articles/s41586-023-06291-2).?

There is also research that says that RAG based models consistently outperform fine-tuned models on domain knowledge (Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs, ?https://arxiv.org/abs/2312.05934).??

If a RAG model outperforms domain specific LLM, it implies that the domain specific LLM has not learnt any domain specific insights, but just serves to memorize domain specific knowledge, in which case it is unlikely to perform better than domain adopted LLM. It?is an indicator of challenges involved in creating and benefiting from emergent abilities when fine-tuning?General Purpose AI?for domain specific knowledge.

Domain Adaptation Methods

The domain adaptation can be done in several ways. Broadly, it can be either unsupervised domain adaptation or supervised domain adaptation. The supervised domain adaptation is mostly similar to the instruction tuning.?

In my view, a?generic instruction tuning using question-answer pair datasets may not lead to more inference-based knowledge representations. Instruction tuning may only end up adding additional memorized data to the model without giving any domain-specific capabilities. For example, the highly rated Med-PaLM model was initially fine-tuned on several Question-Answer pair datasets that did not lead to any significant improvements. It was further trained using additional techniques like instruction prompt tuning, a data and parameter-efficient alignment technique, to further adapt the initial Flan-PaLM to the medical domain. ?

The current research in this field indicates that unsupervised domain adaptation has not always been successful. Some of the researchers say that the models struggle to retrieve knowledge when it is present in the middle or end of the document trained on! (Unsupervised LLM Adaptation for Question Answering, https://arxiv.org/abs/2402.12170).?This phenomenon implies that such unsupervised domain adaptation may not have led to any emergent abilities.?

The domain adaptation should go far beyond what RAG can achieve, i.e., more than just retrieving the contextual knowledge?without which RAG might become more preferable. I think more research on unsupervised domain adaptation training is required as against the current focus on general-purpose models.?I don’t think reasons like distribution of “source training dataset” of the pre-trained model being different from the distribution of “target dataset” used for domain specific unsupervised fine-tuning are such major stumbling blocks

Conclusion

This?article argues that a light weight Generative AI model, used only for its natural language interface and reasoning capabilities, with external knowledge repository containing corporate private data, is sufficient as a central hub of business process frameworks that utilizes autonomous agents as against externally hosted large models like GPT-4 which could be quite costly. However, if such LLMs can be adopted to the domain specific knowledge of the company, they would realize the full potential of Generative AI as against a RAG based model. There would be a small RAG component to take care of the changing and real time knowledge bringing down costs of RAG also. ?

This design pattern would bring down the cost tremendously while improving the performance of the Generative AI models greatly. Such domain specific models can outperform the large General Purpose Models and RAG based models only if they are able to create domain specific emergent abilities. Such emergent abilities may not arise with methods like instruction fine-tuning with QA datasets, and solution could be unsupervised domain adaptation which is not currently very successful. More research is needed on unsupervised domain adaptation techniques which can result in light weight domain specific models making LLM based business process workflow less costly.

Kasiraman Ramachandran

Retired as MD & Chief Product Owner at Standard Chartered Global Business Services

10 个月

Very well written with lot of analysis. Examples quoted are quite useful to understand the context.

回复
Murugesan Narayanaswamy

From Finance & IT to AI Innovation: Mastering the Future | Deep Learning | NLP | Generative AI

10 个月

Some addition: Researchers have introduced a new RAG method termed "Retrieval Augmented Fine Tuning (RAFT)" in their paper: "RAFT: Adapting Language Model to Domain Specific RAG, https://arxiv.org/abs/2403.10131). This method stands in-between RAG and Domain Adaptation. The method involves fine-tuning the LLM for using RAG based retrieval by using training prompts that contain context based CoT reasoning. This way the LLM can further learn on reasoning required for RAG based retrievals. By training on significant amount of domain related documents, the model can be made to memorize the required domain knowledge which would help perform better on RAG based retrieval. This method can also be used where RAG is not performing well - same RAFT tuned model can be used for different domains for RAG based retrieval.

回复

要查看或添加评论,请登录

Murugesan Narayanaswamy的更多文章