Navigating the LLM Labyrinth: The Dance of Knowledge and Reasoning
Unveiling the Nature of GPT Models: More than Just Knowledge Databases

Navigating the LLM Labyrinth: The Dance of Knowledge and Reasoning

Why LLM's are a Reasoning Engine ?

Percival Lowell, an astronomer from Boston, provides a compelling narrative from 1894 that highlights the distinction between knowledge and reasoning. Although Lowell meticulously documented Martian "canals," his intriguing yet flawed conclusions were later debunked by NASA's Mariner missions. This story underlines that reasoning, in the absence of sound knowledge, can lead to well-articulated but incorrect conclusions.

Similarly, AI models like GPT-4 are often mischaracterized as extensive knowledge repositories, whereas they are, in reality, advanced reasoning engines. Despite being trained using the entirety of the internet, their performance is hindered by a lack of explicit knowledge. So may argue that this limitation renders GPT-4 merely a "stochastic parrot," but this perspective is misguided. The model's performance improves dramatically when it has access to the right information. Let’s first understand how and why the technology now ready to make credible change.


A map of Mars drawn by Italian astronomer Giovani Schiaparelli in the 1880s, showing waterways on the planet’s surface. It was Schiaparelli who first proposed the canal theory. From the book “A Popular Handbook and Atlas of Astronomy” by William Peck, 1891

AI Breaking the Existing Paradox

Michael Polanyi, a philosopher and scientist of the 20th century, posted a concept that would later be termed “Polanyi’s Paradox.” He introduced the idea with the assertion, “we know more than we can tell,” emphasizing the vast difference between tacit knowledge, which is personal and hard to formalize, and explicit knowledge, which can be readily articulated and codified.

This paradox becomes critically relevant in today’s AI-driven context. While humans possess a vast reservoir of tacit knowledge — comprising skills, intuition, and experiences — much of this knowledge is difficult to articulate or codify. In stark contrast, AI, built on algorithms and data, excels in handling explicit knowledge but often stumbles when nuanced intuition and experience are paramount.

The recent academic progress in GenAI is driven by below factors :

Recent progress in generative AI has been driven by four factors:  
1) Computing power
2) Innovations in model architecture
3) Ability to “pre-train” using large amounts of unlabeled data
4) Refinements in training techniques        

1) Model performance depends strongly on scale, which includes the amount of computing power used for training, the number of model parameters, and dataset size. Pre-training an LLM requires thousands of GPUs and weeks to months of dedicated training time. For example, estimates indicate that a single training run for a GPT-3 model with 175 billion parameters, trained on 300 billion tokens, may cost $5 million dollars in just computing costs

2) In terms of model architecture, modern LLMs make use of two earlier key innovations: positional encoding and self-attention. Positional encodings keep track of the order in which a word occurs in a given input. This allows large bodies of input text to be broken into smaller segments that can be processed simultaneously without “forgetting” earlier parts of the input Meanwhile, self-attention assigns importance weights to each word in the context of the entire input text. Older approaches that assign importance based on word wfrequencies that may misrepresent a word’s true semantic importance .These older methods may also base on semantic content within a small window. In contrast, self-attention enables models to capture long-range semantic relationships within an input text, even when that text is broken up and processed in parallel.

3) LLMs can be pre-trained on large amounts of unlabeled data. For instance, GPT is trained on unlabeled text data, allowing it to learn patterns in human language without explicit Guidance. Because unlabeled data is far more prevalent than labeled data, this allows for LLMs to learn about natural language on a much larger training corpus.The resulting model can be used in multiple applications because its training is not specific to a particular set of tasks

4) Finally, general-purpose LLMs can be further “fine-tuned” to generate output that matches the priorities of any specific setting. For example, an LLM may generate several potential responses to a given query, but some of them may be factually incorrect or biased. To discipline this model, human evaluators can rank these outputs to train a reward function prioritizes some responses over others. Such refinements can significantly improve model quality but making a general-purpose model better suited to its specific application

The Interplay of Knowledge and Reasoning in AI

When we talk about AI, especially models like GPT-4, LaMA, Anthropic, and Falcon, we often find ourselves at the crossroads of knowledge and reasoning. These two elements, while distinct, are deeply intertwined in the realm of artificial intelligence.

1. The Nature of Knowledge in AI:

The "knowledge" in these models is derived from vast datasets they've been trained on. However, this knowledge isn't explicit or definitive. Instead, it's a probabilistic understanding based on patterns in the data. Unlike a human who can recall a specific fact from memory, these models generate responses based on patterns they've identified during training.

Limitations of Embedded Knowledge:

  • Fuzziness: The knowledge isn't clear-cut. It's based on the likelihood of a piece of information being accurate, given the patterns the model has seen.
  • Lack of Explicit Recall: These models don't have a "lookup" function in the traditional sense. They can't pull a specific fact from a database. Instead, they generate responses based on patterns.
  • Temporal Limitations: Without continuous updates, the models' knowledge becomes static, limited to the last training cut-off.

2. Reasoning – The AI's Thought Process:

Reasoning in AI is the model's ability to take a piece of information (like a question) and process it using its embedded knowledge to generate a coherent response. It's the logic and structure behind how AI models think and respond.

Vector Databases – Bridging Knowledge and Reasoning:

Vector databases play a pivotal role in AI's knowledge-reasoning interplay. They allow for efficient indexing, storage, and querying of vast datasets. When an AI model reasons and processes a query, it's essentially navigating through these high-dimensional vector spaces to find the most relevant and probable response.

3. The Symbiotic Relationship:

While knowledge and reasoning can be discussed as separate entities, in AI, they're symbiotic. Knowledge without the ability to reason is just a static database. Reasoning without knowledge is directionless. It's the combination of vast, pattern-based knowledge with advanced reasoning capabilities that gives models like GPT-4 their power.

The Value of Input Data:

The adage "garbage in, garbage out" is particularly relevant here. The quality, diversity, and breadth of input data directly influence an AI model's knowledge base. A well-trained model with diverse data will have a richer understanding and, consequently, better reasoning capabilities. Conversely, biased or limited data can lead to skewed knowledge and flawed reasoning.

The dance between knowledge and reasoning in AI is intricate. As we continue to develop and refine AI models, understanding this interplay becomes crucial. It's not just about feeding data but ensuring that the data is representative, diverse, and unbiased. Only then can we harness the true potential of AI, where knowledge and reasoning coalesce to generate insights previously thought impossible.

Key things you should be knowing while working with LLM’s

While the focus often lies on the output of AI, the input—information provided to the model for analysis—is equally vital. The responses generated by the model are heavily influenced by the available information, and we often neglect the limitations of its knowledge, the cost associated with sourcing information, and the challenge of surfacing pertinent information at opportune moments. Addressing these challenges is as fundamental as enhancing the model's reasoning capabilities. First of all understanding the basic building blocks:

Language

Language is not just a random jumble of words. Instead, there are (fairly) definite grammatical rules for how words of different kinds can be put together: in English, for example, nouns can be preceded by adjectives and followed by verbs, but typically two nouns can’t be right next to each other. Such grammatical structure can (at least approximately) be captured by a set of rules that define how what amount to “parse trees” can be put together:

Source:Wolfram

ChatGPT doesn’t have any explicit “knowledge” of such rules. But somehow in its training it implicitly “discovers” them—and then seems to be good at following them

Pre-processing

The reality is that data ingestion is a much more complex and difficult process. LLMs need an interpretable, well-structured corpus of natural language data to kickstart their training process. Data, in this multifaceted age of the internet, is anything but uniform. It’s messy, complex, and tangled. This makes trying to piece together a structured corpus of training data a tall task and emphasizes just how important data preparation is in the LLM stack and pipeline.

For e.g in the below design pattern, the File Context Extracter (Data extraction) holds the keys to the kingdom for getting right data into vector index.

This is really challenging if your data inputs has graphs , bar charts, pie charts (for.e.g data availble in managment presentations) or even tabular data. Chart images can be easily found in news, web pages, company reports and scientific papers. However, raw numerical tables are lost when charts are published as images. The industry is trying to figure out the problem, but hihg level below is design pattem for coverting chart data into structured data.

ChartOCR Framework

Chunking Text for Vector Databases

Vector databases often require data to be in smaller, consistent chunks for efficient storage and retrieval. Chunking involves dividing lengthy texts into smaller segments or chunks, ensuring that each piece retains enough context to be meaningful.

Embeddings

Embeddings convert textual data into fixed-size vectors, preserving semantic context. These vector representations can then be used for a myriad of tasks, including similarity searches, clustering, and classification. Different embeddings might prioritize different aspects of the text, from semantic meaning to sentence structure.

Vector Databases

Vector databases store embeddings in a manner optimized for high-speed similarity searches. Given a query embedding, these databases can quickly retrieve the most similar vectors, facilitating tasks like recommendation, anomaly detection, and clustering.

These foundational concepts provide the groundwork for more advanced NLP methodologies and pipelines. Proper understanding and implementation can vastly improve the outcomes of NLP projects.

Tokens

Tokenization decomposes texts into smaller units, called tokens. A token might represent a word, part of a word, or even a single character. This process helps in analyzing and processing the text, making it digestible for models and algorithms.

Choosing the right Foundation Model

What Is a Model?

Say you want to know (as Galileo did back in the late 1500s) how long it’s going to take a cannon ball dropped from each floor of the Tower of Pisa to hit the ground. Well, you could just measure it in each case and make a table of the results. Or you could do what is the essence of theoretical science: make a model that gives some kind of procedure for computing the answer rather than just measuring and remembering each case.

Any model you use has some particular underlying structure—then a certain set of “knobs you can turn” (i.e. parameters you can set) to fit your data. And in the case of ChatGPT, lots of such “knobs” are used—actually, 175 billion of them.

But the remarkable thing is that the underlying structure of ChatGPT or LLM's—with “just” that many parameters—is sufficient to make a model that computes next-word probabilities “well enough” to give us reasonable essay-length pieces of text.

Below is the evolutonary tree for LLM's


Neural Networks: The Brain Behind AI

?In human brains there are about 100 billion neurons (nerve cells), each capable of producing an electrical pulse up to perhaps a thousand times a second. The neurons are connected in a complicated net, with each neuron having tree-like branches allowing it to pass electrical signals to perhaps thousands of other neurons.

When we “see an image” what’s happening is that when photons of light from the image fall on (“photoreceptor”) cells at the back of our eyes they produce electrical signals in nerve cells. These nerve cells are connected to other nerve cells, and eventually the signals go through a whole sequence of layers of neurons.

In Conclusion

As we stand on the precipice of a new era in artificial intelligence, it's essential to remember the lessons from our past, like Lowell's Martian canals. The blend of knowledge and reasoning, tacit and explicit understanding, is what will drive the next wave of AI innovations. While the technical intricacies of AI and LLMs are vast and ever-evolving, at their core lies a simple truth: the harmonious marriage of data and logic. As we continue to refine and advance these models, we're not just building better algorithms; we're shaping the future of human-machine collaboration. Let's embrace this journey with curiosity, diligence, and a commitment to harnessing AI's potential responsibly and ethically. Feel free to share your thoughts below! ???

#GPT4 #OpenAI #AI #NaturalLanguageProcessing #TaskComposability #ProductionReady #AdvancedApplications

?



要查看或添加评论,请登录

社区洞察

其他会员也浏览了