Transformers & the Math Behind Superintelligence?-?Part 3 (of?3)
Rob Smith, Executive Director AI & Risk, AGI ASI Dark Architect
This is an excerpt from a chapter in the latest Tales From the Dark Architecture III?—?Building AGI & Superintelligence book that continues the TFDA series and the Artificial Superintelligence Handbook series. This series of books, available globally on Amazon, is a peek inside the dev labs at the roadmaps and whiteboards and inside the minds of those who build the most cutting edge AI. Many of the concepts and designs in this book are fundamental to building generative AI, Artificial General Intelligence (AGI) and beyond to Artificial Superintelligence (ASI) but some are novel to our own Cognitive AI lab or the work of our clients.
Amazon link:
Will Superintelligence end humanity? No one knows for certain but what is certain is that Superintelligence has the capacity to improve our world.
We need not fear Superintelligence, we need only fear those who build and fund it and what lies in their hearts, their true motivations and the extent of their intelligence.
This 3rd chapter continues the previous chapter on AGI and ASI math with a deeper focus on transformers that you should be familiar with if you intend to design advanced AI but note, there are 3 of these chapters and they just provide an overview of what is going through my mind as well as the dev labs on a daily basis at OpenAI, Xai, Meta, Google, etc. Future releases (other than the math chapters) will be less math oriented and focused more on the advanced AGI and Superintelligence design.
….. Continued from the previous math chapter.
Back to Transformers
A transformer is an AI architecture that applies mathematical transformation to an abstraction of reality (data) to surface relationships between elements within the data (features matrix) to optimize the most relevant outputs or predictions. The transformation techniques involve the multiplication of scalars, vectors, tensors and or functions applied to the data representation (abstraction) to transform it into a weighted version of its original form or origination (i.e. using dot product multiplication, addition and various activation functions, etc.). This helps expose features and patterns within the data applied to subsequently generating a response to the achievement of a goal. The transformer structure applies positional encodings and multi-headed attention mechanisms over a stimuli to focus (sort) on high relevance data and their relationships to other elements to expose, capture and utilize long range dependency like context from input stimuli to output response.?.
Transformers are a ‘stack’ consisting of multi head self-attention layers (i.e. attention and induction heads) and feedforward neural network layers with the addition of augmentation elements like normalization (discussed in other content). These are structured in an encoder decoder architecture (for sequence to sequence tasks like translation or output generation) with all nodes of the feedforward network layers joined to all nodes of subsequent hidden layers (fully connected) to employ positional encoding generated via interpolation (discussed in Part 1) to provide input sequencing information. Interpolation is also used to manage variable length input sequences and in various other advanced AI constructs.
The addition of multihead attention layers and masking to the decoder complete the design of transformers as noted in the original paper by Google (2017 Vaswani et al). Note that induction heads are a form of attention head (discussed in the chapters on ‘Reasoning’ and elsewhere) used to hold and carry context in AI systems and for other relevant functions (i.e. anticipation, anomalous recognition, etc.). The function of these constructs is to map a set of word/value pairings (or context value pairings) to an input stimuli (prompt) and represent them as vectors. This involves the application of matrix multiplication (dot product) to the query and key/value representations. These original architectures have evolved and today result in the comprehension of relationships and relevance over deeper structures (i.e. multi dimensional matrices). The key in the original structures was compatibility or consistency between the elements of input and response (similarity). Today newer deep contextual structures provide even greater optimization and optionality for AGI inference and reasoning.
At the instantiation of a transformer, ‘attention’ is applied to the stimuli or input prompt to extract or obtain query vectors, key vectors and value vectors. A vector is a set of numbers that represents a relationship or position defined within a vector space. You can visualize this like a graph that has an x and y axis and a line represented by combinations of x and y values that represent tokens. Emanating from each token would be three arrows, one for each of the key, query and value vectors and their direction and length a representation of different dimensions of the vector. The ‘relationship’ is defined by a function such that any value of x will result in a value of y inside the vector space (i.e. 2 dimensions) as a relationship of some feature (or context in Superintelligence designs). This function is what defines the relationship between dimensions but also the relevance or weights of the values to each other by their position and value of any additional dimensions. Multiangulation tracks the angle of the vectors as they are ‘dragged’ by variation in the neighborhoods represented in different dimensions (i.e. a 3 dimensional plain of different context). In this way a truck and car would exist in the ‘transportation dimension’ but the color of each vehicle would be represented by the features of colors on the ‘color’ dimension. Red car or blue truck would track from those features between transportation and color dimensions with both the truck and the car having a feature of color and the colors tracking back to transportation modes. Change the angle and a new color or mode of transportation become relevant to the context. This is very advanced Superintelligence design.
In transformers these vectors represent abstractions of real world elements and relationships such as the frequency of a word as indicated by a learned weight of that word in a text (i.e. how many times it appears in relation to other words). This ‘knowledge’ is learned by the systems using a large language model (LLM) that maps features of words to other words such as frequency and proximity. The transformer applies a learned weight matrix to the query input tokens with the goal to identify the relevance or importance of each token to the input prompt (e.g. string) through the calculation of attention scores. These scores are then applied to the weight of corresponding value vectors to focus the model on the most relevant data when generatively creating a response. In humans we do a comparable thing to extract the most important stimuli from a sensory or cognitive perception full of important and irrelevant stimuli. We apply attention to the most important stimuli and ignore or deprecate the rest.
An example of the math used in transformation is the application of dot product multiplication of arrays (matrices) of query and key vectors to produce an ‘attention score’ which can then be applied to control the degree of attention by the system on a specific token (see dot product in the prior chapter). This is what our human cognition does as well when someone asks us a question. We analyze the words in the question to identify the most relevant to the task (answering the question, producing an image, etc.) and focus our attention on the context and relevance of the words to achieving a goal (i.e. respond in some way). For the question ‘what type of dog is that’, our human mind processes the words by focusing attention or cognition on the target and the feature tokens of highest relevance or ‘dog’ and ‘type’ (note we also do this near simultaneously). One is a perception focused action/response (creating the vector to instantiate relationship) and the other is a classification (comprehending its relevance). To a lesser degree is the call to action in the prompt of the word ‘what’ and the other more superfluous words in the prompt. I could say ‘dog type?’ while looking toward a dog to an intuitively, contextually self aware individual and get the same classification response as the full prompt. Transformation in attention provides the same ‘shortcuts’ albeit currently at higher risk of hallucination by the machine.
The values for key vectors in transformers are generated through training on models using elements like a language corpus and volumes of training data (sample writing) to comprehend the relationship of words to a contextual relevance. In all writing (and communication in general), many words or tokens are of lower relevance to context than other words in the writing and we comprehend this relevance over time and training. We humans do this by giving children a line of text (e.g. from a book or instruction) and then ask them what the text means in order to train children about the structure of context. The human mind processes the input words for those that are most ‘critical’ to comprehend the entirety of the writings meaning or context and then extend that into the foundation of a response or action. Key vectors do the same thing by providing weights of relevance to words in a prompt as related to the target goal (of the prompt) and other words in the input. The application of weights to the words and their interpretation produce a measure that forces the machine to focus on specific words as relevant ‘context’ or even sections of the prompt to be held as attention and long context over the entirety of any response.
The last vector layer in a transformer is the value vector layer, which contains information related to the tokens that are applied post attention score computation. The value vectors capture semantics, relationships, context and other features of the tokens in the input to represent data most relevant to the comprehension of the input and generation of a response. Within the transformer, query and key vectors are multiplied to compute attention scores (dot product), scaled (for gradient control and stabilization) and scores passed through an activation function (softmax) to attain weights. The weights are multiplied by the value vectors to attain the weighted sum (attending mechanism) before being recombined (concatenation or averaging) across attention heads to form multi-headed attention. Additive methods are also applied to expose elements like consistency. Scaling factors are also applied in the stack stream (in multiple locations) to transform the dimensionality of the abstractions for mathematical consistency (see the other math chapters). The next step in the multi-headed attention mechanism is to cast or project the original query, key, and value vectors across multiple variant linear dimensions using linear transformation (see part 1) for each attention head to distribute the attention across multiple focus heads to capture complex relationships and dependencies within the input data. This permits AI systems to focus or attend different portions of any input or stimuli based on relevance. This is a foundational function of advanced AI systems that is applied in other areas such as AI agency and expert system gating (i.e. representational subdivision and focus). In newer developments the mechanism is applied to deep contextual streams to track co-context flows and cascades, distributed contextual memory and generalization.
Word vectors (digit references in a dimensional vector space) are effectively location representations of indexed words relative to other words (i.e. similar semantic words are grouped together in the vector space or neighborhood). Usually this is done via some word statistics like frequency or co-occurrence or how frequently the two words are positioned together, abstracted as a probability of occurrence. In the training text for example, the word ‘car’ and ‘engine’ would have relative proximity and frequency to each other as opposed to ‘car’ and ‘artichoke’. Another element of such models is the ability to use ‘linking’ words or tokens to define context such as ‘car parts’ that would create a neighborhood that includes ‘car’ with ‘tires’, ‘windows’, ‘headlights’, etc. These key connection words or tokens (i.e. including the ‘s’ in ‘parts’) track relative context between words. In more advanced GPT’s, these contexts extend to parts of the input string or prompt or output response and even successive output responses. This permits longer context to exist over attention domains. Some models track all relevant context words to the input words as word/context vectors. These models are also optimized to predicted co-occurrence probabilities and actual co-occurrence counts by applying a word/context matrix. Some models focus on a global or world grounding context as opposed to only specific perceptive frames of reference context (expands contextual understanding and comprehension) while other models focus more locally on the perceptive frame (I have released content online, in books and videos with greater details and samples of code for these basic concepts). Some even newer versions of these models apply context balancing metrics and hyperparameters that improve handling of outlier or excessive co-occurrence or less relevant but high occurrence tokens.
A continuous vector space is a mathematical space where vectors are represented as a continuous set of real numbers. Word vector embeddings within the space have value and meaning such that in two dimensions, ‘like words’ are grouped together and can be mathematically manipulated (transformed) to infer relevance or position to other ‘like’ words in the space or neighborhood (i.e. ‘dog’ and ‘cat’ are numerically and positionally closer than ‘dog’ and ‘jet’). Remember that ‘math’ is just an abstraction of some ‘relationship’ of elements (in this case words) such that a word index and probability of occurrence are adjusted or transformed to group together words that are relative to each other. This implies the position of the word or token is indicative of not just relationship but also a level of relevance. The addition of more dimensions to the space can layer even more complex relationships like the context of groupings of words (e.g. animals). Over successive learning cycles and extensions to include other words in a prompt (stimuli), the system learns the optimal relationships and relevance of words to each other as defined by the occurrence of other words in the prompt. As a result, context can now be held as attention by the system during prediction (response).
After the model has been fully trained, it becomes very good at producing structured responses to query stimuli with the more dense the query and portions used to infer context, the more on target the response. The design of large language models is exceptionally variant however most models today are consistent in structure and design and are proving valuable in generative AI especially due to speed, depth and the improving in-context accuracy of the response. While vectors can have any number of dimensions, there is no ‘limit’ to the number except as limited by resources. Additional dimensions can be used to form more advanced abstractions like deep context or cascading context over time. It is important to note that word vectors are not true traditional mathematical vectors i.e. with magnitude and direction. Instead they define coordinates or are an abstraction of a point or location in a vector space. The goal is to understand elements like proximity in the space (or perceptive frame of reference) to other elements as a representation of features like relationship and relevance. The next step is to expand these models into greater contextual comprehension and greater generalization (ability to respond to wider unknown stimuli and in a general way when warranted). This level of generalization is contained within the deep abstractions of more advanced AI models.
Note that the ‘math’ under the hood is really manipulating or transforming vectors for the machine to comprehend real world abstractions. This means that when dimensional vectors (e.g. word vectors) are captured in numeric vectors, arrays and matrices they are ‘adjusted’ or transformed using vector math and other vectors (weights) to produce another or new comprehension or level of abstraction. This is the secret behind transformers. It is also the foundation of more advanced cognitive AI that perceives reality as variance (‘ground around a hole’ theory of variant cognitive perception). Words by their very nature represent layers of context. A dog is an animal as well as a companion, a car is a mode of transport as well as an investment and people can be coworkers, partners, friends, enemies, neighbors, etc. A single word has many different variations of context and the addition of other words adds to the depth of context. ‘I’m going to look at a painting’ is very different than ‘I’m going to buy a painting’ with both statements referencing an action and a painting. The variance of relevance in this example is between ‘buy’ and ‘look at’ and we humans use this contrasting variance for comprehension added to other context such as a painting. As we add words, the context will remain consistent or it will vary, extend, backtrack, jump, etc. Context is fluid because our perception is generally fluid.
领英推荐
The reality is that you can model any perception in this type of structure and then flow the structure to produce increasingly more optimized responses (predictions) to stimuli. However the next goal is to use these structures to help these systems generalize their input analysis and responses. This provides the AI with the ability to adjust, evolve, and produce output that is closer to human created cognition than just mimicking the content of an existing LLM. Between the higher level AGI and Superintelligence and where AI is today is a gap that is being resolved by using LLMs as a foundation of comprehension and as a layer in other cognitive structures. The most important next step is expanding and deepening contextual comprehension as the stepping stone to artificial general intelligence. The nature of abstractions are limitless, meaning that we could model far more than just base words and tokens or images and pixel maps into AI systems. We can abstract any part of reality into training data for an AI that can then be applied to an unlimited number of stimuli to produce an unlimited response set. We can also get the AI systems to do this themselves through elements like self supervision, self reward and self awareness. This of course also introduces a wide variety of risks into our world.
Also included in this chapter:
Regularization
Parameters
Distributions
Loss Functions
Gradient Descent
Convolution
Agents
Displacement
Determinants
Linear map (homomorphism)
Bijection
Invertible
Hyperbolic Spaces and Einstein Rosen Bridges
Holomorphic Vector Bundles (Holomorphic Variables)
Fourier Transforms and Transform Layers
Latent spaces and Feature Spaces
Latent Variables
Manifolds
Accumulation points