Quick read: Generative AI & Large Language Models (LLM) #4

Quick read: Generative AI & Large Language Models (LLM) #4

Part 4: Latent states?—?Roles and Applications

Former articles in the series:

Part 1:Generative & Discriminative Models:?Describe versus Decide

Part 2: Contemporary examples of Generative AI Models and their usage

Part 3: Embeddings?: Roles and Applications


Figure 1


Latent states?—?hidden representation of the accumulated knowledge

?

Figure 2: Schematics of the input sentence — X, generated embedding — H, and the weight matrix - W. From?

Let's revisit Figure 2 from the previous article.By repeating the process used to produce our embedding layer, the H vector enables us to create further coding—namely, latent layers. These layers, supported by normalization and additional mathematical steps, yield highly non-linear “mixtures.” With proper training, these latent layers represent the inner learning of the model's true nature, stripping away noise and unnecessary information. Often, they have a lower dimensionality than the embedding input, acting as a bottleneck layer. Most of the time, this latent space representation serves as the true encapsulation of the accumulated knowledge within the model.


Semantic proximity and embeddings powered by the classic Skip-Gram model

?

The Skip-gram model predicts context words (words that appear nearby in a sentence) given a target words. It learns to represent words in a high-dimensional vector space based on their semantic context. The Skip-Gram model predicts the surrounding context words from a center word and it used in tasks such as sentence compilation or sentence translation. This is a classic algorithm, part of the Word2Vec framework developed by Mikolov et al. Word2Vec is the most popular algorithm for word embeddings that bring out the semantic similarity of words in order to capture different facets of the meaning of a word in a sentence.

During training, the model dynamically processes each word in the sentence as the target word and generates context words repeatedly for training.

For example, in the sentence "The cat sat on the mat," if the target word is "cat" and the context window is 2, the input is "cat", and the outputs are ["The", "sat"]. We prepare the input in a form of

?(Center word, Context/any word(s), Label wanted y/n) with labeling 1 for a context and 0 otherwise. See for example Figure 3. We train the appropriate matrices for minimal overall error.

?

?(Cat,sat,1), (Cat,The,1)? ← wanted
vs.
(Cat,on,0),(Cat,the,0),(Cat,mat,0)? ←- "false" content

Figure 3: preparing word pairs with proper labels to train the model to predict the context of the word "cat"


Key Characteristics of Generative Models:

?

  1. Multiple Outputs per Input: During training, the model generates several outputs for each input (center word).
  2. Output Structure: Each output is a combination of the center word, a context word, and the associated probability.
  3. Dynamic Pair Generation: For a vocabulary of size N, the model produces roughly NXN

?input-output pairs, ensuring training on every possible word pair through its dynamic nature.

  1. Density Function: When generating a density function using the trained model, it performs effectively since it already accounts for the probability distribution.
  2. Sentence Completion: For tasks like sentence completion, additional steps are required to discriminate between potential outputs. This involves selecting the word with the highest probability (see Part 1: Generative & Discriminative Models — Describe versus Decide).
  3. Computational Complexity: Having approximately N*N probabilities reflects both the model's power and its main weakness—excessive computational demands. This challenge is particularly evident in transformer models.


In reality, multiple matrices must be learned to propagate and generate various outputs (see Figure 4). These matrices are refined iteratively: repeatedly feeding inputs, performing calculations, and adjusting coefficients until the error is minimized.

?

?

Figure



?

Path to Data Authenticity

?

Figure 5: Path to genuine and effective new generated data, see text. Credit:?


The generation of new, genuine data requires meeting several important criteria, as depicted in Figure 5. This data should span all possible states, avoiding gaps (who said sparse?). Achieving this allows us to construct distributions that produce new data truly reflecting the core meaning of the original dataset.

Effective Novelty

Naturally, intrinsic order exists in data. We often observe modes of states and varying degrees of clustering. To evaluate the usefulness of the generated latent space, Effective Novelty provides a straightforward metric.

Definition: Effective Novelty is a quantitative measurement ranging from 0 to 1. It represents the proportion of novel and unique entities generated out of the total existing and generated entities.

?

Posterior Collapse

The Posterior Collapse phenomenon occurs when latent variables become uninformative. In such cases, the generative model disregards the latent space and relies solely on input data to reconstruct the output. This is a common issue in Variational Autoencoders (VAEs).

?

Figure 6:See Text for discussion


Figure 6: An effective generative model innovates by adding new latent states to the input data? (green circles) based on the learned density function - The blue line.? The blue line represents the model's learned density function, reflecting an accurate representation of the "real world."

Effective learnings introduces evenly spaced (second row? - right) or tightly spaced (left) latent states (yellow plus signs) and removes redundant states when necessary. In contrast, a non-innovative model fails by introducing poorly spaced latent states or neglecting to add meaningful ones (third row).

?

?

Two Conflicting Phenomena:

  1. High Clustering + Effective Novelty: Imagine a library where books are grouped by subject (clusters). Within each subject, there is a wide variety of unique titles, and each book serves a specific purpose.
  2. Mode Collapse: Imagine a library with mostly empty shelves. Of the few available books, many are duplicates of the same group, representing a lack of variety.

?

Figure 6 illustrate the conflicting modes. Table 1 summarize the differences between the different behaviors based on various characteristics .

Clearly, High Clustering + Effective Novelty is preferable to Mode Collapse for achieving meaningful and diverse data generation.

?

Table 1: Adequate generative latent?space

?


?

Usage Example 1: Leveraging a Self-Organizing Map (SOM) combined with a Convolutional Autoencoder to characterize current and future users by persona and behavioral traits.

?

?

?

Figure 7: (a) A trained Self-Organizing Map (SOM) identified clusters, including representing patients at high risk of quitting (low retention). The density of states among existing patients (the numerical data) enabled us to generate synthetic patient profiles on demand and predict how future users might behave in advance.

?

?

Figure 7:(b) An example of the reward system we implemented, based on patient personas, as well as their past and predicted future activities. This system was designed to reward and motivate users toward desired health indicators, ultimately achieving sustainable patient reconditioning and health benefits.

?

We aimed to characterize different user groups, identifying individuals at medium to high risk of developing diabetes in the coming years. Since these users carried their mobile phones, we could measure their physical activity levels. This data served as the input for a convolutional autoencoder.

We utilized transfer learning by initializing convolutional filters using the classic high-pass, low-pass, and band-pass filter families, rather than starting with random initial values. The bottleneck layers consisted of several deep latent states. These latent spaces were subsequently mapped to a secondary network, a semi-continuous neural network known as a Self-Organizing Map (SOM). After further training using appropriate metrics (such as the U-matrix) and information propagation techniques, we successfully identified clusters, Figure 7(a). These clusters were interpreted using phenotypic data, including user age, gender, fasting blood glucose levels, and Glycosylated Hemoglobin (Hemoglobin A1c) test results.

Our work achieved the following:

  1. Tracking patient behavior over time: Monitoring responsiveness, consistency, and engagement.
  2. Classifying patients by persona type: Grouping users based on shared traits and characteristics.
  3. Predicting patient retention risk: Identifying individuals at risk of disengagement.
  4. Implementing preventive measures: Designing interventions for patients with a low retention risk.
  5. Personalizing responses and rewards: Utilizing reinforcement learning to motivate patients toward improved health outcomes, achieving sustainable patient reconditioning. Figure 7(b)
  6. Tracking behavioral changes over time: Enabling personal alerts or Physician intervention when necessary.
  7. Generating plausible profiles of future users: Simulating new personas with varying traits for future analysis.

?

It is important to note that all of the above was achieved before the advent of the LLM (Large Language Model) revolution. Although generative models existed, they were far less mature and recognized than today. Overcoming various constraints required creativity, leveraging numerous techniques and combinations of generative models, deep learning, machine learning, reinforcement learning, and classic signal processing methods.

?

Usage Example 2?—?using generative model to create virtual?molecule

?

Figure 8:


?

Enter MolMIM, a novel probabilistic autoencoder model designed for generating small molecules with desired pharmacokinetic (PK) and pharmacodynamic (PD) properties. MolMIM leverages Mutual Information Machine (MIM) learning to create a clustered latent space, enabling efficient sampling of valid, unique, and novel molecules. The model outperforms existing methods in both single- and multi-objective property optimization tasks, such as balancing solubility, permeability, and receptor binding affinity, using a simple evolutionary search algorithm. Its success is attributed to the inherent structure of its learned latent space, which naturally clusters molecules with similar PK/PD profiles.

Figure 8 illustrates this process: small perturbations over the initial molecule (8a & 8b) result in minor modifications to molecular structure, while larger perturbations sample from more distant regions of the latent distribution, yielding significantly altered molecules (8c & 8d). Structural differences are highlighted in red, while similarities are marked in green (see 8e).

Having a generative tool like MolMIM provides an expedited path to experimental assays, facilitating the design of lead compounds with optimized pharmacokinetic (PK) and pharmacodynamic (PD) profiles, thereby accelerating drug discovery and development.

Word of caution: Are we certain that we have achieved only the desired qualities? This, of course, requires thorough verification—an aspect I will address in a future article.


This article explores generative AI models, focusing on latent states as crucial components for capturing accumulated knowledge. We? explains how these models, like Skip-Gram, learn semantic relationships and generate new data by creating a latent space representation. Key concepts such as effective novelty and posterior collapse are discussed. Practical applications are illustrated through examples of predicting user behavior using Self-Organizing Maps and generating novel molecules with MolMIM, emphasizing the potential and limitations of generative models in various fields.

?

Stay tuned for the next article: Transformer - High Level View

?

?

?

?

?

?

?

要查看或添加评论,请登录

Ilan Sinai的更多文章

社区洞察

其他会员也浏览了