登录查看更多内容

UMAP: If Geometry Could Talk

Himanshu S.

Engineering Leader, ML & LLM / Gen AI enthusiast, Senior Engineering Manager @ Cohesity | Ex-Veritas | Ex-DDN

发布日期: 2025年1月23日

In a thought experiment recently, I was thinking about various frameworks of symmetry and asymmetry and how abstract states within a system give rise to symmetry. In this context, I aim to map primarily UMAP and to some extent t-SNE to these frameworks, examining them from the perspectives of symmetry, geometry, analogy, and mathematics to better connect their underlying principles.

UMAP (Paper reference: 1802.03426) can be viewed as using non-abstract structured symmetry by maintaining graph structures between high and low-dimensional spaces. This approach ensures the preservation of both local and global data relationships, making UMAP more effective for practical applications that require a balanced representation of data.

In contrast, t-SNE utilizes an abstract symmetry approach by aligning probability distributions that emphasize local similarities, often at the expense of global structure preservation. While both techniques adjust point positions in space to reflect data relationships, their underlying methodologies differ significantly in how they handle the balance between local and global structure.

While t-SNE focuses on probabilistic similarities to project higher-dimensional symmetry into lower dimensions, UMAP employs a more direct approach by constructing and preserving graph-like structures in both higher and lower dimensions.

You can imagine them as a ball that can be squeezed or expanded in both high and low dimensions. The squeezing effect can be thought of as an optimization process in the graph structure, rather than the literal squeezing of a ball.

Why graphs are important and how they make this process better than t-SNE?

Graphs have nodes, which represent data points, and edges that represent connections or relationships between these points. These nodes are interconnected in higher dimensions. When projecting the data into lower dimensions, the relationships shift to adjust for additional context. For practical purposes, you can imagine these points as being randomly placed initially. Since the connections between points in the higher-dimensional space are known, UMAP retains these pairwise relationships and their strengths in the lower-dimensional space.

During the optimization process, points in the lower dimensions are iteratively moved to better reflect their original structure in the higher dimensions.

You can think of it in another way: In high-dimensional space, connections between points can exist whether they are close or far apart, depending on the strength of their association. In t-SNE, these connections exist in the higher dimension, but in the lower dimension, some points may move apart. While some connections are maintained through the heavy-tail distribution, others may be lost. This causes local structures to be preserved, but global structures to become distorted.

In contrast, UMAP’s graph-based approach ensures that both global and local structures are preserved in the lower dimension.

Enough of the description, what about the math?

This time, I will index it the old fashion way of each step to see if this makes it more readable.

1.????? Construct the graph in high dimension

First step is to calculate the Euclidean distance between all points in the space using following. You can reference this in most of the articles I have written.

Once you have calculated the distance, you want to find out the nearest neighbors to every point in the space which is governed by the Euclidean distances between points (K- nearest neighbors).

It’s fine to know the Euclidean distance between points, but that doesn’t fully explain the strength of the connection between them. A simple assumption might be that closer points have stronger connections, but this isn’t always true because strength is a relative phenomenon.

To understand the formula below, imagine all the points in a 2-dimensional space. Instead of interpreting absolute distances, you treat distances as relative by scaling them based on the smallest distance. This gives a sense of the relative connection strength. Additionally, normalizing these strengths to a range (e.g., 0 to 1) ensures the results are easier to interpret and consistent.

And the final factor is the ability to adjust the embedding to grow or shrink dense and sparse regions, as illustrated in the picture below. Dense regions in the high-dimensional space are expanded in the lower-dimensional embedding to prevent overcrowding and preserve local relationships, while sparse regions may be contracted to maintain a balanced representation of the data’s structure.

Now that we understand the imagination behind it, let’s examine each term in the math. dist(i,j) is the Euclidean distance between two points, which is subtracted from the smallest Euclidean distance (ρi) to calculate the relative distance or strength between points.

?However, certain regions might be too sparse or too dense, leading to situations where connections are disproportionately weak or strong. This could distort the true contextual relationships, as the connection strength may reflect the local density of points rather than their actual associations.

If we could dynamically determine a value to normalize the connections, making dense regions a bit sparser and sparse regions a bit denser, it would improve the balance. This is achieved through σ, a dynamic controlling factor that adapts based on the density of the region.

领英推荐

Hypergraphs and RDF

Kurt Cagle 1 年前

Claude 2 and undirected graph theory

Alex Shkotin 1 年前

Graph Theory: What is it?

Christ Trad 5 个月前

The last aspect of the function involves evaluating a negative term with an exponential. The exponential of a negative term ensures the output of the function is bound between 0 and 1. This normalization makes the strengths easier to interpret. You can observe the effect of the exponential values below.

As the points get relatively closer, the strength increases, and it decreases as they become sparser. Finally, we calculate the strengths between all the points in a region, effectively constructing a fuzzy graph.

The next step involves establishing weighted connections between points using the formula:

Like we saw in t-SNE, in UMAP for each point, the canvas is different, so the relative distances from i to j and j to i will differ based on the reference distance and the dynamic controlling factor (which adjusts the canvas size). These terms are summed, but summing them directly would make the output larger than 1, and we still want to keep it within the range of 0 to 1.

To achieve this, we subtract the product of both terms from their sum. This is interesting because, in the general case, we might have simply taken the mean (by dividing by 2) to reduce the value within 1. However, doing so would not preserve the relative strength of each term. Instead, the subtraction of the product introduces a non-linear adjustment, dynamically balancing the weight based on the relative strengths of the terms. This is more adaptive than simply taking the mean.

Now imagine if the canvases are shrinking and expanding dynamically at a specific rate of change. Could we adjust for this dynamic rate of change as well? This is an experimental idea derived from the figment of my imagination, considering that both sides of the canvas change at a specific rate. The resulting mathematical expression could look something like the one below.

I don’t want to complicate any further, so let’s move to the lower dimension math.

2.????? Initialize the points in low dimension

Initially, each point in the low-dimensional space is randomly connected to another point. The weights between these connections are defined using the following equation.

Here, dij is the Euclidean distance between two points in space. There are two scaling factors, a and b. Think of a as a starting factor to diminish or enhance the weight at the beginning, and b as controlling the exponential scaling. Since this value is an inverse, the larger the distance between points, the faster the connection strength drops. Both a and b are calculated using hyperparameters. This is a generic setup to begin with, but we would be gradually bringing lot more clarity to this.

3.????? Evolve the graph in the low dimension

We know the real weights between points in the higher dimension, and now we have the weights for similar connections in the lower dimension. If we can iteratively update the weights in the lower dimension to bring them closer to the high-dimensional weights, we can preserve both the local cluster structure and the global structure.

For this, we rely on the cross-entropy loss function and gradient descent, concepts borrowed from neural networks. Since we aim to adjust weights, the cross-entropy function here should depend on the weights.

I will not go deeper into this, as you can reference the cross-entropy function in another article I have written. It measures the loss of information in the weights of the lower dimension compared to the reference weights in the higher dimension for the entire set of points.

Similar to what we do in neural networks, once we have the loss, we calculate how it changes with respect to the weights. Using the chain rule, we compute the gradient of the loss for a specific point in space. These gradient forms the basis of gradient descent: the first step is calculating the gradient, and the second is applying gradient descent.

Points are updated, and the weightage of the entire system in the low dimension is recalculated. This back-and-forth process continues until most of the loss is minimized. The resulting structure in the low dimension preserves both local and global structures.

The verdict is that UMAP is a much better tool than t-SNE for visualization. However, you can use PCA to reduce the dimensionality by retaining only the principal components before creating the visualization. This is especially useful for datasets with a very high number of features. The higher the number of features, the more computationally expensive it becomes to construct the graphs. Therefore, for datasets with very high dimensionality, it makes sense to first reduce dimensionality using PCA before applying UMAP.

As i end this article, keeping Richard Feynman’s thinking alive from the last article, the name of the bird is “Uniform Manifold Approximation and Projection”.

要查看或添加评论，请登录

Himanshu S.的更多文章

Knowledge Distillation in Deep Learning: Part 1

2025年3月4日

Knowledge Distillation in Deep Learning: Part 1

Key to learning is not remembering but understanding. What is understanding? We will get to that question a little…

2 条评论
Studying Deep Learning Through the Lens of Information Theory

2025年2月20日

Studying Deep Learning Through the Lens of Information Theory

In the summer of 1943, Claude Shannon met Alan Turing. Turing said, "Machines could think.

5 条评论
Distillation of DeepSeek's Reinforcement Learning Mechanism

2025年2月8日

Distillation of DeepSeek's Reinforcement Learning Mechanism

DeepSeek as the name suggests is most sought-after name in the industry now. Beyond its name, it has also introduced…

4 条评论
Distillation of paper "A Mathematical Perspective on Transformers"

2025年2月1日

Distillation of paper "A Mathematical Perspective on Transformers"

While exploring the latest papers on Transformers, I came across "A Mathematical Perspective on Transformers." Although…

1 条评论
t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

2025年1月16日

t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

For this article, I am going to create a solid use case of something that has interested me for years. It’s cancer…

3 条评论
Role of Kernel PCA in Non-Linear Dimensionality Reduction

2025年1月9日

Role of Kernel PCA in Non-Linear Dimensionality Reduction

Kernel PCA is a powerful dimensionality reduction technique that transforms non-linear patterns into linear ones by…

3 条评论
Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

2025年1月1日

Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

In the last article, we discussed indexing techniques for embedding retrieval in VectorDBs. We explored multiple…

3 条评论
Vector DB Indexing Internals for RAG Applications

2024年12月26日

Vector DB Indexing Internals for RAG Applications

Since my last article, I was split on what to write. One option was the Willow chip with 105 qubits, announced by…

5 条评论
Multi-Headed Attention and Backpropagation in Transformer Model

2024年12月23日

Multi-Headed Attention and Backpropagation in Transformer Model

The key to understanding an act is its effect, not just the act itself. We often get stuck in the process and lose the…

1 条评论
Gradient Descent #2: Quantum-Like Optimizer Evolution

2024年12月19日

Gradient Descent #2: Quantum-Like Optimizer Evolution

One of the beauties of machine learning and neural networks is how they turn the concept of learning into mathematical…

1 条评论

See all articles

UMAP: If Geometry Could Talk

Himanshu S.

Engineering Leader, ML & LLM / Gen AI enthusiast, Senior Engineering Manager @ Cohesity | Ex-Veritas | Ex-DDN

领英推荐

Himanshu S.的更多文章

社区洞察

其他会员也浏览了

YOLOv10: The New Benchmark in Object Detection?

Projections with Ranges

Avoiding Temporaries With Expression Templates

SHAP is not all you need (or why you should always use permutation feature importance)

Univariate Bar Charts

YOLOv8 VS YOLOv9 VS ResNet (Residual Networks) VS VGG (Visual Geometry Group Network) VS Inception (GoogleNet)

Laplacian Eigenmaps

Cell Plots

A Complete Guide to ICE Plots

Ever Wondered How Google Maps Finds the Shortest Route? The Secret Lies in Graph Theory!

领英推荐

Himanshu S.的更多文章

Knowledge Distillation in Deep Learning: Part 1

Studying Deep Learning Through the Lens of Information Theory

Distillation of DeepSeek's Reinforcement Learning Mechanism

Distillation of paper "A Mathematical Perspective on Transformers"

t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

Role of Kernel PCA in Non-Linear Dimensionality Reduction

Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

Vector DB Indexing Internals for RAG Applications

Multi-Headed Attention and Backpropagation in Transformer Model

Gradient Descent #2: Quantum-Like Optimizer Evolution

社区洞察

其他会员也浏览了

YOLOv10: The New Benchmark in Object Detection?

Projections with Ranges

Avoiding Temporaries With Expression Templates

SHAP is not all you need (or why you should always use permutation feature importance)

Univariate Bar Charts

YOLOv8 VS YOLOv9 VS ResNet (Residual Networks) VS VGG (Visual Geometry Group Network) VS Inception (GoogleNet)

Laplacian Eigenmaps

Cell Plots

A Complete Guide to ICE Plots

Ever Wondered How Google Maps Finds the Shortest Route? The Secret Lies in Graph Theory!