登录查看更多内容

Studying Deep Learning Through the Lens of Information Theory

Himanshu S.

Engineering Leader, ML & LLM / Gen AI enthusiast, Senior Engineering Manager @ Cohesity | Ex-Veritas | Ex-DDN

发布日期: 2025年2月20日

In the summer of 1943, Claude Shannon met Alan Turing. Turing said, "Machines could think." and Shannon listened. Later he built Theseus, a mechanical mouse that learned to escape mazes. That was the first hint of AI. Decades later, researchers unknowingly followed his path and developed deep learning. His information theory became the basis of modern AI including neural networks and transformers.

He introduced information theory in 1948, which provided the mathematical tools to measure, transmit and store information efficiently. At that time, he used it for compression, noise reduction and error correction in transmission but this theory turned out to be so universal that even the universe seems to harness its principles using randomness and order to play a game of roulette.

Today information theory is used to explain the universe as a holographic construct. It is also applied to explain gravity and the emergence of time. Given its universal scope, how can it not be used in one of the greatest discoveries of our time - deep learning networks? Information theory has deep roots in transformers and this article will map its principles to several different functions of transformers.

At its core, information theory is about answering a simple question: How much information is contained in a message, how can it be compressed and how can we transmit it efficiently?

What is information?

This is where things get interesting. Shannon revolutionized the field by introducing probability distributions and entropy into the discussion. The more uniformly distributed a probability is, the higher the entropy and greater the uncertainty about the outcome. In this context, information is the quantity required to reduce that uncertainty. In simple terms, one can assume that higher entropy implies that more information is required to reduce the uncertainty.

Let’s take an example of a coin toss. If both heads and tails have 50-50% chance of coming then it means the uncertainty is high which means high information. But if the coin is biased and always throws heads then there is no uncertainty hence no new information.

Shannon defined entropy as a way to measure this uncertainty:

H(X)=?∑p(x) log p(x)

This function, H(X), measures the uncertainty in the system. If you look closely at the right-hand side, you see a summation over all events where each event's probability is multiplied by the logarithm of that probability. Here, the logarithm gives the "surprise" value for each event where events with higher probability yield lower surprise and vice versa. Multiplying each event’s surprise by its probability ensures that we weight each outcome by its likelihood. Simply adding the unweighted surprise values would not accurately reflect the contribution of each event. This weighting yields a more balanced measure of the system's overall uncertainty.

Before we move to the next part of information theory, we now know more uncertainty is more information and less uncertainty is less information.

A key extension of information theory is the information bottleneck framework, which builds on Shannon’s work by focusing on compression and reconstruction. Shannon demonstrated that information can be compressed by aggregating similarly structured data and retaining its essential variance while removing noise and approximating the data. This process of noise removal and data approximation is at the heart of the modern AI revolution.

How do the layers of transformer model connect with information theory? Let’s connect the dots.

First we can take a peek into the components of Transformer architectures. Input tokens are first converted into embeddings (combined with positional encodings). These embeddings are then processed through stacks of encoder and decoder layers. Each encoder layer typically includes multi-head self-attention, feed-forward networks (FFNs), and normalization/residual connections. Similarly, decoder layers incorporate multi-head self-attention, cross-attention with the encoder outputs, and FFNs.

When calculating self-attention, you capture the information and indirectly the uncertainty of each token through its interactions with other tokens. This process generates a probability distribution for each token's contribution, indicating the likelihood of its relevance in the given context. This whole process reduces the information by reducing the uncertainty and lowering the entropy, thus making the system more informative. This step can be compared to a form of lossy compression, where information is condensed into more meaningful signals.

In the next step, adding the residual information to the original embedding matrix helps recover some of the lost information from the self-attention's lossy compression (representational). Layer normalization again compresses the signal into a range, but this time it does lossless.

Now if you want to look at it from encoder and decoder perspective, you will see information theory at work here as well.

From an entropy perspective, the encoder acts as a compression module. It takes the raw high entropy input loaded with noise and redundancy and processes it through multi-head self-attention, residual connections and layer normalization. In each encoder layer self-attention computes query, key and value matrices. The dot products between queries and keys are normalized using softmax to produce a probability distribution over tokens, which weights their relevance and reduces uncertainty by filtering out less informative parts. The feed-forward networks then further compress these representations, lowering the overall entropy while retaining essential features.

On the flip side, the decoder functions as a reconstruction module. It takes this compressed, lower-entropy representation from the encoder and using masked self-attention along with cross-attention over the encoder outputs rebuilds it into a coherent output sequence. In short the encoder reduces entropy by compressing information while the decoder reconstructs and expands it with just the right certainty.

As we end this article the emerging big picture is that of deep learning isn’t just computation, it’s structured information flow. Every part of the deep learning network is a channel that processes, compresses and refines data, just like in information theory.

Kaustubh Vairagi, CSM

1 周

Indeed a deep study . Nicely explained

1 次回应

Pauliina S.

Catering General Manager

1 周

excellent, good keep it up

1 次回应

Ujwalendu Prakash

Director of Engineering at Bright Money | Tech Leader & Innovator | Mentor & Team Builder | Building Platforms with Expertise in Microservices, Cloud, Performance and DevOps

1 周

A very good and interesting way for understanding about Transformers Models. Himanshu S. - Could you suggest some more resources here?

Dr. Sunil Kumar

Associate Professor at BBD University

1 周

Insightful

1 次回应

查看更多评论

要查看或添加评论，请登录

Himanshu S.的更多文章

Knowledge Distillation in Deep Learning: Part 1

2025年3月4日

Knowledge Distillation in Deep Learning: Part 1

Key to learning is not remembering but understanding. What is understanding? We will get to that question a little…

2 条评论
Distillation of DeepSeek's Reinforcement Learning Mechanism

2025年2月8日

Distillation of DeepSeek's Reinforcement Learning Mechanism

DeepSeek as the name suggests is most sought-after name in the industry now. Beyond its name, it has also introduced…

4 条评论
Distillation of paper "A Mathematical Perspective on Transformers"

2025年2月1日

Distillation of paper "A Mathematical Perspective on Transformers"

While exploring the latest papers on Transformers, I came across "A Mathematical Perspective on Transformers." Although…

1 条评论
UMAP: If Geometry Could Talk

2025年1月23日

UMAP: If Geometry Could Talk

In a thought experiment recently, I was thinking about various frameworks of symmetry and asymmetry and how abstract…
t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

2025年1月16日

t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

For this article, I am going to create a solid use case of something that has interested me for years. It’s cancer…

3 条评论
Role of Kernel PCA in Non-Linear Dimensionality Reduction

2025年1月9日

Role of Kernel PCA in Non-Linear Dimensionality Reduction

Kernel PCA is a powerful dimensionality reduction technique that transforms non-linear patterns into linear ones by…

3 条评论
Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

2025年1月1日

Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

In the last article, we discussed indexing techniques for embedding retrieval in VectorDBs. We explored multiple…

3 条评论
Vector DB Indexing Internals for RAG Applications

2024年12月26日

Vector DB Indexing Internals for RAG Applications

Since my last article, I was split on what to write. One option was the Willow chip with 105 qubits, announced by…

5 条评论
Multi-Headed Attention and Backpropagation in Transformer Model

2024年12月23日

Multi-Headed Attention and Backpropagation in Transformer Model

The key to understanding an act is its effect, not just the act itself. We often get stuck in the process and lose the…

1 条评论
Gradient Descent #2: Quantum-Like Optimizer Evolution

2024年12月19日

Gradient Descent #2: Quantum-Like Optimizer Evolution

One of the beauties of machine learning and neural networks is how they turn the concept of learning into mathematical…

1 条评论

See all articles

Himanshu S.的更多文章

Knowledge Distillation in Deep Learning: Part 1

Distillation of DeepSeek's Reinforcement Learning Mechanism

Distillation of paper "A Mathematical Perspective on Transformers"

UMAP: If Geometry Could Talk

t-SNE: Evolving Symmetry of High-Dimensional Data in Low Dimension space

Role of Kernel PCA in Non-Linear Dimensionality Reduction

Optimizing the Efficiency of Vector Databases with Principal Component Analysis (PCA)

Vector DB Indexing Internals for RAG Applications

Multi-Headed Attention and Backpropagation in Transformer Model

Gradient Descent #2: Quantum-Like Optimizer Evolution