Studying Deep Learning Through the Lens of Information Theory
Himanshu S.
Engineering Leader, ML & LLM / Gen AI enthusiast, Senior Engineering Manager @ Cohesity | Ex-Veritas | Ex-DDN
In the summer of 1943, Claude Shannon met Alan Turing. Turing said, "Machines could think." and Shannon listened. Later he built Theseus, a mechanical mouse that learned to escape mazes. That was the first hint of AI. Decades later, researchers unknowingly followed his path and developed deep learning. His information theory became the basis of modern AI including neural networks and transformers.
He introduced information theory in 1948, which provided the mathematical tools to measure, transmit and store information efficiently. At that time, he used it for compression, noise reduction and error correction in transmission but this theory turned out to be so universal that even the universe seems to harness its principles using randomness and order to play a game of roulette.
Today information theory is used to explain the universe as a holographic construct. It is also applied to explain gravity and the emergence of time. Given its universal scope, how can it not be used in one of the greatest discoveries of our time - deep learning networks? Information theory has deep roots in transformers and this article will map its principles to several different functions of transformers.
At its core, information theory is about answering a simple question: How much information is contained in a message, how can it be compressed and how can we transmit it efficiently?
What is information?
This is where things get interesting. Shannon revolutionized the field by introducing probability distributions and entropy into the discussion. The more uniformly distributed a probability is, the higher the entropy and greater the uncertainty about the outcome. In this context, information is the quantity required to reduce that uncertainty. In simple terms, one can assume that higher entropy implies that more information is required to reduce the uncertainty.
Let’s take an example of a coin toss. If both heads and tails have 50-50% chance of coming then it means the uncertainty is high which means high information. But if the coin is biased and always throws heads then there is no uncertainty hence no new information.
Shannon defined entropy as a way to measure this uncertainty:
H(X)=?∑p(x) log p(x)
This function, H(X), measures the uncertainty in the system. If you look closely at the right-hand side, you see a summation over all events where each event's probability is multiplied by the logarithm of that probability. Here, the logarithm gives the "surprise" value for each event where events with higher probability yield lower surprise and vice versa. Multiplying each event’s surprise by its probability ensures that we weight each outcome by its likelihood. Simply adding the unweighted surprise values would not accurately reflect the contribution of each event. This weighting yields a more balanced measure of the system's overall uncertainty.
Before we move to the next part of information theory, we now know more uncertainty is more information and less uncertainty is less information.
A key extension of information theory is the information bottleneck framework, which builds on Shannon’s work by focusing on compression and reconstruction. Shannon demonstrated that information can be compressed by aggregating similarly structured data and retaining its essential variance while removing noise and approximating the data. This process of noise removal and data approximation is at the heart of the modern AI revolution.
How do the layers of transformer model connect with information theory? Let’s connect the dots.
First we can take a peek into the components of Transformer architectures. Input tokens are first converted into embeddings (combined with positional encodings). These embeddings are then processed through stacks of encoder and decoder layers. Each encoder layer typically includes multi-head self-attention, feed-forward networks (FFNs), and normalization/residual connections. Similarly, decoder layers incorporate multi-head self-attention, cross-attention with the encoder outputs, and FFNs.
When calculating self-attention, you capture the information and indirectly the uncertainty of each token through its interactions with other tokens. This process generates a probability distribution for each token's contribution, indicating the likelihood of its relevance in the given context. This whole process reduces the information by reducing the uncertainty and lowering the entropy, thus making the system more informative. This step can be compared to a form of lossy compression, where information is condensed into more meaningful signals.
In the next step, adding the residual information to the original embedding matrix helps recover some of the lost information from the self-attention's lossy compression (representational). Layer normalization again compresses the signal into a range, but this time it does lossless.
Now if you want to look at it from encoder and decoder perspective, you will see information theory at work here as well.
From an entropy perspective, the encoder acts as a compression module. It takes the raw high entropy input loaded with noise and redundancy and processes it through multi-head self-attention, residual connections and layer normalization. In each encoder layer self-attention computes query, key and value matrices. The dot products between queries and keys are normalized using softmax to produce a probability distribution over tokens, which weights their relevance and reduces uncertainty by filtering out less informative parts. The feed-forward networks then further compress these representations, lowering the overall entropy while retaining essential features.
On the flip side, the decoder functions as a reconstruction module. It takes this compressed, lower-entropy representation from the encoder and using masked self-attention along with cross-attention over the encoder outputs rebuilds it into a coherent output sequence. In short the encoder reduces entropy by compressing information while the decoder reconstructs and expands it with just the right certainty.
As we end this article the emerging big picture is that of deep learning isn’t just computation, it’s structured information flow. Every part of the deep learning network is a channel that processes, compresses and refines data, just like in information theory.
Indeed a deep study . Nicely explained
Catering General Manager
1 周excellent, good keep it up
Director of Engineering at Bright Money | Tech Leader & Innovator | Mentor & Team Builder | Building Platforms with Expertise in Microservices, Cloud, Performance and DevOps
1 周A very good and interesting way for understanding about Transformers Models. Himanshu S. - Could you suggest some more resources here?
Associate Professor at BBD University
1 周Insightful