How to Explain Deep Learning using Chaos and Complexity
Carlos E. Perez
Author The Deep Learning Playbook, Artificial Intuition, Fluency & Empathy, A Pattern Language for Generative AI and Long Reasoning AI
I want to talk to you today about the concerns of Non-Equilibrium Information Dynamics and how an understanding its features lead us to a better intuition about Deep Learning systems or learning systems in general.
Just to recap my observation from a previous post on “Deep Learning in Non-Equilibrium Dynamics”. In our study of Deep Learning, practitioners derive their intuition from the mathematics of physical systems. However, since these are not physical system that we study but rather information system, we apply information theoretic principles. Now, information theory has its origins also in mathematics that describe physics (i.e. Thermodynamics). Both theories are essentially bulk observations of nature. What I mean by bulk, is that they are aggregate measure of systems with a large number of interacting particles or entities.
Kieran D. Kelly, who’s writing I just stumbled upon lately, has one of the better intuitions out there about non-equilibrium dynamics. His blog is a pleasure to read, and I recommend it highly for anyone interested in this kind of esoteric thing.
Wired has posted an article the other day titled “Move Over Coders?—?Physicists will soon Rule Silicon Valley”. Now, we might make the observation that Physicists in general have to have a decent IQ to do what they do and thus be able to handle computer science. We can also argue that the mathematics found in Deep Learning isn’t really that advanced compared to what’s found in a typical undergraduate physics curriculum (emphasis on undergraduate). However, there is something else that most people do not understand but it is generally understood by someone studying physics.
What people can’t seem to comprehend, and this is even among folks with technical background like computer science and mathematics, is the relationship with math and reality. They don’t recognize that the math that we use are just approximations of reality. That the math has serious limitations beyond certain dimensions. People doing physics know this because despite using analytic forms, we are constantly performing hand waving approximations (i.e. Use Taylor series to expand any function and throw out any term beyond the quadratic). So when I write about the limits of Math with respect to AI, I get a ton of outrage from math inclined folk! The ignorance in this world, even among the learned, is really surprising.
Going back to Kelly, he echoes the same sentiment about math and reality:
Physics is, in a sense, a science of linear dynamics, a science of “dynamics without feedback”; such dynamics are indeed easily compressible, but the real world is a world that abounds with feedback, a “nonlinear” world full of “incompressible dynamics”.
For many, this statement may seem to be a shock. But it really is not, this is just basic reality that there are limits to analytic forms. Another thing that seems to confuse people is the use of the word “linear” and “non-linear” by Physicists. Most people think of “linear” being that of a linear equation and I guess non-linear to mean something that’s not. So a quadratic equation qualifies as non-linear. What the Physicist however defines as linear and non-linear is from the point of view of differential equations. Linear differential equation have a chance of being solvable in a closed form solution. In contrast, with non-linear differential equations, almost all bets are of. The most classic example is the Navier-Stokes equation for fluids. Solvable analytically only up to 2 dimensions. Yes, 2 dimensions, that is an unrealistic flat-land world.
Basically though, think of non-linear as systems that have feedback. In other words, most of our reality. So to understand a bit about our reality, we have to understand a bit about the nature of non-linearity. It turns out over the years, there has been two features about feedback systems that have been studied. This is chaos and complexity. Kelly has a whole set of articles about these two subjects, and I’ll re-direct you there to get an introduction.
Now what I want to focus on is information systems (not physical systems), so what we are really looking for is chaos and complexity in the context of information systems. (side note: Deep Learning systems are information systems despite the poor association with the term Neural Networks). So here’s the very nice table from Kelly:
Source: https://www.kierandkelly.com/what-is-complexity/
Kelly writes:
What drives evolution’s spontaneous and progressive complexity is the interplay of insufficient negative feedback and strong positive feedback; or in other words what drives evolution is The Interplay of Random Innovation and Natural Reinforcement.
Negative feedback here are the natural tendency that exists in the Second Law of Thermodynamics (which really is the law of large numbers). That is, systems tend towards maximum entropy. The positive feedback however is a mechanism that can lead to chaos. But at the upper right quadrant, we discover emergent complexity. In other words, one has to embrace the existence of mutual feedback as well as randomness. Unfortunately, our mathematical legacy, that of assuming nice independent Gaussian distributions and favoring sparsity (or parsimony) over randomness is demanding an unnatural constraint on the system.
An assumption of IID (i.e. Independent Identical Distributed) features and an assumption that sparsity is the favored solution is walking every researcher towards an entirely wrong direction! These assumptions are the equivalent of physicists making their equations linear. It is all so that our mathematics become convenient. Unfortunately, God did not mandate that reality be conveniently expressed in mathematics. We are pushing our researchers to buy into religion and not reality.
Now, before I completely forget, let me explain how chaos and complexity relate to explaining Deep Learning. Let’s start with randomness or entropy, I wrote about this in “The Unreasonable Effectiveness of Randomness”. When we study Deep Learning, we simply can’t ignore the presence of randomness. It just seems to be an intrinsic feature of these systems. The most simple intuition I can think of here is that, diversity leads to surviveability. Monocultures, tends to less adaptability and possible extinction. In fact, the most counter-intuitive notion, randomness leads to information preservation. As an example of this in computer science, this is used in “Information Dispersal Algorithms”. That is, you take information and scatter it among different storage nodes and in a massive scale you do it randomly. You basically build storage that is highly redundant. This is the same mechanism as you find in holographic memories. So here, we establish the value of high entropy.
Let’s examine the other axis, that of high mutual information that can lead to unstable feedback and thus chaos. Mutual Information is the anti-thesis of many probabilistic methods. That’s because the math simply can’t handle it. But should we shoehorn reality to fit the math? I think not. One of the better characterization of how Deep Learning is able to work well in domains of higher mutual information is this paper “Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language”:
Source: https://arxiv.org/abs/1606.06737v2
How can we know when machines are bad or good? The old answer is to compute the loss function. The new answer is to also compute the mutual information as a function of separation, which can immediately show how well the model is doing at capturing correlations on different scales.
Deep Learning must be able to learn correlations at multiple scales to be of any use. Actually, to phrase it in a different way that does make sense is, Deep Learning must be able to understand the composition of language, from letters to word, to sentences and eventually to complete texts. Deep learning works because it captures language.
And the learning mechanism for this is what exactly? Jeremy England actually has very compelling argument as to how life self organizes. You can read it at Quanta: “A New Physics Theory of Life”. We can take this idea and use it to explain how learning works in Deep Learning. I’ve written early about the 3 Ilities. Explanations of “Trainability” is extremely important. A layered DL system actually builds a representation of language from the lower layers up to the more abstract higher layers. Each layer has its own mutual entanglement that is actually discovered through training. Over time, the entanglement get reinforced such that the breaking of the entanglement becomes less likely. So, for example, if the network only sees Latin characters then it never develops the ability to understand Arabic characters. Layers are also interconnected, so there is a constraint at the bottom ( more fundamental concepts ) and at the top ( minimizing relative entropy ). So eventually, a language hierarchy is built.
The objection here though is that it should take an infinite amount of time to arrive at a proper representation. That’s where the interplay of entropy comes into the picture. The basic theory is not unlike that of the holographic principle. Randomness begets robustness while mutual information begets self organization and compression. What begets generalization? Not sure, but something seems to emerge at the upper right hand quadrant!
To understand more, either keep reading this blog or head over and talk to us at “Intuition Machine”.
consultant
8 年Looks like fun, but it smacks of meddling with the subconscious instead of just using it.
Thinknowlogy is the world's only naturally intelligent knowledge technology, based on Laws of Intelligence that are naturally found in the human language. Open souce software.
8 年Carlos E. Perez, you assume the theory of evolution is scientific. However, it is a religion: Any Code of Conduct on scientific research endorses: Science is observable, testable, repeatable and falsifiable. However, none of the claimed phenomena of the theory of evolution – and its derivative theories – is ever observed. Examples: ? String theory, the assumed strings can't be observed by definition; ? Dark energy theory, the assumed dark energy can't be observed by definition; ? Dark matter theory, the assumed dark matter can't be observed by definition; ? Black hole theory, the assumed black holes can't be observed by definition. The origin of the observed phenomenon can only be assumed; ? Multiverse theory, the assumed “other universes” can't be observed by definition; ? Oort cloud theory, the assumed Oort cloud isn't located, and can't therefore be observed; ? Extraterrestrial life theory, the assumed extraterrestrial life isn't located, and can't therefore be observed; ? Inflation theory, the assumed inflaton particles can't be observed; ? Macro-evolution theory or Transition of Kinds theory, the assumed transitions (rock minerals →microbes →vegetation →animals →humans) can't be observed due to the assumed hundreds of millions of years that it could take before another transition occurs. And the phenomena that are claimed to be detected – like the Higgs boson and gravitational waves – are debunked. Science is about observable phenomena, while religion is about unobservable phenomena. Not being observable – and therefore not testable, not repeatable and not falsifiable – the theory of evolution and its derivative theories are not scientific. Instead it is a religion. Here's a list of research on physics that is accepted more and more. It is based on observed phenomena instead of theories. But it is in direct conflict with the current view on physics: ? Primer fields. Part 1: https://youtube.com/watch?v=9EPlyiW-xGI ? Electric universe: https://electricuniverse.info ? The Thunderbolts project: https://thunderbolts.info ? Plasma universe: https://plasma-universe.com ? Plasma cosmology: https://en.wikipedia.org/wiki/Plasma_cosmology ? The SAFIRE project tries to replicate an electric sun in a lab: https://everythingselectric.com/safire-project/
Senior Consultant at Deloitte
8 年Carlos, As always this is an article worth reading. The axis names for the graph taken from Henry W. Lin and Max Tegmark paper are cut.