Who we are: Answers From Data Science
Could it be that our brains and our societies behave in the same way? If so, can we characterize that behavior in a mathematically fundamental and elegant formula? If behavior can be understood in these terms then marketing, fraud detection, and resource management could built on rational, empirical, and testable foundations. Behavior has interested us from as far back as we have evidence and probably long before that. Philosophers, Theologians, and economists have speculated on the root cause of human behavior but it wasn't until the middle of the twentieth century that we had the mathematical tools, like expectation values, and information theory to describe human behavior in terms of the basic laws of physics.
The older models have withstood the test of time and deserve serious consideration so that any new theory should accord, in large part, with established theories but it should be superior in simplicity, consistency, accuracy, and predictive power. When we harmonize behavior with the laws of physics, we simplify our ability to describe the world by using one set of laws instead of many.
Some might object that applying a physical principle, such as energy minimization, to behavior abrogates freewill, but this is not so. Individuals are still free to perceive any system's constraints differently. What we call freewill should correlate with differences in perceived constraints. A rock would fall to the valley floor but for the mountain holding it up. One person may perceive no way to alter the situation while another may perceive that a lever is at hand. Whether the rock remains in place will depend on peoples' perception of the constraints of reality and utility. In any case, the entire system can be described in terms of energy minimization with constraints. Freewill then should not stand as a barrier to our describing a model of human behavior with a few well chosen simple physical rules.
Heretofore many economic models begin with a few simple assertions that seem reasonable but sweep difficult questions under the rug for the sake of “practical simplicity.” Let's illustrate a typical economics, or game theoretic model, and the typical assertions.
The y-axis might be time or some other variable and the x-axis might be price. The seller A would like the cost of his goods to be as high as possible while the buyer B would like to minimize his capital outlay, then by utility theory they will, all else being equal, settle on a price C in the middle. This model appeals to common sense but leaves many questions unanswered. What is Utility, much less Utility Theory? Why is the line straight? Why will the parties meet in the middle? What are the other things that are equal? How does B know that A wants to sell in the first place? How can this model be measured? To which the reply is often something like, “This is model only and it can not be proved, but it can be used to predict the outcome of a negotiation ...”
In contrast, a model based on the physical sciences might say that when B exchanges a dollar with A in this transaction, or through a series of equivalent transactions, he is buying enough energy to perform some measurable amount of work. So for instance B has a choice of buying cake or gasoline, and then pushing or driving his car to work. When the physical and economic worlds are united under one law, any economic theory becomes testable within the limits of the theory and equipment of the day. And as an aside, it is clear that food is more valuable than energy, which we knew at the beginning of the industrial revolution but forgot when we learned that we could make biofuels
But there is a problem – the mind of A and B are unknown, how does B know that A will sell, and how will they communicate? Happily physics has encountered and developed tools to deal with such problems – statistical mechanics. Just as Ludwig Boltzmann could not measure the individual speed of atoms but could describe their motion in a statistical fashion, the mind of A and B can not be known, but their actions can be described in a statistical fashion. The rule is to assign equal or random probabilities to the entities when nothing else is known, i.e. if there is no constraint (or equivalently no constraint is known) then just assign equal probabilities to the variables of the equation. Thus, in the case above, the reason C is in the middle is that it is the average of many random possible outcomes. It might not be true in any particular case, but on average it will be true. And just as importantly it will be true with very limited knowledge.
When something is true on average, we say that it is the expected value. Statisticians and natural scientists call this the expectation value, and we write it like this:
This is just a fancy way of saying take the average value of f(x). An expectation value doesn't have to be an average, but it often is. Even when the expectation value is not itself an average, it usually contains some term that is an average. We usually find that the researcher has found some stable pattern, like a bell curve, around an average, or an exponential function where the exponent contains the average. And the inventor of the equation will often throw in a constant or two to make things fit between whatever scale nature is using and what we humans have devised. Instead of being thrown off, we should take comfort when we see that an expectation value contains an average and a constant or two. Moreover, we can guess that the form of the equation will commonly be one of a half dozen repeating formulas, a straight line, a sinusoidal wave function, and exponentially growing or decaying function, an "s" curve (or tanh) function, a bell curve, or a gamma function that grows rapidly on one side and then tails off far more gradually on the other.
One common, very common, expectation value is where entropy is a maximum. Entropy maximization is energy minimization's other face. Don't worry we'll explain how later, just take it on faith for the moment because we have a really important point to cover before we get deeper into the math. We want to prove, or at least show, that people really do behave in accord with the principle of energy minimization.
Which leads us now to our maximum entropy diagram. If income is distributed in the US according to a maximum entropy equation,
then we have a statistical basis, and not just a common sense reason, for saying that human behavior is based on the principle of energy minimization. The value 'T' is the average temperature in statistical mechanics but we will use the average income in a moment. Shortly this paper will give a more detailed explanation of how we got this equation and work at developing an intuition of the math so we will be prepared to venture on our own into the world of statistical analysis.
If it turns out that income obeys entropy maximization (i.e. people are energy minimizers) this observation is very important, for we can make some bold statements with confidence like these:
- When a bank account's activity uses more than the minimum amount of energy something suspicious is happening.
- In any transaction where one party is expending a great deal of energy there must be some solution where that party is spending the minimum amount of energy given some, as yet unknown, constraint.
- When we find a constraint that causes the entropy distribution to fit the behavior, we can expect that explanation to be most likely.
Enough talk, show us the graph! Entropy vs 1996 US Income distribution from US census data.
But results without understanding are meaningless, how can one make sense of the math? Empirical equations tell you what the data looks like while derived equations tell you why the data looks like that. Both approaches have merit. in this case you could inspect the data and immediately grasp that there was an exponential decay function at work. And knowing that averages and constants are likely to help you could have listed half a dozen forms of potential expectation equations like maximum entropy equation shown earlier. with a little trial and error anyone could have come up with that equation even if they knew nothing about entropy.
There is a really great website called https://www.betterexplained.com that explains that equations which use e^x tell us how things evolve, while equations involving log(x) tell us why they evolve. This is a pretty good insight since you would have naturally take the log of our equation: log(<f(x)>) based on the fact that we used logarithmic equations log(e^x )=x . You would have done this just to see how you could transform the fitted curve. Then you could have looked up the resulting equation and found that you had recovered the entropy equation and been done. But that still doesn't feel like understanding, it feels like a rush job just to hand in the homework before the deadline so you can move on to other subjects. Deriving the equation from first principles is the extra-credit version.
Maximum entropy with tears or losing your mind
Claude Shannon developed the theory of information that has been called a stunning intellectual achievement. And when it resembled the entropy equation, John Von Neumann suggested he call it entropy "Because no one knows what entropy is." It took generations to prove that actual entropy, the kind that is used in chemistry, is the same thing. Very few authors understand the distinction and how to merge the two ideas so the Internet is filled with all sorts of misconceptions that make understanding entropy almost impossible. See the John Von Neumann quote above! Deriving our own information equation, removes the scariness and leads to clarity. Besides knowing how to do this will make you, regular Joe, the geekiest geek at the next tech convention, since you alone will be able to derive the entropy equation on a napkin to the utter horror of the mister smart-e-pants, geek, hipster know-it-all.
To begin with, we already suspect that the answer will involve the log(x), so when we are designing our information equation we should try and arrange the rules so that only the log equation will satisfy them, but they should also satisfy common sense. We know this because 1) why things work is usually a "log(x)" function, and 2) the curve we are trying to explain is an "e^x" function.
First we want information about a probability event I(p) to be greater than 0. After all what would negative information be?
Second, like any good mathematician, you want the problem to be bounded and a good choice is make sure that all the values of our equation happen between 0 and 1.
Next we want to at least be able to say when we do not have any information. Let's demand that when we know something in advance, e.g. that the probability of an outcome is certain, we gain no information. Said in another way when the probability p1 is 1 , then the information I(p1) is zero. And in the back of our minds we are remembering that e^0 =1 or log(1)=0.
Then we want to say something intelligent bout how to combine information I(p) and probabilities p(x) . And we will recall that for independent probabilities p 1 ,p 2 , etc. up to pn you just multiply the probabilities together to get the result. After all, anyone can remember that when flipping a coin the probability of getting heads is 1/2, and that getting heads twice has a probability of 1/4, and so on. So lets require that information do the same. I(p1) and I(p2)=I(p1 ×p2 ) . And again we are remembering that e^x ×e^y =e^(x+y) and therefore the log(x)×log(y)=log(x+y) .
Lastly we want a function that is both proportional and continuous. That is, for small changes in probability there are small changes in information, while for large changes in probability there are large changes in information. And of course we want a smooth function that we can differentiate or integrate. If the function were to hop all over the place, and fluctuate wildly, as we changed p(x) our function would be so difficult to work with, that it would be useless.
Hmm.... take that mister hipster! That's it. We have enough specification to build the theory of information. Sure we need to dress it up in nice mathematical garb, but its really not so scary as we thought. Any actual algebra we forgot we can just look up on the Internet.
Summarizing our requirements into axioms we have:
When mathematicians say they the can "infer" something, they mean guess because they know the answer. In this case we can "infer" that a logarithm equation will satisfy our requirements and we can "infer" the solution to our equation is:
You might want to go back and check our inference. Does this equation (9) satisfy our requirements?
Happily it does, so it turns out we could "infer" correctly. You can test it out in some common sense ways. For a coin flip p = 1/2 and 1/p = 2. Therefore, we can write down the information in a single coin flip like this.
Perfect, what this equation tells us is that, using log base 2, the sender can send one of two signals (i.e. there is a fifty percent chance she will send one or zero, heads or tails, up or down) and for each signal we get, that we could not predict better than fifty percent of the time, we get one bit of information. Just what we need to build a communication system. So we have reason to believe we have done a good job in constructing our information function.
Now using our knowledge that expectation values are usually just average values, we are just going to define the entropy value as the expectation value, to be the weighted average value of <I(P)> and name it H(P) , where P={p1,p2,...pn} . Let's take a moment to recall that a weighted average of any function f(x) is done like this:
which makes sense, after all you want to weight the value of the function at x1 by the probability of getting any value at all at x1 , so naturally when we want the expectation value of a particular function, like I(P), we would write it like this.
Now for a couple of quick substitutions, let pi =p(xi ),I(pi )=log(1/pi ), and ∑pi =1 .
And we can write the Shannon Information Entropy Equation in its standard form like this:
You should take a moment to congratulate yourself, for you have just reconstructed Shannon's 1948 paper on information theory without having a PhD in computer science, statistics, or physics. Well done.
Is that it? Are we finished? Is this what people mean by entropy? Is this the only entropy equation? What does it look like when you plot it out?
OK, this is not the only entropy equation, if you go searching you will find many different entropy equations, but you should have the tools now to follow along and see how the other entropy equations are derived. Most importantly it is not the chemical maximum entropy equation of Boltzmann and Gibbs:
used earlier when we matched to the US income distribution, so we are not yet in a position to claim that people minimize energy based on ore newly derived entropy equation. We will have to work a little harder to get this equation into the correct form so that we can say that people minimized energy. But for curiosity lets just quickly plot the function to see what entropy H(P) looks like.
Well that's not very interesting what happened? Our probability function was a constant p(x) = 1/100. Between 0 and 1 we simply divided our probability, one, by the number of tests, 100. If our probability had been a bell curve, or an "S" curve then we entropy would have had some point at which it was a maximum. Furthermore we would have an maximum expectation value at the same point as the highest probability value. For example at the top of the bell curve. This doesn't seem all that interesting why not just stick with the probability function?
It turns out that we can find the probability function P that maximizes the entropy function by treating p as a variable and solving for the entropy function's extreme points. The method used is named Lagrange multipliers, and we'll cover those a little but here we are going to use a tool before we fully understand it. The method consists of adding constraint equations to our function, and then adding an extra variable to allow us to find where the big equation, with the constraints, levels off and becomes zero, or runs into a boundary and is cut off there.
The method consists of taking our equation y = f(x) and moving it around until one side equals zero.
Then we add constraints like x can't be greater than 1, but we multiply that equation by a variable λ called a Lagrange multiplier and set the constraint equation to zero also like so:
becomes, (by subtracting one from both sides and multiplying both sides by lambda)
And since 0 + 0 + 0 = 0 is always fine we can write our new equation called the Lagrangian L=0 :
Then we look for the maximum, i.e. where the slope of our new equation equals zero, or more precisely where the difference in slope between our original equation and the constraints is zero.
Then, we solve these equations for a λ that satisfies our derivative equations and we put that back into our Lagrangian as a constant to produce an equation that yields a maximum under the given set of constraints. In case you forgot how to do partial derivatives, you are going to determine the rate at which the function is changing at a point, f(a,b), if we hold y fixed and allow x to vary and if we hold x fixed and allow y to vary.
We’ll start by looking at the case of holding y fixed and allowing x to vary. Since we are interested in the rate of change of the function at (a,b) and are holding y fixed this means that we are going to always have y=b (if we didn’t have this then eventually y would have to change in order to get to the point…). Doing this will give us a function involving only x’s and we can define a new function as follows, 4x^2 ? b + λx ? λ and for x=a we get L=4a^2 ? y + λa ? λ when we are done we change (a,b) back to (x,y).
then you solve for λ in terms of x and y. and put that back into our constraint equations to solve for λ and then substitute the result into the Lagrangian.
OK we have enough math to continue with deriving a maximum information entropy equation.
Let's suppose that we have a simple economy where the money starts out evenly distributed, like our boring flat line above, and then we look for a maximum entropy equation and see what we get. In our economy no one is allowed to go into debt, p i >0 , and we have some amount of money M that is distributed evenly among N people. Remember our rule: when you don't know the distribution - always distribute the variable with equal probability! So in our economy each agent will act for random reasons about which we know nothing.
Our hypothesis is, that the system as a whole will evolve to some state where entropy is maximized, we'll test this maximized entropy function against real data and compare it to other probability functions to see what other conclusions can be drawn.
First, you have to find the probability function p i which maximizes the entropy and then see whether it matches our real economy.
If we let ni be the number of agents who have i dollars we have two constraints:
and
or we could say that pi =ni / N and get
and
Now we apply the Lagrange multipliers:
from which we get:
and we solve this for pi
and so
where we have set 1+μ=λ0
Now we put our new pi back into our constraint equations to solve for λ.
and
We can substitute a continuous equation by approximation for large M.
From these equations we have, approximately:
and
And from this we can solve for λ0 in terms of λ
and therefore, if we let T=M/N we have:
This is exactly the equation, apart from a constant, that we matched to the US 1996 income distribution above, so we are justified in calling that economy information entropic. But we can go further because it is the also Boltzmann-Gibbs distribution for the ideal gas, where we can think of T, the average temperature, as the average income of the economy T = M/N. And now we can say that people do behave, or at least in the US in 1996 behaved, as if they were subject to simple energy minimization principles, just as ideal gases are.
But aren't you puzzled why we behave like that? Some people might take it for granted that people seek to minimize the energy used in gathering food, finding shelter and so on but what about the underlying biology? Shouldn't the brain exhibit the same, or at least a similar, probability distribution? Let's look at the power vs frequency distribution of an EEG of a normal human brain and see
And in humans we see this:
In fly brains we see the same, albeit simpler, distribution: https://jn.physiology.org/content/110/7/1703
Cool, seeing that same distribution likely means that we have a good explanation. It's no guarantee of course, because correlation does not equal causation, but here we have a good correlation with fundamental physical principle - not UFO sightings. We are probably on firm ground to believe the brain and humans do seek to minimize energy outlays.
This spectral analysis probably means that there are lots of low frequency brainwaves and just a few high frequency brainwaves. Some recent researchers speculate that the few high frequency waves are our "thoughts". In this view the low frequency items "work for" the high frequency items in much the same way as the bottom forty percent of Americans work for the Walton family, owners of Walmart. But another view is that, the top frequency items in the brain float on top of, and are supported by, the lower frequency items. Whichever viewpoint is correct, it does suggest a new model from brain behavior.
The chemistry of thought happens at a wave front gradient boundary, and the brain, as a whole seeks the lowest energy state, and the high frequency items, because they carry more information are important to understand and differentiate thought.
More over in a Proceedings of the National Academy of Sciences (PNAS) paper on the effects of anesthesia titled "Rapid fragmentation of neuronal networks at the onset of propofol-induced unconsciousness" (DOCID pnas.1210907109), the researchers reported that a few seconds before consciousness was lost, as indicated by the patients’ response to sound stimuli, the EEG revealed the onset of low-frequency (>1 hertz), or “slow wave,” oscillations. Meanwhile, the implanted electrodes showed that the firing of ensembles of individual neurons in different but nearby regions of the cortex was interrupted every few hundred milliseconds, an unusual pattern since cortical neurons usually fire regularly and without interruption. The slow oscillations, the team realized, were occurring asynchronously across the cortex, meaning that when one set of neurons in one area was firing, another set of neurons in a nearby area was often silent. This pattern likely disrupts the passage of information between cortical areas, something that has been associated with loss of consciousness by several studies over the last few years.
So it appears as though the brain creates waves and on each wave front there is a chemical gradient that allows neurons to interact with each other. Low frequency patterns are more global and suppress individual regions while higher frequency areas communicate with each other rather like radio stations across the brain. Do high frequency signals occur on the lower frequency wave front? That is not known. However it appears that the entire ensemble of brainwaves is driven by a function that matches the Boltzmann ideal gas probability function. In short, we behave as we do because our brains behave as they do.
We can also speculate that injured brains, from say a concussion, do not have the energy resources to support waves moving across the damaged nerves and therefore interrupt the flow of information across the brain.
One last thing before we go, our earlier utility theory discussion can be put on much firmer ground. Actors A and B are thermodynamic reservoirs each containing a certain amount of entropy and the price C is the equilibrium position of the two systems. We could work through utility theory and carefully match each concept and equation to a maximum entropy argument. At the end of tat exercise we would then be justified in saying that Utility Theory was a domain specific application of thermodynamics and the meaning of utility is the maximization of entropy, or put another way, the minimization of energy.
Interesting. Continued
Gyobutsuji Zen Temple Board of Directors Member.
9 年????????????????