#18) Why Not Become an AI Genius Today? Seems Like Today is Your Day...
Dearest Students Of The Matrix, you are the chosen ones: those who dare to sell the Kitty Litter; those who wish to find the Secret AI Entrance in the Math Of Mirrors! Wait no longer--today your quest will bear fruit!
Here is your daily reminder to please open this link in a separate window so you can toggle back-and-forth between the code and my explanations of it (use Alt + Tab in Windows, or Command + Tab/Shift Tab with Macs).
3.3) Let's Walk Through the Math of Feed Forward Slowly:
l0 x syn0 = l1LH, so in our example 1 x 3.66 = 3.66, but don't forget we have to add the other two products of l0 x the corresponding weights of syn0. In our example, l0,2 x syn0,2= 0 x something = 0, so it doesn't matter. But l0,3 x syn0,3 does matter because l0,3=1, and and we know from our matrix example in the last section that the value for syn0,3 is 0.16. Therefore, l0,3 x syn0,3 = 1 x 0.16 = 0.16. Our product of l0,1 x syn0,1 + our product of l0,3 x syn0,3 = 3.66 + 0.16 = 3.82, and 3.82 is l1_LH. Next, we have to run l1_LH through our nonlin() function to create a probability between 0 and 1. Nonlin(l1_LH) uses the code, return 1/(1+np.exp(-x)), so in our example that would be: 1/(1+(2.718^-3.82))=0.98, so l1 (the RH side of the l1 node) is 0.98.
So, what just happened with the equation: 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98 above? The computer used some fancy code, return 1/(1+np.exp(-x)), to do what we could do manually with our eyeballs--it told us the corresponding y value of x = 3.82 on the sigmoid curve as pictured in this diagram:
(taken with gratitude from: Andrew Trask)
Notice that, at 3.82 on the X axis, the corresponding point on the blue, curved line is about 0.98 on the y axis. Our code converted 3.82 into a statistical probability between 0 and 1. It's helpful to visualize this on the diagram so you know there's no abstract hocus-pocus going on here. The computer did what we did: it used math to "eyeball what 3.82 on the X axis would be on the Y axis of our diagram." Nothing more.
Again: nonlin() is the part of the sigmoid function that renders any number as a value between 0-1. It is the code, return 1/(1+np.exp(-x)). It does not take slope. But in back prop, we're going to use the other part of the sigmoid function, the part that does take slope, i.e., return x*(1-x) because you will notice that lines 57 and 71 specifically request the sigmoid to take slope with the code, (deriv==True).
Now, rinse-and-repeat: we'll multiply our l1 value by our syn1,1 value. l1 x syn1 = l2LH which in our example would be 0.98 x 12.21 = 11.97. But again, don't forget that to 11.97 we must add all the products of all the other l1 neurons times their corresponding syn1 weights, so for simplicity's sake trust me that they all added up to -11.97 (I am using the same matrix). So you end up with 11.97 + -11.97 = 0.00, which is l2_LH. Next we run l2_LH through our fabulous nonlin() function, which would be: 1/(1+2.718^-(0)) = 0.5, which is l2, which is our very first prediction of what the truth, y, might be! Congratulations! You just completed your first feed forward!
Now, let's assemble all our variables in one place, for clarity:
l0=1 syn0,1=3.66 l1_LH=3.82 l1=0.98 syn1,1=12.21 l2_LH=0 l2=~0.5 y=1 (this is a "Yes" answer to survey Question 4, "Actually bought Litter Rip?") l2_error = y-l2 = 1-0.5 = 0.5
OK, let's now take a look at the matrix multiplication that makes this all happen (for those of you who are rookies to matrix multiplication and linear algebra, Grant Sanderson teaches it brilliantly, with lovely graphics, in 14 YouTube videos. Watch those first, if you wish, then return here).
First, on line 58 we multiply the 4x3 l0 and the 3x4 Syn0 to create (hidden layer) l1, a 4x4 matrix:
Now we pass it through the "nonlin()" function in line 58, which is a fancy math expression I explained above that "squishes" all values down to values between 0 and 1:
1/(1 + 2.781281^-x) This creates layer 1, the hidden layer of our neural network: l1: [0.98 0.03 0.61 0.58] [0.01 0.95 0.43 0.34] [0.54 0.34 0.06 0.87] [0.27 0.50 0.95 0.10]
If you find yourself feeling faint at the mere sight of matrix multiplication, fear not. We're going to start simple, and break down our multiplication into tiny pieces, so you can get a feel for how this works. Let's take one, single training example from our input. Row 1 (customer one's survey answers): [1,0,1] a 1x3 matrix. We're going to multiply that by syn0, which would still be a 3x4 matrix, and our new l1 would be a 1x4 matrix. Here's how that simplified process can be visualized:
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.) row 1 of l0: col 1 of syn0: [1 0 1] X [ 3.66] + [ 3.82 -3.54 0.44 0.34] [1 0 1] X [ -4.84] + = [ (row 2 of l0 x cols. 1, 2, 3, and 4 of syn0...) ] [1 0 1] X [ 0.16] [ etc... ] Then pass the above 4x4 product through "nonlin()" and you get the l1 values l1: [0.98 0.03 0.61 0.58] [0.01 0.95 0.43 0.34] [0.54 0.34 0.06 0.87] [0.27 0.50 0.95 0.10]
Note that, on line 58, we next take the sigmoid function of l1 because we need l1 to have values between 0 and 1, hence: l1=nonlin(np.dot(l0,syn0))
It is on line 58 that we see Big Advantage #1 of the Four Big Advantages of the sigmoid function. When we pass the dot product matrix of l0 and syn0 through the nonlin() function, the sigmoid converts each value in the matrix into a statistical probability between 0 and 1.
My loveliest and most astute readers will then pose a clever question: "Gee, Dave, 'statistical probability' of what?" Well, Friends, prepare yourself for (yet another) stunning breakthrough in deep learning. Although it pains me, I must delay your joyful discovery until tomorrow so that I can go have a life. Ta-ta for now, and I leave you with visual joy in this travel photo from the jaw-dropping Quijiang Art Museum in Xi'an, China: