#20) Section 4 of 5: How a Bunch of Numbers Can Actually Learn From Trial-and-Error: Gradient Descent
Mighty Friends, one simple question has always fascinated me: I understand how a child can learn by trial-and-error. But how does the box of silicon that we call a computer learn from trial-and-error? Today, we begin to answer that question.
As always, I humbly beseech thee to please open this link in a separate window so you can toggle back-and-forth between the code and my explanations of it (use Alt + Tab in Windows, or Command + Tab/Shift Tab with Macs).
4.1) The Big Picture of Gradient Descent
What is the purpose of gradient descent? It is to find the best set of adjustments to our network of weights so that it gives a better prediction in the next iteration. In other words, certain values in the synapse matrices of our network need to be increased or decreased in order to give a better prediction next time. To adjust each of these values, we must answer two key questions:
1) In what direction do I adjust the number? Do I increase the value, or decrease it? Positive direction, or negative? and
2) By how much do I increase or decrease the number? A little, or a lot?
We will examine these two basic questions in great detail below. But if you want to visualize what gradient descent does, simply remember that "gradient" is just a fancy word for "slope." If you remember our curvy red bowl from Section 1, The Big Picture, gradient descent simply means calculating the optimal slope of the surface of that bowl to get that little white ball down to the bottom of the bowl as quickly and efficiently as possible. So keep that curvy red bowl in your mind.
Our first step in gradient descent is to calculate how much our current prediction missed the actual truth, namely a 1/yes or a 0/no in y.
4.2) How Far Off is our Prediction when Compared to Survey Question #4?
Line 66
l2_error = y - l2
So, by how much did our first prediction miss the target of "Yes/1," the actual truth from survey question four that Customer One did indeed buy Litter Rip? Well, with Customer One (Row One of l0), we want to compare our l2 prediction to the y value of 1, since Customer did indeed buy Litter Rip! When I say, "compare our l2 prediction," I mean we subtract the l2 probability from the y value and the remainder is our l2_error, or "how much we missed the target value y by."
So, big picture here again: our network took the input of each customer's response to the first three survey questions, and manipulated that data to come up with a prediction of whether that customer bought Litter Rip! or not. Because we have four customers, our network made four predictions. And you may recall that the 4x1 y vector contains the answers of four customers to question four, "Have you purchased Litter Rip!?" It contains four "0" or "1" values to which we want to compare the four predictions our network came up with.
Once we know our l2_error (which of course is also a vector of four errors, one for each prediction), next we want to print that error, so we can eyeball our process in real time:
Print Error: Lines 72-73
Line 72 is a clever parlor trick to have the computer print out our l2_error every 10,000 iterations. It's helpful for us to envision the learning the network is doing if it "shows us its homework" every 10,000 times, so we can see its progress. The line, if (j% 10000)==0: means, "If your iterator is at a number of iterations that, when divided by 10,000, leaves no remainder, then..." So, j%10000 would have a remainder of 0 only six times: at 0 iterations, 10,000, 20,000, and so on to 60,000. So this print-out gives us a nice report on the progress of our network's learning.
The code + str(np.mean(np.abs(l2_error)))) simplifies our print out by taking the absolute value of each of the 4 values, then averaging all 4 into one mean number and printing that.
OK, so we now know how much our predictions about four customers (l2) missed the Actual Truth about who purchased Litter Rip! (y). And we've printed that.
But of course, any distance between us and The Oracle's castle is too much for our hearts to bear, so how can we reduce the current, unsatisfactory prediction error of 0.5 to finally attain enlightenment?
One step at a time. First, let's get clear on what part of our network needs to change in order to improve our network's next prediction. After that, we'll discuss how to adjust our network. Tune in again tomorrow, and bask in this beauty tonight: my favorite nightclub in Mexico City, Dixon's:
Noise and Vibrations | Acoustics | Product Development
5 年Can I get access to all the newsletters from beginning?