machine learning: intuition of Gaussian processing

machine learning: intuition of Gaussian processing

This article talks about the Gaussian process and Gaussian regression. We begin with an intuition of assumption that data points are close by then their height should be also close by. With this, we can draw prior samples from test data and draw posterior samples from conditional distribution. The conditional distribution is obtained from multivariate Gaussian theorem via the joint vector distribution. The following expands by referencing slides from machine learning lecture taught by Processor Nando. And his video lecture can be found on youtube.

Now we're gonna assume X's is given like {x_1, x_2, x_3} , and want to model the f(x)'s. Moreover, we assume f's are multivariance function, so we have a vector (f_1, f_2, f_3), and this vector has zero mean, and covariance matrix which captures the relationship of the three points in the diagram. For example, x_1 and x_2 should be more correlated because they are nearby than x_1 and x_3. And thus k_12 and k_21 has larger value (0.7) than k_13 or k_31.

There are many ways to measure the similarity x_i and x_j, one of the possible way here we use is squared exponential curve. It can take as similarity functions because it gets zero when the distance between x_i and x_j is very large and is 1 when the two points are equal each other. We can use this function to fill the covariance matrix to describe the cloud of points, or at least describe their height. We don't intend to describe the x-axis because X is given, the only we use X to construct K, the covariance matrix.

As shown in the following slide, when we add a new point x asterisk between x_2 and x_3, but don't know the height (i.e., f asterisk) of the new point. So, where is the possible height of the new point? It seems to believe shade triangle is better than other possible choices in the following diagram. Because if x-axis distances between x asterisk and its nearest point x_3 are small, somehow we expect the height distance f asterisk and its nearest points are also small. This is called smoothness in machine learning. So, a small variation in x-axis you also wanna small variance in the y-axis.

Here we assume f asterisk also comes from a Gaussian distribution which is also zero mean because we assume test data comes from the same distribution as the training data. If we have multivariance Gaussian as joint Gaussian (f, f asterisk), and we want conditional p(f asterisk | f), we can use multivariance Gaussian theorem to obtain mean mu asterisk and variance sigma asterisk.

In other words, for any point from the test set, I can predict the mean and variance. If I take a large number of test point I can plot a beautiful line. From the following diagram, we predict mean and variance and thus the confidence for red dots which are test points. And from intuition, we see the confidence is high where the (training) data is, so the uncertainty is low. When we don't have data, saying for the test data, I cannot be too confident of the prediction.

Now what I have is a function that I put as input x asterisk, and I get as output the mean and variance. The Gaussian process is the distribution of functions because mean and variance is just function of x. If I have those two functions what I am to do is that I can get a very fine-grained of points.

The following python code snippet is for creating the prior function. What is the first thing that I should do is that I create test dataset X_(1:N), where N points are so close to each other, and the assume zero mu vector and compute out covariance matrix K using the kernel function. We randomly sample 10 normal distributions for obtaining the prior function by the product of the normal distributions and cholesky matrix L, where it's noted that L transpose times L is equal to similarity matrix K. As shown in the diagram in the above slide, each color is a single prior function that I obtain in this way.

Here we only need to take a look at the multivariate Gaussian theorem, which basically can figure out the posterior mean and variance for conditional distribution p(f asterisk | f) from the joint vector (f, f asterisk) distribution.

In other words, given that I have training points set D, I will combine the data with my prior functions. The prior functions, as above slides showing, is specified via the covariance that I want the function to be smooth because the similarity matrix assume if the two points are close by the height is also close by.

And if I draw functions from the conditional Gaussian I get the bottom right figure. Remember that I can evaluate the conditional Gaussian p(f asterisk | f) at any point, and for all those points I can compute the mean and the variance. And then I can get those beautiful plots where the training data basically squishes the uncertainty, grabs those function and ties them. Note in the bottom right picture, each color is one vector (f_*1, f_*2, ...) and with the number of f_* points along x-axis goes very large, I will get a smooth curve or saying a function from the vector.

With this, we have Noiseless Gaussian process regression as defined in the following two slides.

There are cases when we cannot be sure about fs, there has some error. And this is the case of noisy Gaussian regression, with extra term epsilon normally distributed. From the following slide, having Gaussian noise epsilon only involves a minor modification of the algorithm which is the addition of a diagonal term sigma square everything else is the same when you plot this you however that even when you have data, as shown in the blue frame in the following diagram, the certainty is collapse because we assume the noise.

For noisy GP, we can rewrite the mean of f asterisk as in the sum of basis functions. So the Gaussian process if you look at the mean, what you're doing is you're just fitting an RBF you're just fitting a non-linear function using the combination of basis functions where you placed each basis function at the data. Now we know that putting the basis function where the data is is the principle thing.

Chen Yang


要查看或添加评论,请登录

Chen Yang??????的更多文章

社区洞察

其他会员也浏览了