Linear Regression Notes
1, find the objective/loss/penalty function in accordance with a given scenario
In linear regression, our goal is to predict a target value y given a vector of input values, say x. Suppose we have a function f which satisfies f(x) = y, or f(x) is very close to y.Naturally, the difference between f(x) and y should be very close to 0. With that said, our goal is to minimize
f(x) - y
Most of the time, we have a lot of x's(training set); and we can denote these as x1,x2,...,xn. Accordingly, we have y1, y2,all the way to yn. And we can rewrite our goal as
(f(x1) - y1) + (f(x2) - y2) + ... + (f(xn) - yn)
Mathematically, the difference between f(xi) and yi can either be negative or positive. With that said, if we add all these difference values together, they may cancel each other. Plus, the above function cannot be differentiated well to find the local minimum/maximum point so as to figure out parameters. To make things easier, people use the square form to replace the above function and rewrite it as
((f(x1) - y1)^2 + (f(x2) - y2)^2 + ...+ (f(xn) - yn)^2)
2, try to minimize the objective/loss/penalty function to figure out parameters
Derivative tells us that if a function reaches its minimum or maximum point, its first order derivative value at that point should be 0. The same rule applies to partial derivatives. We can use this to minimize the loss function, and figure out related parameters. But if we have thousands of parameters in function f, it's not that convenient for us to do so. Aha! Gradient descent comes to rescue! Mathematically, the gradient direction is the steepest direction of a function, and if we decrease the function towards the gradient direction, we can reach the said minimum point. This can be drawn from the definition of derivative.
3, test the final function to overcome both under-fitting and over-fitting and try to make the algorithm more general
Our parameters are to be figured out using the above mentioned techniques on the training set. The training set may be sampled noisily. And this may impact our parameters. As a result, our model may fit the training set well, but cannot fit unseen data. To avoid this, we have to split our data into training set and test set, and use penalty term in our function to avoid overfitting and things alike.