Linear Regression- gradient descent optimization
Debi Prasad Rath
@AmazeDataAI- Technical Architect | Machine Learning | Deep Learning | NLP | Gen AI | Azure | AWS | Databricks
Hi connections. Trust you are doing well. In this post we will continue from where we left off and that is to discuss about "gradient descent" algorithm. As the name suggests, gradeint descent is an optimization algorithm to find optimal values of beta terms that minimizes cost function. This is an iterative process by estimating accuracies each time and update beta terms such that error gets reduced or become nearly zero. This is one of the powerful toolkit in all of machine learning applications.
Let me break it down for you guys. In "gradient descent" gradient refers to a vector that defines the direction of slope/angle of the current point to find minimum of cost function. By definition, it is coined as rate of change in cost function for tiny/very very small change in its parameters. In simple terms, you can define it as d(cost_function)/ dX or dy/dX parametrized by X<s>. For instance, while riding a bike, you apply brakes in such a way that bike gets slow down before getting crashed with an obstacle. Well, that is nothing but gradient descent algorithm for you.
Technically, gradient descent is the first order derivative optimization function.Precisely, it performs two tasks as in first gradient gets computed and then accordingly make a step (move) opposite to the gradient. This step or move is required so as to update gradient by mutiplying current gradient with the learning rate. Over here, learning rate refers to the learning in order to make a move or step. It should not be a very big step or a tiny little baby step to strike a balance, so that cost function is converging to its minimum.
Let us understand it in pieces collectively.Try to recollect the equation of linear regression with one independent variable that is y = Beta0 + Beta1 * X1 + error, where Beta0 represents intercept term and Beta1 represents slope/gradient/coefficent/weight that is rate of change of target with respect to a very small change in X<s>. Intercept is a constant value when the fitted line makes a cut on the y-axis. Slope term (in this case Beta1).
Well, you can find derivative/slope/gradient by using a tangent line to observe the steepness. In this way, our slope will inform about Beta terms that needs to be updated along the way. More fundamentally, the slope with an initial attempt would be steeper, along the way with new parameter updates it will get reduced. This process of parameter update happens iteratively until that point where it attains minimum value is alternatively known as "convergence point". Intuitively convergence point is where cost function is minimum as you can see from below image. This is synonymously indicating the fact that at this model should stop learning and average out error across entire dataset. keep in mind that both loss and cost are used in this context, but loss means error for one training example but error is for the entire training dataset.
gradient descent formula
-------------------------
Beta0 = Beta0 - learning_rate * d(cost_function)/dBeta0
Beta1 = Beta1 - learning_rate * d(cost_function)/dBeta1