# Explain Gradient Descent optimization approach.

Easy Last updated on May 7, 2022, 1:17 a.m.

Gradient Descent is a method to learn parameters in Machine Learning Algorithms when other optimization Algorithms cannot provide close form solutions. It is a first-order iterative algorithm for finding the minimum of a function. We use this algorithm to minimize the loss function, and it works on convex optimization.

Let us see this basic example:

Step 0: Initialize w, l(learning rate)
Step 1: While Stopping Criterion not meet do:
#Mini_Batch Logic#
Compute Slope of function (say f'(x,w))
update w:=w-l*f'(x,w)
end While

If we look at slope at w1 point, we will see its slope is negative, so the value of W should move in right direction. Whereas, if we look at the slope of w2 Point, it is Positive. Therefore update of W will make it move in the left direction.

Batch gradient descent computes the gradient using the whole dataset. It is excellent for convex functions, as we move somewhat directly towards an optimum solution, either local or global. However, the disadvantage of Batch Gradient Descent Algorithm is that gradient is being computed for the whole dataset, so it is computationally expensive.

For example, we train a dog breed classifier (CNN) with 1 million image data for 1000 epochs. We will now have to calculate the gradient value using 1 million images for each epoch weight update.

Stochastic gradient descent (SGD) on the other hand, computes the gradient using a single sample. It tries to learn the direction of loss minimization using just one example, which is not a good approximation of the dataset, thus creating a noisier gradient. It reaches minima faster than Batch gradient descent but is not guaranteed to converge to global/optimum minima.
“Stochastic Gradient Descent(SGD) often converges much faster compared to Gradient Descent, but the error function is not as minimized. Often in most cases, the close approximation that we get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.”

Mini-Batch gradient descent is what we mostly use in real-world scenarios. Here we divide our dataset into mini-batches(usually 64,128 cause of RAM storage) and then approximate our gradient based on those mini-batches, which turns out to be a better approximation than what we obtained with one example.