DL Lesson 7 – Gradient Descent Variants | Dataplexa

Gradient Descent Variants

In the previous lesson, we learned how loss functions measure how wrong a neural network’s predictions are.

Now we answer the most important question in training:

How does the network reduce this loss?

The answer is gradient descent.


What Is Gradient Descent?

Gradient descent is an optimization algorithm. Its job is to adjust the weights of a neural network so that the loss becomes smaller step by step.

Mathematically, it works by computing how much each weight contributes to the loss and then updating the weights in the opposite direction.

This direction is determined by the gradient.


Real-World Intuition

Imagine you are standing on a mountain covered in fog. Your goal is to reach the lowest point in the valley.

You cannot see the whole landscape, but you can feel the slope beneath your feet.

At every step, you move slightly downhill. Eventually, you reach the bottom.

This is exactly how gradient descent works.


The Basic Gradient Descent Update Rule

Every weight in the network is updated using this idea:

new_weight = old_weight - learning_rate * gradient

Each part of this equation matters.

The gradient tells us the direction of steepest increase. We subtract it to move downhill.

The learning rate controls how big each step is.


Why Do We Need Variants?

In theory, basic gradient descent works. In practice, it is slow and inefficient for large datasets and deep networks.

Modern deep learning models can have millions of parameters. Computing gradients using the entire dataset for every update is expensive.

That is why we use different variants.


Batch Gradient Descent

Batch gradient descent computes gradients using the entire training data before updating the weights.

This approach is stable and accurate, but it is very slow and requires a lot of memory.

It is rarely used in deep learning today.


Stochastic Gradient Descent (SGD)

Stochastic gradient descent updates the weights using only one training example at a time.

This makes learning much faster, but also noisier.

The loss may fluctuate, but the model often converges faster than batch gradient descent.

# Conceptual example
for each sample:
    compute gradient
    update weights

Mini-Batch Gradient Descent

Mini-batch gradient descent is the most commonly used variant.

Instead of using one sample or the entire dataset, it uses small batches of data.

This provides a balance between speed and stability.

Almost all modern deep learning frameworks use this approach internally.


Why Learning Rate Is Critical

If the learning rate is too small, training becomes extremely slow.

If it is too large, the model may overshoot the minimum and fail to converge.

Choosing the right learning rate is one of the most important skills in deep learning.


Gradient Descent in Practice

In real training code, we do not manually update weights.

Frameworks handle this internally.

model.compile(
    optimizer="sgd",
    loss="categorical_crossentropy"
)

Different optimizers are built on top of gradient descent.

We will study them in upcoming lessons.


Exercises

Exercise 1:
Why is batch gradient descent slow for deep learning?

Because it computes gradients using the entire dataset for every update.

Exercise 2:
Why does stochastic gradient descent introduce noise?

Because updates are based on individual samples rather than averages.

Quick Quiz

Q1. Which gradient descent variant is most commonly used?

Mini-batch gradient descent.

Q2. What happens if learning rate is too large?

Training becomes unstable and may not converge.

In the next lesson, we will study advanced optimization techniques that improve gradient descent and make training faster and more stable.