DL Lesson 22 – Vanishing Gradients | Dataplexa

Vanishing and Exploding Gradients

In the previous lesson, we learned how early stopping helps us decide when to stop training.

In this lesson, we explore a deeper and more fundamental problem — why deep neural networks become hard to train as they grow deeper.

This difficulty is caused by vanishing gradients and exploding gradients.

Why Gradients Matter

During training, neural networks learn by adjusting weights using gradients computed during backpropagation.

Gradients tell the model:

“How much should this weight change to reduce error?”

If gradients are too small or too large, learning becomes unstable or stops completely.

What Is the Vanishing Gradient Problem?

Vanishing gradients occur when gradients become extremely small as they move backward through many layers.

When this happens:

Earlier layers receive almost no learning signal and stop updating.

As a result, the network behaves as if only the last few layers are learning.

Why Does This Happen?

During backpropagation, gradients are multiplied by derivatives of activation functions.

If those derivatives are less than 1, repeated multiplication causes gradients to shrink exponentially.

This problem was very common when using sigmoid and tanh activation functions.

What Is the Exploding Gradient Problem?

Exploding gradients are the opposite situation.

Here, gradients grow exponentially as they move backward, becoming very large.

This causes:

Unstable updates, wildly changing weights, and sometimes numerical overflow.

Real-World Analogy

Imagine giving instructions through a long chain of people.

If the message gets whispered too softly, it fades away — vanishing gradient.

If the message gets shouted louder at each step, it becomes chaos — exploding gradient.

Deep networks suffer from the same problem.

How This Affects Deep Networks

Very deep networks may:

• Learn extremely slowly • Fail to converge • Produce unstable loss values

This is why training deep models was nearly impossible before modern solutions were introduced.

Detecting Gradient Problems

Signs of vanishing gradients:

• Training loss stops decreasing • Early layers stop changing

Signs of exploding gradients:

• Loss becomes NaN • Training suddenly diverges

Gradient Clipping (Practical Solution)

One common solution for exploding gradients is gradient clipping.

Gradients are limited to a fixed maximum value to prevent instability.

optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001,
    clipnorm=1.0
)

This prevents gradients from becoming too large while preserving learning direction.

Modern Solutions to Vanishing Gradients

Several techniques were developed to solve vanishing gradients:

• ReLU and its variants • Better weight initialization • Batch normalization • Residual connections

We will study each of these techniques in upcoming lessons.

Mini Practice

Why do you think ReLU helps reduce vanishing gradients compared to sigmoid?

Exercises

Exercise 1:
Why do sigmoid activations cause vanishing gradients?

Because their derivatives are very small for large input values.

Exercise 2:
What is the main risk of exploding gradients?

Training becomes unstable and may diverge.

Quick Quiz

Q1. Which problem affects early layers more?

Vanishing gradients.

Q2. What technique limits gradient size?

Gradient clipping.

In the next lesson, we will learn evaluation metrics and understand how to properly measure the performance of deep learning models.

← Previous Lesson DL Index Next ➜