DL Lesson 47 – Vanishing Gradients | Dataplexa

Vanishing Gradients in Recurrent Neural Networks

Recurrent Neural Networks introduced memory into deep learning, but they also introduced a serious training challenge known as vanishing gradients.

This problem explains why early RNNs struggled with long sequences and why newer architectures were eventually created.


What Is a Gradient in Simple Terms?

During training, neural networks learn by adjusting weights.

These adjustments are guided by gradients, which tell the model how much each weight contributed to an error.

If gradients are strong, learning is effective. If gradients become extremely small, learning slows down or completely stops.


Why RNNs Are Especially Vulnerable

RNNs reuse the same weights at every time step.

When training an RNN, gradients must flow backward through every step in the sequence.

For long sequences, this means gradients are multiplied many times during backpropagation.

If these values are smaller than 1, repeated multiplication causes them to shrink exponentially.

This is the vanishing gradient problem.


What Happens When Gradients Vanish?

When gradients vanish, the network:

Fails to learn long-term dependencies, Focuses only on recent inputs, Appears to train but never truly improves.

In practice, this means the model forgets early information in a sequence.


Real-World Example

Consider a sentence:

"I grew up in France, so I speak fluent ____."

To predict the missing word, the model must remember information from the beginning of the sentence.

A basic RNN often fails here because the signal from early words vanishes before it reaches the prediction step.


Mathematical Intuition

During backpropagation through time, gradients are multiplied by the derivative of the activation function.

For tanh or sigmoid, these derivatives are usually less than 1.

Repeated multiplication causes gradients to shrink rapidly:

(0.5) ^ 20 ≈ 0.000001

Such tiny gradients are effectively useless for learning.


Why This Is a Serious Problem

Vanishing gradients prevent RNNs from:

Learning long sentences, Modeling long time-series dependencies, Capturing distant context in speech or text.

This limitation made early RNNs unreliable for many real-world tasks.


Early Mitigation Techniques

Before advanced architectures existed, researchers tried several partial solutions:

Careful weight initialization, Using ReLU instead of tanh, Gradient clipping.

While helpful, these approaches did not fully solve the problem.


Gradient Clipping (Simple Fix)

Gradient clipping prevents gradients from becoming too small or too large by forcing them into a defined range.

optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)

This improves stability but does not restore long-term memory.


The Breakthrough Insight

The real solution required changing the architecture itself.

Instead of forcing gradients to flow through many steps, networks needed a mechanism to selectively remember important information.

This insight led directly to advanced sequence models.


Exercises

Exercise 1:
Why do gradients vanish more severely in long sequences?

Because gradients are multiplied repeatedly during backpropagation through time.

Exercise 2:
Why do sigmoid and tanh contribute to vanishing gradients?

Their derivatives are typically less than 1, causing exponential shrinkage.

Quick Check

Q: Does gradient clipping completely solve the vanishing gradient problem?

No. It stabilizes training but does not enable long-term memory.

Understanding vanishing gradients explains why basic RNNs struggle with long-term dependencies.

In the next lesson, we will explore architectures specifically designed to overcome this limitation.