ADAM & RMSProp Optimizers
In the previous lesson, we explored Stochastic Gradient Descent and Mini-Batch Gradient Descent, which significantly improved training speed.
However, even these methods can struggle when the learning rate is not chosen properly.
To solve this, modern optimization algorithms adapt the learning rate automatically. Two of the most widely used optimizers are RMSProp and ADAM.
The Problem with a Fixed Learning Rate
A single learning rate for all parameters is often inefficient.
Some parameters need large updates, while others need very small adjustments.
Using one learning rate for all can cause slow learning or instability.
RMSProp – Root Mean Square Propagation
RMSProp solves this problem by adapting the learning rate for each parameter individually.
It keeps track of the moving average of squared gradients.
# Conceptual RMSProp update
cache = decay_rate * cache + (1 - decay_rate) * gradient^2
weights = weights - learning_rate * gradient / sqrt(cache + epsilon)
This prevents the learning rate from becoming too large.
Why RMSProp Works Well
RMSProp automatically reduces the step size when gradients become large.
This makes training:
- More stable
- Faster to converge
- Less sensitive to learning rate choice
Real-World Analogy
Imagine walking downhill with adjustable brakes.
When the slope is steep, you slow down automatically.
When it is gentle, you move faster.
That is exactly what RMSProp does.
ADAM – Adaptive Moment Estimation
ADAM combines the best ideas from both RMSProp and Momentum.
It keeps track of:
- The average of gradients (momentum)
- The average of squared gradients (RMSProp)
# Conceptual ADAM update
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)
weights = weights - learning_rate * m_hat / sqrt(v_hat + epsilon)
Why ADAM Is So Popular
ADAM is often the default optimizer in deep learning frameworks because:
- It converges fast
- Works well with noisy gradients
- Requires minimal tuning
In many cases, ADAM works well even without changing default settings.
ADAM vs RMSProp
Both optimizers are powerful, but ADAM usually performs better on complex problems.
RMSProp is simpler, while ADAM is more adaptive.
When Should You Use Them?
In practice:
- Use RMSProp for simpler models
- Use ADAM for deep or complex models
Most production systems start with ADAM.
Exercises
Exercise 1:
What problem does RMSProp solve?
Exercise 2:
What two ideas does ADAM combine?
Exercise 3:
Why is ADAM widely used?
Quick Quiz
Q1. What does RMSProp track?
Q2. What makes ADAM better than SGD?
In the next lesson, we will study A* Search Algorithm and see how heuristics guide intelligent search.