Algorithms Lesson 47 – ADAM & RMSProp | Dataplexa

ADAM & RMSProp Optimizers

In the previous lesson, we explored Stochastic Gradient Descent and Mini-Batch Gradient Descent, which significantly improved training speed.

However, even these methods can struggle when the learning rate is not chosen properly.

To solve this, modern optimization algorithms adapt the learning rate automatically. Two of the most widely used optimizers are RMSProp and ADAM.


The Problem with a Fixed Learning Rate

A single learning rate for all parameters is often inefficient.

Some parameters need large updates, while others need very small adjustments.

Using one learning rate for all can cause slow learning or instability.


RMSProp – Root Mean Square Propagation

RMSProp solves this problem by adapting the learning rate for each parameter individually.

It keeps track of the moving average of squared gradients.

# Conceptual RMSProp update
cache = decay_rate * cache + (1 - decay_rate) * gradient^2
weights = weights - learning_rate * gradient / sqrt(cache + epsilon)

This prevents the learning rate from becoming too large.


Why RMSProp Works Well

RMSProp automatically reduces the step size when gradients become large.

This makes training:

  • More stable
  • Faster to converge
  • Less sensitive to learning rate choice

Real-World Analogy

Imagine walking downhill with adjustable brakes.

When the slope is steep, you slow down automatically.

When it is gentle, you move faster.

That is exactly what RMSProp does.


ADAM – Adaptive Moment Estimation

ADAM combines the best ideas from both RMSProp and Momentum.

It keeps track of:

  • The average of gradients (momentum)
  • The average of squared gradients (RMSProp)
# Conceptual ADAM update
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2

m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)

weights = weights - learning_rate * m_hat / sqrt(v_hat + epsilon)

Why ADAM Is So Popular

ADAM is often the default optimizer in deep learning frameworks because:

  • It converges fast
  • Works well with noisy gradients
  • Requires minimal tuning

In many cases, ADAM works well even without changing default settings.


ADAM vs RMSProp

Both optimizers are powerful, but ADAM usually performs better on complex problems.

RMSProp is simpler, while ADAM is more adaptive.


When Should You Use Them?

In practice:

  • Use RMSProp for simpler models
  • Use ADAM for deep or complex models

Most production systems start with ADAM.


Exercises

Exercise 1:
What problem does RMSProp solve?

It adapts the learning rate to prevent unstable updates.

Exercise 2:
What two ideas does ADAM combine?

Momentum and RMSProp.

Exercise 3:
Why is ADAM widely used?

It converges fast and works well with minimal tuning.

Quick Quiz

Q1. What does RMSProp track?

Moving average of squared gradients.

Q2. What makes ADAM better than SGD?

Adaptive learning rates and momentum.

In the next lesson, we will study A* Search Algorithm and see how heuristics guide intelligent search.