DL Lesson 14 – Optimization Algorithms | Dataplexa

Optimization Algorithms

In the previous lesson, we learned how Batch Normalization stabilizes activations inside deep neural networks.

Now we move into one of the most critical topics in Deep Learning — optimization algorithms.

Optimization algorithms decide how a neural network learns, how fast it learns, and whether it converges at all.


What Optimization Means in Deep Learning

Training a neural network means finding the best values for millions of parameters.

These parameters must minimize a loss function by adjusting weights in the correct direction.

The optimizer controls this adjustment process.


Loss Surface and Learning Intuition

You can imagine training as moving downhill on a complex, high-dimensional surface.

Each point represents a set of weights, and the height represents the loss.

An optimizer decides:

• How big each step should be • In which direction to move • How to avoid unstable oscillations


Gradient Descent (The Foundation)

Almost all optimization algorithms are built on top of gradient descent.

At each step, gradients tell us which direction increases loss.

We move in the opposite direction to reduce error.

weights = weights - learning_rate * gradient

This simple rule is powerful, but alone it is often inefficient for deep networks.


Why Basic Gradient Descent Is Not Enough

Real loss surfaces are not smooth bowls.

They contain:

• Flat regions • Steep valleys • Local minima • Saddle points

A naive optimizer can get stuck, oscillate wildly, or converge very slowly.


Stochastic Gradient Descent (SGD)

Instead of using the entire dataset, SGD updates weights using small batches.

This introduces noise, which surprisingly helps escape poor minima.

SGD became the backbone of modern deep learning.

from tensorflow.keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01)

Momentum: Learning with Memory

Momentum improves SGD by remembering past gradients.

Instead of stopping abruptly, updates gain inertia in consistent directions.

This accelerates convergence and reduces oscillations.

optimizer = SGD(
    learning_rate=0.01,
    momentum=0.9
)

Adaptive Learning Rate Methods

One learning rate does not fit all parameters.

Some weights need large updates, others need tiny corrections.

Adaptive optimizers adjust learning rates automatically during training.


Adam Optimizer (Industry Standard)

Adam combines momentum with adaptive learning rates.

It works well out of the box and is widely used in practice.

from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001)

Adam is often the default choice when training deep networks.


Choosing the Right Optimizer

There is no universally best optimizer.

The choice depends on:

• Model architecture • Dataset size • Training stability • Generalization behavior

Understanding optimizers gives you control over learning itself.


Exercises

Exercise 1:
Why is momentum useful in optimization?

Momentum accelerates learning and reduces oscillations by using past gradients.

Exercise 2:
Why are adaptive optimizers helpful?

They automatically adjust learning rates for different parameters.

Quick Quiz

Q1. What is the main role of an optimizer?

To control how weights are updated to minimize loss.

Q2. Is Adam always the best choice?

No. Different problems may require different optimizers.

In the next lesson, we will bring everything together and study the Deep Learning Training Pipeline — from initialization to convergence.