Algorithms Lesson 46 – Stochastic & Mini-Batch GD | Dataplexa

Stochastic & Mini-Batch Gradient Descent

In the previous lesson, we learned how standard Gradient Descent works by using the entire dataset to update parameters.

While this approach is mathematically clean, it becomes very slow and impractical when datasets grow large.

This is where two powerful variants come into play: Stochastic Gradient Descent and Mini-Batch Gradient Descent.


Why Standard Gradient Descent Is Not Enough

Standard Gradient Descent computes gradients using all data points before making a single update.

For small datasets, this is fine. But in real systems with millions of records, this becomes extremely slow.

We need faster updates.


Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates the model after every single data point.

Instead of waiting for the full dataset, it learns continuously.

# Pseudo-code for SGD
for each data_point:
    gradient = compute_gradient(data_point)
    weights = weights - learning_rate * gradient

This makes learning much faster.


Intuition Behind SGD

Think of SGD like adjusting your direction after every step while walking downhill.

The path may look noisy, but you reach the bottom much faster.


Advantages of SGD

SGD is widely used because:

  • It is fast
  • Uses very little memory
  • Works well with large datasets

However, it also has downsides.


Problems with SGD

Because SGD updates parameters using only one data point, the updates can be very noisy.

This may cause:

  • Unstable convergence
  • Oscillations near the minimum

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a compromise between full Gradient Descent and SGD.

Instead of using all data or just one record, it uses a small batch.

# Mini-batch example
batch_size = 32

for batch in dataset:
    gradient = compute_gradient(batch)
    weights = weights - learning_rate * gradient

Why Mini-Batch Is the Industry Standard

Mini-Batch Gradient Descent balances:

  • Speed
  • Stability
  • Memory efficiency

This is why almost all modern ML frameworks use mini-batches by default.


Real-World Example

When training a neural network on millions of images, processing all images at once is impossible.

Mini-batches allow the model to learn efficiently using limited memory.


Choosing the Batch Size

Common batch sizes are:

  • 16
  • 32
  • 64
  • 128

Smaller batches learn faster but are noisy. Larger batches are stable but slower.


Exercises

Exercise 1:
Why is standard Gradient Descent slow on large datasets?

Because it requires processing the entire dataset before every update.

Exercise 2:
What is the main drawback of SGD?

Its updates are noisy and may cause unstable convergence.

Exercise 3:
Why is Mini-Batch Gradient Descent preferred?

It balances speed, stability, and memory usage.

Quick Quiz

Q1. What does SGD update after?

After every single data point.

Q2. What does mini-batch mean?

Using a small subset of data for each update.

In the next lesson, we will explore ADAM and RMSProp, advanced optimizers that further improve convergence.