Stochastic & Mini-Batch Gradient Descent
In the previous lesson, we learned how standard Gradient Descent works by using the entire dataset to update parameters.
While this approach is mathematically clean, it becomes very slow and impractical when datasets grow large.
This is where two powerful variants come into play: Stochastic Gradient Descent and Mini-Batch Gradient Descent.
Why Standard Gradient Descent Is Not Enough
Standard Gradient Descent computes gradients using all data points before making a single update.
For small datasets, this is fine. But in real systems with millions of records, this becomes extremely slow.
We need faster updates.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent updates the model after every single data point.
Instead of waiting for the full dataset, it learns continuously.
# Pseudo-code for SGD
for each data_point:
gradient = compute_gradient(data_point)
weights = weights - learning_rate * gradient
This makes learning much faster.
Intuition Behind SGD
Think of SGD like adjusting your direction after every step while walking downhill.
The path may look noisy, but you reach the bottom much faster.
Advantages of SGD
SGD is widely used because:
- It is fast
- Uses very little memory
- Works well with large datasets
However, it also has downsides.
Problems with SGD
Because SGD updates parameters using only one data point, the updates can be very noisy.
This may cause:
- Unstable convergence
- Oscillations near the minimum
Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a compromise between full Gradient Descent and SGD.
Instead of using all data or just one record, it uses a small batch.
# Mini-batch example
batch_size = 32
for batch in dataset:
gradient = compute_gradient(batch)
weights = weights - learning_rate * gradient
Why Mini-Batch Is the Industry Standard
Mini-Batch Gradient Descent balances:
- Speed
- Stability
- Memory efficiency
This is why almost all modern ML frameworks use mini-batches by default.
Real-World Example
When training a neural network on millions of images, processing all images at once is impossible.
Mini-batches allow the model to learn efficiently using limited memory.
Choosing the Batch Size
Common batch sizes are:
- 16
- 32
- 64
- 128
Smaller batches learn faster but are noisy. Larger batches are stable but slower.
Exercises
Exercise 1:
Why is standard Gradient Descent slow on large datasets?
Exercise 2:
What is the main drawback of SGD?
Exercise 3:
Why is Mini-Batch Gradient Descent preferred?
Quick Quiz
Q1. What does SGD update after?
Q2. What does mini-batch mean?
In the next lesson, we will explore ADAM and RMSProp, advanced optimizers that further improve convergence.