Batch Normalization
In the previous lesson, we studied Dropout, a technique that reduces overfitting by randomly disabling neurons during training.
In this lesson, we focus on a technique that improves training stability and speed — Batch Normalization.
Batch Normalization is one of the most important breakthroughs that made very deep neural networks practical.
The Core Problem: Internal Covariate Shift
As a neural network trains, the distribution of activations inside the network keeps changing.
When earlier layers update their weights, later layers suddenly receive inputs with new ranges and patterns.
This constant shift slows down training and makes optimization unstable.
This problem is known as internal covariate shift.
What Batch Normalization Does
Batch Normalization stabilizes learning by normalizing activations within each mini-batch.
Instead of letting values grow unpredictably, BatchNorm ensures that activations remain well-behaved.
The network can then learn faster and deeper.
How Normalization Works (Intuition)
For a given batch of activations:
First, the mean is subtracted. Then, values are divided by the standard deviation.
This produces normalized values with a stable scale across training steps.
Importantly, BatchNorm does NOT remove model expressiveness. It adds learnable parameters to scale and shift values back if needed.
Where Batch Normalization Is Applied
Batch Normalization is usually applied after the linear transformation and before the activation function.
This ordering is critical for effective training.
Code Example: Batch Normalization Layer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Activation
model = Sequential([
Dense(128),
BatchNormalization(),
Activation("relu"),
Dense(64),
BatchNormalization(),
Activation("relu"),
Dense(1)
])
Here, BatchNormalization is applied before each activation function.
Why Batch Normalization Helps Training
With normalized activations, the optimizer does not need to fight unstable gradients.
This allows:
• Faster convergence • Higher learning rates • More stable deep networks
In many cases, BatchNorm alone significantly reduces overfitting.
Batch Normalization vs Dropout
Batch Normalization stabilizes learning. Dropout prevents over-dependence.
They solve different problems and are often used together in deep networks.
However, when BatchNorm is strong, dropout may be reduced or removed.
Training vs Inference Behavior
During training, BatchNorm computes statistics from each mini-batch.
During inference, it uses moving averages learned during training.
This ensures consistent predictions on unseen data.
Common Mistakes
Placing BatchNorm after activation or using it inconsistently can reduce its effectiveness.
BatchNorm also works best when batch sizes are reasonably large.
Exercises
Exercise 1:
What problem does Batch Normalization primarily solve?
Exercise 2:
Should BatchNorm be applied before or after activation?
Quick Quiz
Q1. Does BatchNorm remove model flexibility?
Q2. Can BatchNorm replace Dropout?
In the next lesson, we will dive into Optimization Algorithms and see how different optimizers control learning dynamics in deep networks.