DL Lesson 13 – Batch Normalization | Dataplexa

Batch Normalization

In the previous lesson, we studied Dropout, a technique that reduces overfitting by randomly disabling neurons during training.

In this lesson, we focus on a technique that improves training stability and speed — Batch Normalization.

Batch Normalization is one of the most important breakthroughs that made very deep neural networks practical.

The Core Problem: Internal Covariate Shift

As a neural network trains, the distribution of activations inside the network keeps changing.

When earlier layers update their weights, later layers suddenly receive inputs with new ranges and patterns.

This constant shift slows down training and makes optimization unstable.

This problem is known as internal covariate shift.

What Batch Normalization Does

Batch Normalization stabilizes learning by normalizing activations within each mini-batch.

Instead of letting values grow unpredictably, BatchNorm ensures that activations remain well-behaved.

The network can then learn faster and deeper.

How Normalization Works (Intuition)

For a given batch of activations:

First, the mean is subtracted. Then, values are divided by the standard deviation.

This produces normalized values with a stable scale across training steps.

Importantly, BatchNorm does NOT remove model expressiveness. It adds learnable parameters to scale and shift values back if needed.

Where Batch Normalization Is Applied

Batch Normalization is usually applied after the linear transformation and before the activation function.

This ordering is critical for effective training.

Code Example: Batch Normalization Layer

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Activation

model = Sequential([
    Dense(128),
    BatchNormalization(),
    Activation("relu"),

    Dense(64),
    BatchNormalization(),
    Activation("relu"),

    Dense(1)
])

Here, BatchNormalization is applied before each activation function.

Why Batch Normalization Helps Training

With normalized activations, the optimizer does not need to fight unstable gradients.

This allows:

• Faster convergence • Higher learning rates • More stable deep networks

In many cases, BatchNorm alone significantly reduces overfitting.

Batch Normalization vs Dropout

Batch Normalization stabilizes learning. Dropout prevents over-dependence.

They solve different problems and are often used together in deep networks.

However, when BatchNorm is strong, dropout may be reduced or removed.

Training vs Inference Behavior

During training, BatchNorm computes statistics from each mini-batch.

During inference, it uses moving averages learned during training.

This ensures consistent predictions on unseen data.

Common Mistakes

Placing BatchNorm after activation or using it inconsistently can reduce its effectiveness.

BatchNorm also works best when batch sizes are reasonably large.

Exercises

Exercise 1:
What problem does Batch Normalization primarily solve?

It reduces internal covariate shift and stabilizes training.

Exercise 2:
Should BatchNorm be applied before or after activation?

Before the activation function.

Quick Quiz

Q1. Does BatchNorm remove model flexibility?

No, it includes learnable scaling and shifting parameters.

Q2. Can BatchNorm replace Dropout?

Sometimes it reduces the need, but they solve different problems.

In the next lesson, we will dive into Optimization Algorithms and see how different optimizers control learning dynamics in deep networks.

← Previous Lesson DL Index Next ➜