Weight Initialization Strategies
In the previous lesson, we built an understanding of Multi-Layer Perceptrons and how stacking layers helps neural networks learn complex patterns.
However, even a well-designed network can fail before learning begins if weights are initialized poorly.
Why Weight Initialization Matters
When training starts, the network has no knowledge. All learning begins from the initial weights.
If weights are too small, signals may vanish. If weights are too large, signals may explode.
Both situations make learning slow or completely unstable.
The Core Problem
During training, gradients flow backward through the network.
Bad initialization causes gradients to either:
• Shrink layer by layer (vanishing gradients)
• Grow uncontrollably (exploding gradients)
This makes deep networks extremely hard to train.
Naive Initialization (What NOT to Do)
Many beginners initialize all weights with zeros.
This is a serious mistake.
If all neurons start with the same weights, they learn the same thing forever.
The network loses its ability to learn diverse features.
Random Initialization
A better idea is to initialize weights randomly.
Random values break symmetry and allow neurons to learn different patterns.
import numpy as np
weights = np.random.randn(10, 5) * 0.01
However, pure random initialization still has limitations for deep networks.
Xavier (Glorot) Initialization
Xavier initialization was designed to keep the variance of activations stable across layers.
It works well with activation functions like sigmoid and tanh.
from tensorflow.keras.initializers import GlorotUniform
initializer = GlorotUniform()
This approach balances incoming and outgoing signals, reducing gradient issues.
He Initialization
He initialization is specifically designed for ReLU-based networks.
Since ReLU drops negative values, He initialization compensates by scaling weights properly.
from tensorflow.keras.initializers import HeNormal
initializer = HeNormal()
Most modern deep learning models use He initialization by default.
Real-World Intuition
Think of learning like passing information through pipes.
If pipes are too narrow, information fades. If pipes are too wide, pressure builds uncontrollably.
Good initialization sizes the pipes correctly before learning begins.
How Frameworks Handle Initialization
Modern libraries like TensorFlow and PyTorch automatically choose good defaults.
However, understanding initialization helps you:
• Debug unstable training
• Design custom architectures
• Improve convergence speed
Mini Practice
Ask yourself:
Why does ReLU need a different initialization than sigmoid or tanh?
Exercises
Exercise 1:
Why is initializing all weights to zero a bad idea?
Exercise 2:
Which initialization works best with ReLU?
Quick Quiz
Q1. What problem does Xavier initialization solve?
Q2. Can bad initialization stop learning completely?
In the next lesson, we will move into hyperparameter tuning and understand how learning rate, depth, and batch size affect training behavior.