DL Lesson Initialization Strategies – TITLE HERE | Dataplexa

Weight Initialization Strategies

In the previous lesson, we built an understanding of Multi-Layer Perceptrons and how stacking layers helps neural networks learn complex patterns.

However, even a well-designed network can fail before learning begins if weights are initialized poorly.


Why Weight Initialization Matters

When training starts, the network has no knowledge. All learning begins from the initial weights.

If weights are too small, signals may vanish. If weights are too large, signals may explode.

Both situations make learning slow or completely unstable.


The Core Problem

During training, gradients flow backward through the network.

Bad initialization causes gradients to either:

• Shrink layer by layer (vanishing gradients)
• Grow uncontrollably (exploding gradients)

This makes deep networks extremely hard to train.


Naive Initialization (What NOT to Do)

Many beginners initialize all weights with zeros.

This is a serious mistake.

If all neurons start with the same weights, they learn the same thing forever.

The network loses its ability to learn diverse features.


Random Initialization

A better idea is to initialize weights randomly.

Random values break symmetry and allow neurons to learn different patterns.

import numpy as np

weights = np.random.randn(10, 5) * 0.01

However, pure random initialization still has limitations for deep networks.


Xavier (Glorot) Initialization

Xavier initialization was designed to keep the variance of activations stable across layers.

It works well with activation functions like sigmoid and tanh.

from tensorflow.keras.initializers import GlorotUniform

initializer = GlorotUniform()

This approach balances incoming and outgoing signals, reducing gradient issues.


He Initialization

He initialization is specifically designed for ReLU-based networks.

Since ReLU drops negative values, He initialization compensates by scaling weights properly.

from tensorflow.keras.initializers import HeNormal

initializer = HeNormal()

Most modern deep learning models use He initialization by default.


Real-World Intuition

Think of learning like passing information through pipes.

If pipes are too narrow, information fades. If pipes are too wide, pressure builds uncontrollably.

Good initialization sizes the pipes correctly before learning begins.


How Frameworks Handle Initialization

Modern libraries like TensorFlow and PyTorch automatically choose good defaults.

However, understanding initialization helps you:

• Debug unstable training
• Design custom architectures
• Improve convergence speed


Mini Practice

Ask yourself:

Why does ReLU need a different initialization than sigmoid or tanh?


Exercises

Exercise 1:
Why is initializing all weights to zero a bad idea?

All neurons learn identical features, breaking learning diversity.

Exercise 2:
Which initialization works best with ReLU?

He initialization.

Quick Quiz

Q1. What problem does Xavier initialization solve?

It keeps activation variance stable across layers.

Q2. Can bad initialization stop learning completely?

Yes, by causing vanishing or exploding gradients.

In the next lesson, we will move into hyperparameter tuning and understand how learning rate, depth, and batch size affect training behavior.