DL Lesson 8 – Weight Initialization | Dataplexa

Weight Initialization

In the previous lesson, we studied gradient descent variants and how neural networks update their weights to minimize loss.

Now we answer a critical question that decides whether training succeeds or completely fails even before learning starts.

How should the weights be initialized?


Why Weight Initialization Matters

A neural network does not learn from scratch automatically.

It starts with initial weight values, and all learning happens relative to those values.

If the initialization is poor, the network may:

Fail to learn anything, learn extremely slowly, or get stuck producing meaningless outputs.

This is one of the most common hidden reasons why beginners think deep learning “does not work”.


What Happens If We Initialize All Weights to Zero?

At first glance, initializing weights to zero may seem logical and clean.

In reality, this completely breaks learning.

When all weights are the same, all neurons in a layer behave identically.

During backpropagation, they receive the same gradients and update in the same way.

This means the network never learns diverse features.

This problem is called symmetry breaking failure.


Random Initialization: The First Improvement

To break symmetry, weights must be initialized randomly.

This allows different neurons to learn different patterns.

A simple random initialization looks like this:

import numpy as np

weights = np.random.randn(3, 4) * 0.01

While this works better than zeros, it still has serious limitations for deep networks.


The Vanishing and Exploding Gradient Problem

In deep networks, signals pass through many layers.

If weights are too small, gradients shrink layer by layer until learning stops.

This is called vanishing gradients.

If weights are too large, gradients grow exponentially and cause unstable training.

This is called exploding gradients.

Good initialization must balance both.


Xavier (Glorot) Initialization

Xavier initialization was designed to keep the variance of activations roughly constant across layers.

It works especially well with sigmoid and tanh activation functions.

Conceptually, it scales weights based on the number of input and output neurons.

from tensorflow.keras.initializers import GlorotUniform

initializer = GlorotUniform()

This method significantly improves training stability in deep networks.


He Initialization

When ReLU-based activations are used, Xavier initialization is no longer ideal.

He initialization was created specifically for ReLU and its variants.

It allows stronger signal flow without causing instability.

from tensorflow.keras.initializers import HeNormal

initializer = HeNormal()

Most modern deep learning models using ReLU rely on He initialization.


How Initialization Is Used in Practice

In real-world training, you rarely initialize weights manually.

Frameworks allow you to specify initialization strategies directly in layers.

from tensorflow.keras.layers import Dense

Dense(
    units=128,
    activation="relu",
    kernel_initializer="he_normal"
)

This ensures consistent and safe initialization across large architectures.


Real-World Intuition

Think of training a neural network like teaching a group of students.

If everyone starts with exactly the same knowledge, they all make the same mistakes and never improve collectively.

If everyone starts with slightly different understanding, they learn different aspects and improve faster together.

Weight initialization plays this role in neural networks.


Exercises

Exercise 1:
Why does zero initialization fail in deep learning?

It causes symmetry, making all neurons learn the same thing.

Exercise 2:
Which initialization is preferred for ReLU activations?

He initialization.

Quick Quiz

Q1. What problem does Xavier initialization solve?

It stabilizes variance across layers and prevents vanishing or exploding gradients.

Q2. Why does initialization affect convergence speed?

Because it determines how signals and gradients flow at the start of training.

In the next lesson, we will connect initialization with bias–variance behavior and understand how model capacity affects learning.