Activation Functions
In the previous lesson, we explored how a neural network is structured using layers, neurons, weights, and biases.
Now we answer a very important question: How does a neural network learn complex patterns instead of behaving like a simple calculator?
The answer lies in activation functions. They introduce non-linearity into the network, which makes Deep Learning powerful.
Why Activation Functions Are Necessary
If a neural network had no activation functions, each layer would perform only a linear operation.
Stacking multiple linear layers without activation is mathematically equivalent to a single linear layer. That means no matter how deep the network is, it would still behave like a simple linear model.
Activation functions solve this problem by allowing the network to learn non-linear relationships.
Real-World Intuition
Think of a decision like approving a loan.
The decision is not based on one straight rule. It depends on income, credit history, age, debts, and many hidden interactions between them.
Activation functions help neural networks model such complex, real-world decision boundaries.
How Activation Functions Work (Technical View)
Each neuron computes a weighted sum of inputs and adds a bias:
z = w1*x1 + w2*x2 + ... + wn*xn + b
This value z is then passed through
an activation function f(z) to produce the final output.
a = f(z)
Common Activation Functions
Different activation functions serve different purposes. Let’s understand the most important ones used in practice.
1. Sigmoid Function
The sigmoid function squashes any input value into a range between 0 and 1.
This makes it useful for probability-based outputs, such as binary classification.
import math
def sigmoid(z):
return 1 / (1 + math.exp(-z))
However, sigmoid suffers from a major problem: vanishing gradients, which slows learning in deep networks.
2. ReLU (Rectified Linear Unit)
ReLU is the most widely used activation function in modern deep learning models.
It outputs zero for negative values and returns the input directly for positive values.
def relu(z):
return max(0, z)
ReLU is computationally efficient and helps reduce the vanishing gradient problem.
3. Tanh Function
Tanh is similar to sigmoid, but it outputs values between -1 and 1.
It is zero-centered, which often leads to better convergence compared to sigmoid.
Choosing the Right Activation Function
There is no single activation function that works best for all layers.
In practice:
- ReLU is used in hidden layers
- Sigmoid is used for binary output layers
- Softmax is used for multi-class outputs
Understanding this choice is a key skill for every Deep Learning engineer.
Mini Practice
Think about this scenario:
If your model always predicts the same output, what might be the issue with activation functions?
Exercises
Exercise 1:
Why do neural networks need non-linear activation functions?
Exercise 2:
Why is ReLU preferred over sigmoid in hidden layers?
Quick Quiz
Q1. Which activation function outputs values between 0 and 1?
Q2. What happens if a deep network has no activation functions?
In the next lesson, we will move deeper into the mechanics of learning by understanding forward propagation and backpropagation, where gradients and optimization truly begin.