DL Lesson 4 – Activation Functions | Dataplexa

Activation Functions

In the previous lesson, we explored how a neural network is structured using layers, neurons, weights, and biases.

Now we answer a very important question: How does a neural network learn complex patterns instead of behaving like a simple calculator?

The answer lies in activation functions. They introduce non-linearity into the network, which makes Deep Learning powerful.


Why Activation Functions Are Necessary

If a neural network had no activation functions, each layer would perform only a linear operation.

Stacking multiple linear layers without activation is mathematically equivalent to a single linear layer. That means no matter how deep the network is, it would still behave like a simple linear model.

Activation functions solve this problem by allowing the network to learn non-linear relationships.


Real-World Intuition

Think of a decision like approving a loan.

The decision is not based on one straight rule. It depends on income, credit history, age, debts, and many hidden interactions between them.

Activation functions help neural networks model such complex, real-world decision boundaries.


How Activation Functions Work (Technical View)

Each neuron computes a weighted sum of inputs and adds a bias:

z = w1*x1 + w2*x2 + ... + wn*xn + b

This value z is then passed through an activation function f(z) to produce the final output.

a = f(z)

Common Activation Functions

Different activation functions serve different purposes. Let’s understand the most important ones used in practice.


1. Sigmoid Function

The sigmoid function squashes any input value into a range between 0 and 1.

This makes it useful for probability-based outputs, such as binary classification.

import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

However, sigmoid suffers from a major problem: vanishing gradients, which slows learning in deep networks.


2. ReLU (Rectified Linear Unit)

ReLU is the most widely used activation function in modern deep learning models.

It outputs zero for negative values and returns the input directly for positive values.

def relu(z):
    return max(0, z)

ReLU is computationally efficient and helps reduce the vanishing gradient problem.


3. Tanh Function

Tanh is similar to sigmoid, but it outputs values between -1 and 1.

It is zero-centered, which often leads to better convergence compared to sigmoid.


Choosing the Right Activation Function

There is no single activation function that works best for all layers.

In practice:

  • ReLU is used in hidden layers
  • Sigmoid is used for binary output layers
  • Softmax is used for multi-class outputs

Understanding this choice is a key skill for every Deep Learning engineer.


Mini Practice

Think about this scenario:

If your model always predicts the same output, what might be the issue with activation functions?


Exercises

Exercise 1:
Why do neural networks need non-linear activation functions?

Without non-linearity, deep networks collapse into a linear model and cannot learn complex patterns.

Exercise 2:
Why is ReLU preferred over sigmoid in hidden layers?

ReLU reduces vanishing gradients and is computationally efficient.

Quick Quiz

Q1. Which activation function outputs values between 0 and 1?

Sigmoid.

Q2. What happens if a deep network has no activation functions?

It behaves like a linear model regardless of depth.

In the next lesson, we will move deeper into the mechanics of learning by understanding forward propagation and backpropagation, where gradients and optimization truly begin.