Algorithms Lesson 20 – Huffman Coding | Dataplexa

Huffman Coding

In the previous lesson, we studied the Fractional Knapsack Problem and saw how greedy choices based on ratios lead to optimal solutions.

In this lesson, we explore another powerful greedy algorithm — Huffman Coding.

Huffman Coding is widely used in data compression and forms the foundation of formats like ZIP, JPEG, and MP3.


What Is Huffman Coding?

Huffman Coding is an algorithm used to compress data by assigning variable-length binary codes to characters.

The key idea is simple:

Characters that appear more frequently get shorter codes, while rare characters get longer codes.

This reduces the total number of bits needed to represent the data.


Why Compression Matters

Imagine sending a large text file over the internet.

If we can reduce the file size, we save:

• Storage space
• Transmission time
• Network bandwidth

Huffman Coding achieves this without losing any information.


Greedy Strategy Behind Huffman Coding

Huffman Coding uses a greedy approach based on frequency.

At each step:

We combine the two characters with the smallest frequencies.

This local greedy decision leads to a globally optimal prefix code.


Important Property: Prefix Codes

Huffman Codes are prefix-free.

This means:

No character’s code is a prefix of another character’s code.

Because of this property, the encoded data can be decoded uniquely without ambiguity.


Step-by-Step Construction

Let us walk through the process conceptually.

Step 1: Calculate the frequency of each character.

Step 2: Create a leaf node for each character and insert them into a min-heap based on frequency.

Step 3: Remove the two nodes with the smallest frequencies.

Step 4: Create a new internal node with frequency equal to their sum and insert it back into the heap.

Step 5: Repeat until only one node remains.

This final node becomes the root of the Huffman Tree.


Simple Example

Suppose we have the following characters:

A: 5
B: 9
C: 12
D: 13
E: 16
F: 45

Characters with smaller frequencies end up deeper in the tree, receiving longer binary codes.

Frequent characters stay near the root and get shorter codes.


Pseudocode Overview

create a min-heap of nodes (frequency, character)

while heap size > 1:
    left = extract min
    right = extract min

    newNode.frequency = left.frequency + right.frequency
    newNode.left = left
    newNode.right = right

    insert newNode into heap

return root of the tree

Once the tree is built, binary codes are generated by traversing left (0) and right (1).


Time Complexity

Let n be the number of characters.

Building the heap takes O(n).

Each extraction and insertion takes O(log n).

Overall time complexity:

O(n log n)


Why Greedy Works Here

By always combining the two least frequent characters, we ensure that the longest codes are assigned to the least frequent characters.

This minimizes the total encoded length, which is the optimal goal.


Mini Practice

Think about this:

  • Why would assigning long codes to frequent characters be inefficient?
  • How does prefix-free coding simplify decoding?

Exercises

Exercise 1:
What is the greedy choice in Huffman Coding?

Combining the two characters with the smallest frequencies.

Exercise 2:
Why must Huffman codes be prefix-free?

To allow unique and unambiguous decoding.

Exercise 3:
Is Huffman Coding lossy or lossless?

Lossless.

Quick Quiz

Q1. What data structure is used to build Huffman Trees?

Min-Heap (Priority Queue).

Q2. What is the main advantage of Huffman Coding?

It minimizes total encoded data size.

Huffman Coding shows how greedy strategies can be applied to real-world systems like file compression and media storage.

In the next lesson, we will explore graph-based greedy algorithms starting with Dijkstra’s Shortest Path Algorithm.