One-Hot Encoding
Until now, you have learned how text is cleaned, tokenized, and analyzed using NLP techniques. However, machines still cannot directly understand words.
To use text in Machine Learning models, we must convert words into numbers. One-Hot Encoding is the simplest way to do this.
In this lesson, you will understand what one-hot encoding is, why it is used, how it works step by step, its limitations, and where it fits in the NLP pipeline.
Why Do We Need One-Hot Encoding?
Machine Learning algorithms work with numbers, not words.
For example, a model cannot understand:
"NLP is powerful"
So we must convert words into a numeric representation. One-hot encoding is the first and most basic approach to achieve this.
What Is One-Hot Encoding?
One-hot encoding represents each word as a vector where:
- The vector length equals the vocabulary size
- Only one position has value 1
- All other positions are 0
Each word gets a unique position in the vector.
Simple Intuition
Assume a vocabulary:
["nlp", "is", "fun"]
| Word | One-Hot Vector |
|---|---|
| nlp | [1, 0, 0] |
| is | [0, 1, 0] |
| fun | [0, 0, 1] |
Each word is uniquely identified by a binary vector.
One-Hot Encoding in NLP Pipeline
In classic NLP systems, one-hot encoding appears early:
- Text cleaning
- Tokenization
- One-Hot Encoding
- Machine Learning model
It is often used as a learning concept before Bag of Words and TF-IDF.
Practical Example Using Python
Let us convert text into one-hot vectors using Python.
Where to run this code:
- Google Colab (recommended)
- Jupyter Notebook
- VS Code with Python
Manual One-Hot Encoding Example
sentences = ["nlp is fun", "nlp is powerful"]
# build vocabulary
vocab = sorted(set(" ".join(sentences).split()))
print("Vocabulary:", vocab)
# one-hot encoding
one_hot_vectors = []
for sentence in sentences:
vector = [1 if word in sentence.split() else 0 for word in vocab]
one_hot_vectors.append(vector)
print("One-Hot Vectors:")
for vec in one_hot_vectors:
print(vec)
Output:
Vocabulary: ['fun', 'is', 'nlp', 'powerful']
One-Hot Vectors:
[1, 1, 1, 0]
[0, 1, 1, 1]
How to Understand This Output
The vocabulary defines the vector length. Each position corresponds to one word.
For sentence "nlp is fun":
- fun → 1
- is → 1
- nlp → 1
- powerful → 0
This vector simply indicates word presence.
One-Hot Encoding Using scikit-learn
In practice, we use libraries instead of manual encoding.
from sklearn.preprocessing import OneHotEncoder
sentences = ["nlp is fun", "nlp is powerful"]
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform([[word] for word in " ".join(sentences).split()])
print(encoded)
Advantages of One-Hot Encoding
- Very simple to understand
- No mathematical complexity
- Works for small vocabularies
Limitations of One-Hot Encoding
One-hot encoding has serious drawbacks:
- Vector size grows with vocabulary
- No semantic meaning between words
- Sparse and memory inefficient
For example, "king" and "queen" are equally distant as "king" and "banana".
Why One-Hot Encoding Is Not Enough
Because it does not capture meaning, modern NLP systems use:
- Word embeddings
- Word2Vec
- GloVe
- FastText
One-hot encoding is mainly a learning foundation.
Real-Life Usage
- Teaching NLP fundamentals
- Small experiments
- Binary categorical features
It is rarely used alone in large NLP systems.
Assignment / Homework
Practice Environment:
- Google Colab
- Jupyter Notebook
Tasks:
- Create one-hot vectors for 5 custom sentences
- Compare vocabulary size vs vector size
- Try adding new words and observe vector expansion
Practice Questions
Q1. Why is one-hot encoding used in NLP?
Q2. Does one-hot encoding capture word meaning?
Quick Quiz
Q1. What is the length of a one-hot vector?
Q2. Which value is non-zero in one-hot encoding?
Quick Recap
- One-hot encoding converts words into binary vectors
- Vector size equals vocabulary size
- Simple but inefficient
- No semantic meaning
- Foundation for advanced embeddings