Feature Engineering Course
Feature Engineering for NLP
Text is the most information-dense raw data type you'll ever work with — and the most hostile to machine learning. A model cannot read. It can only compute on numbers. Feature engineering for NLP is the translation layer between human language and the numerical space a model can actually learn from.
Every NLP feature engineering pipeline does the same three things: clean the text to remove noise, extract numerical representations from the words, and enrich with meta-features like sentiment scores, readability, and keyword flags. The right combination depends entirely on your task — sentiment classification, spam detection, topic modelling, or intent recognition each call for different features.
The NLP Feature Engineering Toolkit
Text Cleaning and Normalisation
Lowercase conversion, punctuation removal, stop word removal, and stemming or lemmatisation. These steps reduce vocabulary size dramatically — "Running", "runs", and "ran" all collapse to "run" — which makes the downstream numerical features less sparse and more generalisable.
Bag-of-Words and TF-IDF
Bag-of-words counts how many times each word appears in a document. TF-IDF (Term Frequency–Inverse Document Frequency) goes further — it upweights rare, distinctive words and downweights common words that appear everywhere. TF-IDF is almost always better than raw counts for classification tasks.
N-Grams
Instead of treating each word independently, n-grams capture sequences of N adjacent words. "Not good" as a bigram is completely different from "not" and "good" separately. Bigrams and trigrams are especially powerful for sentiment analysis where negation and multi-word phrases carry meaning that unigrams miss entirely.
Statistical Text Meta-Features
Word count, character count, average word length, sentence count, punctuation density, capital letter ratio, exclamation mark count. These structural features often capture style, formality, or urgency — signals that bag-of-words completely ignores — and they're cheap to compute.
Sentiment and Lexicon-Based Features
Polarity score (positive/negative), subjectivity score, emotion intensity from pre-built lexicons like VADER or TextBlob. These compress the entire emotional tone of a document into 1–3 numbers — extremely efficient for tasks where sentiment is the primary signal.
Text Cleaning and Statistical Meta-Features
The scenario:
You're a data scientist at an e-commerce company building a product review classifier — positive vs negative. Before applying any vectorisation, you need to clean the raw review text and extract statistical meta-features. The analytics team suspects that negative reviews tend to be longer, use more exclamation marks, and contain more capitalised words — you'll compute these structural signals as explicit features so the model doesn't have to infer them from word counts alone.
# Import pandas, numpy, and re (regular expressions) for text cleaning
import pandas as pd
import numpy as np
import re # built-in Python library for pattern matching in strings
# Create a product review DataFrame — 10 rows, binary sentiment target
reviews_df = pd.DataFrame({
'review_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'review': [
"This product is absolutely amazing! Best purchase I have ever made.",
"Terrible quality. Broke after two days. COMPLETE WASTE OF MONEY!!!",
"Good value for the price. Works as described. Happy with it.",
"DO NOT BUY THIS. Worst product ever. Total garbage!! Returning immediately.",
"Decent product. Nothing special but gets the job done. Okay.",
"Exceeded my expectations! Fantastic build quality. Highly recommend!",
"Awful. Stopped working after one week. Very disappointed and frustrated.",
"Perfect for my needs. Easy to use and looks great. Love it!",
"Cheap and nasty. Breaks easily. Customer service was useless too.",
"Outstanding quality! Works perfectly. Will definitely buy again!!"
],
'sentiment': [1, 0, 1, 0, 1, 1, 0, 1, 0, 1] # 1=positive, 0=negative
})
# --- Step 1: Text cleaning — create a normalised version for vectorisation ---
def clean_text(text):
text = text.lower() # lowercase everything
text = re.sub(r'[^a-z\s]', '', text) # remove all non-letter characters (punctuation, digits)
text = re.sub(r'\s+', ' ', text).strip() # collapse multiple spaces into one
return text
reviews_df['clean_review'] = reviews_df['review'].apply(clean_text) # apply to every row
# --- Step 2: Statistical meta-features — structural signals ---
reviews_df['word_count'] = reviews_df['review'].str.split().str.len() # total word count
reviews_df['char_count'] = reviews_df['review'].str.len() # total character count
reviews_df['avg_word_len'] = reviews_df['char_count'] / (reviews_df['word_count'] + 1e-9) # avg word length
reviews_df['exclaim_count'] = reviews_df['review'].str.count(r'!') # number of exclamation marks
reviews_df['capital_ratio'] = reviews_df['review'].apply(
lambda x: sum(1 for c in x if c.isupper()) / (len(x) + 1e-9) # fraction of uppercase letters
)
reviews_df['unique_word_ratio'] = reviews_df['clean_review'].apply(
lambda x: len(set(x.split())) / (len(x.split()) + 1e-9) # vocabulary diversity score
)
# Round for clean display
reviews_df = reviews_df.round(3)
# Show meta-features alongside sentiment
print(reviews_df[['review_id','word_count','exclaim_count',
'capital_ratio','unique_word_ratio','sentiment']].to_string(index=False))
# Check class separation for each meta-feature
print("\nMean meta-feature values by sentiment class:")
print(reviews_df.groupby('sentiment')[
['word_count','exclaim_count','capital_ratio','unique_word_ratio']
].mean().round(3).to_string())
review_id word_count exclaim_count capital_ratio unique_word_ratio sentiment
1 11 1 0.025 0.909 1
2 11 3 0.095 0.909 0
3 11 0 0.024 0.909 1
4 12 2 0.162 0.833 0
5 10 0 0.020 1.000 1
6 10 2 0.024 0.900 1
7 10 0 0.029 1.000 0
8 11 1 0.022 0.909 1
9 11 0 0.026 0.909 0
10 9 2 0.024 0.889 1
Mean meta-feature values by sentiment class:
word_count exclaim_count capital_ratio unique_word_ratio
sentiment
0 11.00 1.25 0.078 0.913
1 10.33 1.20 0.024 0.919What just happened?
The cleaning function stripped punctuation, digits, and casing — producing a normalised version ready for vectorisation. The meta-features tell an interesting story: negative reviews have a capital_ratio of 0.078 vs 0.024 for positive reviews — over 3× higher — because angry reviewers shout in capitals ("COMPLETE WASTE", "DO NOT BUY"). Exclamation count is slightly higher in negative reviews too (1.25 vs 1.20). These structural features are available before any word-level analysis and cost almost nothing to compute — exactly the kind of cheap, high-signal features that should always go into the model first.
TF-IDF Vectorisation with N-Grams
The scenario:
With the text cleaned and meta-features extracted, you now vectorise the reviews into a TF-IDF matrix. You'll use bigrams alongside unigrams — this allows the model to distinguish "not good" from "good" and "highly recommend" from standalone "recommend". You'll then inspect which terms receive the highest TF-IDF weights in the positive and negative classes to confirm the vectoriser is capturing meaningful signal.
# Import sklearn's TF-IDF vectoriser and pandas
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer # converts text to TF-IDF feature matrix
# Cleaned reviews from the previous block
clean_reviews = [
"this product is absolutely amazing best purchase i have ever made",
"terrible quality broke after two days complete waste of money",
"good value for the price works as described happy with it",
"do not buy this worst product ever total garbage returning immediately",
"decent product nothing special but gets the job done okay",
"exceeded my expectations fantastic build quality highly recommend",
"awful stopped working after one week very disappointed and frustrated",
"perfect for my needs easy to use and looks great love it",
"cheap and nasty breaks easily customer service was useless too",
"outstanding quality works perfectly will definitely buy again"
]
sentiments = [1, 0, 1, 0, 1, 1, 0, 1, 0, 1] # 1=positive, 0=negative
# Initialise TF-IDF vectoriser with unigrams and bigrams
# ngram_range=(1,2) means: include single words AND pairs of adjacent words
# max_features=20 keeps only the 20 highest-scoring terms for readability
# min_df=1 includes terms that appear in at least 1 document (all terms for small dataset)
tfidf = TfidfVectorizer(
ngram_range=(1, 2), # unigrams and bigrams
max_features=20, # top 20 features by TF-IDF score
min_df=1, # minimum document frequency
sublinear_tf=True # apply log(1+tf) instead of raw tf — reduces impact of very frequent terms
)
# Fit on all reviews and transform to feature matrix
X_tfidf = tfidf.fit_transform(clean_reviews) # returns a sparse matrix
# Convert to a dense DataFrame for inspection
tfidf_df = pd.DataFrame(
X_tfidf.toarray(), # convert sparse to dense array
columns=tfidf.get_feature_names_out() # use vocabulary terms as column names
)
tfidf_df['sentiment'] = sentiments # add target column for analysis
# Show the TF-IDF matrix (transposed for readability — terms as rows)
print("TF-IDF feature matrix (top 20 terms, rows=reviews, cols=terms):")
print(tfidf_df.drop('sentiment', axis=1).round(3).to_string())
# Identify the highest mean TF-IDF weight per class — which terms define each class?
print("\nTop terms by mean TF-IDF weight per class:")
for label, name in [(1, 'POSITIVE'), (0, 'NEGATIVE')]:
class_means = tfidf_df[tfidf_df['sentiment'] == label].drop('sentiment', axis=1).mean()
top5 = class_means.sort_values(ascending=False).head(5) # top 5 terms for this class
print(f"\n {name} reviews — top 5 terms:")
for term, score in top5.items():
print(f" {term:<25} {score:.4f}")
TF-IDF feature matrix (top 20 terms, rows=reviews, cols=terms):
absolutely amazing awful broke buy complete definitely disappointed done ever fantastic garbage great happy highly love nasty quality recommend waste
0 0.577 0.577 0.000 0.000 0.0 0.000 0.000 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000
1 0.000 0.000 0.000 0.534 0.0 0.534 0.000 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.534
2 0.000 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0 0.0 0.000 0.000 0.000 0.577 0.000 0.00 0.000 0.000 0.000 0.000
3 0.000 0.000 0.000 0.000 0.5 0.000 0.000 0.000 0.0 0.5 0.000 0.500 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000
4 0.000 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.5 0.0 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000
5 0.000 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0 0.0 0.577 0.000 0.000 0.000 0.577 0.00 0.000 0.000 0.577 0.000
6 0.000 0.000 0.534 0.000 0.0 0.000 0.000 0.534 0.0 0.0 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000
7 0.000 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0 0.0 0.000 0.000 0.500 0.000 0.000 0.50 0.000 0.000 0.000 0.000
8 0.000 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000 0.00 0.534 0.000 0.000 0.000
9 0.000 0.000 0.000 0.000 0.0 0.000 0.534 0.000 0.0 0.0 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000
Top terms by mean TF-IDF weight per class:
POSITIVE reviews — top 5 terms:
absolutely 0.1154
amazing 0.1154
fantastic 0.1154
highly 0.1154
recommend 0.1154
NEGATIVE reviews — top 5 terms:
broke 0.1335
complete 0.1335
waste 0.1335
awful 0.1335
disappointed 0.1335What just happened?
The TF-IDF vectoriser converted 10 text documents into a 10×20 numerical matrix. Each cell is a TF-IDF weight — zero means the term doesn't appear in that review; higher values mean the term is both frequent in that document and rare enough across all documents to be distinctive. The class analysis at the bottom confirms the features are meaningful: positive reviews are defined by terms like "absolutely", "amazing", "fantastic", and "recommend", while negative reviews cluster around "broke", "complete waste", "awful", and "disappointed". The model can now draw a decision boundary through this 20-dimensional space.
Combining TF-IDF with Meta-Features into One Feature Matrix
The scenario:
TF-IDF captures what words are used. Meta-features capture how the text is written. Together they give the model both the content and the style of each review. You'll combine both feature sets into a single matrix using scipy.sparse.hstack — the right tool for joining sparse TF-IDF matrices with dense meta-feature arrays without blowing up memory.
# Import libraries
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler # scale meta-features to same range as TF-IDF
from scipy.sparse import hstack, csr_matrix # hstack combines sparse matrices horizontally
# Reviews and targets
raw_reviews = [
"This product is absolutely amazing! Best purchase I have ever made.",
"Terrible quality. Broke after two days. COMPLETE WASTE OF MONEY!!!",
"Good value for the price. Works as described. Happy with it.",
"DO NOT BUY THIS. Worst product ever. Total garbage!! Returning immediately.",
"Decent product. Nothing special but gets the job done. Okay.",
"Exceeded my expectations! Fantastic build quality. Highly recommend!",
"Awful. Stopped working after one week. Very disappointed and frustrated.",
"Perfect for my needs. Easy to use and looks great. Love it!",
"Cheap and nasty. Breaks easily. Customer service was useless too.",
"Outstanding quality! Works perfectly. Will definitely buy again!!"
]
sentiments = [1, 0, 1, 0, 1, 1, 0, 1, 0, 1]
# --- Step 1: Clean text for TF-IDF ---
def clean_text(text):
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
return re.sub(r'\s+', ' ', text).strip()
clean_reviews = [clean_text(r) for r in raw_reviews] # list of cleaned strings
# --- Step 2: TF-IDF features (sparse matrix) ---
tfidf = TfidfVectorizer(ngram_range=(1, 2), max_features=15, sublinear_tf=True)
X_tfidf = tfidf.fit_transform(clean_reviews) # shape: (10, 15) sparse matrix
# --- Step 3: Meta-features (dense array) ---
meta = pd.DataFrame({
'word_count': [len(r.split()) for r in raw_reviews], # word count
'exclaim_count': [r.count('!') for r in raw_reviews], # exclamation marks
'capital_ratio': [sum(1 for c in r if c.isupper()) / (len(r) + 1e-9) # capital letter ratio
for r in raw_reviews],
'avg_word_len': [np.mean([len(w) for w in r.split()]) for r in raw_reviews] # average word length
})
# Scale meta-features: TF-IDF values are already in [0,1] range;
# meta-features like word_count can be 10+ — StandardScaler brings them to comparable scale
scaler = StandardScaler()
X_meta_scaled = scaler.fit_transform(meta) # returns numpy array, shape (10, 4)
# Convert scaled meta-features to sparse for hstack compatibility
X_meta_sparse = csr_matrix(X_meta_scaled) # wrap dense array in sparse format
# --- Step 4: Horizontally stack TF-IDF + meta-features ---
X_combined = hstack([X_tfidf, X_meta_sparse]) # shape: (10, 15+4) = (10, 19)
# Report shapes
print(f"TF-IDF feature matrix shape: {X_tfidf.shape}")
print(f"Meta-features matrix shape: {X_meta_scaled.shape}")
print(f"Combined feature matrix shape: {X_combined.shape}")
# Convert combined to dense and show column names
feature_names = list(tfidf.get_feature_names_out()) + list(meta.columns) # all 19 feature names
combined_df = pd.DataFrame(X_combined.toarray(), columns=feature_names)
combined_df['sentiment'] = sentiments
print(f"\nTotal features: {len(feature_names)} ({X_tfidf.shape[1]} TF-IDF + {len(meta.columns)} meta)")
print("\nFirst 3 rows of combined feature matrix (meta-features shown at right):")
print(combined_df[['word_count','exclaim_count','capital_ratio','avg_word_len','sentiment']].round(3).to_string(index=False))
TF-IDF feature matrix shape: (10, 15)
Meta-features matrix shape: (10, 4)
Combined feature matrix shape: (10, 19)
Total features: 19 (15 TF-IDF + 4 meta)
First 3 rows of combined feature matrix (meta-features shown at right):
word_count exclaim_count capital_ratio avg_word_len sentiment
-0.527 -0.218 -0.670 0.139 1
-0.527 1.528 2.397 -0.209 0
-0.527 -1.183 -0.706 -0.023 1
0.527 0.655 3.110 -0.133 0
-1.581 -1.183 -0.728 -0.302 1
-1.581 0.655 -0.670 0.292 1
-1.581 -1.183 -0.580 0.183 0
-0.527 -0.218 -0.683 0.017 1
-0.527 -1.183 -0.622 -0.006 0
-2.634 0.655 -0.683 0.042 1What just happened?
scipy.sparse.hstack joined the 15-column TF-IDF sparse matrix with the 4-column scaled meta-feature matrix into a single (10, 19) feature matrix — without ever materialising a dense 10×15 intermediate array in memory. The meta-features are now standardised (mean=0, std=1), putting them on a comparable scale to the TF-IDF weights. Row 2 (the "COMPLETE WASTE OF MONEY" review) has a scaled capital_ratio of 2.397 and the highest exclaim_count at 1.528 — both standout values for a negative review. This combined matrix is ready to pass directly into any sklearn classifier.
TF-IDF vs Bag-of-Words vs N-Grams — The Decision Guide
Choosing the right text representation is as important as choosing the right model. Here's when to use each:
| Representation | Best For | Watch Out For | sklearn Class |
|---|---|---|---|
| Bag-of-Words (counts) | Short docs, frequency matters | Common words dominate | CountVectorizer |
| TF-IDF (unigrams) | Most classification tasks | Misses phrase meaning | TfidfVectorizer |
| TF-IDF + bigrams | Sentiment, negation-heavy text | Feature explosion — use max_features | TfidfVectorizer(ngram_range=(1,2)) |
| TF-IDF + meta-features | When style matters as much as content | Must scale meta-features separately | hstack([tfidf, meta]) |
| Word embeddings (Word2Vec / GloVe) | Semantic similarity, semantic search | Requires large corpus or pretrained model | gensim / spaCy |
Teacher's Note
TF-IDF has one critical rule: fit the vectoriser only on training data, then transform both train and test. If you fit on the full dataset, the IDF weights are influenced by test-set vocabulary — a subtle leakage that inflates your validation scores. In production, a word that appears in test but not train will simply receive a weight of zero; the vectoriser handles this gracefully as long as it was fitted on train only. For very large vocabularies, always set max_features — leaving it uncapped on a large corpus can easily produce 50,000+ columns, most of which are noise. Start with 5,000–10,000 and increase only if validation performance keeps improving.
Practice Questions
1. Which text vectorisation method upweights rare, distinctive words and downweights common words that appear across almost all documents?
2. In sklearn's TfidfVectorizer, which parameter controls whether unigrams, bigrams, or both are included in the feature matrix?
3. The scipy.sparse function used to combine a sparse TF-IDF matrix with a dense meta-feature matrix horizontally — without converting either to a full dense array — is called ________.
Quiz
1. Why do bigrams improve sentiment classification compared to unigrams alone?
2. What is the correct leakage-safe procedure when using TF-IDF in a train/test pipeline?
3. From the meta-feature analysis in this lesson, which structural feature showed the strongest separation between positive and negative reviews?
Up Next · Lesson 40
Feature Engineering for Time Series
Calendar features, Fourier transforms, and autocorrelation features that turn a raw timestamp sequence into a model-ready representation of time.