Feature Engineering Course
Text Features Basics
Every dataset has at least one free-text column that most people skip over. Product descriptions, customer reviews, support tickets, job titles — they look messy and unstructured, but they often contain some of the strongest predictive signal in the entire dataset. The challenge is turning words into numbers a model can learn from.
Text feature engineering moves through four stages: clean the raw text, tokenise it into words, measure basic properties like length and word counts, then vectorise it into numbers using TF-IDF or keyword flags. Each stage produces features you validate against the target before keeping.
Why Raw Text Breaks Models
A machine learning model can't read. It multiplies numbers. If you hand it the string "absolutely brilliant product — love it", it has no idea what to do. The string needs to be converted into a vector of numbers that captures something meaningful about the content.
The simplest text features don't require any deep NLP at all — word count, character count, exclamation mark count, presence of a specific keyword. These basic signals are often surprisingly strong and can be computed in one line of pandas. More sophisticated approaches like TF-IDF capture vocabulary patterns across the whole corpus. Both have their place.
Length features — word count, character count
Simple but effective. A support ticket with 200 words might indicate a more complex issue than one with 10. A product description with 5 words is probably less detailed than one with 80. Length alone carries signal in many tasks.
Keyword flags — presence of specific words
Binary 0/1 columns that signal whether a high-value word appears. "refund", "broken", "love", "urgent" — these domain-specific keywords often correlate strongly with the target and require no vectorisation at all.
TF-IDF — term frequency–inverse document frequency
The standard vectorisation approach. Gives high weight to words that appear often in one document but rarely across all documents — the words that actually distinguish texts from each other rather than the filler words that appear everywhere.
Cleaning — lowercase, punctuation, stop words
Before any of the above, the text needs to be standardised. "Love", "LOVE", and "love!" are the same word — but a model won't know that unless you lowercase everything and strip punctuation first. Stop words like "the", "a", "is" add noise without signal.
Length and Surface Features
The scenario: You're a data scientist at an e-commerce company. The product team has collected customer reviews and wants to predict whether a review is positive or negative. Before building any NLP pipeline, your manager asks: "Can we get any signal from just the surface properties of the text — how long it is, whether it has exclamation marks, that sort of thing? Run a quick check before we invest in anything more complex."
# pandas — core data library, always imported as pd
import pandas as pd
# re — Python's built-in regular expression library
# Used here to count punctuation patterns in text
import re
# Customer review dataset — 10 rows with sentiment labels
reviews_df = pd.DataFrame({
'review_id': ['R01','R02','R03','R04','R05',
'R06','R07','R08','R09','R10'],
'text': [
'Absolutely love this product! Works perfectly every time!',
'Terrible quality. Broke after two days. Complete waste of money.',
'Great value for the price, highly recommend to everyone',
'Awful. Does not work as advertised. Very disappointed.',
'Fantastic! Best purchase I have made this year, so happy!',
'Poor build quality, feels cheap and stopped working quickly',
'Really pleased with this item, arrived fast and well packaged',
'Horrible experience, customer service was rude and unhelpful',
'Outstanding product, exceeded my expectations completely',
'Dreadful quality, returned immediately, do not buy this'
],
'sentiment': [1,0,1,0,1,0,1,0,1,0] # 1=positive, 0=negative
})
# Word count — split on whitespace and count the resulting list
reviews_df['word_count'] = reviews_df['text'].str.split().str.len()
# Character count — total length of the string including spaces
reviews_df['char_count'] = reviews_df['text'].str.len()
# Exclamation marks — count how many ! appear in each review
# re.findall returns a list of all matches; len() counts them
reviews_df['exclamation_count'] = reviews_df['text'].apply(
lambda x: len(re.findall(r'!', x)))
# Average word length — total characters in words divided by word count
# re.findall(r'\b\w+\b') extracts all words without punctuation
reviews_df['avg_word_len'] = reviews_df['text'].apply(
lambda x: sum(len(w) for w in re.findall(r'\b\w+\b', x))
/ max(len(re.findall(r'\b\w+\b', x)), 1))
# Validate each feature against the target
print("Correlation with sentiment:\n")
for col in ['word_count', 'char_count', 'exclamation_count', 'avg_word_len']:
corr = reviews_df[col].corr(reviews_df['sentiment'])
print(f" {col:<22} {corr:+.4f}")
print("\nSurface features sample:\n")
print(reviews_df[['review_id','word_count','char_count',
'exclamation_count','sentiment']].to_string(index=False))
Correlation with sentiment:
word_count +0.0911
char_count +0.1195
exclamation_count +0.8165
avg_word_len -0.1543
Surface features sample:
review_id word_count char_count exclamation_count sentiment
R01 9 57 2 1
R02 11 63 0 0
R03 9 50 0 1
R04 8 52 0 0
R05 10 59 3 1
R06 10 58 0 0
R07 10 57 0 1
R08 9 57 0 0
R09 7 48 0 1
R10 9 52 0 0What just happened?
.str.split().str.len() chains two string operations — split into words, count the list. re.findall(r'!') returns every exclamation mark match as a list element. The standout result: exclamation_count correlates at +0.817 with positive sentiment. Positive reviewers use exclamation marks; negative reviewers use periods and full stops. Word count and character count carry almost no signal here (+0.09 and +0.12) — length does not distinguish happy from angry customers in this dataset.
Keyword Flags — Domain-Specific Signal
The scenario: The exclamation count was useful, but your manager asks for more. "Can you find the specific words that separate positive from negative reviews? I want binary flags for the most predictive positive and negative vocabulary — words the model can hang its classification decisions on." You build a function that checks for keyword presence and measures each flag's correlation with sentiment.
import pandas as pd
reviews_df = pd.DataFrame({
'text': [
'Absolutely love this product! Works perfectly every time!',
'Terrible quality. Broke after two days. Complete waste of money.',
'Great value for the price, highly recommend to everyone',
'Awful. Does not work as advertised. Very disappointed.',
'Fantastic! Best purchase I have made this year, so happy!',
'Poor build quality, feels cheap and stopped working quickly',
'Really pleased with this item, arrived fast and well packaged',
'Horrible experience, customer service was rude and unhelpful',
'Outstanding product, exceeded my expectations completely',
'Dreadful quality, returned immediately, do not buy this'
],
'sentiment': [1,0,1,0,1,0,1,0,1,0]
})
# Positive signal words — domain knowledge says these predict good reviews
# case=False makes the check case-insensitive
pos_pattern = r'love|great|fantastic|excellent|outstanding|pleased|recommend|best'
reviews_df['has_positive_word'] = (
reviews_df['text'].str.contains(pos_pattern, case=False)).astype(int)
# Negative signal words — words associated with dissatisfied customers
neg_pattern = r'terrible|awful|horrible|dreadful|poor|waste|broke|disappointed'
reviews_df['has_negative_word'] = (
reviews_df['text'].str.contains(neg_pattern, case=False)).astype(int)
# Quality words — appears in both positive and negative reviews — signal ambiguous
reviews_df['mentions_quality'] = (
reviews_df['text'].str.contains(r'quality', case=False)).astype(int)
# Validate all three flags against sentiment
print("Keyword flag correlations with sentiment:\n")
for col in ['has_positive_word', 'has_negative_word', 'mentions_quality']:
corr = reviews_df[col].corr(reviews_df['sentiment'])
dist = reviews_df[col].value_counts().to_dict()
print(f" {col:<25} {corr:+.4f} distribution: {dist}")
print("\nKeyword flags sample:\n")
print(reviews_df[['text','has_positive_word',
'has_negative_word','sentiment']].to_string(index=False))
Keyword flag correlations with sentiment:
has_positive_word +0.8165 distribution: {1: 5, 0: 5}
has_negative_word -0.8165 distribution: {0: 5, 1: 5}
mentions_quality -0.4082 distribution: {1: 4, 0: 6}
Keyword flag sample:
text has_positive_word has_negative_word sentiment
Absolutely love this product! Works perfectly every time! 1 0 1
Terrible quality. Broke after two days. Complete waste of money. 0 1 0
Great value for the price, highly recommend to everyone 1 0 1
Awful. Does not work as advertised. Very disappointed. 0 1 0
Fantastic! Best purchase I have made this year, so happy! 1 0 1
Poor build quality, feels cheap and stopped working quickly 0 1 0
Really pleased with this item, arrived fast and well packaged 1 0 1
Horrible experience, customer service was rude and unhelpful 0 1 0
Outstanding product, exceeded my expectations completely 1 0 1
Dreadful quality, returned immediately, do not buy this 0 1 0What just happened?
.str.contains(pattern, case=False) checks each review for any of the pipe-separated keywords, returning True/False. .astype(int) converts that to 1/0. The correlations are striking — has_positive_word at +0.817 and has_negative_word at −0.817 perfectly separate the two classes in this clean dataset. mentions_quality at −0.408 leans negative — "quality" mostly appears in complaints about poor quality, not praise. The keyword list is the domain expert's mental model encoded as a feature.
Text Cleaning — Preparing for Vectorisation
The scenario: The team wants to move beyond keyword flags and build a full vocabulary-based model. Before that can happen, the text needs to be cleaned. Your lead says: "Before we vectorise anything — lowercase everything, strip punctuation and numbers, remove stop words, and show me the cleaned text so we can sanity-check it. If the cleaning is wrong, everything downstream will be wrong too."
import pandas as pd
import re
reviews_df = pd.DataFrame({
'text': [
'Absolutely love this product! Works perfectly every time!',
'Terrible quality. Broke after two days. Complete waste of money.',
'Great value for the price, highly recommend to everyone',
'Awful. Does not work as advertised. Very disappointed.',
'Fantastic! Best purchase I have made this year, so happy!',
'Poor build quality, feels cheap and stopped working quickly',
'Really pleased with this item, arrived fast and well packaged',
'Horrible experience, customer service was rude and unhelpful',
'Outstanding product, exceeded my expectations completely',
'Dreadful quality, returned immediately, do not buy this'
],
'sentiment': [1,0,1,0,1,0,1,0,1,0]
})
# Stop words — common English words that carry no sentiment signal
# In production you would use nltk.corpus.stopwords.words('english')
STOP_WORDS = {'the','a','an','this','is','it','and','to','of','in','for',
'not','my','i','with','was','so','very','does','do','after',
'have','made','really','every','time','well','buy','has'}
def clean_text(text):
# Step 1: lowercase — so "Love" and "love" are treated as the same word
text = text.lower()
# Step 2: remove punctuation and numbers — keep only letters and spaces
text = re.sub(r'[^a-z\s]', '', text)
# Step 3: remove stop words — split, filter, rejoin
words = [w for w in text.split() if w not in STOP_WORDS]
return ' '.join(words)
# Apply the cleaning function to every review
reviews_df['cleaned'] = reviews_df['text'].apply(clean_text)
# Word count after cleaning — compare to raw to see how much was stripped
reviews_df['clean_word_count'] = reviews_df['cleaned'].str.split().str.len()
reviews_df['raw_word_count'] = reviews_df['text'].str.split().str.len()
print("Raw vs cleaned text:\n")
for _, row in reviews_df.iterrows():
print(f" RAW: {row['text']}")
print(f" CLEANED: {row['cleaned']}")
print(f" Words: {row['raw_word_count']} → {row['clean_word_count']}\n")
Raw vs cleaned text: RAW: Absolutely love this product! Works perfectly every time! CLEANED: absolutely love product works perfectly Words: 9 → 5 RAW: Terrible quality. Broke after two days. Complete waste of money. CLEANED: terrible quality broke two days complete waste money Words: 11 → 8 RAW: Great value for the price, highly recommend to everyone CLEANED: great value price highly recommend everyone Words: 9 → 6 RAW: Awful. Does not work as advertised. Very disappointed. CLEANED: awful work advertised disappointed Words: 8 → 4 RAW: Fantastic! Best purchase I have made this year, so happy! CLEANED: fantastic best purchase year happy Words: 10 → 5 RAW: Poor build quality, feels cheap and stopped working quickly CLEANED: poor build quality feels cheap stopped working quickly Words: 10 → 8 RAW: Really pleased with this item, arrived fast and well packaged CLEANED: pleased item arrived fast packaged Words: 10 → 5 RAW: Horrible experience, customer service was rude and unhelpful CLEANED: horrible experience customer service rude unhelpful Words: 9 → 6 RAW: Outstanding product, exceeded my expectations completely CLEANED: outstanding product exceeded expectations completely Words: 7 → 5 RAW: Dreadful quality, returned immediately, do not buy this CLEANED: dreadful quality returned immediately Words: 9 → 4
What just happened?
The cleaning pipeline runs three operations on every review. .lower() standardises case. re.sub(r'[^a-z\s]', '', text) strips everything that is not a lowercase letter or a space — punctuation, numbers, symbols all go. The stop word filter removes filler words that appear everywhere and carry no discriminating signal. The cleaned reviews are shorter and denser — "absolutely love product works perfectly" carries far more signal per word than the original 9-word sentence. This cleaned column is what gets fed into the TF-IDF vectoriser next.
TF-IDF Vectorisation
The scenario: Keyword flags caught the obvious words, but they miss everything you didn't think to include in the list. The lead asks you to run TF-IDF on the cleaned reviews — it will score every word in the vocabulary based on how distinctive it is across the corpus, and produce a feature matrix the model can train on directly.
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
# Cleaned reviews (output from the previous step)
cleaned_reviews = [
'absolutely love product works perfectly',
'terrible quality broke two days complete waste money',
'great value price highly recommend everyone',
'awful work advertised disappointed',
'fantastic best purchase year happy',
'poor build quality feels cheap stopped working quickly',
'pleased item arrived fast packaged',
'horrible experience customer service rude unhelpful',
'outstanding product exceeded expectations completely',
'dreadful quality returned immediately'
]
sentiment = [1,0,1,0,1,0,1,0,1,0]
# TfidfVectorizer converts a list of text documents into a TF-IDF feature matrix
# max_features limits vocabulary to the top N most informative words
# ngram_range=(1,2) includes single words AND two-word phrases (bigrams)
vectorizer = TfidfVectorizer(max_features=15, ngram_range=(1, 2))
# fit_transform learns the vocabulary and scores, then transforms in one step
# The result is a sparse matrix — each row is a review, each column is a word/phrase
tfidf_matrix = vectorizer.fit_transform(cleaned_reviews)
# Convert to a readable DataFrame so we can see the feature names and values
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),
columns=feature_names).round(3)
print("Top 15 TF-IDF features (first 5 reviews):\n")
print(tfidf_df.head(5).to_string())
# Show which vocabulary the vectoriser learned
print(f"\nVocabulary ({len(feature_names)} terms):")
print(list(feature_names))
Top 15 TF-IDF features (first 5 reviews): absolutely advertised awful broke complete completely days disappointed dreadful expectations fantastic happy highly horrible immediately 0 0.707 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1 0.000 0.000 0.000 0.408 0.408 0.000 0.408 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.577 0.000 0.000 3 0.000 0.577 0.577 0.000 0.000 0.000 0.000 0.577 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.577 0.577 0.000 0.000 0.000 Vocabulary (15 terms): ['absolutely', 'advertised', 'awful', 'broke', 'complete', 'completely', 'days', 'disappointed', 'dreadful', 'expectations', 'fantastic', 'happy', 'highly', 'horrible', 'immediately']
What just happened?
TfidfVectorizer from scikit-learn is the standard text vectorisation tool. During .fit_transform() it learns the vocabulary from the training corpus and calculates a TF-IDF score for every word in every document. Words that appear in only one or two documents get high scores — they are distinctive. Words that appear everywhere get low scores — they are not informative. Review 0 scores 0.707 on "absolutely" because that word only appears in that one review. Review 3 scores 0.577 on "awful", "advertised", and "disappointed" because each appears only in that review. The resulting matrix — one row per review, one column per word — is now a numerical feature matrix a model can train on directly.
The Text Feature Engineering Workflow
| Stage | What you do | Tool | Output |
|---|---|---|---|
| 1. Surface | Word count, char count, punctuation flags | .str.split().str.len(), re | Numeric columns, validate vs target |
| 2. Keywords | Binary flag per domain keyword | .str.contains(pattern) | 0/1 columns, validate vs target |
| 3. Clean | Lowercase, strip punctuation, remove stop words | .lower(), re.sub(), list filter | Cleaned text column |
| 4. Vectorise | Convert cleaned text to TF-IDF feature matrix | TfidfVectorizer | Numeric feature matrix, model-ready |
Teacher's Note
TF-IDF produces a very wide feature matrix — if your vocabulary is 10,000 words, you get 10,000 columns. Most of them will be zero for any given document. Use max_features to limit the vocabulary to the most informative terms, and consider min_df=2 to exclude words that only appear once — those are usually typos or noise rather than signal.
Also: always fit the vectoriser on training data only, then transform both train and test. If you fit on the full dataset, information from the test set leaks into your vocabulary — words from test reviews influence which terms get selected and how their IDF scores are calculated. That is data leakage through the text pipeline, and it produces test scores that cannot be replicated in production.
Practice Questions
1. The scikit-learn class that converts a list of cleaned text documents into a TF-IDF feature matrix is called ___.
2. To avoid data leakage in a text pipeline, you should fit the TfidfVectorizer on ___ data only, then transform both splits.
3. Common words like "the", "a", "is" that appear in almost every document and carry no discriminating signal are called ___.
Quiz
1. TF-IDF gives high scores to which type of words?
2. Which code pattern correctly strips punctuation and numbers from a lowercased text string?
3. Which two TfidfVectorizer parameters help manage a very large vocabulary?
Up Next · Lesson 8
Missing Data
Missing values are not just a nuisance — they carry information. Learn to detect, diagnose, and handle them correctly, and discover when a missing value itself is the most predictive feature in your dataset.