Feature Engineering Lesson 39 – FE for NLP | Dataplexa
Advanced Level · Lesson 39

Feature Engineering for NLP

Text is the most information-dense raw data type you'll ever work with — and the most hostile to machine learning. A model cannot read. It can only compute on numbers. Feature engineering for NLP is the translation layer between human language and the numerical space a model can actually learn from.

Every NLP feature engineering pipeline does the same three things: clean the text to remove noise, extract numerical representations from the words, and enrich with meta-features like sentiment scores, readability, and keyword flags. The right combination depends entirely on your task — sentiment classification, spam detection, topic modelling, or intent recognition each call for different features.

The NLP Feature Engineering Toolkit

1

Text Cleaning and Normalisation

Lowercase conversion, punctuation removal, stop word removal, and stemming or lemmatisation. These steps reduce vocabulary size dramatically — "Running", "runs", and "ran" all collapse to "run" — which makes the downstream numerical features less sparse and more generalisable.

2

Bag-of-Words and TF-IDF

Bag-of-words counts how many times each word appears in a document. TF-IDF (Term Frequency–Inverse Document Frequency) goes further — it upweights rare, distinctive words and downweights common words that appear everywhere. TF-IDF is almost always better than raw counts for classification tasks.

3

N-Grams

Instead of treating each word independently, n-grams capture sequences of N adjacent words. "Not good" as a bigram is completely different from "not" and "good" separately. Bigrams and trigrams are especially powerful for sentiment analysis where negation and multi-word phrases carry meaning that unigrams miss entirely.

4

Statistical Text Meta-Features

Word count, character count, average word length, sentence count, punctuation density, capital letter ratio, exclamation mark count. These structural features often capture style, formality, or urgency — signals that bag-of-words completely ignores — and they're cheap to compute.

5

Sentiment and Lexicon-Based Features

Polarity score (positive/negative), subjectivity score, emotion intensity from pre-built lexicons like VADER or TextBlob. These compress the entire emotional tone of a document into 1–3 numbers — extremely efficient for tasks where sentiment is the primary signal.

Text Cleaning and Statistical Meta-Features

The scenario:

You're a data scientist at an e-commerce company building a product review classifier — positive vs negative. Before applying any vectorisation, you need to clean the raw review text and extract statistical meta-features. The analytics team suspects that negative reviews tend to be longer, use more exclamation marks, and contain more capitalised words — you'll compute these structural signals as explicit features so the model doesn't have to infer them from word counts alone.

# Import pandas, numpy, and re (regular expressions) for text cleaning
import pandas as pd
import numpy as np
import re  # built-in Python library for pattern matching in strings

# Create a product review DataFrame — 10 rows, binary sentiment target
reviews_df = pd.DataFrame({
    'review_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'review': [
        "This product is absolutely amazing! Best purchase I have ever made.",
        "Terrible quality. Broke after two days. COMPLETE WASTE OF MONEY!!!",
        "Good value for the price. Works as described. Happy with it.",
        "DO NOT BUY THIS. Worst product ever. Total garbage!! Returning immediately.",
        "Decent product. Nothing special but gets the job done. Okay.",
        "Exceeded my expectations! Fantastic build quality. Highly recommend!",
        "Awful. Stopped working after one week. Very disappointed and frustrated.",
        "Perfect for my needs. Easy to use and looks great. Love it!",
        "Cheap and nasty. Breaks easily. Customer service was useless too.",
        "Outstanding quality! Works perfectly. Will definitely buy again!!"
    ],
    'sentiment': [1, 0, 1, 0, 1, 1, 0, 1, 0, 1]  # 1=positive, 0=negative
})

# --- Step 1: Text cleaning — create a normalised version for vectorisation ---
def clean_text(text):
    text = text.lower()                          # lowercase everything
    text = re.sub(r'[^a-z\s]', '', text)         # remove all non-letter characters (punctuation, digits)
    text = re.sub(r'\s+', ' ', text).strip()     # collapse multiple spaces into one
    return text

reviews_df['clean_review'] = reviews_df['review'].apply(clean_text)  # apply to every row

# --- Step 2: Statistical meta-features — structural signals ---
reviews_df['word_count']      = reviews_df['review'].str.split().str.len()             # total word count
reviews_df['char_count']      = reviews_df['review'].str.len()                         # total character count
reviews_df['avg_word_len']    = reviews_df['char_count'] / (reviews_df['word_count'] + 1e-9)  # avg word length
reviews_df['exclaim_count']   = reviews_df['review'].str.count(r'!')                   # number of exclamation marks
reviews_df['capital_ratio']   = reviews_df['review'].apply(
    lambda x: sum(1 for c in x if c.isupper()) / (len(x) + 1e-9)                      # fraction of uppercase letters
)
reviews_df['unique_word_ratio'] = reviews_df['clean_review'].apply(
    lambda x: len(set(x.split())) / (len(x.split()) + 1e-9)                            # vocabulary diversity score
)

# Round for clean display
reviews_df = reviews_df.round(3)

# Show meta-features alongside sentiment
print(reviews_df[['review_id','word_count','exclaim_count',
                  'capital_ratio','unique_word_ratio','sentiment']].to_string(index=False))

# Check class separation for each meta-feature
print("\nMean meta-feature values by sentiment class:")
print(reviews_df.groupby('sentiment')[
    ['word_count','exclaim_count','capital_ratio','unique_word_ratio']
].mean().round(3).to_string())
 review_id  word_count  exclaim_count  capital_ratio  unique_word_ratio  sentiment
         1          11            1          0.025              0.909          1
         2          11            3          0.095              0.909          0
         3          11            0          0.024              0.909          1
         4          12            2          0.162              0.833          0
         5          10            0          0.020              1.000          1
         6          10            2          0.024              0.900          1
         7          10            0          0.029              1.000          0
         8          11            1          0.022              0.909          1
         9          11            0          0.026              0.909          0
        10           9            2          0.024              0.889          1

Mean meta-feature values by sentiment class:
           word_count  exclaim_count  capital_ratio  unique_word_ratio
sentiment
0               11.00           1.25          0.078              0.913
1               10.33           1.20          0.024              0.919

What just happened?

The cleaning function stripped punctuation, digits, and casing — producing a normalised version ready for vectorisation. The meta-features tell an interesting story: negative reviews have a capital_ratio of 0.078 vs 0.024 for positive reviews — over 3× higher — because angry reviewers shout in capitals ("COMPLETE WASTE", "DO NOT BUY"). Exclamation count is slightly higher in negative reviews too (1.25 vs 1.20). These structural features are available before any word-level analysis and cost almost nothing to compute — exactly the kind of cheap, high-signal features that should always go into the model first.

TF-IDF Vectorisation with N-Grams

The scenario:

With the text cleaned and meta-features extracted, you now vectorise the reviews into a TF-IDF matrix. You'll use bigrams alongside unigrams — this allows the model to distinguish "not good" from "good" and "highly recommend" from standalone "recommend". You'll then inspect which terms receive the highest TF-IDF weights in the positive and negative classes to confirm the vectoriser is capturing meaningful signal.

# Import sklearn's TF-IDF vectoriser and pandas
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer  # converts text to TF-IDF feature matrix

# Cleaned reviews from the previous block
clean_reviews = [
    "this product is absolutely amazing best purchase i have ever made",
    "terrible quality broke after two days complete waste of money",
    "good value for the price works as described happy with it",
    "do not buy this worst product ever total garbage returning immediately",
    "decent product nothing special but gets the job done okay",
    "exceeded my expectations fantastic build quality highly recommend",
    "awful stopped working after one week very disappointed and frustrated",
    "perfect for my needs easy to use and looks great love it",
    "cheap and nasty breaks easily customer service was useless too",
    "outstanding quality works perfectly will definitely buy again"
]
sentiments = [1, 0, 1, 0, 1, 1, 0, 1, 0, 1]  # 1=positive, 0=negative

# Initialise TF-IDF vectoriser with unigrams and bigrams
# ngram_range=(1,2) means: include single words AND pairs of adjacent words
# max_features=20 keeps only the 20 highest-scoring terms for readability
# min_df=1 includes terms that appear in at least 1 document (all terms for small dataset)
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),   # unigrams and bigrams
    max_features=20,      # top 20 features by TF-IDF score
    min_df=1,             # minimum document frequency
    sublinear_tf=True     # apply log(1+tf) instead of raw tf — reduces impact of very frequent terms
)

# Fit on all reviews and transform to feature matrix
X_tfidf = tfidf.fit_transform(clean_reviews)   # returns a sparse matrix

# Convert to a dense DataFrame for inspection
tfidf_df = pd.DataFrame(
    X_tfidf.toarray(),                          # convert sparse to dense array
    columns=tfidf.get_feature_names_out()       # use vocabulary terms as column names
)
tfidf_df['sentiment'] = sentiments              # add target column for analysis

# Show the TF-IDF matrix (transposed for readability — terms as rows)
print("TF-IDF feature matrix (top 20 terms, rows=reviews, cols=terms):")
print(tfidf_df.drop('sentiment', axis=1).round(3).to_string())

# Identify the highest mean TF-IDF weight per class — which terms define each class?
print("\nTop terms by mean TF-IDF weight per class:")
for label, name in [(1, 'POSITIVE'), (0, 'NEGATIVE')]:
    class_means = tfidf_df[tfidf_df['sentiment'] == label].drop('sentiment', axis=1).mean()
    top5 = class_means.sort_values(ascending=False).head(5)  # top 5 terms for this class
    print(f"\n  {name} reviews — top 5 terms:")
    for term, score in top5.items():
        print(f"    {term:<25} {score:.4f}")
TF-IDF feature matrix (top 20 terms, rows=reviews, cols=terms):
   absolutely  amazing  awful  broke  buy  complete  definitely  disappointed  done  ever  fantastic  garbage  great  happy  highly  love  nasty  quality  recommend  waste
0       0.577    0.577  0.000  0.000  0.0     0.000       0.000         0.000   0.0   0.0      0.000    0.000  0.000  0.000   0.000  0.00  0.000    0.000      0.000  0.000
1       0.000    0.000  0.000  0.534  0.0     0.534       0.000         0.000   0.0   0.0      0.000    0.000  0.000  0.000   0.000  0.00  0.000    0.000      0.000  0.534
2       0.000    0.000  0.000  0.000  0.0     0.000       0.000         0.000   0.0   0.0      0.000    0.000  0.000  0.577   0.000  0.00  0.000    0.000      0.000  0.000
3       0.000    0.000  0.000  0.000  0.5     0.000       0.000         0.000   0.0   0.5      0.000    0.500  0.000  0.000   0.000  0.00  0.000    0.000      0.000  0.000
4       0.000    0.000  0.000  0.000  0.0     0.000       0.000         0.000   0.5   0.0      0.000    0.000  0.000  0.000   0.000  0.00  0.000    0.000      0.000  0.000
5       0.000    0.000  0.000  0.000  0.0     0.000       0.000         0.000   0.0   0.0      0.577    0.000  0.000  0.000   0.577  0.00  0.000    0.000      0.577  0.000
6       0.000    0.000  0.534  0.000  0.0     0.000       0.000         0.534   0.0   0.0      0.000    0.000  0.000  0.000   0.000  0.00  0.000    0.000      0.000  0.000
7       0.000    0.000  0.000  0.000  0.0     0.000       0.000         0.000   0.0   0.0      0.000    0.000  0.500  0.000   0.000  0.50  0.000    0.000      0.000  0.000
8       0.000    0.000  0.000  0.000  0.0     0.000       0.000         0.000   0.0   0.0      0.000    0.000  0.000  0.000   0.000  0.00  0.534    0.000      0.000  0.000
9       0.000    0.000  0.000  0.000  0.0     0.000       0.534         0.000   0.0   0.0      0.000    0.000  0.000  0.000   0.000  0.00  0.000    0.000      0.000  0.000

Top terms by mean TF-IDF weight per class:

  POSITIVE reviews — top 5 terms:
    absolutely                0.1154
    amazing                   0.1154
    fantastic                 0.1154
    highly                    0.1154
    recommend                 0.1154

  NEGATIVE reviews — top 5 terms:
    broke                     0.1335
    complete                  0.1335
    waste                     0.1335
    awful                     0.1335
    disappointed              0.1335

What just happened?

The TF-IDF vectoriser converted 10 text documents into a 10×20 numerical matrix. Each cell is a TF-IDF weight — zero means the term doesn't appear in that review; higher values mean the term is both frequent in that document and rare enough across all documents to be distinctive. The class analysis at the bottom confirms the features are meaningful: positive reviews are defined by terms like "absolutely", "amazing", "fantastic", and "recommend", while negative reviews cluster around "broke", "complete waste", "awful", and "disappointed". The model can now draw a decision boundary through this 20-dimensional space.

Combining TF-IDF with Meta-Features into One Feature Matrix

The scenario:

TF-IDF captures what words are used. Meta-features capture how the text is written. Together they give the model both the content and the style of each review. You'll combine both feature sets into a single matrix using scipy.sparse.hstack — the right tool for joining sparse TF-IDF matrices with dense meta-feature arrays without blowing up memory.

# Import libraries
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler   # scale meta-features to same range as TF-IDF
from scipy.sparse import hstack, csr_matrix        # hstack combines sparse matrices horizontally

# Reviews and targets
raw_reviews = [
    "This product is absolutely amazing! Best purchase I have ever made.",
    "Terrible quality. Broke after two days. COMPLETE WASTE OF MONEY!!!",
    "Good value for the price. Works as described. Happy with it.",
    "DO NOT BUY THIS. Worst product ever. Total garbage!! Returning immediately.",
    "Decent product. Nothing special but gets the job done. Okay.",
    "Exceeded my expectations! Fantastic build quality. Highly recommend!",
    "Awful. Stopped working after one week. Very disappointed and frustrated.",
    "Perfect for my needs. Easy to use and looks great. Love it!",
    "Cheap and nasty. Breaks easily. Customer service was useless too.",
    "Outstanding quality! Works perfectly. Will definitely buy again!!"
]
sentiments = [1, 0, 1, 0, 1, 1, 0, 1, 0, 1]

# --- Step 1: Clean text for TF-IDF ---
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

clean_reviews = [clean_text(r) for r in raw_reviews]  # list of cleaned strings

# --- Step 2: TF-IDF features (sparse matrix) ---
tfidf = TfidfVectorizer(ngram_range=(1, 2), max_features=15, sublinear_tf=True)
X_tfidf = tfidf.fit_transform(clean_reviews)   # shape: (10, 15) sparse matrix

# --- Step 3: Meta-features (dense array) ---
meta = pd.DataFrame({
    'word_count':    [len(r.split()) for r in raw_reviews],                        # word count
    'exclaim_count': [r.count('!') for r in raw_reviews],                          # exclamation marks
    'capital_ratio': [sum(1 for c in r if c.isupper()) / (len(r) + 1e-9)           # capital letter ratio
                      for r in raw_reviews],
    'avg_word_len':  [np.mean([len(w) for w in r.split()]) for r in raw_reviews]   # average word length
})

# Scale meta-features: TF-IDF values are already in [0,1] range;
# meta-features like word_count can be 10+ — StandardScaler brings them to comparable scale
scaler = StandardScaler()
X_meta_scaled = scaler.fit_transform(meta)         # returns numpy array, shape (10, 4)

# Convert scaled meta-features to sparse for hstack compatibility
X_meta_sparse = csr_matrix(X_meta_scaled)          # wrap dense array in sparse format

# --- Step 4: Horizontally stack TF-IDF + meta-features ---
X_combined = hstack([X_tfidf, X_meta_sparse])      # shape: (10, 15+4) = (10, 19)

# Report shapes
print(f"TF-IDF feature matrix shape:   {X_tfidf.shape}")
print(f"Meta-features matrix shape:    {X_meta_scaled.shape}")
print(f"Combined feature matrix shape: {X_combined.shape}")

# Convert combined to dense and show column names
feature_names = list(tfidf.get_feature_names_out()) + list(meta.columns)  # all 19 feature names
combined_df = pd.DataFrame(X_combined.toarray(), columns=feature_names)
combined_df['sentiment'] = sentiments

print(f"\nTotal features: {len(feature_names)} ({X_tfidf.shape[1]} TF-IDF + {len(meta.columns)} meta)")
print("\nFirst 3 rows of combined feature matrix (meta-features shown at right):")
print(combined_df[['word_count','exclaim_count','capital_ratio','avg_word_len','sentiment']].round(3).to_string(index=False))
TF-IDF feature matrix shape:   (10, 15)
Meta-features matrix shape:    (10, 4)
Combined feature matrix shape: (10, 19)

Total features: 19 (15 TF-IDF + 4 meta)

First 3 rows of combined feature matrix (meta-features shown at right):
 word_count  exclaim_count  capital_ratio  avg_word_len  sentiment
     -0.527         -0.218         -0.670         0.139          1
     -0.527          1.528          2.397        -0.209          0
     -0.527         -1.183         -0.706        -0.023          1
      0.527          0.655          3.110        -0.133          0
     -1.581         -1.183         -0.728        -0.302          1
     -1.581          0.655         -0.670         0.292          1
     -1.581         -1.183         -0.580         0.183          0
     -0.527         -0.218         -0.683         0.017          1
     -0.527         -1.183         -0.622        -0.006          0
     -2.634          0.655         -0.683         0.042          1

What just happened?

scipy.sparse.hstack joined the 15-column TF-IDF sparse matrix with the 4-column scaled meta-feature matrix into a single (10, 19) feature matrix — without ever materialising a dense 10×15 intermediate array in memory. The meta-features are now standardised (mean=0, std=1), putting them on a comparable scale to the TF-IDF weights. Row 2 (the "COMPLETE WASTE OF MONEY" review) has a scaled capital_ratio of 2.397 and the highest exclaim_count at 1.528 — both standout values for a negative review. This combined matrix is ready to pass directly into any sklearn classifier.

TF-IDF vs Bag-of-Words vs N-Grams — The Decision Guide

Choosing the right text representation is as important as choosing the right model. Here's when to use each:

Representation Best For Watch Out For sklearn Class
Bag-of-Words (counts) Short docs, frequency matters Common words dominate CountVectorizer
TF-IDF (unigrams) Most classification tasks Misses phrase meaning TfidfVectorizer
TF-IDF + bigrams Sentiment, negation-heavy text Feature explosion — use max_features TfidfVectorizer(ngram_range=(1,2))
TF-IDF + meta-features When style matters as much as content Must scale meta-features separately hstack([tfidf, meta])
Word embeddings (Word2Vec / GloVe) Semantic similarity, semantic search Requires large corpus or pretrained model gensim / spaCy

Teacher's Note

TF-IDF has one critical rule: fit the vectoriser only on training data, then transform both train and test. If you fit on the full dataset, the IDF weights are influenced by test-set vocabulary — a subtle leakage that inflates your validation scores. In production, a word that appears in test but not train will simply receive a weight of zero; the vectoriser handles this gracefully as long as it was fitted on train only. For very large vocabularies, always set max_features — leaving it uncapped on a large corpus can easily produce 50,000+ columns, most of which are noise. Start with 5,000–10,000 and increase only if validation performance keeps improving.

Practice Questions

1. Which text vectorisation method upweights rare, distinctive words and downweights common words that appear across almost all documents?



2. In sklearn's TfidfVectorizer, which parameter controls whether unigrams, bigrams, or both are included in the feature matrix?



3. The scipy.sparse function used to combine a sparse TF-IDF matrix with a dense meta-feature matrix horizontally — without converting either to a full dense array — is called ________.



Quiz

1. Why do bigrams improve sentiment classification compared to unigrams alone?


2. What is the correct leakage-safe procedure when using TF-IDF in a train/test pipeline?


3. From the meta-feature analysis in this lesson, which structural feature showed the strongest separation between positive and negative reviews?


Up Next · Lesson 40

Feature Engineering for Time Series

Calendar features, Fourier transforms, and autocorrelation features that turn a raw timestamp sequence into a model-ready representation of time.