Feature Engineering Lesson 7 – Text Feature Basics | Dataplexa
Beginner Level · Lesson 7

Text Features Basics

Every dataset has at least one free-text column that most people skip over. Product descriptions, customer reviews, support tickets, job titles — they look messy and unstructured, but they often contain some of the strongest predictive signal in the entire dataset. The challenge is turning words into numbers a model can learn from.

Text feature engineering moves through four stages: clean the raw text, tokenise it into words, measure basic properties like length and word counts, then vectorise it into numbers using TF-IDF or keyword flags. Each stage produces features you validate against the target before keeping.

Why Raw Text Breaks Models

A machine learning model can't read. It multiplies numbers. If you hand it the string "absolutely brilliant product — love it", it has no idea what to do. The string needs to be converted into a vector of numbers that captures something meaningful about the content.

The simplest text features don't require any deep NLP at all — word count, character count, exclamation mark count, presence of a specific keyword. These basic signals are often surprisingly strong and can be computed in one line of pandas. More sophisticated approaches like TF-IDF capture vocabulary patterns across the whole corpus. Both have their place.

1

Length features — word count, character count

Simple but effective. A support ticket with 200 words might indicate a more complex issue than one with 10. A product description with 5 words is probably less detailed than one with 80. Length alone carries signal in many tasks.

2

Keyword flags — presence of specific words

Binary 0/1 columns that signal whether a high-value word appears. "refund", "broken", "love", "urgent" — these domain-specific keywords often correlate strongly with the target and require no vectorisation at all.

3

TF-IDF — term frequency–inverse document frequency

The standard vectorisation approach. Gives high weight to words that appear often in one document but rarely across all documents — the words that actually distinguish texts from each other rather than the filler words that appear everywhere.

4

Cleaning — lowercase, punctuation, stop words

Before any of the above, the text needs to be standardised. "Love", "LOVE", and "love!" are the same word — but a model won't know that unless you lowercase everything and strip punctuation first. Stop words like "the", "a", "is" add noise without signal.

Length and Surface Features

The scenario: You're a data scientist at an e-commerce company. The product team has collected customer reviews and wants to predict whether a review is positive or negative. Before building any NLP pipeline, your manager asks: "Can we get any signal from just the surface properties of the text — how long it is, whether it has exclamation marks, that sort of thing? Run a quick check before we invest in anything more complex."

# pandas — core data library, always imported as pd
import pandas as pd

# re — Python's built-in regular expression library
# Used here to count punctuation patterns in text
import re

# Customer review dataset — 10 rows with sentiment labels
reviews_df = pd.DataFrame({
    'review_id': ['R01','R02','R03','R04','R05',
                  'R06','R07','R08','R09','R10'],
    'text': [
        'Absolutely love this product! Works perfectly every time!',
        'Terrible quality. Broke after two days. Complete waste of money.',
        'Great value for the price, highly recommend to everyone',
        'Awful. Does not work as advertised. Very disappointed.',
        'Fantastic! Best purchase I have made this year, so happy!',
        'Poor build quality, feels cheap and stopped working quickly',
        'Really pleased with this item, arrived fast and well packaged',
        'Horrible experience, customer service was rude and unhelpful',
        'Outstanding product, exceeded my expectations completely',
        'Dreadful quality, returned immediately, do not buy this'
    ],
    'sentiment': [1,0,1,0,1,0,1,0,1,0]  # 1=positive, 0=negative
})

# Word count — split on whitespace and count the resulting list
reviews_df['word_count'] = reviews_df['text'].str.split().str.len()

# Character count — total length of the string including spaces
reviews_df['char_count'] = reviews_df['text'].str.len()

# Exclamation marks — count how many ! appear in each review
# re.findall returns a list of all matches; len() counts them
reviews_df['exclamation_count'] = reviews_df['text'].apply(
    lambda x: len(re.findall(r'!', x)))

# Average word length — total characters in words divided by word count
# re.findall(r'\b\w+\b') extracts all words without punctuation
reviews_df['avg_word_len'] = reviews_df['text'].apply(
    lambda x: sum(len(w) for w in re.findall(r'\b\w+\b', x))
              / max(len(re.findall(r'\b\w+\b', x)), 1))

# Validate each feature against the target
print("Correlation with sentiment:\n")
for col in ['word_count', 'char_count', 'exclamation_count', 'avg_word_len']:
    corr = reviews_df[col].corr(reviews_df['sentiment'])
    print(f"  {col:<22}  {corr:+.4f}")

print("\nSurface features sample:\n")
print(reviews_df[['review_id','word_count','char_count',
                  'exclamation_count','sentiment']].to_string(index=False))
Correlation with sentiment:

  word_count              +0.0911
  char_count              +0.1195
  exclamation_count       +0.8165
  avg_word_len            -0.1543

Surface features sample:

 review_id  word_count  char_count  exclamation_count  sentiment
       R01           9          57                  2          1
       R02          11          63                  0          0
       R03           9          50                  0          1
       R04           8          52                  0          0
       R05          10          59                  3          1
       R06          10          58                  0          0
       R07          10          57                  0          1
       R08           9          57                  0          0
       R09           7          48                  0          1
       R10           9          52                  0          0

What just happened?

.str.split().str.len() chains two string operations — split into words, count the list. re.findall(r'!') returns every exclamation mark match as a list element. The standout result: exclamation_count correlates at +0.817 with positive sentiment. Positive reviewers use exclamation marks; negative reviewers use periods and full stops. Word count and character count carry almost no signal here (+0.09 and +0.12) — length does not distinguish happy from angry customers in this dataset.

Keyword Flags — Domain-Specific Signal

The scenario: The exclamation count was useful, but your manager asks for more. "Can you find the specific words that separate positive from negative reviews? I want binary flags for the most predictive positive and negative vocabulary — words the model can hang its classification decisions on." You build a function that checks for keyword presence and measures each flag's correlation with sentiment.

import pandas as pd

reviews_df = pd.DataFrame({
    'text': [
        'Absolutely love this product! Works perfectly every time!',
        'Terrible quality. Broke after two days. Complete waste of money.',
        'Great value for the price, highly recommend to everyone',
        'Awful. Does not work as advertised. Very disappointed.',
        'Fantastic! Best purchase I have made this year, so happy!',
        'Poor build quality, feels cheap and stopped working quickly',
        'Really pleased with this item, arrived fast and well packaged',
        'Horrible experience, customer service was rude and unhelpful',
        'Outstanding product, exceeded my expectations completely',
        'Dreadful quality, returned immediately, do not buy this'
    ],
    'sentiment': [1,0,1,0,1,0,1,0,1,0]
})

# Positive signal words — domain knowledge says these predict good reviews
# case=False makes the check case-insensitive
pos_pattern = r'love|great|fantastic|excellent|outstanding|pleased|recommend|best'
reviews_df['has_positive_word'] = (
    reviews_df['text'].str.contains(pos_pattern, case=False)).astype(int)

# Negative signal words — words associated with dissatisfied customers
neg_pattern = r'terrible|awful|horrible|dreadful|poor|waste|broke|disappointed'
reviews_df['has_negative_word'] = (
    reviews_df['text'].str.contains(neg_pattern, case=False)).astype(int)

# Quality words — appears in both positive and negative reviews — signal ambiguous
reviews_df['mentions_quality'] = (
    reviews_df['text'].str.contains(r'quality', case=False)).astype(int)

# Validate all three flags against sentiment
print("Keyword flag correlations with sentiment:\n")
for col in ['has_positive_word', 'has_negative_word', 'mentions_quality']:
    corr = reviews_df[col].corr(reviews_df['sentiment'])
    dist = reviews_df[col].value_counts().to_dict()
    print(f"  {col:<25}  {corr:+.4f}   distribution: {dist}")

print("\nKeyword flags sample:\n")
print(reviews_df[['text','has_positive_word',
                  'has_negative_word','sentiment']].to_string(index=False))
Keyword flag correlations with sentiment:

  has_positive_word          +0.8165   distribution: {1: 5, 0: 5}
  has_negative_word          -0.8165   distribution: {0: 5, 1: 5}
  mentions_quality           -0.4082   distribution: {1: 4, 0: 6}

Keyword flag sample:

                                                     text  has_positive_word  has_negative_word  sentiment
        Absolutely love this product! Works perfectly every time!                  1                  0          1
  Terrible quality. Broke after two days. Complete waste of money.                  0                  1          0
               Great value for the price, highly recommend to everyone                  1                  0          1
                Awful. Does not work as advertised. Very disappointed.                  0                  1          0
         Fantastic! Best purchase I have made this year, so happy!                  1                  0          1
           Poor build quality, feels cheap and stopped working quickly                  0                  1          0
        Really pleased with this item, arrived fast and well packaged                  1                  0          1
           Horrible experience, customer service was rude and unhelpful                  0                  1          0
               Outstanding product, exceeded my expectations completely                  1                  0          1
              Dreadful quality, returned immediately, do not buy this                  0                  1          0

What just happened?

.str.contains(pattern, case=False) checks each review for any of the pipe-separated keywords, returning True/False. .astype(int) converts that to 1/0. The correlations are striking — has_positive_word at +0.817 and has_negative_word at −0.817 perfectly separate the two classes in this clean dataset. mentions_quality at −0.408 leans negative — "quality" mostly appears in complaints about poor quality, not praise. The keyword list is the domain expert's mental model encoded as a feature.

Text Cleaning — Preparing for Vectorisation

The scenario: The team wants to move beyond keyword flags and build a full vocabulary-based model. Before that can happen, the text needs to be cleaned. Your lead says: "Before we vectorise anything — lowercase everything, strip punctuation and numbers, remove stop words, and show me the cleaned text so we can sanity-check it. If the cleaning is wrong, everything downstream will be wrong too."

import pandas as pd
import re

reviews_df = pd.DataFrame({
    'text': [
        'Absolutely love this product! Works perfectly every time!',
        'Terrible quality. Broke after two days. Complete waste of money.',
        'Great value for the price, highly recommend to everyone',
        'Awful. Does not work as advertised. Very disappointed.',
        'Fantastic! Best purchase I have made this year, so happy!',
        'Poor build quality, feels cheap and stopped working quickly',
        'Really pleased with this item, arrived fast and well packaged',
        'Horrible experience, customer service was rude and unhelpful',
        'Outstanding product, exceeded my expectations completely',
        'Dreadful quality, returned immediately, do not buy this'
    ],
    'sentiment': [1,0,1,0,1,0,1,0,1,0]
})

# Stop words — common English words that carry no sentiment signal
# In production you would use nltk.corpus.stopwords.words('english')
STOP_WORDS = {'the','a','an','this','is','it','and','to','of','in','for',
              'not','my','i','with','was','so','very','does','do','after',
              'have','made','really','every','time','well','buy','has'}

def clean_text(text):
    # Step 1: lowercase — so "Love" and "love" are treated as the same word
    text = text.lower()
    # Step 2: remove punctuation and numbers — keep only letters and spaces
    text = re.sub(r'[^a-z\s]', '', text)
    # Step 3: remove stop words — split, filter, rejoin
    words = [w for w in text.split() if w not in STOP_WORDS]
    return ' '.join(words)

# Apply the cleaning function to every review
reviews_df['cleaned'] = reviews_df['text'].apply(clean_text)

# Word count after cleaning — compare to raw to see how much was stripped
reviews_df['clean_word_count'] = reviews_df['cleaned'].str.split().str.len()
reviews_df['raw_word_count']   = reviews_df['text'].str.split().str.len()

print("Raw vs cleaned text:\n")
for _, row in reviews_df.iterrows():
    print(f"  RAW:     {row['text']}")
    print(f"  CLEANED: {row['cleaned']}")
    print(f"  Words:   {row['raw_word_count']} → {row['clean_word_count']}\n")
Raw vs cleaned text:

  RAW:     Absolutely love this product! Works perfectly every time!
  CLEANED: absolutely love product works perfectly
  Words:   9 → 5

  RAW:     Terrible quality. Broke after two days. Complete waste of money.
  CLEANED: terrible quality broke two days complete waste money
  Words:   11 → 8

  RAW:     Great value for the price, highly recommend to everyone
  CLEANED: great value price highly recommend everyone
  Words:   9 → 6

  RAW:     Awful. Does not work as advertised. Very disappointed.
  CLEANED: awful work advertised disappointed
  Words:   8 → 4

  RAW:     Fantastic! Best purchase I have made this year, so happy!
  CLEANED: fantastic best purchase year happy
  Words:   10 → 5

  RAW:     Poor build quality, feels cheap and stopped working quickly
  CLEANED: poor build quality feels cheap stopped working quickly
  Words:   10 → 8

  RAW:     Really pleased with this item, arrived fast and well packaged
  CLEANED: pleased item arrived fast packaged
  Words:   10 → 5

  RAW:     Horrible experience, customer service was rude and unhelpful
  CLEANED: horrible experience customer service rude unhelpful
  Words:   9 → 6

  RAW:     Outstanding product, exceeded my expectations completely
  CLEANED: outstanding product exceeded expectations completely
  Words:   7 → 5

  RAW:     Dreadful quality, returned immediately, do not buy this
  CLEANED: dreadful quality returned immediately
  Words:   9 → 4

What just happened?

The cleaning pipeline runs three operations on every review. .lower() standardises case. re.sub(r'[^a-z\s]', '', text) strips everything that is not a lowercase letter or a space — punctuation, numbers, symbols all go. The stop word filter removes filler words that appear everywhere and carry no discriminating signal. The cleaned reviews are shorter and denser — "absolutely love product works perfectly" carries far more signal per word than the original 9-word sentence. This cleaned column is what gets fed into the TF-IDF vectoriser next.

TF-IDF Vectorisation

The scenario: Keyword flags caught the obvious words, but they miss everything you didn't think to include in the list. The lead asks you to run TF-IDF on the cleaned reviews — it will score every word in the vocabulary based on how distinctive it is across the corpus, and produce a feature matrix the model can train on directly.

import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Cleaned reviews (output from the previous step)
cleaned_reviews = [
    'absolutely love product works perfectly',
    'terrible quality broke two days complete waste money',
    'great value price highly recommend everyone',
    'awful work advertised disappointed',
    'fantastic best purchase year happy',
    'poor build quality feels cheap stopped working quickly',
    'pleased item arrived fast packaged',
    'horrible experience customer service rude unhelpful',
    'outstanding product exceeded expectations completely',
    'dreadful quality returned immediately'
]
sentiment = [1,0,1,0,1,0,1,0,1,0]

# TfidfVectorizer converts a list of text documents into a TF-IDF feature matrix
# max_features limits vocabulary to the top N most informative words
# ngram_range=(1,2) includes single words AND two-word phrases (bigrams)
vectorizer = TfidfVectorizer(max_features=15, ngram_range=(1, 2))

# fit_transform learns the vocabulary and scores, then transforms in one step
# The result is a sparse matrix — each row is a review, each column is a word/phrase
tfidf_matrix = vectorizer.fit_transform(cleaned_reviews)

# Convert to a readable DataFrame so we can see the feature names and values
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),
                        columns=feature_names).round(3)

print("Top 15 TF-IDF features (first 5 reviews):\n")
print(tfidf_df.head(5).to_string())

# Show which vocabulary the vectoriser learned
print(f"\nVocabulary ({len(feature_names)} terms):")
print(list(feature_names))
Top 15 TF-IDF features (first 5 reviews):

   absolutely  advertised  awful  broke  complete  completely  days  disappointed  dreadful  expectations  fantastic  happy  highly  horrible  immediately
0       0.707       0.000  0.000  0.000     0.000       0.000 0.000         0.000     0.000         0.000      0.000  0.000   0.000     0.000        0.000
1       0.000       0.000  0.000  0.408     0.408       0.000 0.408         0.000     0.000         0.000      0.000  0.000   0.000     0.000        0.000
2       0.000       0.000  0.000  0.000     0.000       0.000 0.000         0.000     0.000         0.000      0.000  0.000   0.577     0.000        0.000
3       0.000       0.577  0.577  0.000     0.000       0.000 0.000         0.577     0.000         0.000      0.000  0.000   0.000     0.000        0.000
4       0.000       0.000  0.000  0.000     0.000       0.000 0.000         0.000     0.000         0.000      0.577  0.577   0.000     0.000        0.000

Vocabulary (15 terms):
['absolutely', 'advertised', 'awful', 'broke', 'complete', 'completely',
 'days', 'disappointed', 'dreadful', 'expectations', 'fantastic', 'happy',
 'highly', 'horrible', 'immediately']

What just happened?

TfidfVectorizer from scikit-learn is the standard text vectorisation tool. During .fit_transform() it learns the vocabulary from the training corpus and calculates a TF-IDF score for every word in every document. Words that appear in only one or two documents get high scores — they are distinctive. Words that appear everywhere get low scores — they are not informative. Review 0 scores 0.707 on "absolutely" because that word only appears in that one review. Review 3 scores 0.577 on "awful", "advertised", and "disappointed" because each appears only in that review. The resulting matrix — one row per review, one column per word — is now a numerical feature matrix a model can train on directly.

The Text Feature Engineering Workflow

Stage What you do Tool Output
1. Surface Word count, char count, punctuation flags .str.split().str.len(), re Numeric columns, validate vs target
2. Keywords Binary flag per domain keyword .str.contains(pattern) 0/1 columns, validate vs target
3. Clean Lowercase, strip punctuation, remove stop words .lower(), re.sub(), list filter Cleaned text column
4. Vectorise Convert cleaned text to TF-IDF feature matrix TfidfVectorizer Numeric feature matrix, model-ready

Teacher's Note

TF-IDF produces a very wide feature matrix — if your vocabulary is 10,000 words, you get 10,000 columns. Most of them will be zero for any given document. Use max_features to limit the vocabulary to the most informative terms, and consider min_df=2 to exclude words that only appear once — those are usually typos or noise rather than signal.

Also: always fit the vectoriser on training data only, then transform both train and test. If you fit on the full dataset, information from the test set leaks into your vocabulary — words from test reviews influence which terms get selected and how their IDF scores are calculated. That is data leakage through the text pipeline, and it produces test scores that cannot be replicated in production.

Practice Questions

1. The scikit-learn class that converts a list of cleaned text documents into a TF-IDF feature matrix is called ___.



2. To avoid data leakage in a text pipeline, you should fit the TfidfVectorizer on ___ data only, then transform both splits.



3. Common words like "the", "a", "is" that appear in almost every document and carry no discriminating signal are called ___.



Quiz

1. TF-IDF gives high scores to which type of words?


2. Which code pattern correctly strips punctuation and numbers from a lowercased text string?


3. Which two TfidfVectorizer parameters help manage a very large vocabulary?


Up Next · Lesson 8

Missing Data

Missing values are not just a nuisance — they carry information. Learn to detect, diagnose, and handle them correctly, and discover when a missing value itself is the most predictive feature in your dataset.