EDA Lesson 33 – Feature Extraction Basics | Dataplexa
Intermediate Level · Lesson 33

Feature Extraction Basics

Your raw columns are rarely in the best shape for a model to learn from. A date column is just a label until you extract the day of the week from it. A price and a quantity are two separate numbers until you multiply them into revenue. Feature extraction is the skill of turning what you have into what your model actually needs.

What Is Feature Extraction?

A raw column is like a locked safe. The information is in there — but the model can't read it. Feature extraction picks the lock. You take a column the model can't use directly and create new columns that contain the same information in a form the model can understand and learn from.

📅 Dates

Extract hour, weekday, month, is_weekend, quarter

🔢 Numbers

Create ratios, products, differences, binary flags

📝 Text

Extract length, word count, keyword presence

🏷️ Categories

Group rare values, create buckets, map ordinals

The Dataset We'll Use

The scenario: You're a data scientist at a food delivery platform. The data team has handed you a raw orders table and asked you to prepare it for a delivery time prediction model. The raw columns — a timestamp, a product description, a price and quantity, and a distance — aren't in a state the model can use. Your job is to extract useful features from every column before modelling begins.

import pandas as pd
import numpy as np

# Raw delivery orders — timestamp, free text name, price, quantity, distance
df = pd.DataFrame({
    'order_id':    [1001,1002,1003,1004,1005,1006,1007,1008,1009,1010],
    'order_time':  ['2024-03-04 12:35:00','2024-03-08 19:22:00','2024-03-09 08:05:00',
                    '2024-03-10 20:48:00','2024-03-15 13:10:00','2024-03-16 22:05:00',
                    '2024-03-17 07:55:00','2024-03-18 12:00:00','2024-03-22 18:30:00',
                    '2024-03-23 23:45:00'],
    'item_name':   ['Margherita Pizza - Large','Chicken Burger with Fries','Veggie Wrap',
                    'BBQ Ribs - Full Rack with Sides','Caesar Salad','Spicy Noodle Bowl',
                    'Breakfast Bagel with Egg','Fish and Chips','Pad Thai Noodles - Large',
                    'Loaded Nachos with Guacamole and Salsa'],
    'unit_price':  [12.99,9.49,7.99,22.50,8.99,10.49,6.50,11.99,12.49,13.99],
    'quantity':    [2,1,3,1,2,1,2,1,2,3],
    'distance_km': [2.1,5.8,1.4,3.2,4.7,8.1,0.9,2.6,3.8,6.4],
    'delivery_min':[22,38,15,28,35,52,12,24,30,45]   # TARGET — what we want to predict
})

print("Raw data — four columns the model can't use yet:")
print(df[['order_time','item_name','unit_price','quantity']].head(4).to_string(index=False))

What just happened?

Four columns, four problems: order_time is a text string the model can't do maths with. item_name is free text — meaningless to any algorithm. unit_price and quantity are separate when the model really needs to know total value. We'll fix all four in the next steps.

Step 1 — Extract Features From the Date Column

The scenario: You have a hunch that delivery time varies by time of day and day of week — late-night Saturday orders are probably slower than Monday lunchtime ones. But "2024-03-08 19:22:00" as a text string is completely useless to the model. You need to pull hour, weekday, and a meal period label out of it as separate numeric or categorical columns.

# Convert text to a real datetime object first — required for all .dt operations
df['order_time'] = pd.to_datetime(df['order_time'])

# The .dt accessor unlocks all datetime properties on the column
df['hour']        = df['order_time'].dt.hour          # 0–23
df['day_of_week'] = df['order_time'].dt.dayofweek     # 0=Monday … 6=Sunday
df['is_weekend']  = (df['order_time'].dt.dayofweek >= 5).astype(int)  # Sat/Sun = 1

# Custom meal period label — a business-logic feature the model couldn't create itself
def meal_period(hour):
    if 6  <= hour < 11:  return 'Breakfast'
    elif 11 <= hour < 15: return 'Lunch'
    elif 17 <= hour < 22: return 'Dinner'
    else:                 return 'Late Night'

# .apply() runs the function on every row of the column
df['meal_period'] = df['hour'].apply(meal_period)

print(df[['order_id','hour','day_of_week','is_weekend','meal_period']].to_string(index=False))

What just happened?

pandas' pd.to_datetime() converts the text string to a real datetime object. The .dt accessor then unlocks properties like .hour and .dayofweek — returning a full column of values, no loop needed. .apply(meal_period) runs our custom function row by row to assign the business-logic label.

From one raw timestamp, we now have four model-ready features. Order 1006 (Late Night, Saturday) took 52 minutes — the slowest in the dataset. Order 1007 (Early Morning, Sunday) took just 12 minutes. The time signal is real.

Step 2 — Extract Features From the Text Column

The scenario: You suspect that complex orders — the ones with long names and many components — take longer to prepare. "Loaded Nachos with Guacamole and Salsa" probably takes longer than "Veggie Wrap." But the model can't read text. You need to extract the signal from the name: how long is it, how many words, is it a large portion, does it contain ingredients that suggest a heavy meal?

# .str gives access to string methods on the whole column — no loops needed

# Character count — longer names suggest more complex orders
df['name_length'] = df['item_name'].str.len()

# Word count — split into words, then count the words
df['word_count']  = df['item_name'].str.split().str.len()

# Is it a Large portion? Returns True/False, .astype(int) turns it into 1/0
df['is_large']    = df['item_name'].str.contains('Large', case=False).astype(int)

# Is it a heavy meal? Check for multiple keywords with | (OR) pattern
heavy = 'Pizza|Ribs|Burger|Chips|Nachos'
df['is_heavy']    = df['item_name'].str.contains(heavy, case=False).astype(int)

print(df[['item_name','name_length','word_count','is_large','is_heavy']].to_string(index=False))

What just happened?

pandas' .str accessor applies string operations to every row at once. .str.split().str.len() chains two operations: first splits the name into a list of words, then counts the list length. .str.contains('pattern', case=False) returns True wherever the pattern matches — the | character means OR in regex, so it checks for any of the heavy meal keywords.

"Loaded Nachos with Guacamole and Salsa" — name_length 38, word_count 6, took 45 minutes. "Veggie Wrap" — name_length 11, word_count 2, took 15 minutes. The pattern is real. The model now has four numbers where it had one unreadable string.

Step 3 — Extract Features From Numbers

The scenario: The modelling lead points out that unit_price and quantity are two separate signals — but what matters for delivery is the total order value and how dense the revenue is per kilometre driven. A £40 order 1km away is very different from a £40 order 8km away. These ratio and combination features are things the model cannot create on its own — you have to build them.

# Total order value — price × quantity — the most important combination feature
df['order_value'] = df['unit_price'] * df['quantity']

# Revenue density — how much money per km of delivery?
df['value_per_km'] = (df['order_value'] / df['distance_km']).round(2)

# High-value binary flag — above the median order value → 1, below → 0
median_val         = df['order_value'].median()
df['is_high_value']= (df['order_value'] > median_val).astype(int)

# Order size bucket — single / double / multi item
df['size_bucket'] = pd.cut(df['quantity'], bins=[0,1,2,10],
                            labels=['Single','Double','Multi'])

print(df[['order_id','unit_price','quantity','order_value',
          'value_per_km','is_high_value','size_bucket']].to_string(index=False))

What just happened?

pandas column arithmetic is automatic and element-wise — df['unit_price'] * df['quantity'] multiplies each row's values together without a loop. Comparisons like df['order_value'] > median_val return a boolean Series; .astype(int) converts True/False to 1/0. pd.cut() bins the quantity column into labelled brackets.

The value_per_km feature is the most interesting: order 1003 (Veggie Wrap ×3, 1.4km) earns £17.12/km — very efficient. Order 1006 (Spicy Noodle Bowl, 8.1km) earns just £1.30/km. A routing algorithm would treat these very differently. The model couldn't compute this on its own — you had to build it first.

Step 4 — Validate Every Feature Against the Target

The scenario: You've created 12 new features. Before handing them to the modelling team, your lead asks one question: "Are these actually useful?" Building features the model can't use is wasted time and wasted columns. A quick correlation check against delivery_min tells you which ones to prioritise and which to deprioritise.

from scipy import stats

new_features = ['hour','day_of_week','is_weekend','name_length',
                'word_count','is_large','is_heavy',
                'order_value','value_per_km','is_high_value']

print("Feature validation — correlation with delivery_min:\n")
print(f"{'Feature':<16} {'r':>7}  {'Strength':>10}  Recommendation")
print("─" * 58)

results = []
for feat in new_features:
    r, p = stats.pearsonr(df[feat], df['delivery_min'])
    strength = "Strong"   if abs(r) > 0.6 else \
               "Moderate" if abs(r) > 0.3 else "Weak"
    rec      = "✓ Keep"   if abs(r) > 0.3 else "? Review"
    results.append((abs(r), feat, r, strength, rec))

# Sort by absolute correlation — strongest first
for _, feat, r, strength, rec in sorted(results, reverse=True):
    print(f"  {feat:<14} {r:>+7.3f}  {strength:>10}  {rec}")

What just happened?

scipy's stats.pearsonr() runs a correlation check on every extracted feature against the target. We sort by absolute correlation so the most valuable features rise to the top.

The text features — name_length (+0.716) and word_count (+0.714) — are the strongest predictors. They beat raw features like day_of_week. is_weekend, is_large, and value_per_km are weak — possibly because 10 rows isn't enough to show a pattern, or possibly because the signal genuinely isn't there. With a larger dataset you'd validate again before including them.

Teacher's Note

Feature extraction is where domain knowledge pays off. The idea that longer item names predict slower delivery came from understanding the business — complex items take longer to prepare. A pure statistician running automated feature selection might never think to extract name_length. That domain intuition is what makes the difference between features that work and features that don't.

Always validate extracted features against the target before including them. If a feature you thought would be useful shows near-zero correlation, don't delete it immediately — with small samples the signal may just be hidden. But move it to the bottom of your priority list and only add it to the model if it improves validation performance.

Practice Questions

1. After converting a column to datetime with pd.to_datetime(), which pandas accessor unlocks properties like .hour, .month, and .dayofweek?



2. Which pandas string method checks whether a keyword appears in each row of a text column — returning True or False per row?



3. Which pandas method runs a custom Python function on every row of a column — used here to assign a meal period label based on the hour value?



Quiz

1. You have a free-text item_name column. Why extract name_length and word_count rather than passing the raw text directly to the model?


2. Your dataset has unit_price and quantity as separate columns. What is the most important feature to create from them?


3. A new feature is_weekend shows r = −0.04 with the target on a 10-row sample. What should you do?


Up Next · Lesson 34

Domain-Driven EDA

How knowing the industry transforms which patterns you look for, which anomalies you trust, and which features you build — the difference between analysis and expertise.