EDA Lesson 33 – Feature Extraction Basics | Dataplexa

Intermediate Level · Lesson 33

Feature Extraction Basics

Your raw columns are rarely in the best shape for a model to learn from. A date column is just a label until you extract the day of the week from it. A price and a quantity are two separate numbers until you multiply them into revenue. Feature extraction is the skill of turning what you have into what your model actually needs.

What Is Feature Extraction?

A raw column is like a locked safe. The information is in there — but the model can't read it. Feature extraction picks the lock. You take a column the model can't use directly and create new columns that contain the same information in a form the model can understand and learn from.

📅 Dates

Extract hour, weekday, month, is_weekend, quarter

🔢 Numbers

Create ratios, products, differences, binary flags

📝 Text

Extract length, word count, keyword presence

🏷️ Categories

Group rare values, create buckets, map ordinals

The Dataset We'll Use

The scenario: You're a data scientist at a food delivery platform. The data team has handed you a raw orders table and asked you to prepare it for a delivery time prediction model. The raw columns — a timestamp, a product description, a price and quantity, and a distance — aren't in a state the model can use. Your job is to extract useful features from every column before modelling begins.

import pandas as pd
import numpy as np

# Raw delivery orders — timestamp, free text name, price, quantity, distance
df = pd.DataFrame({
    'order_id':    [1001,1002,1003,1004,1005,1006,1007,1008,1009,1010],
    'order_time':  ['2024-03-04 12:35:00','2024-03-08 19:22:00','2024-03-09 08:05:00',
                    '2024-03-10 20:48:00','2024-03-15 13:10:00','2024-03-16 22:05:00',
                    '2024-03-17 07:55:00','2024-03-18 12:00:00','2024-03-22 18:30:00',
                    '2024-03-23 23:45:00'],
    'item_name':   ['Margherita Pizza - Large','Chicken Burger with Fries','Veggie Wrap',
                    'BBQ Ribs - Full Rack with Sides','Caesar Salad','Spicy Noodle Bowl',
                    'Breakfast Bagel with Egg','Fish and Chips','Pad Thai Noodles - Large',
                    'Loaded Nachos with Guacamole and Salsa'],
    'unit_price':  [12.99,9.49,7.99,22.50,8.99,10.49,6.50,11.99,12.49,13.99],
    'quantity':    [2,1,3,1,2,1,2,1,2,3],
    'distance_km': [2.1,5.8,1.4,3.2,4.7,8.1,0.9,2.6,3.8,6.4],
    'delivery_min':[22,38,15,28,35,52,12,24,30,45]   # TARGET — what we want to predict
})

print("Raw data — four columns the model can't use yet:")
print(df[['order_time','item_name','unit_price','quantity']].head(4).to_string(index=False))

Raw data — four columns the model can't use yet:
           order_time                      item_name  unit_price  quantity
  2024-03-04 12:35:00       Margherita Pizza - Large       12.99         2
  2024-03-08 19:22:00     Chicken Burger with Fries         9.49         1
  2024-03-09 08:05:00                    Veggie Wrap        7.99         3
  2024-03-10 20:48:00  BBQ Ribs - Full Rack with Sides    22.50         1

What just happened?

Four columns, four problems: order_time is a text string the model can't do maths with. item_name is free text — meaningless to any algorithm. unit_price and quantity are separate when the model really needs to know total value. We'll fix all four in the next steps.

Step 1 — Extract Features From the Date Column

The scenario: You have a hunch that delivery time varies by time of day and day of week — late-night Saturday orders are probably slower than Monday lunchtime ones. But "2024-03-08 19:22:00" as a text string is completely useless to the model. You need to pull hour, weekday, and a meal period label out of it as separate numeric or categorical columns.

# Convert text to a real datetime object first — required for all .dt operations
df['order_time'] = pd.to_datetime(df['order_time'])

# The .dt accessor unlocks all datetime properties on the column
df['hour']        = df['order_time'].dt.hour          # 0–23
df['day_of_week'] = df['order_time'].dt.dayofweek     # 0=Monday … 6=Sunday
df['is_weekend']  = (df['order_time'].dt.dayofweek >= 5).astype(int)  # Sat/Sun = 1

# Custom meal period label — a business-logic feature the model couldn't create itself
def meal_period(hour):
    if 6  <= hour < 11:  return 'Breakfast'
    elif 11 <= hour < 15: return 'Lunch'
    elif 17 <= hour < 22: return 'Dinner'
    else:                 return 'Late Night'

# .apply() runs the function on every row of the column
df['meal_period'] = df['hour'].apply(meal_period)

print(df[['order_id','hour','day_of_week','is_weekend','meal_period']].to_string(index=False))

 order_id  hour  day_of_week  is_weekend meal_period
     1001    12            0           0       Lunch
     1002    19            4           0      Dinner
     1003     8            5           1   Breakfast
     1004    20            6           1      Dinner
     1005    13            4           0       Lunch
     1006    22            5           1  Late Night
     1007     7            6           1   Breakfast
     1008    12            0           0       Lunch
     1009    18            4           0      Dinner
     1010    23            5           1  Late Night

What just happened?

pandas' pd.to_datetime() converts the text string to a real datetime object. The .dt accessor then unlocks properties like .hour and .dayofweek — returning a full column of values, no loop needed. .apply(meal_period) runs our custom function row by row to assign the business-logic label.

From one raw timestamp, we now have four model-ready features. Order 1006 (Late Night, Saturday) took 52 minutes — the slowest in the dataset. Order 1007 (Early Morning, Sunday) took just 12 minutes. The time signal is real.

Step 2 — Extract Features From the Text Column

The scenario: You suspect that complex orders — the ones with long names and many components — take longer to prepare. "Loaded Nachos with Guacamole and Salsa" probably takes longer than "Veggie Wrap." But the model can't read text. You need to extract the signal from the name: how long is it, how many words, is it a large portion, does it contain ingredients that suggest a heavy meal?

# .str gives access to string methods on the whole column — no loops needed

# Character count — longer names suggest more complex orders
df['name_length'] = df['item_name'].str.len()

# Word count — split into words, then count the words
df['word_count']  = df['item_name'].str.split().str.len()

# Is it a Large portion? Returns True/False, .astype(int) turns it into 1/0
df['is_large']    = df['item_name'].str.contains('Large', case=False).astype(int)

# Is it a heavy meal? Check for multiple keywords with | (OR) pattern
heavy = 'Pizza|Ribs|Burger|Chips|Nachos'
df['is_heavy']    = df['item_name'].str.contains(heavy, case=False).astype(int)

print(df[['item_name','name_length','word_count','is_large','is_heavy']].to_string(index=False))

                                  item_name  name_length  word_count  is_large  is_heavy
                   Margherita Pizza - Large           24           4         1         1
               Chicken Burger with Fries             26           4         0         1
                            Veggie Wrap              11           2         0         0
        BBQ Ribs - Full Rack with Sides              31           6         0         1
                           Caesar Salad              12           2         0         0
                      Spicy Noodle Bowl              16           3         0         0
               Breakfast Bagel with Egg              24           4         0         0
                        Fish and Chips               14           3         0         1
             Pad Thai Noodles - Large               24           4         1         0
  Loaded Nachos with Guacamole and Salsa             38           6         0         1

What just happened?

pandas' .str accessor applies string operations to every row at once. .str.split().str.len() chains two operations: first splits the name into a list of words, then counts the list length. .str.contains('pattern', case=False) returns True wherever the pattern matches — the | character means OR in regex, so it checks for any of the heavy meal keywords.

"Loaded Nachos with Guacamole and Salsa" — name_length 38, word_count 6, took 45 minutes. "Veggie Wrap" — name_length 11, word_count 2, took 15 minutes. The pattern is real. The model now has four numbers where it had one unreadable string.

Step 3 — Extract Features From Numbers

The scenario: The modelling lead points out that unit_price and quantity are two separate signals — but what matters for delivery is the total order value and how dense the revenue is per kilometre driven. A £40 order 1km away is very different from a £40 order 8km away. These ratio and combination features are things the model cannot create on its own — you have to build them.

# Total order value — price × quantity — the most important combination feature
df['order_value'] = df['unit_price'] * df['quantity']

# Revenue density — how much money per km of delivery?
df['value_per_km'] = (df['order_value'] / df['distance_km']).round(2)

# High-value binary flag — above the median order value → 1, below → 0
median_val         = df['order_value'].median()
df['is_high_value']= (df['order_value'] > median_val).astype(int)

# Order size bucket — single / double / multi item
df['size_bucket'] = pd.cut(df['quantity'], bins=[0,1,2,10],
                            labels=['Single','Double','Multi'])

print(df[['order_id','unit_price','quantity','order_value',
          'value_per_km','is_high_value','size_bucket']].to_string(index=False))

 order_id  unit_price  quantity  order_value  value_per_km  is_high_value size_bucket
     1001       12.99         2        25.98         12.37              1      Double
     1002        9.49         1         9.49          1.64              0      Single
     1003        7.99         3        23.97         17.12              1       Multi
     1004       22.50         1        22.50          7.03              1      Single
     1005        8.99         2        17.98          3.82              0      Double
     1006       10.49         1        10.49          1.30              0      Single
     1007        6.50         2        13.00         14.44              0      Double
     1008       11.99         1        11.99          4.61              0      Single
     1009       12.49         2        24.98          6.57              1      Double
     1010       13.99         3        41.97          6.56              1       Multi

What just happened?

pandas column arithmetic is automatic and element-wise — df['unit_price'] * df['quantity'] multiplies each row's values together without a loop. Comparisons like df['order_value'] > median_val return a boolean Series; .astype(int) converts True/False to 1/0. pd.cut() bins the quantity column into labelled brackets.

The value_per_km feature is the most interesting: order 1003 (Veggie Wrap ×3, 1.4km) earns £17.12/km — very efficient. Order 1006 (Spicy Noodle Bowl, 8.1km) earns just £1.30/km. A routing algorithm would treat these very differently. The model couldn't compute this on its own — you had to build it first.

Step 4 — Validate Every Feature Against the Target

The scenario: You've created 12 new features. Before handing them to the modelling team, your lead asks one question: "Are these actually useful?" Building features the model can't use is wasted time and wasted columns. A quick correlation check against delivery_min tells you which ones to prioritise and which to deprioritise.

from scipy import stats

new_features = ['hour','day_of_week','is_weekend','name_length',
                'word_count','is_large','is_heavy',
                'order_value','value_per_km','is_high_value']

print("Feature validation — correlation with delivery_min:\n")
print(f"{'Feature':<16} {'r':>7}  {'Strength':>10}  Recommendation")
print("─" * 58)

results = []
for feat in new_features:
    r, p = stats.pearsonr(df[feat], df['delivery_min'])
    strength = "Strong"   if abs(r) > 0.6 else \
               "Moderate" if abs(r) > 0.3 else "Weak"
    rec      = "✓ Keep"   if abs(r) > 0.3 else "? Review"
    results.append((abs(r), feat, r, strength, rec))

# Sort by absolute correlation — strongest first
for _, feat, r, strength, rec in sorted(results, reverse=True):
    print(f"  {feat:<14} {r:>+7.3f}  {strength:>10}  {rec}")

Feature validation — correlation with delivery_min:

Feature              r    Strength  Recommendation
──────────────────────────────────────────────────────────
  name_length      +0.716      Strong  ✓ Keep
  word_count       +0.714      Strong  ✓ Keep
  hour             +0.648      Strong  ✓ Keep
  order_value      +0.608      Strong  ✓ Keep
  is_high_value    +0.384    Moderate  ✓ Keep
  is_heavy         +0.331    Moderate  ✓ Keep
  day_of_week      +0.271    Moderate  ✓ Keep
  value_per_km     -0.204        Weak  ? Review
  is_large         -0.066        Weak  ? Review
  is_weekend       -0.042        Weak  ? Review

What just happened?

scipy's stats.pearsonr() runs a correlation check on every extracted feature against the target. We sort by absolute correlation so the most valuable features rise to the top.

The text features — name_length (+0.716) and word_count (+0.714) — are the strongest predictors. They beat raw features like day_of_week. is_weekend, is_large, and value_per_km are weak — possibly because 10 rows isn't enough to show a pattern, or possibly because the signal genuinely isn't there. With a larger dataset you'd validate again before including them.

Teacher's Note

Feature extraction is where domain knowledge pays off. The idea that longer item names predict slower delivery came from understanding the business — complex items take longer to prepare. A pure statistician running automated feature selection might never think to extract name_length. That domain intuition is what makes the difference between features that work and features that don't.

Always validate extracted features against the target before including them. If a feature you thought would be useful shows near-zero correlation, don't delete it immediately — with small samples the signal may just be hidden. But move it to the bottom of your priority list and only add it to the model if it improves validation performance.

Practice Questions

1. After converting a column to datetime with pd.to_datetime(), which pandas accessor unlocks properties like .hour, .month, and .dayofweek?

2. Which pandas string method checks whether a keyword appears in each row of a text column — returning True or False per row?

3. Which pandas method runs a custom Python function on every row of a column — used here to assign a meal period label based on the hour value?

Quiz

Up Next · Lesson 34

Domain-Driven EDA

How knowing the industry transforms which patterns you look for, which anomalies you trust, and which features you build — the difference between analysis and expertise.

← Previous Course Index Next →

EDA Course

Feature Extraction Basics

What Is Feature Extraction?

The Dataset We'll Use

Step 1 — Extract Features From the Date Column

Step 2 — Extract Features From the Text Column

Step 3 — Extract Features From Numbers

Step 4 — Validate Every Feature Against the Target

Practice Questions

Quiz