Data Science Lesson 16 – Feature Creation | Dataplexa

Feature Engineering · Lesson 16

Feature Creation

Transform raw ecommerce data into powerful predictive features using mathematical operations, time-based engineering, and categorical encoding techniques.

Identify Business Problem

Engineer Mathematical Features

Create Time-Based Features

Validate Feature Performance

Why Create New Features?

Raw data rarely tells the complete story. Your ecommerce dataset has basic columns like unit_price and quantity, but what about profit margins or seasonal trends? Machine learning models perform significantly better when you give them features that capture business relationships.

Think about predicting customer returns. The model needs more than just product category — it needs engineered features like order value relative to customer's average, days since last purchase, or price-per-unit ratios. These derived features often become your most predictive variables.

Mathematical Features

Ratios, differences, products of existing columns

Temporal Features

Day of week, month, seasonality patterns

Categorical Encoding

One-hot, target encoding, frequency counts

Aggregation Features

Customer totals, category averages, rankings

Mathematical Feature Engineering

The scenario: Flipkart's data team needs better features to predict high-value customers. Raw revenue numbers don't capture purchasing behavior patterns — they need engineered features that reveal customer profitability and buying intensity.

import pandas as pd
import numpy as np

# Load the ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')

# Quick look at our raw features
print("Original columns:")
print(df.columns.tolist())

Original columns:
['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']

What just happened?

We loaded the standard ecommerce dataset with 12 base columns. Notice we have quantity, unit_price, and revenue — perfect for creating ratio-based features. Try this: Think about what business metrics you could calculate from these three columns alone.

Now we'll create powerful mathematical features. These capture business relationships that individual columns miss.

# Create revenue efficiency features
df['revenue_per_item'] = df['revenue'] / df['quantity']
df['price_vs_revenue_ratio'] = df['unit_price'] / df['revenue_per_item']

# Customer value indicators  
df['high_quantity_flag'] = (df['quantity'] >= 5).astype(int)
df['premium_purchase'] = (df['unit_price'] > df['unit_price'].quantile(0.75)).astype(int)

print("New mathematical features:")
print(df[['revenue_per_item', 'price_vs_revenue_ratio', 'high_quantity_flag', 'premium_purchase']].head())

New mathematical features:
   revenue_per_item  price_vs_revenue_ratio  high_quantity_flag  premium_purchase
0           2499.00                    1.00                   0                 1
1            899.50                    1.00                   0                 0
2           1250.00                    1.00                   0                 0
3            750.00                    1.00                   1                 0
4           3200.00                    1.00                   0                 1

What just happened?

We created four powerful features: revenue_per_item shows average value per unit, price_vs_revenue_ratio should equal 1.0 (validation check), and two binary flags for bulk buyers and premium customers. Try this: Create a feature combining age and revenue to find young high-spenders.

Time-Based Feature Creation

Date columns are feature goldmines. A single date gives you day of week, month, quarter, seasonality patterns — each potentially predictive for different business problems.

# Convert date string to datetime - critical first step
df['date'] = pd.to_datetime(df['date'])

# Extract temporal features that capture business patterns
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
df['month'] = df['date'].dt.month
df['is_weekend'] = (df['day_of_week'].isin([5, 6])).astype(int)
df['quarter'] = df['date'].dt.quarter

print("Sample of temporal features:")
print(df[['date', 'day_of_week', 'month', 'is_weekend', 'quarter']].head())

Sample of temporal features:
        date  day_of_week  month  is_weekend  quarter
0 2023-01-15            6      1           1        1
1 2023-01-22            6      1           1        1
2 2023-02-08            2      2           0        1
3 2023-02-14            1      2           0        1
4 2023-03-03            4      3           0        1

What just happened?

We extracted five temporal features from one date column. Notice day_of_week=6 means Sunday, and is_weekend=1 confirms weekend shopping behavior. The quarter feature will capture seasonal patterns. Try this: Create features for holiday proximity or payroll cycles.

Advanced time features can capture cyclical patterns. Why does this matter? Because customer behavior follows predictable cycles — weekend vs weekday shopping, month-end salary spending, festival seasons.

# Create cyclical features using sine/cosine - preserves circular nature
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)

# Business-specific time features
df['is_month_start'] = (df['date'].dt.day <= 7).astype(int)  # Salary week
df['is_month_end'] = (df['date'].dt.day >= 25).astype(int)   # Pre-salary spending

print("Advanced temporal features:")
print(df[['month_sin', 'month_cos', 'day_sin', 'is_month_start', 'is_month_end']].head())

Advanced temporal features:
   month_sin  month_cos   day_sin  is_month_start  is_month_end
0      0.500      0.866 -0.781305               0             0
1      0.500      0.866 -0.781305               0             0
2      0.866      0.500  0.623490               0             0
3      0.866      0.500  0.974928               0             0
4      1.000      0.000 -0.433884               1             0

What just happened?

The sine/cosine features capture cyclical patterns — machine learning models understand that January and December are adjacent months through these features. month_sin=1.000 represents March (peak of sine curve). The salary-cycle flags identify early-month and late-month shopping patterns. Try this: Create features for Indian festivals like Diwali or regional shopping patterns.

Weekend sales spike 40-60% higher than weekdays, validating our is_weekend feature creation

The chart reveals clear temporal patterns — Saturday peaks at ₹22.3K while Tuesday drops to ₹8.9K. This 150% variance proves why temporal features matter for prediction models.

Business teams can use this insight for inventory planning, staff scheduling, and targeted promotions. Models trained with these temporal features will automatically adjust predictions based on day-of-week patterns.

Categorical Feature Engineering

Machine learning algorithms need numbers, not text. But converting product_category to numbers isn't straightforward — different encoding methods capture different types of information.

Before: Text Categories

product_category
Electronics
Clothing  
Food
Books

After: Numerical Features

cat_Electronics  cat_Clothing  cat_Food
1                0             0
0                1             0  
0                0             1

# One-hot encoding - each category becomes a binary column
category_dummies = pd.get_dummies(df['product_category'], prefix='cat')
df = pd.concat([df, category_dummies], axis=1)

# Frequency encoding - how often does each category appear?
category_counts = df['product_category'].value_counts()
df['category_frequency'] = df['product_category'].map(category_counts)

print("One-hot encoded categories:")
print(category_dummies.head())
print("\nFrequency encoding:")
print(df[['product_category', 'category_frequency']].head())

One-hot encoded categories:
   cat_Books  cat_Clothing  cat_Electronics  cat_Food  cat_Home
0          0             0                1         0         0
1          0             1                0         0         0
2          0             0                0         1         0
3          1             0                0         0         0
4          0             0                0         0         1

Frequency encoding:
product_category  category_frequency
      Electronics                 287
         Clothing                 254  
             Food                 198
            Books                 134
             Home                 127

What just happened?

One-hot encoding created 5 binary columns — cat_Electronics=1 means this row is Electronics. Frequency encoding shows category_frequency=287 means Electronics appears 287 times, making it the most popular category. Try this: Create target encoding using average revenue per category.

Customer Aggregation Features

Individual transactions tell part of the story. Customer-level aggregations reveal spending patterns, loyalty metrics, and lifetime value indicators that significantly improve model performance.

# Customer-level aggregations - reveal spending behavior patterns
customer_stats = df.groupby('customer_age').agg({
    'revenue': ['sum', 'mean', 'count'],
    'rating': 'mean',
    'returned': 'sum'
}).round(2)

# Flatten column names for easier access
customer_stats.columns = ['total_revenue', 'avg_revenue', 'order_count', 'avg_rating', 'return_count']
customer_stats['return_rate'] = (customer_stats['return_count'] / customer_stats['order_count']).round(3)

print("Customer aggregation features (top 10 by total revenue):")
print(customer_stats.nlargest(10, 'total_revenue'))

Customer aggregation features (top 10 by total revenue):
              total_revenue  avg_revenue  order_count  avg_rating  return_count  return_rate
customer_age                                                                               
43                 89420.50      2235.51           40        4.12             2        0.050
28                 87650.25      2191.26           40        4.23             1        0.025
51                 85340.00      2133.50           40        4.08             3        0.075
35                 84290.75      2107.27           40        4.15             2        0.050
22                 83150.50      2078.76           40        4.31             1        0.025
38                 82475.25      2061.88           40        4.19             2        0.050
56                 81920.00      2048.00           40        4.02             4        0.100
29                 80680.75      2017.02           40        4.28             1        0.025
45                 79540.50      1988.51           40        4.11             3        0.075
33                 78450.25      1961.26           40        4.22             2        0.050

What just happened?

We aggregated by customer age and created powerful features: total_revenue=89420.50 shows highest spenders, return_rate=0.050 means 5% return rate. Notice age 56 has highest returns (10%) despite high revenue. Try this: Group by city and product category for regional preferences.

Older customers (56+) show 10% return rates but lower satisfaction scores — key insight for customer retention strategies

📊 Data Insight

Age-based segmentation reveals critical patterns: customers aged 22-29 have 2.5% return rates with 4.2+ ratings, while 50+ customers show 7.5-10% returns with declining satisfaction. This suggests different product recommendations or customer service approaches by age segment.

Feature Validation and Selection

Creating features is half the battle. Validating their predictive power prevents feature bloat — having hundreds of weakly correlated features that confuse models instead of helping them.

# Calculate feature correlations with target variable (returned)
feature_cols = ['revenue_per_item', 'high_quantity_flag', 'premium_purchase', 
                'is_weekend', 'month', 'category_frequency', 'customer_age']

correlations = df[feature_cols + ['returned']].corr()['returned'].drop('returned').sort_values(ascending=False)

print("Feature correlation with returns (target variable):")
print(correlations)
print(f"\nTotal engineered features created: {len(feature_cols)}")
print(f"Features with |correlation| > 0.1: {sum(abs(correlations) > 0.1)}")

Feature correlation with returns (target variable):
premium_purchase      0.154
customer_age          0.089
category_frequency    0.067
month                -0.023
is_weekend          -0.045
high_quantity_flag  -0.089
revenue_per_item    -0.123

Total engineered features created: 7
Features with |correlation| > 0.1: 2

What just happened?

We validated feature quality using correlation analysis. premium_purchase shows strongest positive correlation (0.154) with returns, while revenue_per_item has negative correlation (-0.123). Only 2 out of 7 features exceed the 0.1 correlation threshold. Try this: Use mutual information scores for non-linear relationships.

Common Mistake: Feature Explosion

Creating 50+ features without validation leads to overfitting and slow models. Always calculate correlation, importance scores, or mutual information. Drop features with correlation < 0.05 unless domain expertise suggests otherwise. Keep your feature set focused and interpretable.

29% of created features show strong predictive power — typical ratio for well-engineered features in business applications

The validation shows our feature engineering success rate: 2 strong features from 7 created represents good efficiency. Premium purchases and revenue-per-item both predict returns in opposite directions, giving our model valuable signal.

Business insight: premium purchases correlate with higher returns (0.154) while bulk purchasing reduces returns (-0.089). This suggests premium products might have quality expectations issues, while bulk buyers are more committed customers.

Pro tip: Always create features with business context in mind. Technical perfection means nothing if the features don't capture real customer behavior patterns. Domain expertise beats algorithmic cleverness 90% of the time.

Quiz

Up Next

Feature Selection

Learn systematic methods to identify the most predictive features from your engineered dataset using statistical tests, importance scores, and dimensionality reduction techniques.

← Previous Course Index Next →