Data Science
Feature Creation
Transform raw ecommerce data into powerful predictive features using mathematical operations, time-based engineering, and categorical encoding techniques.
Why Create New Features?
Raw data rarely tells the complete story. Your ecommerce dataset has basic columns like unit_price and quantity, but what about profit margins or seasonal trends? Machine learning models perform significantly better when you give them features that capture business relationships.
Think about predicting customer returns. The model needs more than just product category — it needs engineered features like order value relative to customer's average, days since last purchase, or price-per-unit ratios. These derived features often become your most predictive variables.
Mathematical Features
Ratios, differences, products of existing columns
Temporal Features
Day of week, month, seasonality patterns
Categorical Encoding
One-hot, target encoding, frequency counts
Aggregation Features
Customer totals, category averages, rankings
Mathematical Feature Engineering
The scenario: Flipkart's data team needs better features to predict high-value customers. Raw revenue numbers don't capture purchasing behavior patterns — they need engineered features that reveal customer profitability and buying intensity.
import pandas as pd
import numpy as np
# Load the ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')
# Quick look at our raw features
print("Original columns:")
print(df.columns.tolist())
Original columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']
What just happened?
We loaded the standard ecommerce dataset with 12 base columns. Notice we have quantity, unit_price, and revenue — perfect for creating ratio-based features. Try this: Think about what business metrics you could calculate from these three columns alone.
Now we'll create powerful mathematical features. These capture business relationships that individual columns miss.
# Create revenue efficiency features
df['revenue_per_item'] = df['revenue'] / df['quantity']
df['price_vs_revenue_ratio'] = df['unit_price'] / df['revenue_per_item']
# Customer value indicators
df['high_quantity_flag'] = (df['quantity'] >= 5).astype(int)
df['premium_purchase'] = (df['unit_price'] > df['unit_price'].quantile(0.75)).astype(int)
print("New mathematical features:")
print(df[['revenue_per_item', 'price_vs_revenue_ratio', 'high_quantity_flag', 'premium_purchase']].head())
New mathematical features: revenue_per_item price_vs_revenue_ratio high_quantity_flag premium_purchase 0 2499.00 1.00 0 1 1 899.50 1.00 0 0 2 1250.00 1.00 0 0 3 750.00 1.00 1 0 4 3200.00 1.00 0 1
What just happened?
We created four powerful features: revenue_per_item shows average value per unit, price_vs_revenue_ratio should equal 1.0 (validation check), and two binary flags for bulk buyers and premium customers. Try this: Create a feature combining age and revenue to find young high-spenders.
Time-Based Feature Creation
Date columns are feature goldmines. A single date gives you day of week, month, quarter, seasonality patterns — each potentially predictive for different business problems.
# Convert date string to datetime - critical first step
df['date'] = pd.to_datetime(df['date'])
# Extract temporal features that capture business patterns
df['day_of_week'] = df['date'].dt.dayofweek # 0=Monday, 6=Sunday
df['month'] = df['date'].dt.month
df['is_weekend'] = (df['day_of_week'].isin([5, 6])).astype(int)
df['quarter'] = df['date'].dt.quarter
print("Sample of temporal features:")
print(df[['date', 'day_of_week', 'month', 'is_weekend', 'quarter']].head())
Sample of temporal features:
date day_of_week month is_weekend quarter
0 2023-01-15 6 1 1 1
1 2023-01-22 6 1 1 1
2 2023-02-08 2 2 0 1
3 2023-02-14 1 2 0 1
4 2023-03-03 4 3 0 1
What just happened?
We extracted five temporal features from one date column. Notice day_of_week=6 means Sunday, and is_weekend=1 confirms weekend shopping behavior. The quarter feature will capture seasonal patterns. Try this: Create features for holiday proximity or payroll cycles.
Advanced time features can capture cyclical patterns. Why does this matter? Because customer behavior follows predictable cycles — weekend vs weekday shopping, month-end salary spending, festival seasons.
# Create cyclical features using sine/cosine - preserves circular nature
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
# Business-specific time features
df['is_month_start'] = (df['date'].dt.day <= 7).astype(int) # Salary week
df['is_month_end'] = (df['date'].dt.day >= 25).astype(int) # Pre-salary spending
print("Advanced temporal features:")
print(df[['month_sin', 'month_cos', 'day_sin', 'is_month_start', 'is_month_end']].head())
Advanced temporal features: month_sin month_cos day_sin is_month_start is_month_end 0 0.500 0.866 -0.781305 0 0 1 0.500 0.866 -0.781305 0 0 2 0.866 0.500 0.623490 0 0 3 0.866 0.500 0.974928 0 0 4 1.000 0.000 -0.433884 1 0
What just happened?
The sine/cosine features capture cyclical patterns — machine learning models understand that January and December are adjacent months through these features. month_sin=1.000 represents March (peak of sine curve). The salary-cycle flags identify early-month and late-month shopping patterns. Try this: Create features for Indian festivals like Diwali or regional shopping patterns.
Weekend sales spike 40-60% higher than weekdays, validating our is_weekend feature creation
The chart reveals clear temporal patterns — Saturday peaks at ₹22.3K while Tuesday drops to ₹8.9K. This 150% variance proves why temporal features matter for prediction models.
Business teams can use this insight for inventory planning, staff scheduling, and targeted promotions. Models trained with these temporal features will automatically adjust predictions based on day-of-week patterns.
Categorical Feature Engineering
Machine learning algorithms need numbers, not text. But converting product_category to numbers isn't straightforward — different encoding methods capture different types of information.
Before: Text Categories
product_category Electronics Clothing Food Books
After: Numerical Features
cat_Electronics cat_Clothing cat_Food 1 0 0 0 1 0 0 0 1
# One-hot encoding - each category becomes a binary column
category_dummies = pd.get_dummies(df['product_category'], prefix='cat')
df = pd.concat([df, category_dummies], axis=1)
# Frequency encoding - how often does each category appear?
category_counts = df['product_category'].value_counts()
df['category_frequency'] = df['product_category'].map(category_counts)
print("One-hot encoded categories:")
print(category_dummies.head())
print("\nFrequency encoding:")
print(df[['product_category', 'category_frequency']].head())
One-hot encoded categories:
cat_Books cat_Clothing cat_Electronics cat_Food cat_Home
0 0 0 1 0 0
1 0 1 0 0 0
2 0 0 0 1 0
3 1 0 0 0 0
4 0 0 0 0 1
Frequency encoding:
product_category category_frequency
Electronics 287
Clothing 254
Food 198
Books 134
Home 127
What just happened?
One-hot encoding created 5 binary columns — cat_Electronics=1 means this row is Electronics. Frequency encoding shows category_frequency=287 means Electronics appears 287 times, making it the most popular category. Try this: Create target encoding using average revenue per category.
Customer Aggregation Features
Individual transactions tell part of the story. Customer-level aggregations reveal spending patterns, loyalty metrics, and lifetime value indicators that significantly improve model performance.
# Customer-level aggregations - reveal spending behavior patterns
customer_stats = df.groupby('customer_age').agg({
'revenue': ['sum', 'mean', 'count'],
'rating': 'mean',
'returned': 'sum'
}).round(2)
# Flatten column names for easier access
customer_stats.columns = ['total_revenue', 'avg_revenue', 'order_count', 'avg_rating', 'return_count']
customer_stats['return_rate'] = (customer_stats['return_count'] / customer_stats['order_count']).round(3)
print("Customer aggregation features (top 10 by total revenue):")
print(customer_stats.nlargest(10, 'total_revenue'))
Customer aggregation features (top 10 by total revenue):
total_revenue avg_revenue order_count avg_rating return_count return_rate
customer_age
43 89420.50 2235.51 40 4.12 2 0.050
28 87650.25 2191.26 40 4.23 1 0.025
51 85340.00 2133.50 40 4.08 3 0.075
35 84290.75 2107.27 40 4.15 2 0.050
22 83150.50 2078.76 40 4.31 1 0.025
38 82475.25 2061.88 40 4.19 2 0.050
56 81920.00 2048.00 40 4.02 4 0.100
29 80680.75 2017.02 40 4.28 1 0.025
45 79540.50 1988.51 40 4.11 3 0.075
33 78450.25 1961.26 40 4.22 2 0.050
What just happened?
We aggregated by customer age and created powerful features: total_revenue=89420.50 shows highest spenders, return_rate=0.050 means 5% return rate. Notice age 56 has highest returns (10%) despite high revenue. Try this: Group by city and product category for regional preferences.
Older customers (56+) show 10% return rates but lower satisfaction scores — key insight for customer retention strategies
Age-based segmentation reveals critical patterns: customers aged 22-29 have 2.5% return rates with 4.2+ ratings, while 50+ customers show 7.5-10% returns with declining satisfaction. This suggests different product recommendations or customer service approaches by age segment.
Feature Validation and Selection
Creating features is half the battle. Validating their predictive power prevents feature bloat — having hundreds of weakly correlated features that confuse models instead of helping them.
# Calculate feature correlations with target variable (returned)
feature_cols = ['revenue_per_item', 'high_quantity_flag', 'premium_purchase',
'is_weekend', 'month', 'category_frequency', 'customer_age']
correlations = df[feature_cols + ['returned']].corr()['returned'].drop('returned').sort_values(ascending=False)
print("Feature correlation with returns (target variable):")
print(correlations)
print(f"\nTotal engineered features created: {len(feature_cols)}")
print(f"Features with |correlation| > 0.1: {sum(abs(correlations) > 0.1)}")
Feature correlation with returns (target variable): premium_purchase 0.154 customer_age 0.089 category_frequency 0.067 month -0.023 is_weekend -0.045 high_quantity_flag -0.089 revenue_per_item -0.123 Total engineered features created: 7 Features with |correlation| > 0.1: 2
What just happened?
We validated feature quality using correlation analysis. premium_purchase shows strongest positive correlation (0.154) with returns, while revenue_per_item has negative correlation (-0.123). Only 2 out of 7 features exceed the 0.1 correlation threshold. Try this: Use mutual information scores for non-linear relationships.
Common Mistake: Feature Explosion
Creating 50+ features without validation leads to overfitting and slow models. Always calculate correlation, importance scores, or mutual information. Drop features with correlation < 0.05 unless domain expertise suggests otherwise. Keep your feature set focused and interpretable.
29% of created features show strong predictive power — typical ratio for well-engineered features in business applications
The validation shows our feature engineering success rate: 2 strong features from 7 created represents good efficiency. Premium purchases and revenue-per-item both predict returns in opposite directions, giving our model valuable signal.
Business insight: premium purchases correlate with higher returns (0.154) while bulk purchasing reduces returns (-0.089). This suggests premium products might have quality expectations issues, while bulk buyers are more committed customers.
Pro tip: Always create features with business context in mind. Technical perfection means nothing if the features don't capture real customer behavior patterns. Domain expertise beats algorithmic cleverness 90% of the time.
Quiz
1. You're analyzing Myntra's fashion ecommerce data to predict customer lifetime value. You have columns for unit_price, quantity, and customer_age. What's the most effective approach to create features that capture purchasing power?
2. When creating cyclical time features for seasonal patterns in retail data, which approach correctly preserves the circular nature of months where December and January are adjacent?
3. After creating 25 new features for a customer churn prediction model, you notice the model is overfitting and training time has increased significantly. What's the most systematic approach to feature validation and selection?
Up Next
Feature Selection
Learn systematic methods to identify the most predictive features from your engineered dataset using statistical tests, importance scores, and dimensionality reduction techniques.