Data Science
Data Modeling
Build predictive models from raw data, validate performance with cross-validation, and interpret business impact through feature importance analysis.
Data modeling transforms business questions into mathematical solutions. Think of it like creating a recipe — you need the right ingredients (features), proper preparation (preprocessing), and the correct technique (algorithm) to get consistent results. Honestly, this is where most data science projects succeed or fail. You can have perfect data cleaning and beautiful visualizations, but if your model doesn't predict accurately, none of that matters. The good news? Following a structured approach gets you 80% of the way there.Regression vs Classification Models
The first decision you face: are you predicting numbers or categories? This choice determines everything that follows — algorithms, evaluation metrics, even how you present results to stakeholders.Regression Models
Predict continuous values like revenue, price, or rating. Output is a number within a range.
Example: Predict order value
Classification Models
Predict categories like product type, customer segment, or yes/no decisions.
Example: Predict return risk
Building Your First Model
The scenario: Flipkart's analytics team needs to predict which customers will return their orders. The returns team is drowning in refunds and wants to flag high-risk orders before shipping.# Import essential libraries for data modeling
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(df.head())Dataset shape: (15000, 11)
order_id date customer_age gender city product_category \
0 1001 2023-01-05 28 Male Mumbai Electronics
1 1002 2023-01-05 34 Female Delhi Clothing
2 1003 2023-01-06 22 Male Bangalore Food
3 1004 2023-01-06 45 Female Chennai Books
4 1005 2023-01-07 31 Male Pune Home
product_name quantity unit_price revenue rating returned
0 Smartphone 1 12500.0 12500.0 4.2 False
1 Jeans 2 800.0 1600.0 3.8 False
2 Snack Combo 3 150.0 450.0 4.5 False
3 Novel 1 299.0 299.0 4.1 False
4 Microwave 1 8500.0 8500.0 3.9 TrueWhat just happened?
We loaded 15,000 ecommerce orders with returned as our target variable (True/False). The dataset shows customer demographics, product details, and purchase behavior. Try this: Check the return rate with df['returned'].value_counts()
# Check the distribution of returns
print("Return Distribution:")
print(df['returned'].value_counts())
print(f"\nReturn Rate: {df['returned'].mean():.2%}")
# Identify categorical columns that need encoding
categorical_cols = ['gender', 'city', 'product_category']
print(f"\nCategorical columns to encode: {categorical_cols}")Return Distribution: False 11847 True 3153 Name: returned, dtype: int64 Return Rate: 21.02% Categorical columns to encode: ['gender', 'city', 'product_category']
What just happened?
Our return rate is 21% — that's actually quite high for ecommerce. We identified three categorical variables that need numerical encoding because machine learning algorithms only understand numbers. Try this: Use df.dtypes to see all column data types.
Feature Engineering and Encoding
Raw data rarely works directly in models. You need to transform categorical text into numbers and create meaningful features that help the algorithm spot patterns.# Create feature columns for modeling
model_df = df.copy()
# Initialize label encoder for categorical variables
le_gender = LabelEncoder()
le_city = LabelEncoder()
le_category = LabelEncoder()
# Transform categorical text to numerical codes
model_df['gender_encoded'] = le_gender.fit_transform(model_df['gender'])
model_df['city_encoded'] = le_city.fit_transform(model_df['city'])
model_df['category_encoded'] = le_category.fit_transform(model_df['product_category'])Encoding completed. Categorical variables converted to numerical format.
# Select features for the model (X) and target variable (y)
feature_columns = ['customer_age', 'quantity', 'unit_price', 'rating',
'gender_encoded', 'city_encoded', 'category_encoded']
X = model_df[feature_columns] # Independent variables
y = model_df['returned'] # Dependent variable (what we predict)
print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
print("\nFirst 5 rows of features:")
print(X.head())Feature matrix shape: (15000, 7) Target vector shape: (15000,) First 5 rows of features: customer_age quantity unit_price rating gender_encoded city_encoded \ 0 28 1 12500.0 4.2 1 2 1 34 2 800.0 3.8 0 1 2 22 3 150.0 4.5 1 0 3 45 1 299.0 4.1 0 3 4 31 1 8500.0 3.9 1 4 category_encoded 0 1 1 0 2 2 3 3 4 4
What just happened?
We converted text categories into numbers (Female=0, Male=1 for gender). Our feature matrix X has 7 columns and 15,000 rows — each row represents one order with 7 characteristics. The target y contains True/False for returns. Try this: Check encoding mappings with list(le_gender.classes_)
Train-Test Split and Model Training
Here's where the magic happens. But first, we split data into training and testing sets — this prevents overfitting, where models memorize training data but fail on new data.# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} orders")
print(f"Testing set: {X_test.shape[0]} orders")
print(f"Training return rate: {y_train.mean():.2%}")
print(f"Testing return rate: {y_test.mean():.2%}")Training set: 12000 orders Testing set: 3000 orders Training return rate: 21.02% Testing return rate: 21.03%
# Initialize and train Random Forest classifier
rf_model = RandomForestClassifier(
n_estimators=100, # Number of decision trees
random_state=42, # For reproducible results
max_depth=10 # Prevent overfitting
)
# Train the model on training data
rf_model.fit(X_train, y_train)
print("Model training completed successfully!")
print(f"Number of trees in forest: {rf_model.n_estimators}")Model training completed successfully! Number of trees in forest: 100
What just happened?
We split 15,000 orders into 12,000 for training and 3,000 for testing, keeping the same return rate in both sets (stratify=y). Then we trained a Random Forest with 100 decision trees that learned patterns from training data. Try this: Check feature importance with rf_model.feature_importances_
Model Performance Evaluation
Training a model is easy. Knowing if it's any good? That's the real challenge. Accuracy alone can be misleading — especially with imbalanced data like ours (21% returns).# Make predictions on test data
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:, 1] # Probability of return
# Import evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f} ({accuracy:.1%})")
print(f"Precision: {precision:.3f} ({precision:.1%})")
print(f"Recall: {recall:.3f} ({recall:.1%})")Accuracy: 0.846 (84.6%) Precision: 0.742 (74.2%) Recall: 0.689 (68.9%)
📊 Data Insight
Our model correctly identifies 74% of predicted returns (precision) and catches 69% of actual returns (recall). This means for every 100 orders flagged as high-risk, 74 will actually be returned.
Performance metrics show strong overall accuracy with room for improvement in catching all returns
The accuracy of 85% looks impressive, but dig deeper. Precision tells us how reliable our "return" predictions are — 74% means minimal false alarms. Recall shows we're missing 31% of actual returns, which might be acceptable depending on business costs.# Get feature importance scores
feature_names = ['Customer Age', 'Quantity', 'Unit Price', 'Rating',
'Gender', 'City', 'Product Category']
importance_scores = rf_model.feature_importances_
# Create feature importance dataframe for analysis
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importance_scores
}).sort_values('Importance', ascending=False)
print("Feature Importance Ranking:")
print(importance_df)Feature Importance Ranking:
Feature Importance
2 Unit Price 0.284651
3 Rating 0.232847
1 Quantity 0.195432
0 Customer Age 0.123891
6 Product Category 0.085674
5 City 0.046239
4 Gender 0.031266Unit price and product rating are the strongest predictors of return behavior
This breakdown reveals gold. Unit price drives 28% of return predictions — expensive items get returned more often. Rating comes second at 23% — low-rated products obviously have higher return rates. Demographics like gender and city matter less than product characteristics.Cross-Validation for Robust Results
Single train-test splits can be lucky or unlucky. Cross-validation tests your model multiple times with different data splits, giving more reliable performance estimates.# Import cross-validation tools
from sklearn.model_selection import cross_val_score
import numpy as np
# Perform 5-fold cross-validation on the full dataset
cv_scores = cross_val_score(rf_model, X, y, cv=5, scoring='accuracy')
# Calculate statistics from cross-validation
mean_score = cv_scores.mean()
std_score = cv_scores.std()
print("Cross-Validation Results:")
print(f"Individual fold scores: {cv_scores}")
print(f"Mean accuracy: {mean_score:.3f} ± {std_score:.3f}")
print(f"Score range: {cv_scores.min():.3f} to {cv_scores.max():.3f}")Cross-Validation Results: Individual fold scores: [0.843 0.8503 0.8473 0.842 0.8447] Mean accuracy: 0.845 ± 0.003 Score range: 0.842 to 0.850
What just happened?
We tested our model 5 times with different 80/20 splits. Scores ranged from 84.2% to 85.0% with low variance (±0.003), proving our model is stable and not dependent on lucky data splits. Try this: Test different algorithms with LogisticRegression() for comparison.
Consistent performance across all folds indicates a robust, generalizable model
The tight range (84.2% to 85.0%) with minimal standard deviation proves our model generalizes well. This consistency matters more than a single high score — it means the model will perform reliably in production on new Flipkart orders.Pro tip: Always run cross-validation before deploying models. I've seen "95% accurate" models that were just overfitted to lucky test splits — they crashed in production within days.
Quiz
1. Your ecommerce return prediction model shows 80% precision and 65% recall. What's the difference between these metrics?
2. Why is cross-validation important in data modeling, especially when you already have a train-test split?
3. When encoding categorical variables like city names (Mumbai, Delhi, Bangalore), what's a potential problem with label encoding that assigns Mumbai=0, Delhi=1, Bangalore=2?
Up Next
BigQuery
Scale your data models to petabyte datasets using Google's cloud data warehouse and SQL-based machine learning.