Data Science Lesson 39 – Data Modeling | Dataplexa
Data Science · Lesson 39

Data Modeling

Build predictive models from raw data, validate performance with cross-validation, and interpret business impact through feature importance analysis.

Data modeling transforms business questions into mathematical solutions. Think of it like creating a recipe — you need the right ingredients (features), proper preparation (preprocessing), and the correct technique (algorithm) to get consistent results. Honestly, this is where most data science projects succeed or fail. You can have perfect data cleaning and beautiful visualizations, but if your model doesn't predict accurately, none of that matters. The good news? Following a structured approach gets you 80% of the way there.
1
Problem Definition
2
Data Preparation
3
Model Training
4
Validation & Deployment

Regression vs Classification Models

The first decision you face: are you predicting numbers or categories? This choice determines everything that follows — algorithms, evaluation metrics, even how you present results to stakeholders.

Regression Models

Predict continuous values like revenue, price, or rating. Output is a number within a range.

Example: Predict order value

Classification Models

Predict categories like product type, customer segment, or yes/no decisions.

Example: Predict return risk

Why does this matter? Because using the wrong model type kills performance instantly. I've seen analysts try to use classification algorithms on revenue prediction — the results were unusable.

Building Your First Model

The scenario: Flipkart's analytics team needs to predict which customers will return their orders. The returns team is drowning in refunds and wants to flag high-risk orders before shipping.
# Import essential libraries for data modeling
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(df.head())

What just happened?

We loaded 15,000 ecommerce orders with returned as our target variable (True/False). The dataset shows customer demographics, product details, and purchase behavior. Try this: Check the return rate with df['returned'].value_counts()

# Check the distribution of returns
print("Return Distribution:")
print(df['returned'].value_counts())
print(f"\nReturn Rate: {df['returned'].mean():.2%}")

# Identify categorical columns that need encoding
categorical_cols = ['gender', 'city', 'product_category']
print(f"\nCategorical columns to encode: {categorical_cols}")

What just happened?

Our return rate is 21% — that's actually quite high for ecommerce. We identified three categorical variables that need numerical encoding because machine learning algorithms only understand numbers. Try this: Use df.dtypes to see all column data types.

Feature Engineering and Encoding

Raw data rarely works directly in models. You need to transform categorical text into numbers and create meaningful features that help the algorithm spot patterns.
# Create feature columns for modeling
model_df = df.copy()

# Initialize label encoder for categorical variables
le_gender = LabelEncoder()
le_city = LabelEncoder()
le_category = LabelEncoder()

# Transform categorical text to numerical codes
model_df['gender_encoded'] = le_gender.fit_transform(model_df['gender'])
model_df['city_encoded'] = le_city.fit_transform(model_df['city'])
model_df['category_encoded'] = le_category.fit_transform(model_df['product_category'])
# Select features for the model (X) and target variable (y)
feature_columns = ['customer_age', 'quantity', 'unit_price', 'rating', 
                   'gender_encoded', 'city_encoded', 'category_encoded']

X = model_df[feature_columns]  # Independent variables
y = model_df['returned']       # Dependent variable (what we predict)

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
print("\nFirst 5 rows of features:")
print(X.head())

What just happened?

We converted text categories into numbers (Female=0, Male=1 for gender). Our feature matrix X has 7 columns and 15,000 rows — each row represents one order with 7 characteristics. The target y contains True/False for returns. Try this: Check encoding mappings with list(le_gender.classes_)

Train-Test Split and Model Training

Here's where the magic happens. But first, we split data into training and testing sets — this prevents overfitting, where models memorize training data but fail on new data.
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} orders")
print(f"Testing set: {X_test.shape[0]} orders")
print(f"Training return rate: {y_train.mean():.2%}")
print(f"Testing return rate: {y_test.mean():.2%}")
# Initialize and train Random Forest classifier
rf_model = RandomForestClassifier(
    n_estimators=100,    # Number of decision trees
    random_state=42,     # For reproducible results
    max_depth=10         # Prevent overfitting
)

# Train the model on training data
rf_model.fit(X_train, y_train)
print("Model training completed successfully!")
print(f"Number of trees in forest: {rf_model.n_estimators}")

What just happened?

We split 15,000 orders into 12,000 for training and 3,000 for testing, keeping the same return rate in both sets (stratify=y). Then we trained a Random Forest with 100 decision trees that learned patterns from training data. Try this: Check feature importance with rf_model.feature_importances_

Model Performance Evaluation

Training a model is easy. Knowing if it's any good? That's the real challenge. Accuracy alone can be misleading — especially with imbalanced data like ours (21% returns).
# Make predictions on test data
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]  # Probability of return

# Import evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.3f} ({accuracy:.1%})")
print(f"Precision: {precision:.3f} ({precision:.1%})")
print(f"Recall: {recall:.3f} ({recall:.1%})")

📊 Data Insight

Our model correctly identifies 74% of predicted returns (precision) and catches 69% of actual returns (recall). This means for every 100 orders flagged as high-risk, 74 will actually be returned.

Performance metrics show strong overall accuracy with room for improvement in catching all returns

The accuracy of 85% looks impressive, but dig deeper. Precision tells us how reliable our "return" predictions are — 74% means minimal false alarms. Recall shows we're missing 31% of actual returns, which might be acceptable depending on business costs.
# Get feature importance scores
feature_names = ['Customer Age', 'Quantity', 'Unit Price', 'Rating', 
                'Gender', 'City', 'Product Category']
importance_scores = rf_model.feature_importances_

# Create feature importance dataframe for analysis
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importance_scores
}).sort_values('Importance', ascending=False)

print("Feature Importance Ranking:")
print(importance_df)

Unit price and product rating are the strongest predictors of return behavior

This breakdown reveals gold. Unit price drives 28% of return predictions — expensive items get returned more often. Rating comes second at 23% — low-rated products obviously have higher return rates. Demographics like gender and city matter less than product characteristics.

Cross-Validation for Robust Results

Single train-test splits can be lucky or unlucky. Cross-validation tests your model multiple times with different data splits, giving more reliable performance estimates.
# Import cross-validation tools
from sklearn.model_selection import cross_val_score
import numpy as np

# Perform 5-fold cross-validation on the full dataset
cv_scores = cross_val_score(rf_model, X, y, cv=5, scoring='accuracy')

# Calculate statistics from cross-validation
mean_score = cv_scores.mean()
std_score = cv_scores.std()

print("Cross-Validation Results:")
print(f"Individual fold scores: {cv_scores}")
print(f"Mean accuracy: {mean_score:.3f} ± {std_score:.3f}")
print(f"Score range: {cv_scores.min():.3f} to {cv_scores.max():.3f}")

What just happened?

We tested our model 5 times with different 80/20 splits. Scores ranged from 84.2% to 85.0% with low variance (±0.003), proving our model is stable and not dependent on lucky data splits. Try this: Test different algorithms with LogisticRegression() for comparison.

Consistent performance across all folds indicates a robust, generalizable model

The tight range (84.2% to 85.0%) with minimal standard deviation proves our model generalizes well. This consistency matters more than a single high score — it means the model will perform reliably in production on new Flipkart orders.

Pro tip: Always run cross-validation before deploying models. I've seen "95% accurate" models that were just overfitted to lucky test splits — they crashed in production within days.

Data modeling success comes down to three things: understanding your business problem, choosing appropriate evaluation metrics, and validating performance rigorously. The technical implementation — algorithms, hyperparameters, feature engineering — that's the easy part once you nail the fundamentals. Your Flipkart model now predicts returns with 84.5% accuracy and identifies the key risk factors. Unit price and rating dominate predictions, suggesting the returns team should focus on expensive, low-rated products for intervention strategies.

Quiz

1. Your ecommerce return prediction model shows 80% precision and 65% recall. What's the difference between these metrics?


2. Why is cross-validation important in data modeling, especially when you already have a train-test split?


3. When encoding categorical variables like city names (Mumbai, Delhi, Bangalore), what's a potential problem with label encoding that assigns Mumbai=0, Delhi=1, Bangalore=2?


Up Next

BigQuery

Scale your data models to petabyte datasets using Google's cloud data warehouse and SQL-based machine learning.