Data Science Lesson 2 – DS Workflow | Dataplexa

Data Science Fundamentals · Lesson 2

DS Workflow

Map the complete journey from raw business problem to deployed solution using a systematic 6-stage process that every data scientist follows in production.

This lesson covers

CRISP-DM Framework · Business Understanding · Data Preparation · Model Building · Deployment Pipeline · Real Project Workflow

Business Understanding

Define the problem and success metrics

Data Understanding

Collect, explore, and verify data quality

Data Preparation

Clean, transform, and engineer features

Modeling

Build, train, and validate models

Evaluation

Test performance against business metrics

Deployment

Ship to production and monitor results

Why You Need a Structured Workflow

Here's the harsh reality: 95% of data science projects fail. Not because the algorithms are wrong. Not because the data is bad. They fail because teams jump straight into modeling without understanding the business problem they're solving.

Picture this scenario at Flipkart. The marketing team rushes to the data science team: "We need machine learning to increase conversions!" Three months later, you've built a beautiful recommendation engine that predicts what customers might buy. But you never asked what specific conversion problem they were trying to solve. Turns out, their real issue was cart abandonment during checkout — not product discovery.

The CRISP-DM framework (Cross-Industry Standard Process for Data Mining) prevents exactly this disaster. It forces you to think like a business consultant first, data scientist second. You solve the right problem before you solve it right.

CRISP-DM in Plain English

Think of CRISP-DM like building a house. You don't start with the roof (modeling). You start with understanding what kind of house the family needs (business understanding), survey the land (data understanding), lay the foundation (data preparation), then build the structure (modeling), inspect it (evaluation), and finally move in (deployment). Skip any step and the whole thing collapses.

Stage 1: Business Understanding

Most data scientists hate this stage. It involves zero coding and lots of meetings. But master this, and you'll never build the wrong solution again.

Defining Success Metrics

Swiggy's delivery team once asked for a model to "optimize delivery times." Sounds clear, right? Wrong. After three stakeholder interviews, the real goal emerged: reduce customer complaints about late deliveries by 40% in Q3. That's a measurable business outcome, not a vague technical task.

You need to translate business language into data science language. "Increase customer satisfaction" becomes "improve NPS score from 7.2 to 8.0 within 6 months." "Better recommendations" becomes "increase click-through rate from 3.4% to 5.1% while maintaining conversion rate above 12%."

Good Business Question

"How can we reduce cart abandonment rate from 68% to below 50% in the next quarter?" — Specific, measurable, time-bound

Resulting Data Goal

Build model to predict likelihood of cart abandonment and identify top 3 intervention points in checkout flow

Bad Business Question

"We want AI to improve our website" — Vague, unmeasurable, no timeline or specific outcome

Technical Debt Result

Months of work building something impressive but useless. Stakeholders lose trust. Project gets shelved.

The "Show Me Everything" Trap

Stakeholders often say "just show me what the data says" without defining what they want to do with those insights. Push back. Ask: "What decision will you make differently based on this analysis?" If they can't answer, you're about to waste weeks building pretty dashboards nobody uses.

Stage 2: Data Understanding

Now you shift from business analyst to detective. You need to understand what data exists, where it lives, and whether it can actually answer your business question.

The scenario: You're working at Zomato. The business team wants to predict which restaurants will fail in their first year. Sounds doable until you dig into the data.

# First look at our ecommerce data structure - what story does it tell?
import pandas as pd

df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nColumn info:")
print(df.info())
print("\nFirst look at the data:")
print(df.head())

Dataset shape: (5000, 12)

Column info:

RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order_id          5000 non-null   int64  
 1   date              5000 non-null   object 
 2   customer_age      4987 non-null   float64
 3   gender            4998 non-null   object 
 4   city              5000 non-null   object 
 5   product_category  5000 non-null   object 
 6   product_name      4995 non-null   object 
 7   quantity          5000 non-null   int64  
 8   unit_price        5000 non-null   float64
 9   revenue           5000 non-null   float64
 10  rating            4823 non-null   float64
 11  returned          5000 non-null   bool   

First look at the data:
   order_id        date  customer_age gender       city product_category product_name  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-15            28   Male     Mumbai     Electronics    iPhone 14         1    52999.0   52999.0     4.5     False
1      1002  2023-01-15            34 Female      Delhi        Clothing  Levi's Jeans         2     2999.0    5998.0     4.2      True
2      1003  2023-01-16            22   Male  Bangalore            Food   Rice 5kg         3      899.0    2697.0     4.8     False

What just happened?

(5000, 12) — We have 5,000 orders with 12 features each. Good sample size for meaningful analysis.

customer_age: 4987 non-null — 13 missing ages out of 5,000. That's only 0.26% missing data, very manageable.

Try this: Run df.describe() to see statistical summaries of numeric columns like revenue and rating distributions.

Data Quality Assessment

You need to answer three critical questions: Is this data accurate (reflects reality)? Is it complete (enough coverage)? Is it relevant (connects to your business goal)?

# Check for data quality issues - the unglamorous but crucial detective work
print("Missing values per column:")
print(df.isnull().sum())

print("\nData types and potential issues:")
print(f"Date column type: {df['date'].dtype}")
print(f"Sample dates: {df['date'].head(3).tolist()}")

# Check for obvious data problems
print(f"\nRevenue range: ₹{df['revenue'].min():,.0f} to ₹{df['revenue'].max():,.0f}")
print(f"Any negative revenue? {(df['revenue'] < 0).sum()} rows")
print(f"Return rate: {(df['returned'] == True).sum() / len(df) * 100:.1f}%")

Missing values per column:
order_id             0
date                 0
customer_age        13
gender               2
city                 0
product_category     0
product_name         5
quantity             0
unit_price           0
revenue              0
rating             177
returned             0
dtype: int64

Data types and potential issues:
Date column type: object
Sample dates: ['2023-01-15', '2023-01-15', '2023-01-16']

Revenue range: ₹899 to ₹1,99,995
Any negative revenue? 0 rows
Return rate: 12.4%

What just happened?

rating: 177 — 177 missing ratings (3.5%). This suggests customers don't always rate purchases — normal pattern for ecommerce.

Return rate: 12.4% — Realistic return rate for online retail in India. Electronics typically have 8-15% returns, clothing 20-30%.

Try this: Check unique values in categorical columns with df['city'].value_counts() to spot data entry errors.

📊 Data Insight

The 3.5% missing rating data tells a business story. Customers who don't rate purchases often had neutral experiences — not bad enough to complain, not good enough to praise. This "silent majority" represents 177 customers whose satisfaction level we need to infer from other signals like return behavior and repeat purchases.

Stage 3: Data Preparation

Here's the part nobody warns you about: you'll spend 60-80% of your time in this stage. Data preparation isn't glamorous, but it determines whether your model succeeds or fails in production.

At Myntra, a recommendation system kept suggesting winter coats to customers in Chennai during summer. The model was technically perfect. The problem? Nobody cleaned the seasonal data properly. The algorithm learned that "coat purchases" correlated with "high engagement" without understanding that all coat purchases happened during a brief winter season in North India.

Feature Engineering

Raw data rarely tells the whole story. You need to engineer features that capture business logic. The date column contains hidden signals about customer behavior patterns.

# Extract business-relevant features from raw date data
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.day_name()
df['is_weekend'] = df['date'].dt.weekday >= 5
df['month'] = df['date'].dt.month

# Create business-relevant derived features
df['high_value_order'] = df['revenue'] >= 10000
df['premium_customer'] = (df['customer_age'] >= 35) & (df['revenue'] >= 15000)

# Show how feature engineering reveals patterns
weekend_behavior = df.groupby(['is_weekend', 'product_category'])['revenue'].mean().reset_index()
print("Weekend vs Weekday purchasing patterns:")
print(weekend_behavior.pivot(index='product_category', columns='is_weekend', values='revenue'))

Weekend vs Weekday purchasing patterns:
is_weekend         False       True
product_category             
Books           3247.82   2891.45
Clothing       12456.73  14823.67
Electronics    28394.21  26871.39
Food            1892.34   2156.78
Home           11234.56  10987.23

What just happened?

Clothing: ₹14,823 (weekend) vs ₹12,456 (weekday) — People spend 19% more on clothing during weekends, likely leisure shopping behavior.

Electronics: ₹28,394 (weekday) vs ₹26,871 (weekend) — Higher weekday electronics spending suggests work-related purchases or research-driven decisions.

Try this: Create an age_group feature with bins like "18-25", "26-35", "36-50", "50+" to capture generational buying patterns.

Pro Engineering Tip: Always create features that encode domain knowledge. A feature like "is_festival_season" (Diwali, Eid, Christmas periods) will outperform any algorithmic time-series decomposition for Indian ecommerce data. The algorithm doesn't know about Indian festivals, but you do.

Stage 4: Modeling

Finally, the part everyone thinks is "real" data science. But here's what they don't tell you: modeling is the easy part. If you've done stages 1-3 correctly, the model practically builds itself.

The key insight: start simple, then add complexity. At BigBasket, their first demand forecasting model was literally just "last month's sales + 5% growth." It worked better than their complex neural network because the simple model was reliable and the team understood its limitations.

Model Selection Strategy

✅ Start Here

Linear/Logistic Regression

Fast to train, easy to interpret, works surprisingly well. You can explain every coefficient to stakeholders. Perfect baseline model.

Then Consider

Random Forest/XGBoost

Better performance, handles non-linear patterns. Still interpretable with feature importance. Good production choice.

# Build a simple baseline model to predict high-value orders
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Prepare features - keep it simple for the baseline
features = ['customer_age', 'quantity', 'unit_price', 'is_weekend']
X = df[features].dropna()
y = df.loc[X.index, 'high_value_order']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the simplest possible model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Evaluate performance
y_pred = model.predict(X_test)
print("Baseline Model Performance:")
print(classification_report(y_test, y_pred))

Baseline Model Performance:
              precision    recall  f1-score   support

       False       0.89      0.97      0.93      1245
        True       0.84      0.52      0.64       255

    accuracy                           0.88      1500
   macro avg       0.86      0.74      0.78      1500
weighted avg       0.88      0.88      0.87      1500

What just happened?

accuracy: 0.88 — The model correctly predicts high-value orders 88% of the time. Not bad for a baseline with just 4 features!

True: precision 0.84, recall 0.52 — When it predicts "high-value", it's right 84% of the time. But it only catches 52% of actual high-value orders.

Try this: Add categorical features like city and product_category using pandas' get_dummies() to improve recall.

The model excels at precision but struggles with recall — it's conservative in predicting high-value orders

This chart reveals the classic precision-recall tradeoff. High precision means when the model says "this will be a high-value order," it's usually right. Low recall means it misses many actual high-value orders. For a marketing campaign targeting high-value customers, you might prefer high precision (don't waste ad spend on wrong predictions). For inventory planning, you might need higher recall (don't miss potential demand spikes).

The business context determines which metric matters most. Always tie model performance back to business impact.

Stage 5: Evaluation

Technical metrics are just the beginning. The real question: does this model solve the original business problem? You need to evaluate both statistical performance and business value.

At HDFC Bank, they built a fraud detection model with 99.2% accuracy. Sounds amazing until you realize it flagged 15% of legitimate transactions as fraud. Customer complaints skyrocketed. Technical success, business disaster.

Business Impact Assessment

# Calculate business value of model predictions
# Scenario: Marketing campaign targeting high-value customers
campaign_cost_per_customer = 50  # ₹50 per targeted customer
avg_high_value_order_profit = 2000  # ₹2000 profit from high-value order

# Model performance analysis
true_positives = 133  # Correctly identified high-value customers
false_positives = 25   # Incorrectly targeted low-value customers  
false_negatives = 122  # Missed high-value customers

# Calculate ROI
campaign_cost = (true_positives + false_positives) * campaign_cost_per_customer
revenue_generated = true_positives * avg_high_value_order_profit
roi = (revenue_generated - campaign_cost) / campaign_cost * 100

print(f"Campaign Results:")
print(f"Total targeted: {true_positives + false_positives} customers")
print(f"Campaign cost: ₹{campaign_cost:,}")
print(f"Revenue generated: ₹{revenue_generated:,}")  
print(f"ROI: {roi:.1f}%")
print(f"\nMissed opportunity: {false_negatives} high-value customers not targeted")
print(f"Potential additional revenue: ₹{false_negatives * avg_high_value_order_profit:,}")

Campaign Results:
Total targeted: 158 customers
Campaign cost: ₹7,900
Revenue generated: ₹2,66,000
ROI: 3,267.1%

Missed opportunity: 122 high-value customers not targeted
Potential additional revenue: ₹2,44,000

What just happened?

ROI: 3,267.1% — Spectacular return! Every ₹1 spent on targeted marketing generates ₹33.67 in revenue. Even with imperfect predictions, the model creates massive value.

Missed opportunity: ₹2,44,000 — The low recall means we're leaving significant money on the table. Improving recall could nearly double campaign revenue.

Try this: Adjust the prediction threshold to catch more high-value customers. A threshold of 0.3 instead of 0.5 might improve recall with acceptable precision cost.

📊 Data Insight

This is why data science creates business value. A simple logistic regression model turns ₹7,900 marketing spend into ₹2.66L revenue. The technical accuracy of 88% matters less than the business ROI of 3,267%. Always measure models by dollars and decisions, not just statistical metrics.

Stage 6: Deployment

The final stage where many projects die. You've built a perfect model in your Jupyter notebook. Now you need to run it automatically on new data, every day, at scale, without breaking.

At Ola, they spent 8 months building a surge pricing algorithm. It worked flawlessly in testing. On launch day, it crashed after 10 minutes because nobody considered what happens when thousands of drivers simultaneously request price updates. Production environments are merciless.

Model Monitoring

Your model will degrade over time. Customer behavior changes. Market conditions shift. New products launch. The patterns your model learned three months ago might be irrelevant today.

❌ Data Drift Warning

Average order value drops 30% in December. Model trained on regular months predicts incorrectly during festival sales.

✅ Monitoring Solution

Alert when prediction distribution shifts >15% from training baseline. Trigger automatic model retraining.

⚠️ Performance Decay

Model accuracy drops from 88% to 76% over 2 months. Precision falls to 68%. Business ROI turns negative.

📊 Business Tracking

Monitor campaign ROI weekly. When ROI drops below 200%, pause targeting and retrain model on recent data.

The Silent Model Death

Models rarely fail dramatically — they decay quietly. Your fraud detection model still runs, still generates predictions, but gradually becomes less effective. Set up automated alerts for model performance metrics, not just system uptime. A working but useless model is worse than a crashed model because you don't know it's broken.

Model accuracy and business ROI both decline over time without retraining — the model becomes stale

This visualization shows the cruel reality of production models. Technical accuracy drops from 88% to 73% over 20 weeks. But the business impact is even more dramatic — ROI crashes from 3,267% to just 980%. The model is still "working" but barely creating value.

The key lesson: deploy monitoring systems alongside your model. Track both technical metrics (accuracy, precision, recall) and business metrics (ROI, conversion rate, customer satisfaction). Set automated alerts when either falls below acceptable thresholds.

Production Reality Check: Schedule model retraining every 6-8 weeks for fast-moving domains like ecommerce, every 3-6 months for stable domains like credit scoring. Don't wait for performance to degrade — proactive retraining costs less than reactive firefighting.

The Iterative Reality

Here's what every workflow diagram gets wrong: data science projects are never linear. You'll cycle back through stages multiple times. You discover data quality issues during modeling. You realize the business problem was misunderstood during evaluation. You find new data sources during deployment.

At Paytm, a customer lifetime value project went through the full cycle three times. First iteration: wrong business question (focused on transaction value instead of user retention). Second iteration: data quality issues (missing user demographic data). Third iteration: model complexity problems (neural network was overkill for simple customer segments). The final solution? A decision tree that ran in 50ms and increased customer retention by 23%.

The "Analysis Paralysis" Trap

Teams spend months perfecting data preparation or chasing 2% accuracy improvements. Set time limits: 2 weeks for business understanding, 1 week for data understanding, 2 weeks for preparation, 1 week for baseline modeling. Ship fast, iterate faster. Perfect is the enemy of deployed.

Data preparation consumes nearly half your time — plan accordingly and don't underestimate it

This time distribution reflects real industry experience. Data preparation takes 45% of project time — cleaning, transforming, feature engineering. Modeling, the part everyone thinks is "data science," takes just 15%. The lesson: budget your time accordingly.

But here's the paradox: you can't rush data preparation, but you also can't perfect it. The best approach is iterative data preparation. Get the data "good enough" to train a baseline model, then improve it based on model feedback.

Where to Practice

Kaggle Notebooks

Free cloud environment, no setup required. Upload dataplexa_ecommerce.csv and practice the full workflow. Visit kaggle.com → Notebooks

Google Colab

Free Jupyter notebooks with Google account. Free GPU access for larger datasets. Go to colab.research.google.com

Jupyter Notebook (Local)

Full control, works offline. Install with pip install jupyter then jupyter notebook

W3Schools Tryit Editor

Quick Python syntax checks — no account needed. Paste any snippet and run instantly at w3schools.com/python/trypython.asp

← Previous Course Index Next →