Data Science
DS Workflow
Map the complete journey from raw business problem to deployed solution using a systematic 6-stage process that every data scientist follows in production.
This lesson covers
CRISP-DM Framework · Business Understanding · Data Preparation · Model Building · Deployment Pipeline · Real Project Workflow
Business Understanding
Define the problem and success metrics
Data Understanding
Collect, explore, and verify data quality
Data Preparation
Clean, transform, and engineer features
Modeling
Build, train, and validate models
Evaluation
Test performance against business metrics
Deployment
Ship to production and monitor results
Why You Need a Structured Workflow
Here's the harsh reality: 95% of data science projects fail. Not because the algorithms are wrong. Not because the data is bad. They fail because teams jump straight into modeling without understanding the business problem they're solving.
Picture this scenario at Flipkart. The marketing team rushes to the data science team: "We need machine learning to increase conversions!" Three months later, you've built a beautiful recommendation engine that predicts what customers might buy. But you never asked what specific conversion problem they were trying to solve. Turns out, their real issue was cart abandonment during checkout — not product discovery.
The CRISP-DM framework (Cross-Industry Standard Process for Data Mining) prevents exactly this disaster. It forces you to think like a business consultant first, data scientist second. You solve the right problem before you solve it right.
CRISP-DM in Plain English
Think of CRISP-DM like building a house. You don't start with the roof (modeling). You start with understanding what kind of house the family needs (business understanding), survey the land (data understanding), lay the foundation (data preparation), then build the structure (modeling), inspect it (evaluation), and finally move in (deployment). Skip any step and the whole thing collapses.
Stage 1: Business Understanding
Most data scientists hate this stage. It involves zero coding and lots of meetings. But master this, and you'll never build the wrong solution again.
Defining Success Metrics
Swiggy's delivery team once asked for a model to "optimize delivery times." Sounds clear, right? Wrong. After three stakeholder interviews, the real goal emerged: reduce customer complaints about late deliveries by 40% in Q3. That's a measurable business outcome, not a vague technical task.
You need to translate business language into data science language. "Increase customer satisfaction" becomes "improve NPS score from 7.2 to 8.0 within 6 months." "Better recommendations" becomes "increase click-through rate from 3.4% to 5.1% while maintaining conversion rate above 12%."
Good Business Question
"How can we reduce cart abandonment rate from 68% to below 50% in the next quarter?" — Specific, measurable, time-bound
Resulting Data Goal
Build model to predict likelihood of cart abandonment and identify top 3 intervention points in checkout flow
Bad Business Question
"We want AI to improve our website" — Vague, unmeasurable, no timeline or specific outcome
Technical Debt Result
Months of work building something impressive but useless. Stakeholders lose trust. Project gets shelved.
The "Show Me Everything" Trap
Stakeholders often say "just show me what the data says" without defining what they want to do with those insights. Push back. Ask: "What decision will you make differently based on this analysis?" If they can't answer, you're about to waste weeks building pretty dashboards nobody uses.
Stage 2: Data Understanding
Now you shift from business analyst to detective. You need to understand what data exists, where it lives, and whether it can actually answer your business question.
The scenario: You're working at Zomato. The business team wants to predict which restaurants will fail in their first year. Sounds doable until you dig into the data.
# First look at our ecommerce data structure - what story does it tell?
import pandas as pd
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nColumn info:")
print(df.info())
print("\nFirst look at the data:")
print(df.head())
Dataset shape: (5000, 12) Column info:RangeIndex: 5000 entries, 0 to 4999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 5000 non-null int64 1 date 5000 non-null object 2 customer_age 4987 non-null float64 3 gender 4998 non-null object 4 city 5000 non-null object 5 product_category 5000 non-null object 6 product_name 4995 non-null object 7 quantity 5000 non-null int64 8 unit_price 5000 non-null float64 9 revenue 5000 non-null float64 10 rating 4823 non-null float64 11 returned 5000 non-null bool First look at the data: order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 1001 2023-01-15 28 Male Mumbai Electronics iPhone 14 1 52999.0 52999.0 4.5 False 1 1002 2023-01-15 34 Female Delhi Clothing Levi's Jeans 2 2999.0 5998.0 4.2 True 2 1003 2023-01-16 22 Male Bangalore Food Rice 5kg 3 899.0 2697.0 4.8 False
What just happened?
(5000, 12) — We have 5,000 orders with 12 features each. Good sample size for meaningful analysis.
customer_age: 4987 non-null — 13 missing ages out of 5,000. That's only 0.26% missing data, very manageable.
Try this: Run df.describe() to see statistical summaries of numeric columns like revenue and rating distributions.
Data Quality Assessment
You need to answer three critical questions: Is this data accurate (reflects reality)? Is it complete (enough coverage)? Is it relevant (connects to your business goal)?
# Check for data quality issues - the unglamorous but crucial detective work
print("Missing values per column:")
print(df.isnull().sum())
print("\nData types and potential issues:")
print(f"Date column type: {df['date'].dtype}")
print(f"Sample dates: {df['date'].head(3).tolist()}")
# Check for obvious data problems
print(f"\nRevenue range: ₹{df['revenue'].min():,.0f} to ₹{df['revenue'].max():,.0f}")
print(f"Any negative revenue? {(df['revenue'] < 0).sum()} rows")
print(f"Return rate: {(df['returned'] == True).sum() / len(df) * 100:.1f}%")
Missing values per column: order_id 0 date 0 customer_age 13 gender 2 city 0 product_category 0 product_name 5 quantity 0 unit_price 0 revenue 0 rating 177 returned 0 dtype: int64 Data types and potential issues: Date column type: object Sample dates: ['2023-01-15', '2023-01-15', '2023-01-16'] Revenue range: ₹899 to ₹1,99,995 Any negative revenue? 0 rows Return rate: 12.4%
What just happened?
rating: 177 — 177 missing ratings (3.5%). This suggests customers don't always rate purchases — normal pattern for ecommerce.
Return rate: 12.4% — Realistic return rate for online retail in India. Electronics typically have 8-15% returns, clothing 20-30%.
Try this: Check unique values in categorical columns with df['city'].value_counts() to spot data entry errors.
📊 Data Insight
The 3.5% missing rating data tells a business story. Customers who don't rate purchases often had neutral experiences — not bad enough to complain, not good enough to praise. This "silent majority" represents 177 customers whose satisfaction level we need to infer from other signals like return behavior and repeat purchases.
Stage 3: Data Preparation
Here's the part nobody warns you about: you'll spend 60-80% of your time in this stage. Data preparation isn't glamorous, but it determines whether your model succeeds or fails in production.
At Myntra, a recommendation system kept suggesting winter coats to customers in Chennai during summer. The model was technically perfect. The problem? Nobody cleaned the seasonal data properly. The algorithm learned that "coat purchases" correlated with "high engagement" without understanding that all coat purchases happened during a brief winter season in North India.
Feature Engineering
Raw data rarely tells the whole story. You need to engineer features that capture business logic. The date column contains hidden signals about customer behavior patterns.
# Extract business-relevant features from raw date data
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.day_name()
df['is_weekend'] = df['date'].dt.weekday >= 5
df['month'] = df['date'].dt.month
# Create business-relevant derived features
df['high_value_order'] = df['revenue'] >= 10000
df['premium_customer'] = (df['customer_age'] >= 35) & (df['revenue'] >= 15000)
# Show how feature engineering reveals patterns
weekend_behavior = df.groupby(['is_weekend', 'product_category'])['revenue'].mean().reset_index()
print("Weekend vs Weekday purchasing patterns:")
print(weekend_behavior.pivot(index='product_category', columns='is_weekend', values='revenue'))
Weekend vs Weekday purchasing patterns: is_weekend False True product_category Books 3247.82 2891.45 Clothing 12456.73 14823.67 Electronics 28394.21 26871.39 Food 1892.34 2156.78 Home 11234.56 10987.23
What just happened?
Clothing: ₹14,823 (weekend) vs ₹12,456 (weekday) — People spend 19% more on clothing during weekends, likely leisure shopping behavior.
Electronics: ₹28,394 (weekday) vs ₹26,871 (weekend) — Higher weekday electronics spending suggests work-related purchases or research-driven decisions.
Try this: Create an age_group feature with bins like "18-25", "26-35", "36-50", "50+" to capture generational buying patterns.
Pro Engineering Tip: Always create features that encode domain knowledge. A feature like "is_festival_season" (Diwali, Eid, Christmas periods) will outperform any algorithmic time-series decomposition for Indian ecommerce data. The algorithm doesn't know about Indian festivals, but you do.
Stage 4: Modeling
Finally, the part everyone thinks is "real" data science. But here's what they don't tell you: modeling is the easy part. If you've done stages 1-3 correctly, the model practically builds itself.
The key insight: start simple, then add complexity. At BigBasket, their first demand forecasting model was literally just "last month's sales + 5% growth." It worked better than their complex neural network because the simple model was reliable and the team understood its limitations.
Model Selection Strategy
✅ Start Here
Linear/Logistic Regression
Fast to train, easy to interpret, works surprisingly well. You can explain every coefficient to stakeholders. Perfect baseline model.
Then Consider
Random Forest/XGBoost
Better performance, handles non-linear patterns. Still interpretable with feature importance. Good production choice.
# Build a simple baseline model to predict high-value orders
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Prepare features - keep it simple for the baseline
features = ['customer_age', 'quantity', 'unit_price', 'is_weekend']
X = df[features].dropna()
y = df.loc[X.index, 'high_value_order']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the simplest possible model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Evaluate performance
y_pred = model.predict(X_test)
print("Baseline Model Performance:")
print(classification_report(y_test, y_pred))
Baseline Model Performance:
precision recall f1-score support
False 0.89 0.97 0.93 1245
True 0.84 0.52 0.64 255
accuracy 0.88 1500
macro avg 0.86 0.74 0.78 1500
weighted avg 0.88 0.88 0.87 1500
What just happened?
accuracy: 0.88 — The model correctly predicts high-value orders 88% of the time. Not bad for a baseline with just 4 features!
True: precision 0.84, recall 0.52 — When it predicts "high-value", it's right 84% of the time. But it only catches 52% of actual high-value orders.
Try this: Add categorical features like city and product_category using pandas' get_dummies() to improve recall.
The model excels at precision but struggles with recall — it's conservative in predicting high-value orders
This chart reveals the classic precision-recall tradeoff. High precision means when the model says "this will be a high-value order," it's usually right. Low recall means it misses many actual high-value orders. For a marketing campaign targeting high-value customers, you might prefer high precision (don't waste ad spend on wrong predictions). For inventory planning, you might need higher recall (don't miss potential demand spikes).
The business context determines which metric matters most. Always tie model performance back to business impact.
Stage 5: Evaluation
Technical metrics are just the beginning. The real question: does this model solve the original business problem? You need to evaluate both statistical performance and business value.
At HDFC Bank, they built a fraud detection model with 99.2% accuracy. Sounds amazing until you realize it flagged 15% of legitimate transactions as fraud. Customer complaints skyrocketed. Technical success, business disaster.
Business Impact Assessment
# Calculate business value of model predictions
# Scenario: Marketing campaign targeting high-value customers
campaign_cost_per_customer = 50 # ₹50 per targeted customer
avg_high_value_order_profit = 2000 # ₹2000 profit from high-value order
# Model performance analysis
true_positives = 133 # Correctly identified high-value customers
false_positives = 25 # Incorrectly targeted low-value customers
false_negatives = 122 # Missed high-value customers
# Calculate ROI
campaign_cost = (true_positives + false_positives) * campaign_cost_per_customer
revenue_generated = true_positives * avg_high_value_order_profit
roi = (revenue_generated - campaign_cost) / campaign_cost * 100
print(f"Campaign Results:")
print(f"Total targeted: {true_positives + false_positives} customers")
print(f"Campaign cost: ₹{campaign_cost:,}")
print(f"Revenue generated: ₹{revenue_generated:,}")
print(f"ROI: {roi:.1f}%")
print(f"\nMissed opportunity: {false_negatives} high-value customers not targeted")
print(f"Potential additional revenue: ₹{false_negatives * avg_high_value_order_profit:,}")
Campaign Results: Total targeted: 158 customers Campaign cost: ₹7,900 Revenue generated: ₹2,66,000 ROI: 3,267.1% Missed opportunity: 122 high-value customers not targeted Potential additional revenue: ₹2,44,000
What just happened?
ROI: 3,267.1% — Spectacular return! Every ₹1 spent on targeted marketing generates ₹33.67 in revenue. Even with imperfect predictions, the model creates massive value.
Missed opportunity: ₹2,44,000 — The low recall means we're leaving significant money on the table. Improving recall could nearly double campaign revenue.
Try this: Adjust the prediction threshold to catch more high-value customers. A threshold of 0.3 instead of 0.5 might improve recall with acceptable precision cost.
📊 Data Insight
This is why data science creates business value. A simple logistic regression model turns ₹7,900 marketing spend into ₹2.66L revenue. The technical accuracy of 88% matters less than the business ROI of 3,267%. Always measure models by dollars and decisions, not just statistical metrics.
Stage 6: Deployment
The final stage where many projects die. You've built a perfect model in your Jupyter notebook. Now you need to run it automatically on new data, every day, at scale, without breaking.
At Ola, they spent 8 months building a surge pricing algorithm. It worked flawlessly in testing. On launch day, it crashed after 10 minutes because nobody considered what happens when thousands of drivers simultaneously request price updates. Production environments are merciless.
Model Monitoring
Your model will degrade over time. Customer behavior changes. Market conditions shift. New products launch. The patterns your model learned three months ago might be irrelevant today.
❌ Data Drift Warning
Average order value drops 30% in December. Model trained on regular months predicts incorrectly during festival sales.
✅ Monitoring Solution
Alert when prediction distribution shifts >15% from training baseline. Trigger automatic model retraining.
⚠️ Performance Decay
Model accuracy drops from 88% to 76% over 2 months. Precision falls to 68%. Business ROI turns negative.
📊 Business Tracking
Monitor campaign ROI weekly. When ROI drops below 200%, pause targeting and retrain model on recent data.
The Silent Model Death
Models rarely fail dramatically — they decay quietly. Your fraud detection model still runs, still generates predictions, but gradually becomes less effective. Set up automated alerts for model performance metrics, not just system uptime. A working but useless model is worse than a crashed model because you don't know it's broken.
Model accuracy and business ROI both decline over time without retraining — the model becomes stale
This visualization shows the cruel reality of production models. Technical accuracy drops from 88% to 73% over 20 weeks. But the business impact is even more dramatic — ROI crashes from 3,267% to just 980%. The model is still "working" but barely creating value.
The key lesson: deploy monitoring systems alongside your model. Track both technical metrics (accuracy, precision, recall) and business metrics (ROI, conversion rate, customer satisfaction). Set automated alerts when either falls below acceptable thresholds.
Production Reality Check: Schedule model retraining every 6-8 weeks for fast-moving domains like ecommerce, every 3-6 months for stable domains like credit scoring. Don't wait for performance to degrade — proactive retraining costs less than reactive firefighting.
The Iterative Reality
Here's what every workflow diagram gets wrong: data science projects are never linear. You'll cycle back through stages multiple times. You discover data quality issues during modeling. You realize the business problem was misunderstood during evaluation. You find new data sources during deployment.
At Paytm, a customer lifetime value project went through the full cycle three times. First iteration: wrong business question (focused on transaction value instead of user retention). Second iteration: data quality issues (missing user demographic data). Third iteration: model complexity problems (neural network was overkill for simple customer segments). The final solution? A decision tree that ran in 50ms and increased customer retention by 23%.
The "Analysis Paralysis" Trap
Teams spend months perfecting data preparation or chasing 2% accuracy improvements. Set time limits: 2 weeks for business understanding, 1 week for data understanding, 2 weeks for preparation, 1 week for baseline modeling. Ship fast, iterate faster. Perfect is the enemy of deployed.
Data preparation consumes nearly half your time — plan accordingly and don't underestimate it
This time distribution reflects real industry experience. Data preparation takes 45% of project time — cleaning, transforming, feature engineering. Modeling, the part everyone thinks is "data science," takes just 15%. The lesson: budget your time accordingly.
But here's the paradox: you can't rush data preparation, but you also can't perfect it. The best approach is iterative data preparation. Get the data "good enough" to train a baseline model, then improve it based on model feedback.
Where to Practice
Kaggle Notebooks
Free cloud environment, no setup required. Upload dataplexa_ecommerce.csv and practice the full workflow. Visit kaggle.com → Notebooks
Google Colab
Free Jupyter notebooks with Google account. Free GPU access for larger datasets. Go to colab.research.google.com
Jupyter Notebook (Local)
Full control, works offline. Install with pip install jupyter then jupyter notebook
W3Schools Tryit Editor
Quick Python syntax checks — no account needed. Paste any snippet and run instantly at w3schools.com/python/trypython.asp