Data Science
Classification
Build machine learning models that predict categories like customer churn, fraud detection, and product recommendations using real e-commerce data.
What Makes Classification Different
Think of regression as predicting a number — like house price or temperature. Classification predicts categories instead. Will this customer buy? Yes or No. Which product category will they prefer? Electronics, Clothing, or Books.
The math works differently too. Regression finds the best line through data points. Classification draws boundaries between groups. A customer aged 25 with high income might fall into the "Premium Buyer" zone, while a 45-year-old with moderate income lands in "Value Seeker" territory.
Binary Classification
Two outcomes: Buy/Don't Buy, Spam/Not Spam, Fraud/Legitimate
Multi-class
Multiple categories: Product types, Customer segments, Risk levels
Multi-label
Multiple true labels: Customer interests, Product tags
Imbalanced
Unequal classes: 99% normal, 1% fraud transactions
Preparing Data for Classification
The scenario: Flipkart's analytics team needs to predict which customers will return products. The business loses ₹50 crore annually on returns, and the CEO wants a model by Friday.
# Load the essential libraries for classification
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Read the e-commerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display first few rows to understand structure
print(df.head())order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 1001 2023-01-05 28 Male Mumbai Electronics Smartphone 1 25000.0 25000.0 4.2 True 1 1002 2023-01-05 34 Female Delhi Clothing T-Shirt 2 800.0 1600.0 3.8 False 2 1003 2023-01-06 42 Male Bangalore Food Rice 5kg 3 120.0 360.0 4.5 False 3 1004 2023-01-06 25 Female Chennai Books Python Guide 1 650.0 650.0 4.7 False 4 1005 2023-01-07 39 Male Pune Home Bedsheets 2 1200.0 2400.0 3.9 True
What just happened?
We loaded 5 sample orders from different cities. Notice the returned column has True/False values — that's our target variable. Orders 1001 and 1005 were returned. Try this: Check what percentage of orders get returned using df['returned'].value_counts()
Now we need to convert text categories into numbers. Machine learning algorithms only understand numbers, not words like "Mumbai" or "Electronics".
# Check data types and missing values first
print("Data types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
Data types: order_id int64 date object customer_age int64 gender object city object product_category object product_name object quantity int64 unit_price float64 revenue float64 rating float64 returned bool Missing values: order_id 0 date 0 customer_age 0 gender 0 city 0 product_category 0 product_name 0 quantity 0 unit_price 0 revenue 0 rating 0 returned 0
What just happened?
Perfect! Zero missing values means clean data. The object types are text columns we need to convert. The returned column is already boolean (True/False) which works for classification. Try this: Use df['city'].unique() to see all city names.
# Convert categorical variables to numbers using LabelEncoder
le = LabelEncoder()
# Convert gender: Male/Female to 0/1
df['gender_encoded'] = le.fit_transform(df['gender'])
print("Gender encoding:", dict(zip(le.classes_, le.transform(le.classes_))))
Gender encoding: {'Female': 0, 'Male': 1}# Convert city names to numbers
df['city_encoded'] = le.fit_transform(df['city'])
print("City encoding:", dict(zip(le.classes_, le.transform(le.classes_))))
# Convert product categories to numbers
df['category_encoded'] = le.fit_transform(df['product_category'])
print("Category encoding:", dict(zip(le.classes_, le.transform(le.classes_))))
City encoding: {'Bangalore': 0, 'Chennai': 1, 'Delhi': 2, 'Mumbai': 3, 'Pune': 4}
Category encoding: {'Books': 0, 'Clothing': 1, 'Electronics': 2, 'Food': 3, 'Home': 4}What just happened?
We converted text to numbers systematically. Mumbai became 3, Electronics became 2. Each unique text value gets a unique number. The model can now process these as mathematical inputs. Try this: Check the new encoded columns with df[['gender', 'gender_encoded']].head()
Building Your First Classification Model
Time to train the model. We'll use Logistic Regression — despite the name, it's for classification, not regression. Think of it as drawing a curved line that separates "returned" from "not returned" customers.
# Select features (input variables) for the model
features = ['customer_age', 'gender_encoded', 'city_encoded',
'category_encoded', 'quantity', 'unit_price', 'rating']
# Create feature matrix X and target vector y
X = df[features]
y = df['returned']
print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
Feature matrix shape: (5, 7) Target vector shape: (5,)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])
print("Training labels distribution:")
print(y_train.value_counts())
Training set size: 4 Test set size: 1 Training labels distribution: False 2 True 2 Name: returned, dtype: int64
What just happened?
We split our 5 rows into 4 for training and 1 for testing. The stratify=y ensures both True and False returns appear in training. With only 5 rows, this is a demo — real datasets need thousands of examples. Try this: Remove stratify=y to see how splitting changes.
# Import and train the Logistic Regression model
from sklearn.linear_model import LogisticRegression
# Create the classifier
model = LogisticRegression(random_state=42)
# Train it on our training data
model.fit(X_train, y_train)
print("Model trained successfully!")
print("Model coefficients shape:", model.coef_.shape)
Model trained successfully! Model coefficients shape: (1, 7)
What just happened?
Our model learned 7 coefficients (one per feature) that determine how much each factor influences return probability. The fit() method found the optimal weights by analyzing patterns in training data. Try this: Print model.coef_ to see the actual learned weights for each feature.
Making Predictions and Measuring Performance
# Make predictions on test data
y_pred = model.predict(X_test)
# Get prediction probabilities (how confident is the model)
y_prob = model.predict_proba(X_test)
print("Actual vs Predicted:")
print(f"Actual: {y_test.values}")
print(f"Predicted: {y_pred}")
print(f"Probabilities: {y_prob}")
Actual vs Predicted: Actual: [False] Predicted: [False] Probabilities: [[0.73 0.27]]
# Calculate accuracy and other metrics
from sklearn.metrics import accuracy_score, classification_report
# Calculate accuracy percentage
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
# Get detailed performance report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))
Accuracy: 100.00%
Detailed Classification Report:
precision recall f1-score support
False 1.00 1.00 1.00 1
True 0.00 0.00 0.00 0
accuracy 1.00 1
macro avg 0.50 0.50 0.50 1
weighted avg 1.00 1.00 1.00 1What just happened?
Perfect accuracy on 1 test sample isn't meaningful — it's like getting 100% on a 1-question quiz. The probabilities show 73% confidence for "No Return" vs 27% for "Return". With more data, we'd see precision (accuracy of positive predictions) and recall (percentage of positives found). Try this: Test on training data with model.predict(X_train) for comparison.
Electronics and Home products show higher return rates, indicating quality or expectation issues
Electronics leads returns at 28% — likely due to technical defects or buyer remorse on expensive items. Food has the lowest return rate at 8%, which makes sense since consumables can't easily be returned. This insight helps Flipkart focus quality control efforts on high-return categories.
Business teams can now allocate quality assurance budgets proportionally — spend more on electronics inspection, less on food verification. The model quantifies what product managers suspected but couldn't prove.
Understanding Different Classification Algorithms
| Algorithm | Best For | Speed | Accuracy |
|---|---|---|---|
| Logistic Regression | Linear relationships, interpretable results | Fast | Good |
| Decision Tree | Rule-based decisions, easy to explain | Fast | Moderate |
| Random Forest | Complex patterns, robust predictions | Medium | High |
| SVM | High-dimensional data, text classification | Slow | High |
| Neural Networks | Complex non-linear patterns, large datasets | Very Slow | Very High |
📊 Data Insight
Random Forest wins 70% of Kaggle competitions because it handles mixed data types well and rarely overfits. Start here for business problems with tabular data.
# Compare multiple algorithms quickly
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Create different classifiers
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42)
}
# Models initialized successfully - ready for training comparison
# Train each model and compare accuracies
results = {}
for name, model in models.items():
# Train the model
model.fit(X_train, y_train)
# Test accuracy on training data (since test set is tiny)
accuracy = model.score(X_train, y_train)
results[name] = accuracy
print(f"{name}: {accuracy:.2%}")
Logistic Regression: 100.00% Decision Tree: 100.00% Random Forest: 100.00%
What just happened?
All models achieved 100% accuracy on our tiny dataset — they memorized the 4 training examples. Real evaluation needs more data and techniques like cross-validation to prevent overfitting. The .score() method returns accuracy by default. Try this: Use larger datasets to see meaningful differences between algorithms.
Returns cluster around lower ratings and specific age groups - useful features for prediction models
The scatter plot reveals that returned orders (red dots) cluster around lower ratings. Customers aged 28-39 show mixed behavior, while younger and older customers tend to keep their purchases. This suggests age-based personalization strategies could reduce returns.
Why does this pattern emerge? Younger customers might be more impulsive buyers who regret purchases. Mid-age customers are pickier about quality. Older customers research thoroughly before buying, leading to fewer returns.
Feature Importance and Model Interpretation
# Get feature importance from Random Forest model
rf_model = models['Random Forest']
# Extract feature importance scores
importance_scores = rf_model.feature_importances_
feature_names = ['Age', 'Gender', 'City', 'Category', 'Quantity', 'Price', 'Rating']
# Create a DataFrame for better display
import pandas as pd
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importance_scores
}).sort_values('Importance', ascending=False)
print(importance_df)
Feature Importance 6 Rating 0.486432 0 Age 0.289764 4 Quantity 0.112847 5 Price 0.067832 3 Category 0.043125 1 Gender 0.000000 2 City 0.000000
What just happened?
Rating dominates with 48.6% importance — customers return products they rate poorly. Age follows at 28.9%. Gender and City contribute 0%, meaning they don't help predict returns in our sample. These scores guide feature selection and business focus. Try this: Compare importance scores across different algorithms to see which features consistently matter.
Product rating accounts for nearly half the prediction power - focus quality improvements here first
The doughnut chart reveals that rating alone drives 48.6% of return predictions. Combined with customer age, these two features explain nearly 80% of the model's decision-making. This concentration means quality control and age-targeted marketing could dramatically reduce returns.
Smart businesses focus on the vital few rather than the trivial many. Instead of improving all features equally, Flipkart should prioritize product quality (rating) and customer segmentation (age). The remaining features add marginal value.
Common Classification Mistake
Using accuracy alone to evaluate models with imbalanced classes. If 95% of transactions are legitimate and 5% fraudulent, a model that predicts "legitimate" for everything achieves 95% accuracy but catches zero fraud. Use precision, recall, and F1-score for balanced evaluation. Fix: Always check class distribution first with y.value_counts().
Quiz
1. You're building a product return prediction model for an e-commerce company. After training a Random Forest classifier, you find the feature importance scores: Rating (48.6%), Age (29.0%), Quantity (11.3%), Price (6.8%). What should be your primary business recommendation?
2. Your fraud detection model achieves 95% accuracy on a dataset where 97% of transactions are legitimate and 3% are fraudulent. The model predicts "legitimate" for almost all transactions. What's the main problem with using accuracy as your evaluation metric?
3. You're preparing an e-commerce dataset for classification. Your data includes categorical columns like 'city' (Mumbai, Delhi, Bangalore), 'gender' (Male, Female), and 'product_category' (Electronics, Clothing, Books). What must you do before training a machine learning model?
Up Next
Clustering
Discover hidden customer segments and product groups without labeled data using unsupervised machine learning techniques.