Data Science Lesson 48 – Classification | Dataplexa
Machine Learning · Lesson 48

Classification

Build machine learning models that predict categories like customer churn, fraud detection, and product recommendations using real e-commerce data.

1
Prepare Data
2
Choose Algorithm
3
Train Model
4
Make Predictions

What Makes Classification Different

Think of regression as predicting a number — like house price or temperature. Classification predicts categories instead. Will this customer buy? Yes or No. Which product category will they prefer? Electronics, Clothing, or Books.

The math works differently too. Regression finds the best line through data points. Classification draws boundaries between groups. A customer aged 25 with high income might fall into the "Premium Buyer" zone, while a 45-year-old with moderate income lands in "Value Seeker" territory.

Binary Classification

Two outcomes: Buy/Don't Buy, Spam/Not Spam, Fraud/Legitimate

Multi-class

Multiple categories: Product types, Customer segments, Risk levels

Multi-label

Multiple true labels: Customer interests, Product tags

Imbalanced

Unequal classes: 99% normal, 1% fraud transactions

Preparing Data for Classification

The scenario: Flipkart's analytics team needs to predict which customers will return products. The business loses ₹50 crore annually on returns, and the CEO wants a model by Friday.

# Load the essential libraries for classification
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Read the e-commerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display first few rows to understand structure
print(df.head())

What just happened?

We loaded 5 sample orders from different cities. Notice the returned column has True/False values — that's our target variable. Orders 1001 and 1005 were returned. Try this: Check what percentage of orders get returned using df['returned'].value_counts()

Now we need to convert text categories into numbers. Machine learning algorithms only understand numbers, not words like "Mumbai" or "Electronics".

# Check data types and missing values first
print("Data types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

What just happened?

Perfect! Zero missing values means clean data. The object types are text columns we need to convert. The returned column is already boolean (True/False) which works for classification. Try this: Use df['city'].unique() to see all city names.

# Convert categorical variables to numbers using LabelEncoder
le = LabelEncoder()

# Convert gender: Male/Female to 0/1
df['gender_encoded'] = le.fit_transform(df['gender'])
print("Gender encoding:", dict(zip(le.classes_, le.transform(le.classes_))))
# Convert city names to numbers
df['city_encoded'] = le.fit_transform(df['city'])
print("City encoding:", dict(zip(le.classes_, le.transform(le.classes_))))

# Convert product categories to numbers  
df['category_encoded'] = le.fit_transform(df['product_category'])
print("Category encoding:", dict(zip(le.classes_, le.transform(le.classes_))))

What just happened?

We converted text to numbers systematically. Mumbai became 3, Electronics became 2. Each unique text value gets a unique number. The model can now process these as mathematical inputs. Try this: Check the new encoded columns with df[['gender', 'gender_encoded']].head()

Building Your First Classification Model

Time to train the model. We'll use Logistic Regression — despite the name, it's for classification, not regression. Think of it as drawing a curved line that separates "returned" from "not returned" customers.

# Select features (input variables) for the model
features = ['customer_age', 'gender_encoded', 'city_encoded', 
           'category_encoded', 'quantity', 'unit_price', 'rating']

# Create feature matrix X and target vector y
X = df[features]
y = df['returned']
print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])
print("Training labels distribution:")
print(y_train.value_counts())

What just happened?

We split our 5 rows into 4 for training and 1 for testing. The stratify=y ensures both True and False returns appear in training. With only 5 rows, this is a demo — real datasets need thousands of examples. Try this: Remove stratify=y to see how splitting changes.

# Import and train the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Create the classifier
model = LogisticRegression(random_state=42)
# Train it on our training data
model.fit(X_train, y_train)

print("Model trained successfully!")
print("Model coefficients shape:", model.coef_.shape)

What just happened?

Our model learned 7 coefficients (one per feature) that determine how much each factor influences return probability. The fit() method found the optimal weights by analyzing patterns in training data. Try this: Print model.coef_ to see the actual learned weights for each feature.

Making Predictions and Measuring Performance

# Make predictions on test data
y_pred = model.predict(X_test)
# Get prediction probabilities (how confident is the model)
y_prob = model.predict_proba(X_test)

print("Actual vs Predicted:")
print(f"Actual: {y_test.values}")
print(f"Predicted: {y_pred}")
print(f"Probabilities: {y_prob}")
# Calculate accuracy and other metrics
from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy percentage
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")

# Get detailed performance report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))

What just happened?

Perfect accuracy on 1 test sample isn't meaningful — it's like getting 100% on a 1-question quiz. The probabilities show 73% confidence for "No Return" vs 27% for "Return". With more data, we'd see precision (accuracy of positive predictions) and recall (percentage of positives found). Try this: Test on training data with model.predict(X_train) for comparison.

Electronics and Home products show higher return rates, indicating quality or expectation issues

Electronics leads returns at 28% — likely due to technical defects or buyer remorse on expensive items. Food has the lowest return rate at 8%, which makes sense since consumables can't easily be returned. This insight helps Flipkart focus quality control efforts on high-return categories.

Business teams can now allocate quality assurance budgets proportionally — spend more on electronics inspection, less on food verification. The model quantifies what product managers suspected but couldn't prove.

Understanding Different Classification Algorithms

Algorithm Best For Speed Accuracy
Logistic Regression Linear relationships, interpretable results Fast Good
Decision Tree Rule-based decisions, easy to explain Fast Moderate
Random Forest Complex patterns, robust predictions Medium High
SVM High-dimensional data, text classification Slow High
Neural Networks Complex non-linear patterns, large datasets Very Slow Very High

📊 Data Insight

Random Forest wins 70% of Kaggle competitions because it handles mixed data types well and rarely overfits. Start here for business problems with tabular data.

# Compare multiple algorithms quickly
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Create different classifiers
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42), 
    'Random Forest': RandomForestClassifier(random_state=42)
}
# Train each model and compare accuracies
results = {}
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    # Test accuracy on training data (since test set is tiny)
    accuracy = model.score(X_train, y_train)
    results[name] = accuracy
    print(f"{name}: {accuracy:.2%}")

What just happened?

All models achieved 100% accuracy on our tiny dataset — they memorized the 4 training examples. Real evaluation needs more data and techniques like cross-validation to prevent overfitting. The .score() method returns accuracy by default. Try this: Use larger datasets to see meaningful differences between algorithms.

Returns cluster around lower ratings and specific age groups - useful features for prediction models

The scatter plot reveals that returned orders (red dots) cluster around lower ratings. Customers aged 28-39 show mixed behavior, while younger and older customers tend to keep their purchases. This suggests age-based personalization strategies could reduce returns.

Why does this pattern emerge? Younger customers might be more impulsive buyers who regret purchases. Mid-age customers are pickier about quality. Older customers research thoroughly before buying, leading to fewer returns.

Feature Importance and Model Interpretation

# Get feature importance from Random Forest model
rf_model = models['Random Forest']

# Extract feature importance scores
importance_scores = rf_model.feature_importances_
feature_names = ['Age', 'Gender', 'City', 'Category', 'Quantity', 'Price', 'Rating']

# Create a DataFrame for better display
import pandas as pd
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importance_scores
}).sort_values('Importance', ascending=False)

print(importance_df)

What just happened?

Rating dominates with 48.6% importance — customers return products they rate poorly. Age follows at 28.9%. Gender and City contribute 0%, meaning they don't help predict returns in our sample. These scores guide feature selection and business focus. Try this: Compare importance scores across different algorithms to see which features consistently matter.

Product rating accounts for nearly half the prediction power - focus quality improvements here first

The doughnut chart reveals that rating alone drives 48.6% of return predictions. Combined with customer age, these two features explain nearly 80% of the model's decision-making. This concentration means quality control and age-targeted marketing could dramatically reduce returns.

Smart businesses focus on the vital few rather than the trivial many. Instead of improving all features equally, Flipkart should prioritize product quality (rating) and customer segmentation (age). The remaining features add marginal value.

Common Classification Mistake

Using accuracy alone to evaluate models with imbalanced classes. If 95% of transactions are legitimate and 5% fraudulent, a model that predicts "legitimate" for everything achieves 95% accuracy but catches zero fraud. Use precision, recall, and F1-score for balanced evaluation. Fix: Always check class distribution first with y.value_counts().

Quiz

1. You're building a product return prediction model for an e-commerce company. After training a Random Forest classifier, you find the feature importance scores: Rating (48.6%), Age (29.0%), Quantity (11.3%), Price (6.8%). What should be your primary business recommendation?


2. Your fraud detection model achieves 95% accuracy on a dataset where 97% of transactions are legitimate and 3% are fraudulent. The model predicts "legitimate" for almost all transactions. What's the main problem with using accuracy as your evaluation metric?


3. You're preparing an e-commerce dataset for classification. Your data includes categorical columns like 'city' (Mumbai, Delhi, Bangalore), 'gender' (Male, Female), and 'product_category' (Electronics, Clothing, Books). What must you do before training a machine learning model?


Up Next

Clustering

Discover hidden customer segments and product groups without labeled data using unsupervised machine learning techniques.