Feature Engineering Course
Binning & Discretization
Sometimes a number's exact value matters less than which bucket it falls into. Binning turns continuous features into meaningful categories — and in the right situations, that simplification actually makes your model stronger.
Binning (also called discretization) is the process of grouping continuous numerical values into discrete intervals or categories. Instead of feeding a model raw ages like 23, 24, 25, 26 — you might create bins like "18–25", "26–35", "36–50", "51+". This can reduce the impact of noise, expose non-linear relationships, and produce features that are easier for both models and humans to interpret.
When Discretization Actually Helps
Binning isn't always the right move. Used blindly, you lose information. Used strategically, you can encode real-world domain knowledge directly into your features. Here are the situations where it genuinely pays off:
Non-linear relationships
If the effect of a feature jumps at certain thresholds (e.g., credit risk spikes above age 65 or below age 25), binning captures those breakpoints that a linear model would otherwise miss entirely.
Noisy continuous measurements
When sensor data or user-reported values have measurement noise, grouping nearby values together smooths out irrelevant variation and focuses the model on meaningful differences.
Embedding domain expertise
A doctor knows BMI categories (underweight, normal, overweight, obese) are clinically meaningful. A bank knows credit score bands map to real risk tiers. Binning encodes that knowledge directly.
Handling outliers gracefully
An "80+" age bin contains everyone from 80 to 103 — a single outlier at 103 doesn't distort anything. The bin absorbs extreme values without removing them from the dataset.
Equal-Width Binning with pandas cut()
The scenario: You're a data analyst at a health insurance company building a premium prediction model. One of your features is age — a continuous value from 18 to 75 across your customer base. Your actuary tells you that risk increases meaningfully at certain age thresholds, and the underwriting team already uses four age brackets internally. Your job is to create an age_group feature that mirrors these brackets so the model picks up the same signal the underwriters already know about.
# Import pandas for DataFrame operations and binning
import pandas as pd
# Health insurance customer data with age as a continuous feature
insurance_df = pd.DataFrame({
'customer_id': ['C01', 'C02', 'C03', 'C04', 'C05',
'C06', 'C07', 'C08', 'C09', 'C10'],
'age': [22, 34, 45, 58, 29, 63, 19, 41, 72, 51],
'annual_premium': [1800, 2400, 3100, 4200, 2100,
5400, 1600, 2900, 6800, 3700]
})
# pd.cut() creates equal-width bins or custom bins based on edges you provide
# bins= defines the boundary points — here we use the actuary's four brackets
# labels= assigns a human-readable string to each resulting bin
# include_lowest=True ensures the lowest value (18) is included in the first bin
insurance_df['age_group'] = pd.cut(
insurance_df['age'],
bins=[17, 30, 45, 60, 80],
labels=['18-30', '31-45', '46-60', '61-80'],
include_lowest=True
)
# Print the raw age alongside the new binned feature
print(insurance_df[['customer_id', 'age', 'age_group', 'annual_premium']].to_string(index=False))
# Count how many customers fall into each age group
print("\nCustomers per age group:")
print(insurance_df['age_group'].value_counts().sort_index())
customer_id age age_group annual_premium
C01 22 18-30 1800
C02 34 31-45 2400
C03 45 31-45 3100
C04 58 46-60 4200
C05 29 18-30 2100
C06 63 61-80 5400
C07 19 18-30 1600
C08 41 31-45 2900
C09 72 61-80 6800
C10 51 46-60 3700
Customers per age group:
age_group
18-30 3
31-45 3
46-60 2
61-80 2
dtype: int64What just happened?
pd.cut() read each customer's age, found which of the four boundary ranges it fell into, and assigned the matching label. The second print shows the count per group — three customers in 18-30, three in 31-45, and two each in the older bands. The new age_group column is a categorical type that the model can now encode and learn from.
Equal-Frequency Binning with pandas qcut()
The scenario: You're building a churn prediction model at a SaaS company. One feature is monthly_spend — how much each customer pays per month. The distribution is heavily right-skewed: most customers are on the $49 plan, but a small number of enterprise clients pay thousands. If you use equal-width bins, the top two tiers might contain only one customer each. Your product manager suggests ranking customers into quartiles instead — bottom 25%, middle 50%, and top 25% spenders — so each bin is equally populated and statistically meaningful.
# Import pandas
import pandas as pd
# SaaS customer data with skewed monthly spend distribution
churn_df = pd.DataFrame({
'customer_id': ['S01', 'S02', 'S03', 'S04', 'S05',
'S06', 'S07', 'S08', 'S09', 'S10',
'S11', 'S12'],
'monthly_spend': [49, 49, 99, 49, 199,
49, 499, 99, 49, 1200,
99, 4800],
'churned': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0]
})
# pd.qcut() bins by quantile — each bin gets an equal share of the data
# q=4 creates four quartile bins: Q1 (bottom 25%), Q2, Q3, Q4 (top 25%)
# labels= assigns meaningful tier names instead of numeric quantile ranges
# duplicates='drop' handles cases where many customers share the same value
churn_df['spend_tier'] = pd.qcut(
churn_df['monthly_spend'],
q=4,
labels=['budget', 'standard', 'premium', 'enterprise'],
duplicates='drop'
)
# Print the spend alongside its assigned tier
print(churn_df[['customer_id', 'monthly_spend', 'spend_tier', 'churned']].to_string(index=False))
# Check average churn rate per tier — this is how you validate a bin is meaningful
print("\nChurn rate by spend tier:")
print(churn_df.groupby('spend_tier', observed=True)['churned'].mean().round(2))
customer_id monthly_spend spend_tier churned
S01 49 budget 1
S02 49 budget 0
S03 99 standard 1
S04 49 budget 1
S05 199 premium 0
S06 49 budget 1
S07 499 premium 0
S08 99 standard 0
S09 49 budget 1
S10 1200 enterprise 0
S11 99 standard 1
S12 4800 enterprise 0
Churn rate by spend tier:
spend_tier
budget 0.80
standard 0.67
premium 0.00
enterprise 0.00
dtype: float64What just happened?
pd.qcut() sorted customers by spend and divided them into four equally populated groups. We then used groupby() to calculate the mean churn rate per tier — and the result is striking. Budget customers churn at 80%, enterprise customers at 0%. That pattern was hidden in the raw spend column; the spend_tier feature makes it explicit and usable.
cut() vs qcut() — Choosing the Right Tool
pd.cut() — Equal Width
Splits the value range into equal-sized intervals. Bin boundaries are fixed by the data range, not by population.
Use when: the boundaries themselves are meaningful (e.g., age brackets, BMI thresholds, tax bands).
Risk: Bins may be very unequal in size if data is skewed.
pd.qcut() — Equal Frequency
Splits by quantile so each bin has roughly the same number of data points. Boundaries are computed from the data distribution.
Use when: you want balanced bins regardless of the value distribution — e.g., spend tiers, percentile-based rankings.
Risk: Bin boundaries may be non-intuitive or hard to explain to stakeholders.
scikit-learn KBinsDiscretizer
The scenario: You're preparing a feature engineering pipeline for a mortgage approval model. The model will be retrained monthly on new data and deployed via a REST API. Using pandas cut() is fine in a notebook, but it doesn't fit into a scikit-learn Pipeline object. You need a discretizer that can be fitted, serialized, and applied consistently to new incoming data — enter KBinsDiscretizer.
# Import pandas and numpy
import pandas as pd
import numpy as np
# KBinsDiscretizer fits bin edges from data and transforms into integer or one-hot bins
from sklearn.preprocessing import KBinsDiscretizer
# Mortgage applicant data with income and credit score features
mortgage_df = pd.DataFrame({
'applicant_id': ['M01', 'M02', 'M03', 'M04', 'M05',
'M06', 'M07', 'M08', 'M09', 'M10'],
'annual_income': [42000, 85000, 63000, 120000, 54000,
38000, 210000, 77000, 95000, 49000],
'credit_score': [620, 740, 680, 810, 655,
590, 790, 720, 760, 635]
})
# Select the two features to discretize — must be 2D array for sklearn
features = mortgage_df[['annual_income', 'credit_score']].values
# n_bins=4 creates 4 bins per feature
# strategy='quantile' makes each bin equally populated (same as qcut)
# encode='ordinal' outputs bin indices as integers: 0, 1, 2, 3
kbd = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
# Fit finds the quantile boundaries from training data
kbd.fit(features)
# Transform converts each value to its bin index (0 = lowest, 3 = highest)
binned = kbd.transform(features)
# Store binned features back in the DataFrame as integer columns
mortgage_df['income_bin'] = binned[:, 0].astype(int)
mortgage_df['credit_bin'] = binned[:, 1].astype(int)
# Print the bin edges that were fitted — useful for documentation and auditing
print("Fitted bin edges:")
print(f" annual_income: {[round(e,0) for e in kbd.bin_edges_[0]]}")
print(f" credit_score: {[round(e,1) for e in kbd.bin_edges_[1]]}")
print()
# Print a comparison of raw values and their assigned bin indices
print(mortgage_df[['applicant_id', 'annual_income', 'income_bin', 'credit_score', 'credit_bin']].to_string(index=False))
Fitted bin edges:
annual_income: [38000.0, 49750.0, 74000.0, 90000.0, 210000.0]
credit_score: [590.0, 632.5, 667.5, 747.5, 810.0]
applicant_id annual_income income_bin credit_score credit_bin
M01 42000 0 620 0
M02 85000 2 740 2
M03 63000 1 680 2
M04 120000 3 810 3
M05 54000 1 655 1
M06 38000 0 590 0
M07 210000 3 790 3
M08 77000 2 720 2
M09 95000 3 760 3
M10 49000 0 635 1
What just happened?
KBinsDiscretizer fitted quantile boundaries from the training data — printed as the bin edges — then assigned each applicant an integer bin index from 0 (lowest quartile) to 3 (highest). Both income and credit score now have a simple ordinal feature the model can use. Because this is a proper sklearn transformer, these exact bin edges can be saved and re-applied to new applicants at inference time.
Binning Strategy Comparison
| Strategy | Tool | Bin Sizes | Best For |
|---|---|---|---|
| Equal width | pd.cut() / uniform | Same value range | Domain-defined thresholds |
| Equal frequency | pd.qcut() / quantile | Same row count | Skewed distributions, ranking |
| K-Means | KBinsDiscretizer(kmeans) | Cluster-based | Data with natural groupings |
| Custom | pd.cut(bins=[...]) | Manually defined | Industry-standard categories |
The Rule of Thumb on Bin Count
More bins preserves more information but increases model complexity and risks overfitting. Start with 4–5 bins and validate using a downstream metric like churn rate or target mean per bin. If adjacent bins show similar target rates, merge them.
Ordinal vs One-Hot Encoding of Bins
If your bins have a natural order (low, medium, high), ordinal encoding is fine. If the bins represent fundamentally different categories with no inherent rank, one-hot encode them to prevent the model from treating the integer labels as a continuous scale.
Teacher's Note
Binning is a one-way door. Once you discretize a feature, the model can no longer see the continuous variation within each bin. A 34-year-old and a 44-year-old both become "31-45" — the ten-year difference disappears. This is sometimes exactly what you want, and sometimes a meaningful loss of signal. Always compare model performance with and without binning before committing to it. The best practice is to create both the raw and binned versions as separate features and let feature selection or the model itself decide which version is more useful.
Practice Questions
1. Which pandas function creates bins where each bin contains roughly the same number of data points?
2. What argument should you pass to pd.cut() to ensure the minimum value in the data is included in the first bin?
3. What encode value should you use in KBinsDiscretizer to output bin indices as integers?
Quiz
1. What is the key difference between pd.cut() and pd.qcut()?
2. What is the main advantage of using KBinsDiscretizer over pd.cut() in a production pipeline?
3. What is the main trade-off when discretizing a continuous feature?
Up Next · Lesson 12
Feature Scaling
Min-max, standard scaling, robust scaling — learn which method protects your model from the wrong kinds of dominance.