Feature Engineering Lesson 11 – Binning & Discretization | Dataplexa
Beginner Level · Lesson 11

Binning & Discretization

Sometimes a number's exact value matters less than which bucket it falls into. Binning turns continuous features into meaningful categories — and in the right situations, that simplification actually makes your model stronger.

Binning (also called discretization) is the process of grouping continuous numerical values into discrete intervals or categories. Instead of feeding a model raw ages like 23, 24, 25, 26 — you might create bins like "18–25", "26–35", "36–50", "51+". This can reduce the impact of noise, expose non-linear relationships, and produce features that are easier for both models and humans to interpret.

When Discretization Actually Helps

Binning isn't always the right move. Used blindly, you lose information. Used strategically, you can encode real-world domain knowledge directly into your features. Here are the situations where it genuinely pays off:

1

Non-linear relationships

If the effect of a feature jumps at certain thresholds (e.g., credit risk spikes above age 65 or below age 25), binning captures those breakpoints that a linear model would otherwise miss entirely.

2

Noisy continuous measurements

When sensor data or user-reported values have measurement noise, grouping nearby values together smooths out irrelevant variation and focuses the model on meaningful differences.

3

Embedding domain expertise

A doctor knows BMI categories (underweight, normal, overweight, obese) are clinically meaningful. A bank knows credit score bands map to real risk tiers. Binning encodes that knowledge directly.

4

Handling outliers gracefully

An "80+" age bin contains everyone from 80 to 103 — a single outlier at 103 doesn't distort anything. The bin absorbs extreme values without removing them from the dataset.

Equal-Width Binning with pandas cut()

The scenario: You're a data analyst at a health insurance company building a premium prediction model. One of your features is age — a continuous value from 18 to 75 across your customer base. Your actuary tells you that risk increases meaningfully at certain age thresholds, and the underwriting team already uses four age brackets internally. Your job is to create an age_group feature that mirrors these brackets so the model picks up the same signal the underwriters already know about.

# Import pandas for DataFrame operations and binning
import pandas as pd

# Health insurance customer data with age as a continuous feature
insurance_df = pd.DataFrame({
    'customer_id': ['C01', 'C02', 'C03', 'C04', 'C05',
                    'C06', 'C07', 'C08', 'C09', 'C10'],
    'age': [22, 34, 45, 58, 29, 63, 19, 41, 72, 51],
    'annual_premium': [1800, 2400, 3100, 4200, 2100,
                      5400, 1600, 2900, 6800, 3700]
})

# pd.cut() creates equal-width bins or custom bins based on edges you provide
# bins= defines the boundary points — here we use the actuary's four brackets
# labels= assigns a human-readable string to each resulting bin
# include_lowest=True ensures the lowest value (18) is included in the first bin
insurance_df['age_group'] = pd.cut(
    insurance_df['age'],
    bins=[17, 30, 45, 60, 80],
    labels=['18-30', '31-45', '46-60', '61-80'],
    include_lowest=True
)

# Print the raw age alongside the new binned feature
print(insurance_df[['customer_id', 'age', 'age_group', 'annual_premium']].to_string(index=False))

# Count how many customers fall into each age group
print("\nCustomers per age group:")
print(insurance_df['age_group'].value_counts().sort_index())
 customer_id  age age_group  annual_premium
         C01   22     18-30            1800
         C02   34     31-45            2400
         C03   45     31-45            3100
         C04   58     46-60            4200
         C05   29     18-30            2100
         C06   63     61-80            5400
         C07   19     18-30            1600
         C08   41     31-45            2900
         C09   72     61-80            6800
         C10   51     46-60            3700

Customers per age group:
age_group
18-30    3
31-45    3
46-60    2
61-80    2
dtype: int64

What just happened?

pd.cut() read each customer's age, found which of the four boundary ranges it fell into, and assigned the matching label. The second print shows the count per group — three customers in 18-30, three in 31-45, and two each in the older bands. The new age_group column is a categorical type that the model can now encode and learn from.

Equal-Frequency Binning with pandas qcut()

The scenario: You're building a churn prediction model at a SaaS company. One feature is monthly_spend — how much each customer pays per month. The distribution is heavily right-skewed: most customers are on the $49 plan, but a small number of enterprise clients pay thousands. If you use equal-width bins, the top two tiers might contain only one customer each. Your product manager suggests ranking customers into quartiles instead — bottom 25%, middle 50%, and top 25% spenders — so each bin is equally populated and statistically meaningful.

# Import pandas
import pandas as pd

# SaaS customer data with skewed monthly spend distribution
churn_df = pd.DataFrame({
    'customer_id': ['S01', 'S02', 'S03', 'S04', 'S05',
                    'S06', 'S07', 'S08', 'S09', 'S10',
                    'S11', 'S12'],
    'monthly_spend': [49, 49, 99, 49, 199,
                     49, 499, 99, 49, 1200,
                     99, 4800],
    'churned': [1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0]
})

# pd.qcut() bins by quantile — each bin gets an equal share of the data
# q=4 creates four quartile bins: Q1 (bottom 25%), Q2, Q3, Q4 (top 25%)
# labels= assigns meaningful tier names instead of numeric quantile ranges
# duplicates='drop' handles cases where many customers share the same value
churn_df['spend_tier'] = pd.qcut(
    churn_df['monthly_spend'],
    q=4,
    labels=['budget', 'standard', 'premium', 'enterprise'],
    duplicates='drop'
)

# Print the spend alongside its assigned tier
print(churn_df[['customer_id', 'monthly_spend', 'spend_tier', 'churned']].to_string(index=False))

# Check average churn rate per tier — this is how you validate a bin is meaningful
print("\nChurn rate by spend tier:")
print(churn_df.groupby('spend_tier', observed=True)['churned'].mean().round(2))
 customer_id  monthly_spend spend_tier  churned
         S01             49     budget        1
         S02             49     budget        0
         S03             99   standard        1
         S04             49     budget        1
         S05            199    premium        0
         S06             49     budget        1
         S07            499    premium        0
         S08             99   standard        0
         S09             49     budget        1
         S10           1200 enterprise        0
         S11             99   standard        1
         S12           4800 enterprise        0

Churn rate by spend tier:
spend_tier
budget        0.80
standard      0.67
premium       0.00
enterprise    0.00
dtype: float64

What just happened?

pd.qcut() sorted customers by spend and divided them into four equally populated groups. We then used groupby() to calculate the mean churn rate per tier — and the result is striking. Budget customers churn at 80%, enterprise customers at 0%. That pattern was hidden in the raw spend column; the spend_tier feature makes it explicit and usable.

cut() vs qcut() — Choosing the Right Tool

pd.cut() — Equal Width

Splits the value range into equal-sized intervals. Bin boundaries are fixed by the data range, not by population.

Use when: the boundaries themselves are meaningful (e.g., age brackets, BMI thresholds, tax bands).

Risk: Bins may be very unequal in size if data is skewed.

pd.qcut() — Equal Frequency

Splits by quantile so each bin has roughly the same number of data points. Boundaries are computed from the data distribution.

Use when: you want balanced bins regardless of the value distribution — e.g., spend tiers, percentile-based rankings.

Risk: Bin boundaries may be non-intuitive or hard to explain to stakeholders.

scikit-learn KBinsDiscretizer

The scenario: You're preparing a feature engineering pipeline for a mortgage approval model. The model will be retrained monthly on new data and deployed via a REST API. Using pandas cut() is fine in a notebook, but it doesn't fit into a scikit-learn Pipeline object. You need a discretizer that can be fitted, serialized, and applied consistently to new incoming data — enter KBinsDiscretizer.

# Import pandas and numpy
import pandas as pd
import numpy as np

# KBinsDiscretizer fits bin edges from data and transforms into integer or one-hot bins
from sklearn.preprocessing import KBinsDiscretizer

# Mortgage applicant data with income and credit score features
mortgage_df = pd.DataFrame({
    'applicant_id': ['M01', 'M02', 'M03', 'M04', 'M05',
                     'M06', 'M07', 'M08', 'M09', 'M10'],
    'annual_income': [42000, 85000, 63000, 120000, 54000,
                     38000, 210000, 77000, 95000, 49000],
    'credit_score': [620, 740, 680, 810, 655,
                    590, 790, 720, 760, 635]
})

# Select the two features to discretize — must be 2D array for sklearn
features = mortgage_df[['annual_income', 'credit_score']].values

# n_bins=4 creates 4 bins per feature
# strategy='quantile' makes each bin equally populated (same as qcut)
# encode='ordinal' outputs bin indices as integers: 0, 1, 2, 3
kbd = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')

# Fit finds the quantile boundaries from training data
kbd.fit(features)

# Transform converts each value to its bin index (0 = lowest, 3 = highest)
binned = kbd.transform(features)

# Store binned features back in the DataFrame as integer columns
mortgage_df['income_bin'] = binned[:, 0].astype(int)
mortgage_df['credit_bin'] = binned[:, 1].astype(int)

# Print the bin edges that were fitted — useful for documentation and auditing
print("Fitted bin edges:")
print(f"  annual_income: {[round(e,0) for e in kbd.bin_edges_[0]]}")
print(f"  credit_score:  {[round(e,1) for e in kbd.bin_edges_[1]]}")
print()

# Print a comparison of raw values and their assigned bin indices
print(mortgage_df[['applicant_id', 'annual_income', 'income_bin', 'credit_score', 'credit_bin']].to_string(index=False))
Fitted bin edges:
  annual_income: [38000.0, 49750.0, 74000.0, 90000.0, 210000.0]
  credit_score:  [590.0, 632.5, 667.5, 747.5, 810.0]

 applicant_id  annual_income  income_bin  credit_score  credit_bin
          M01          42000           0           620           0
          M02          85000           2           740           2
          M03          63000           1           680           2
          M04         120000           3           810           3
          M05          54000           1           655           1
          M06          38000           0           590           0
          M07         210000           3           790           3
          M08          77000           2           720           2
          M09          95000           3           760           3
          M10          49000           0           635           1

What just happened?

KBinsDiscretizer fitted quantile boundaries from the training data — printed as the bin edges — then assigned each applicant an integer bin index from 0 (lowest quartile) to 3 (highest). Both income and credit score now have a simple ordinal feature the model can use. Because this is a proper sklearn transformer, these exact bin edges can be saved and re-applied to new applicants at inference time.

Binning Strategy Comparison

Strategy Tool Bin Sizes Best For
Equal width pd.cut() / uniform Same value range Domain-defined thresholds
Equal frequency pd.qcut() / quantile Same row count Skewed distributions, ranking
K-Means KBinsDiscretizer(kmeans) Cluster-based Data with natural groupings
Custom pd.cut(bins=[...]) Manually defined Industry-standard categories

The Rule of Thumb on Bin Count

More bins preserves more information but increases model complexity and risks overfitting. Start with 4–5 bins and validate using a downstream metric like churn rate or target mean per bin. If adjacent bins show similar target rates, merge them.

Ordinal vs One-Hot Encoding of Bins

If your bins have a natural order (low, medium, high), ordinal encoding is fine. If the bins represent fundamentally different categories with no inherent rank, one-hot encode them to prevent the model from treating the integer labels as a continuous scale.

Teacher's Note

Binning is a one-way door. Once you discretize a feature, the model can no longer see the continuous variation within each bin. A 34-year-old and a 44-year-old both become "31-45" — the ten-year difference disappears. This is sometimes exactly what you want, and sometimes a meaningful loss of signal. Always compare model performance with and without binning before committing to it. The best practice is to create both the raw and binned versions as separate features and let feature selection or the model itself decide which version is more useful.

Practice Questions

1. Which pandas function creates bins where each bin contains roughly the same number of data points?



2. What argument should you pass to pd.cut() to ensure the minimum value in the data is included in the first bin?



3. What encode value should you use in KBinsDiscretizer to output bin indices as integers?



Quiz

1. What is the key difference between pd.cut() and pd.qcut()?


2. What is the main advantage of using KBinsDiscretizer over pd.cut() in a production pipeline?


3. What is the main trade-off when discretizing a continuous feature?


Up Next · Lesson 12

Feature Scaling

Min-max, standard scaling, robust scaling — learn which method protects your model from the wrong kinds of dominance.