Feature Engineering Lesson 14 – Feature Construction | Dataplexa
Beginner Level · Lesson 14

Feature Construction

The best features in your dataset don't always exist yet. Sometimes you have to build them — from ratios, differences, flags, and combinations of what you already have. That's feature construction, and it's where domain knowledge turns into model performance.

Feature construction (also called feature creation) is the process of engineering entirely new columns by combining, transforming, or extracting information from existing features. Unlike scaling or encoding — which reshape existing data — construction generates signals that were never directly measured but are often the most predictive thing in your entire dataset.

Why Constructed Features Often Outperform Raw Ones

A bank has monthly_income and monthly_debt as separate columns. Each one alone tells part of the story. But the ratio — debt_to_income_ratio — tells the whole story in a single number. A model could theoretically learn this relationship from raw data, but you're making its job dramatically easier — and more reliable on small datasets — by encoding the relationship directly.

1

Ratio features

Divide one feature by another to express a proportion or rate. Debt-to-income, click-through rate, revenue-per-user. Ratios normalise scale and often carry more signal than either raw value alone.

2

Difference features

Subtract one value from another to capture change, gap, or deviation. Price vs list price, actual vs expected, current vs historical. Differences highlight relative performance rather than absolute magnitude.

3

Flag features

Binary 0/1 indicators that capture whether a condition is met. Is the account overdue? Did the customer open the email? Has the property been renovated? Flags encode business rules directly into the feature space.

4

Aggregate features

Group-level statistics: the average spend per customer segment, total transactions per account, maximum price in a product category. Aggregates compress multi-row information into a single per-entity number.

Ratio and Difference Features

The scenario: You're a data scientist at a mortgage lending company. Your model predicts loan default risk. The underwriting team has told you that raw income and raw debt figures matter less than their relationship — how much of someone's income is already spoken for. You're going to construct three new features: debt_to_income_ratio, loan_to_value_ratio, and income_after_debt — all derived from columns already in the dataset.

# Import pandas
import pandas as pd

# Mortgage applicant data with raw financial features
mortgage_df = pd.DataFrame({
    'applicant_id': ['M01', 'M02', 'M03', 'M04', 'M05',
                     'M06', 'M07', 'M08', 'M09', 'M10'],
    'monthly_income': [5200, 8400, 3900, 12000, 6100,
                      4500, 9800, 7200, 3200, 11000],
    'monthly_debt':   [1800, 2100, 2200, 1500, 3100,
                      1200, 2800, 4100, 1900, 1800],
    'loan_amount':    [180000, 320000, 140000, 480000, 210000,
                      160000, 390000, 270000, 120000, 420000],
    'property_value': [220000, 400000, 175000, 600000, 250000,
                      200000, 480000, 310000, 150000, 520000]
})

# Ratio: monthly debt divided by monthly income — the core risk metric in lending
mortgage_df['debt_to_income'] = (mortgage_df['monthly_debt'] /
                                    mortgage_df['monthly_income']).round(3)

# Ratio: loan amount divided by property value — how much of the asset is borrowed
mortgage_df['loan_to_value'] = (mortgage_df['loan_amount'] /
                                   mortgage_df['property_value']).round(3)

# Difference: how much income remains after servicing all existing debt each month
mortgage_df['income_after_debt'] = mortgage_df['monthly_income'] - mortgage_df['monthly_debt']

# Print the constructed features alongside the applicant IDs
print(mortgage_df[['applicant_id', 'monthly_income', 'monthly_debt',
                   'debt_to_income', 'loan_to_value', 'income_after_debt']].to_string(index=False))
 applicant_id  monthly_income  monthly_debt  debt_to_income  loan_to_value  income_after_debt
          M01            5200          1800           0.346          0.818               3400
          M02            8400          2100           0.250          0.800               6300
          M03            3900          2200           0.564          0.800               1700
          M04           12000          1500           0.125          0.800               10500
          M05            6100          3100           0.508          0.840               3000
          M06            4500          1200           0.267          0.800               3300
          M07            9800          2800           0.286          0.813               7000
          M08            7200          4100           0.569          0.871               3100
          M09            3200          1900           0.594          0.800               1300
          M10           11000          1800           0.164          0.808               9200

What just happened?

Three new columns were created entirely from arithmetic on existing ones. M03 and M08 both earn less than M04, but their debt_to_income ratios of 0.564 and 0.569 flag them as high-risk — over half their income already goes to debt. M04 earns the second highest income and has a ratio of just 0.125. That risk story was invisible in the raw columns alone.

Flag Features

The scenario: You're working at a subscription box company building a churn prediction model. Your dataset tracks customer behaviour over time. The data team suspects that certain binary conditions — whether a customer has ever contacted support, whether they skipped a delivery, whether their last order was a discounted one — are stronger predictors of churn than any continuous feature. You're going to construct these as explicit binary flag columns.

# Import pandas
import pandas as pd

# Subscription customer behavioural data
sub_df = pd.DataFrame({
    'customer_id':      ['C01', 'C02', 'C03', 'C04', 'C05',
                         'C06', 'C07', 'C08', 'C09', 'C10'],
    'support_contacts': [0, 3, 0, 1, 0, 5, 0, 2, 0, 1],
    'skipped_deliveries':[0, 0, 2, 0, 3, 1, 0, 4, 0, 0],
    'last_order_discount':[0, 15, 0, 0, 20, 10, 0, 25, 0, 5],
    'months_subscribed': [24, 3, 8, 14, 2, 5, 36, 6, 18, 11]
})

# Flag: has the customer ever contacted support? (1 if support_contacts > 0)
sub_df['has_contacted_support'] = (sub_df['support_contacts'] > 0).astype(int)

# Flag: has the customer ever skipped a delivery? (1 if skipped_deliveries > 0)
sub_df['has_skipped_delivery'] = (sub_df['skipped_deliveries'] > 0).astype(int)

# Flag: was the last order discounted at all? (1 if discount > 0)
sub_df['last_order_was_discounted'] = (sub_df['last_order_discount'] > 0).astype(int)

# Flag: is the customer a long-term subscriber? (1 if subscribed > 12 months)
sub_df['is_long_term'] = (sub_df['months_subscribed'] > 12).astype(int)

# Print the raw behavioural columns and the four constructed flags
flag_cols = ['customer_id', 'support_contacts', 'skipped_deliveries',
             'has_contacted_support', 'has_skipped_delivery',
             'last_order_was_discounted', 'is_long_term']
print(sub_df[flag_cols].to_string(index=False))
 customer_id  support_contacts  skipped_deliveries  has_contacted_support  has_skipped_delivery  last_order_was_discounted  is_long_term
         C01                 0                   0                      0                     0                          0             1
         C02                 3                   0                      1                     0                          1             0
         C03                 0                   2                      0                     1                          0             0
         C04                 1                   0                      1                     0                          0             1
         C05                 0                   3                      0                     1                          1             0
         C06                 5                   1                      1                     1                          1             0
         C07                 0                   0                      0                     0                          0             1
         C08                 2                   4                      1                     1                          1             0
         C09                 0                   0                      0                     0                          0             1
         C10                 1                   0                      1                     0                          1             0

What just happened?

The boolean condition inside each parenthesis returns a True/False Series. .astype(int) converts that to 1 and 0. C06 now has all four risk flags lit — 5 support contacts, skipped a delivery, last order was discounted, and subscribed for only 5 months. That profile is a clean churn signal even before the model sees it.

Aggregate Features from Group Statistics

The scenario: You're a data analyst at a retail chain building a store performance model. Each row in your transaction dataset represents a single sale. Before modelling, you need to construct store-level aggregate features — the average transaction value per store, the total number of transactions, and the maximum single-sale amount. These group-level summaries compress hundreds of rows per store into a handful of powerful features.

# Import pandas
import pandas as pd

# Individual transaction records across three stores
transactions_df = pd.DataFrame({
    'transaction_id': range(1, 13),
    'store_id': ['S01', 'S01', 'S02', 'S01', 'S03', 'S02',
               'S03', 'S01', 'S02', 'S03', 'S01', 'S02'],
    'sale_amount': [120, 340, 95, 210, 450, 180,
                   320, 155, 270, 490, 88, 310]
})

# groupby + agg computes multiple statistics per store in one call
# 'mean', 'sum', 'count', 'max' are all standard aggregation functions
store_features = transactions_df.groupby('store_id')['sale_amount'].agg(
    avg_transaction='mean',
    total_revenue='sum',
    transaction_count='count',
    max_sale='max'
).reset_index()

# Round avg_transaction for readability
store_features['avg_transaction'] = store_features['avg_transaction'].round(1)

# Merge the store-level features back onto the original transaction rows
# left join so every transaction keeps its row, now enriched with store stats
enriched_df = transactions_df.merge(store_features, on='store_id', how='left')

# Print the store-level aggregate feature table first
print("Store-level aggregate features:")
print(store_features.to_string(index=False))
print()

# Print the first six rows of the enriched transaction table
print("Enriched transaction rows (first 6):")
print(enriched_df.head(6).to_string(index=False))
Store-level aggregate features:
 store_id  avg_transaction  total_revenue  transaction_count  max_sale
      S01            182.6            913                  5       340
      S02            211.0            855                  4       310
      S03            420.0           1260                  3       490

Enriched transaction rows (first 6):
 transaction_id store_id  sale_amount  avg_transaction  total_revenue  transaction_count  max_sale
              1      S01          120            182.6            913                  5       340
              2      S01          340            182.6            913                  5       340
              3      S02           95            211.0            855                  4       310
              4      S01          210            182.6            913                  5       340
              5      S03          450            420.0           1260                  3       490
              6      S02          180            211.0            855                  4       310

What just happened?

groupby().agg() collapsed all transactions per store into a single summary row. Then .merge() broadcast those store-level stats back onto every individual transaction row. Now each transaction knows its store's average sale and total revenue — context a model can use to judge whether this particular sale is above or below the norm for that location.

Combining Construction Techniques

The scenario: You're finalising the feature set for a credit card fraud detection model. The raw data has transaction amount, the customer's average transaction, and whether the merchant category is high-risk. A single constructed feature — amount_vs_avg_ratio — combines ratio logic with the group aggregate you just computed. Then a flag marks whether both the ratio is high and the merchant is risky. These compound features are often the most powerful ones in fraud models.

# Import pandas
import pandas as pd

# Credit card transaction data for fraud detection
fraud_df = pd.DataFrame({
    'txn_id':          ['TX01', 'TX02', 'TX03', 'TX04', 'TX05',
                        'TX06', 'TX07', 'TX08', 'TX09', 'TX10'],
    'amount':           [45, 1800, 62, 33, 2400, 78, 55, 3100, 41, 90],
    'customer_avg_txn': [60, 55, 70, 40, 65, 80, 50, 72, 45, 88],
    'high_risk_merchant':[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
})

# Ratio: how many times larger is this transaction vs the customer's usual spend?
fraud_df['amount_vs_avg_ratio'] = (fraud_df['amount'] /
                                     fraud_df['customer_avg_txn']).round(2)

# Flag: is this transaction more than 5x the customer's average spend?
fraud_df['is_unusually_large'] = (fraud_df['amount_vs_avg_ratio'] > 5).astype(int)

# Compound flag: large AND at a high-risk merchant — the classic fraud signal
fraud_df['high_risk_flag'] = (
    (fraud_df['is_unusually_large'] == 1) &
    (fraud_df['high_risk_merchant'] == 1)
).astype(int)

# Print the constructed features for all transactions
print(fraud_df[['txn_id', 'amount', 'customer_avg_txn', 'amount_vs_avg_ratio',
                'is_unusually_large', 'high_risk_merchant', 'high_risk_flag']].to_string(index=False))
 txn_id  amount  customer_avg_txn  amount_vs_avg_ratio  is_unusually_large  high_risk_merchant  high_risk_flag
   TX01      45                60                 0.75                   0                   0               0
   TX02    1800                55                32.73                   1                   1               1
   TX03      62                70                 0.89                   0                   0               0
   TX04      33                40                 0.82                   0                   0               0
   TX05    2400                65                36.92                   1                   1               1
   TX06      78                80                 0.97                   0                   0               0
   TX07      55                50                 1.10                   0                   0               0
   TX08    3100                72                43.06                   1                   1               1
   TX09      41                45                 0.91                   0                   0               0
   TX10      90                88                 1.02                   0                   0               0

What just happened?

TX02, TX05, and TX08 were flagged by high_risk_flag. All three have amounts 30–43× the customer's normal spend AND were at high-risk merchants. The & operator combines two boolean conditions — both must be true for the compound flag to fire. TX10 spent slightly above average but at a normal merchant, so it stays clean.

Protect against division by zero

When constructing ratio features, always check whether the denominator can be zero. Use df['col'].replace(0, np.nan) or add a small epsilon before dividing to avoid silent infinities breaking your pipeline.

Construct on train, apply to test

For aggregate features computed from group statistics, compute the aggregates on training data only and join them onto the test set. Computing aggregates on the full dataset before splitting leaks future group statistics into training — the same data leakage rule from Lessons 10 and 12 applies here.

Teacher's Note

Feature construction is where your domain knowledge pays its biggest dividends. A ratio or flag that encodes something a subject-matter expert already knows — like a debt-to-income threshold or a fraud velocity rule — will almost always beat a raw column that a model has to discover the same insight from scratch. Talk to the people who understand the business problem. The best constructed features often come from a ten-minute conversation with an underwriter, a fraud analyst, or a customer success manager rather than from automated feature search tools.

Practice Questions

1. After creating a boolean condition like (df['col'] > 0), what method converts it to a binary 0/1 integer column?



2. Which pandas method is used to compute group-level aggregate statistics like mean or sum per category?



3. To avoid data leakage, group-level aggregate features should be computed on ________ data only.



Quiz

1. Which constructed feature best captures credit risk from income and debt columns?


2. What is a flag feature?


3. What should you do before constructing a ratio feature to avoid silent errors in production?


Up Next · Lesson 15

Feature Engineering Workflow

Bring everything together — a structured end-to-end process for applying FE techniques in the right order on any real dataset.