Feature Engineering Lesson 15 – FE Workflow | Dataplexa
Beginner Level · Lesson 15

Feature Engineering Workflow

You've learned the individual techniques. Now it's time to see how they fit together. A disciplined FE workflow is what separates a notebook experiment from a pipeline that actually ships — and survives contact with real, messy data.

A feature engineering workflow is a structured, repeatable sequence of steps that takes raw data and produces a model-ready feature matrix. Each step has a specific purpose, a correct order relative to the others, and a set of mistakes that silently corrupt everything downstream if you get it wrong. This lesson walks through the full sequence on a single realistic dataset.

The Seven-Step FE Workflow

Order matters here. Steps that come too early can leak information. Steps that come too late can undo earlier work. This sequence is battle-tested across production ML pipelines.

1

Understand the data and the problem

Before writing a single line of code: what is the target variable? What does each feature represent in the real world? Which columns are numerical, categorical, ordinal, datetime? Are there any leakage risks — features that would not be available at prediction time?

2

Train-test split — first, before anything else

Split immediately after loading. All subsequent steps — imputation, scaling, encoding, transformation — must be fitted on training data only and applied to test data. Everything before this split is safe. Everything after must respect the boundary.

3

Handle missing values

Impute or flag missing values before any transformation or scaling. Imputing after scaling can produce imputed values outside the fitted scale range. Imputing after encoding can produce category conflicts.

4

Construct new features

Build ratios, flags, differences, and aggregates from existing clean columns. Do this before encoding and scaling so constructed features benefit from the same downstream preprocessing as the originals.

5

Encode categorical features

Apply label encoding, ordinal encoding, or one-hot encoding depending on the column type. Fit encoders on training data. This step happens before scaling because encoding outputs are already in a numerical form ready for the scaler.

6

Transform and scale numerical features

Apply log or power transformations to correct skew, then scale all numerical features. Fit transformers and scalers on training data only. This is the final numerical preprocessing step before modelling.

7

Validate and review the feature matrix

Check shapes, data types, value ranges, and null counts before handing the matrix to a model. Confirm no original text columns remain, no infinity values exist, and all columns are numeric. A 30-second sanity check here prevents hours of debugging later.

Step 1–2: Load Data and Split Immediately

The scenario: You're a data scientist at a healthcare analytics firm building a patient readmission risk model. You've just received a dataset of hospital discharge records. Before doing anything else — before even looking at distributions — you're going to split the data. This is non-negotiable. Every transformation you perform from this point on will be fitted on training data only.

# Import all libraries needed for the full workflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Hospital discharge dataset — mix of numerical, categorical, and ordinal features
hospital_df = pd.DataFrame({
    'patient_id':      ['P01','P02','P03','P04','P05','P06','P07','P08','P09','P10',
                        'P11','P12','P13','P14','P15','P16'],
    'age':             [67,45,72,38,58,81,49,63,55,70,
                        42,66,53,78,60,47],
    'num_diagnoses':   [3,1,5,2,4,6,2,3,1,4,
                        2,5,3,7,2,1],
    'los_days':        [4,2,8,1,5,12,3,6,2,7,
                        3,9,4,14,3,2],    # length of stay in days
    'admission_type':  ['emergency','elective','emergency','elective','urgent',
                        'emergency','elective','urgent','elective','emergency',
                        'urgent','emergency','elective','emergency','urgent','elective'],
    'risk_level':      ['high','low','high','low','medium','high','low','medium',
                        'low','high','medium','high','low','high','medium','low'],
    'readmitted':      [1,0,1,0,0,1,0,1,0,1,
                        0,1,0,1,0,0]   # target: was patient readmitted within 30 days?
})

# Separate features (X) from target (y) before splitting
X = hospital_df.drop(columns=['patient_id', 'readmitted'])
y = hospital_df['readmitted']

# Split: 75% train, 25% test — random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Confirm shapes — train and test must be separate from here on
print(f"Training set:  {X_train.shape[0]} rows, {X_train.shape[1]} features")
print(f"Test set:      {X_test.shape[0]} rows, {X_test.shape[1]} features")
print(f"\nFeature columns: {X_train.columns.tolist()}")
Training set:  12 rows, 5 features
Test set:       4 rows, 5 features

Feature columns: ['age', 'num_diagnoses', 'los_days', 'admission_type', 'risk_level']

What just happened?

We loaded the data, separated the target column, and immediately called train_test_split(). The random_state=42 ensures the same split happens every time the code runs. From this point forward, any transformer we fit will only see X_train.

Step 3–4: Construct Features and Encode Categoricals

The scenario: Continuing the hospital dataset — now that the split is done, we construct a new feature capturing how complex this admission was (diagnoses per day of stay), then encode the two categorical columns. The ordinal risk_level gets ordinal encoding. The nominal admission_type gets one-hot encoding. All encoders are fitted on training data only.

# Import encoders
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# --- STEP 3: Feature Construction ---

# Construct: diagnoses per day — captures admission complexity normalised by stay length
# Apply to both train and test using the same arithmetic (no fitting needed here)
X_train = X_train.copy()
X_test  = X_test.copy()

X_train['diagnoses_per_day'] = (X_train['num_diagnoses'] / X_train['los_days']).round(3)
X_test['diagnoses_per_day']  = (X_test['num_diagnoses']  / X_test['los_days']).round(3)

# Construct: flag for long stay — stays over 7 days are clinically significant
X_train['is_long_stay'] = (X_train['los_days'] > 7).astype(int)
X_test['is_long_stay']  = (X_test['los_days']  > 7).astype(int)

# --- STEP 4: Encoding ---

# Ordinal encode risk_level — low < medium < high is a real clinical order
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_train['risk_encoded'] = oe.fit_transform(X_train[['risk_level']]).astype(int)
X_test['risk_encoded']  = oe.transform(X_test[['risk_level']]).astype(int)

# One-hot encode admission_type — no ordering exists between elective/emergency/urgent
ohe = OneHotEncoder(sparse_output=False, drop='first')
train_ohe = ohe.fit_transform(X_train[['admission_type']])
test_ohe  = ohe.transform(X_test[['admission_type']])

# Convert OHE arrays to DataFrames with proper column names
ohe_cols = ohe.get_feature_names_out(['admission_type'])
train_ohe_df = pd.DataFrame(train_ohe, columns=ohe_cols, index=X_train.index, dtype=int)
test_ohe_df  = pd.DataFrame(test_ohe,  columns=ohe_cols, index=X_test.index,  dtype=int)

# Join OHE columns and drop originals that are no longer needed
X_train = pd.concat([X_train, train_ohe_df], axis=1).drop(columns=['admission_type', 'risk_level'])
X_test  = pd.concat([X_test,  test_ohe_df],  axis=1).drop(columns=['admission_type', 'risk_level'])

# Check the training feature matrix after construction and encoding
print(f"Columns after Steps 3–4: {X_train.columns.tolist()}")
print(f"Shape: {X_train.shape}")
print()
print(X_train.to_string())
Columns after Steps 3–4: ['age', 'num_diagnoses', 'los_days', 'diagnoses_per_day', 'is_long_stay', 'risk_encoded', 'admission_type_emergency', 'admission_type_urgent']
Shape: (12, 8)

    age  num_diagnoses  los_days  diagnoses_per_day  is_long_stay  risk_encoded  admission_type_emergency  admission_type_urgent
      67              3         4              0.750             0             2                         1                      0
      72              5         8              0.625             1             2                         1                      0
      81              6        12              0.500             1             2                         1                      0
      49              2         3              0.667             0             0                         0                      0
      55              1         2              0.500             0             0                         0                      0
      70              4         7              0.571             0             2                         1                      0
      66              5         9              0.556             1             2                         1                      0
      60              2         3              0.667             0             1                         0                      1
      38              2         1              2.000             0             0                         0                      0
      45              1         2              0.500             0             0                         0                      0
      53              3         4              0.750             0             0                         0                      0
      47              1         2              0.500             0             0                         0                      0

What just happened?

We went from 5 raw columns to 8 engineered ones. Two new features were constructed, risk_level was ordinally encoded into a single integer column, and admission_type was one-hot encoded into two binary columns (elective is the dropped baseline). Both encoders were fitted on X_train only and then applied to X_test.

Step 5–6: Transform, Scale, and Validate

The scenario: The final steps — apply a log transformation to correct skew on los_days, scale all numerical features with StandardScaler, then run a validation check to confirm the feature matrix is clean and ready for modelling.

# Import transformation and scaling tools
from sklearn.preprocessing import StandardScaler
import numpy as np

# --- STEP 5: Transform skewed numerical features ---

# los_days is right-skewed (most stays short, a few very long) — apply log1p
X_train['log_los_days'] = np.log1p(X_train['los_days'])
X_test['log_los_days']  = np.log1p(X_test['los_days'])

# Drop the raw los_days — the log version replaces it
X_train = X_train.drop(columns=['los_days'])
X_test  = X_test.drop(columns=['los_days'])

# --- STEP 6a: Scale all numerical features ---

# Columns to scale — exclude binary flags and OHE columns (already 0/1)
scale_cols = ['age', 'num_diagnoses', 'diagnoses_per_day', 'log_los_days', 'risk_encoded']

# Fit scaler on training data only
scaler = StandardScaler()
X_train[scale_cols] = scaler.fit_transform(X_train[scale_cols])
X_test[scale_cols]  = scaler.transform(X_test[scale_cols])

# Round for readability
X_train = X_train.round(3)
X_test  = X_test.round(3)

# --- STEP 6b: Validation check ---

# Confirm shape, dtypes, null counts, and value ranges look correct
print("=== Final Feature Matrix Validation ===")
print(f"Train shape: {X_train.shape}  |  Test shape: {X_test.shape}")
print(f"Null values: {X_train.isnull().sum().sum()} in train, {X_test.isnull().sum().sum()} in test")
print(f"Inf values:  {np.isinf(X_train.values).sum()} in train")
print()
print("Column summary (train):")
print(X_train.describe().round(2).to_string())
=== Final Feature Matrix Validation ===
Train shape: (12, 8)  |  Test shape: (4, 8)
Null values: 0 in train, 0 in test
Inf values:  0 in train

Column summary (train):
        age  num_diagnoses  diagnoses_per_day  is_long_stay  risk_encoded  admission_type_emergency  admission_type_urgent  log_los_days
count  12.0          12.00              12.00         12.00         12.00                     12.00                  12.00         12.00
mean    0.0           0.00               0.00          0.42          0.00                      0.58                   0.17          0.00
std     1.0           1.00               1.00          0.51          1.00                      0.51                   0.39          1.00
min    -1.6          -1.31              -1.07          0.00         -1.27                      0.00                   0.00         -1.59
max     1.8           2.06               3.15          1.00          1.05                      1.00                   1.00          1.62

What just happened?

The log-transformed log_los_days replaced raw los_days, then all continuous columns were standardised. The validation block confirmed zero nulls, zero infinities, and sensible ranges — scaled columns sit near mean 0 and std 1, binary flags remain 0/1 as expected. The feature matrix is ready.

The Order Rules — What Breaks If You Get It Wrong

Mistake What silently breaks
Fit scaler on full dataset before split Test statistics leak into training — evaluation metrics are optimistic
Impute after scaling Imputed values may fall outside the scale range — model sees out-of-distribution inputs
Construct features after encoding Arithmetic on one-hot columns produces nonsense features
Scale binary / one-hot columns 0/1 flags lose their binary meaning and gain decimal values
Skip the validation check Null values or text columns crash the model at fit time with cryptic errors

Teacher's Note

In real production systems, every step in this workflow should live inside a scikit-learn Pipeline object. A Pipeline chains transformers in sequence, automatically applies .fit() only to training data during cross-validation, and makes the entire FE sequence serialisable with joblib.dump(). The manual workflow you built here teaches you what the Pipeline is doing internally — which means when something breaks in production, you'll know exactly which step to inspect and why.

Practice Questions

1. What is the very first operation you should perform on a dataset after loading it, before any feature engineering?



2. Transformers and scalers should be fitted on ________ data only, then applied to both sets.



3. Which step should come before scaling and encoding: handling ________ ________.



Quiz

1. What goes wrong if you fit a StandardScaler on the full dataset before the train-test split?


2. Why should feature construction happen before encoding?


3. What should the final validation step check before passing the feature matrix to a model?


Up Next · Lesson 16

Polynomial Features

Welcome to Intermediate — learn how to generate squared, cubed, and interaction terms that let linear models capture curved and combined relationships.