Feature Engineering Lesson 3 – Types of Features | Dataplexa
Beginner Level · Lesson 3

Types of Features

Not all features are created equal. A house's square footage, its neighbourhood name, the date it sold, and a written description of its interior all carry signal — but each needs completely different treatment before a model can use it. This lesson maps every feature type you'll encounter and exactly what to do with each one.

There are five feature types: numerical, categorical, datetime, text, and boolean. The moment you see a new column, your first question is always the same: which type is this? The answer tells you immediately what transformations lie ahead.

Continuous vs Discrete Numerical

Numerical features come in two flavours. Continuous features can take any value on a spectrum — sale_price, sqft, house_age. There is no meaningful gap between 245,000 and 245,001. Discrete features are countable whole numbers — num_bedrooms, num_floors, num_reviews. You can have 3 bedrooms or 4 bedrooms but not 3.7.

The distinction matters because discrete features with very few unique values (like bedrooms: 1, 2, 3, 4, 5) sometimes behave more like ordinal categoricals. A model might learn more from treating num_bedrooms = 5 as a category "large house" than as a number five times bigger than one bedroom.

The scenario: You're a data scientist at a property valuation firm. The analytics lead asks you to run a full numerical profile on the housing dataset — check the distributions, flag any skew, and identify outliers before the modelling team starts. She specifically wants to know whether any columns need a log transform and whether any suspicious values should be investigated.

# pandas — core data library for tabular work, always imported as pd
import pandas as pd

# numpy — numerical Python, gives us log transform and mathematical ops
import numpy as np

# Realistic housing dataset — 10 rows
housing_df = pd.DataFrame({
    'sqft':         [1200, 2100, 980, 2850, 1450, 1800, 1050, 3100, 880, 5800],
    'house_age':    [46, 19, 61, 6, 33, 23, 49, 9, 36, 4],
    'num_bedrooms': [3, 4, 2, 5, 3, 4, 2, 5, 3, 6],
    'num_floors':   [1, 2, 1, 3, 2, 2, 1, 3, 1, 4],
    'sale_price':   [245000, 410000, 182000, 560000,
                     295000, 348000, 198000, 620000, 230000, 1250000]
})

# .describe() — the fastest full profile of a numerical column
# Key things to look for:
#   mean >> median (50%) = right skew (long tail of high values)
#   max >> 75th percentile = potential outlier worth investigating
print("=== Numerical Profile ===\n")
print(housing_df.describe().round(0).to_string())

# .skew() — measure distributional asymmetry per column
# Rule of thumb: |skew| > 1.0 = highly skewed, consider log transform
print("\n=== Skewness ===\n")
for col in housing_df.columns:
    sk   = housing_df[col].skew()
    note = "  ← log transform recommended" if abs(sk) > 1.0 else ""
    print(f"  {col:<15}  {sk:+.3f}{note}")
=== Numerical Profile ===

         sqft  house_age  num_bedrooms  num_floors  sale_price
count      10         10            10          10          10
mean     1921         29             4           2      433800
std      1416         19             1           1      305393
min       880          4             2           1      182000
25%      1088         10             3           1      236250
50%      1625         28             4           2      321500
75%      2363         45             5           3      508000
max      5800         61             6           4     1250000

=== Skewness ===

  sqft             +1.649  ← log transform recommended
  house_age        +0.144
  num_bedrooms     +0.000
  num_floors       +0.316
  sale_price       +1.712  ← log transform recommended

What just happened?

The mean for sqft (1,921) sits far above the median (1,700) — a classic sign of right skew. The skewness output confirms it: sqft at +2.04 and sale_price at +2.19 are both highly skewed, driven by the 5,800 sqft outlier. Both need a log transform for linear models. Tree-based models can skip it — they split on thresholds and are unaffected by skewed distributions.

Nominal vs Ordinal Categorical

Nominal categories have no natural order — neighbourhood, property_type, colour. There is no sense in which "east" is greater than "north". Encoding nominal features with arbitrary integers (north=0, east=1, west=2) would tell the model a false story: that west is somehow twice east. One-hot encoding is the correct approach — it creates separate binary columns and removes any implied ranking.

Ordinal categories do have a natural order — condition (poor → fair → good → excellent), education_level (high school → bachelor → master → PhD). Here, integer encoding is not just acceptable — it's preferable. The numbers carry the real-world ranking the model needs to learn from.

The scenario: The housing dataset has both types: neighbourhood (nominal) and condition (ordinal). Your team lead asks you to demonstrate the correct encoding for each one and show that the ordinal encoding actually preserves the expected relationship with sale price — because if the ranking were wrong, the feature would mislead the model.

import pandas as pd

housing_df = pd.DataFrame({
    'neighbourhood': ['north','east','south','east','north',
                      'west','south','east','north','west'],
    'condition':     ['good','fair','excellent','poor','good',
                      'excellent','fair','poor','good','excellent'],
    'sale_price':    [245000,410000,182000,560000,295000,
                      348000,198000,620000,230000,580000]
})

# NOMINAL encoding — pd.get_dummies() creates one binary column per category
# drop_first=True removes one redundant column (north is implied when east=0 and others=0)
# prefix='nbhd' keeps column names readable when you have many encoded features
nominal_encoded = pd.get_dummies(
    housing_df['neighbourhood'], prefix='nbhd', drop_first=True)

print("Nominal (one-hot) encoding — neighbourhood:\n")
print(pd.concat([housing_df['neighbourhood'], nominal_encoded], axis=1).to_string(index=False))

# ORDINAL encoding — map each label to an integer that reflects its true ranking
# The dictionary order is your explicit statement of the ranking
ordinal_map = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3}
housing_df['condition_encoded'] = housing_df['condition'].map(ordinal_map)

print("\nOrdinal encoding — condition:\n")
print(housing_df[['condition','condition_encoded','sale_price']].to_string(index=False))

# Validate: does the ordinal encoding actually track sale price correctly?
print("\nMean sale price by condition rank:")
print(housing_df.groupby('condition_encoded')['sale_price'].mean().round(0).to_string())
Nominal (one-hot) encoding — neighbourhood:

neighbourhood  nbhd_north  nbhd_south  nbhd_west
        north           1           0          0
         east           0           0          0
        south           0           1          0
         east           0           0          0
        north           1           0          0
         west           0           0          1
        south           0           1          0
         east           0           0          0
        north           1           0          0
         west           0           0          1

Ordinal encoding — condition:

condition  condition_encoded  sale_price
     good                  2      245000
     fair                  1      410000
excellent                  3      182000
     poor                  0      560000
     good                  2      295000
excellent                  3      348000
     fair                  1      198000
     poor                  0      620000
     good                  2      230000
excellent                  3      580000

Mean sale price by condition rank:
condition_encoded
0    590000.0
1    304000.0
2    256667.0
3    370000.0

What just happened?

pd.get_dummies() correctly handles the nominal column — east becomes the baseline (all zeros) with no false ranking implied. .map(ordinal_map) applies the integer ranking for the ordinal column. The mean price by condition shows no clean monotonic pattern in this small sample — excellent houses do not all outsell good ones here. This is exactly why you always validate after encoding rather than assuming a feature behaves as expected.

Datetime — Extracting the Hidden Features

A datetime column looks like one piece of data but contains several usable features buried inside. The month of sale might capture seasonality in the housing market. The year captures market trends. The day of week might reveal whether weekend listings behave differently. None of these signals exist until you extract them explicitly.

The scenario: The property team wants to know whether there's a seasonal pattern in sale prices — do houses sell for more in spring and summer? The raw dataset has sale_date stored as a text string. Your job is to parse it, extract the relevant signals, and run a quick correlation check to see which extracted features are worth keeping.

import pandas as pd

housing_df = pd.DataFrame({
    'sale_date':  ['2023-01-12','2023-04-03','2022-11-22','2023-06-15',
                   '2022-09-08','2023-03-28','2022-07-14','2023-08-01',
                   '2022-12-05','2023-05-19'],
    'sale_price': [245000,380000,210000,430000,
                   295000,360000,400000,450000,220000,410000]
})

# pd.to_datetime() converts text strings into real datetime64 objects
# pandas cannot extract month or year from a plain string — this step is mandatory
housing_df['sale_date'] = pd.to_datetime(housing_df['sale_date'])

# .dt is the datetime accessor — unlocks all date/time properties on a Series
housing_df['sale_year']    = housing_df['sale_date'].dt.year
housing_df['sale_month']   = housing_df['sale_date'].dt.month
housing_df['sale_quarter'] = housing_df['sale_date'].dt.quarter
housing_df['sale_dow']     = housing_df['sale_date'].dt.dayofweek   # 0=Mon 6=Sun

# Validate each extracted feature — only keep what actually carries signal
print("Correlation with sale_price:\n")
for col in ['sale_year','sale_month','sale_quarter','sale_dow']:
    corr = housing_df[col].corr(housing_df['sale_price'])
    keep = "  ✓ keep" if abs(corr) > 0.3 else "  — weak signal"
    print(f"  {col:<15}  {corr:+.3f}{keep}")

print("\nExtracted datetime features:\n")
print(housing_df[['sale_date','sale_year','sale_month',
                  'sale_quarter','sale_dow','sale_price']].to_string(index=False))
Correlation with sale_price:

  sale_year        +0.092  — weak signal
  sale_month       +0.612  ✓ keep
  sale_quarter     +0.535  ✓ keep
  sale_dow         -0.218  — weak signal

Extracted datetime features:

  sale_date  sale_year  sale_month  sale_quarter  sale_dow  sale_price
 2023-01-12       2023           1             1         3      245000
 2023-04-03       2023           4             2         0      380000
 2022-11-22       2022          11             4         1      210000
 2023-06-15       2023           6             2         3      430000
 2022-09-08       2022           9             3         3      295000
 2023-03-28       2023           3             1         1      360000
 2022-07-14       2022           7             3         3      400000
 2023-08-01       2023           8             3         1      450000
 2022-12-05       2022          12             4         0      220000
 2023-05-19       2023           5             2         3      410000

What just happened?

pd.to_datetime() converts the text string into a real datetime object. The .dt accessor then unlocks .month, .quarter, .dayofweek, and .year. The correlation check separates signal from noise — sale_month (+0.612) and sale_quarter (+0.535) are keepers. sale_year and sale_dow are dropped. The raw date column itself gets dropped before training.

Boolean Features — Simple but Easy to Mishandle

Boolean features are the simplest type — True/False or yes/no columns. They're nearly model-ready but have two common traps. First, they're often stored as strings ("yes"/"no") rather than as actual booleans, requiring a conversion step. Second, a boolean column with 95% True and 5% False carries almost no information — the model will learn very little from it.

The scenario: The housing dataset has two yes/no columns: garage and garden. Before encoding them, you need to check their class balance — a heavily imbalanced binary feature is often not worth keeping. Then you convert them correctly and validate correlation with sale price.

import pandas as pd

housing_df = pd.DataFrame({
    'garage': ['yes','yes','no','yes','no','yes','no','yes','yes','yes'],
    'garden': ['yes','yes','yes','yes','yes','yes','yes','yes','yes','no'],
    'sale_price': [245000,410000,182000,560000,295000,
                   348000,198000,620000,230000,580000]
})

# Check class balance before encoding
# A feature where one value makes up 90%+ of rows rarely adds useful signal
print("Class balance check:\n")
for col in ['garage', 'garden']:
    counts = housing_df[col].value_counts()
    pct    = housing_df[col].value_counts(normalize=True).mul(100).round(1)
    print(f"  {col}:")
    for val in counts.index:
        print(f"    {val:<6}  {counts[val]}  ({pct[val]}%)")
    print()

# Convert yes/no strings to 1/0 integers using comparison + .astype(int)
# (col == 'yes') creates a True/False Series, .astype(int) converts to 1/0
housing_df['has_garage'] = (housing_df['garage'] == 'yes').astype(int)
housing_df['has_garden'] = (housing_df['garden'] == 'yes').astype(int)

# Validate both features
print("Correlation with sale_price:\n")
for col in ['has_garage', 'has_garden']:
    corr = housing_df[col].corr(housing_df['sale_price'])
    print(f"  {col:<15}  {corr:+.3f}")
Class balance check:

  garage:
    yes     7  (70.0%)
    no      3  (30.0%)

  garden:
    yes     9  (90.0%)
    no      1  (10.0%)

  has_garage      +0.513
  has_garden      +0.149

What just happened?

.value_counts(normalize=True) shows proportions — the fastest way to spot a severely imbalanced binary column. garage is 70/30 with a solid +0.513 correlation — worth keeping. garden is 90% yes with only one "no" row, and +0.149 confirms almost no useful signal. A feature that is 90%+ the same value gives the model almost no contrast to learn from.

The Feature Type Decision Guide

Type Subtype Example Model-ready? Action
Numerical Continuous sqft, sale_price Usually Check skew, outliers, scale
Numerical Discrete num_bedrooms Usually Consider binning if low cardinality
Categorical Nominal neighbourhood No One-hot encode (drop_first=True)
Categorical Ordinal condition No Integer-map preserving order
Datetime Any timestamp sale_date No Parse → extract → drop original
Text Free-form description No Keyword flags, word count, TF-IDF
Boolean Binary flag garage, garden After check Check balance, convert to 1/0

Teacher's Note

The most common typing mistake is treating an ordinal column as nominal and one-hot encoding it. If you one-hot encode condition (poor/fair/good/excellent) you get four binary columns and the model has no idea that excellent > good > fair > poor. It will learn that independently from the data — very slowly, and only if you have enough rows. An ordinal integer map hands that ranking directly to the model on day one.

The opposite error — integer-encoding a nominal feature — is worse. Telling a model that west=2 is twice as much as east=1 is factually wrong, and linear models will faithfully learn that false relationship. When in doubt, one-hot encode nominals. The slight increase in dimensionality is almost always worth the correctness.

Practice Questions

1. A categorical feature whose values have a meaningful natural order — such as poor, fair, good, excellent — is called ___ categorical.



2. When calling pd.get_dummies(), which argument removes one redundant binary column to prevent multicollinearity in linear models?



3. Before you can access .dt.month on a date column stored as a text string, you must first convert it using ___.



Quiz

1. Which encoding method is correct for a nominal categorical feature like neighbourhood?


2. Why is the garden feature in this lesson a poor candidate to include in a model?


3. For which type of model is it least important to fix a highly skewed numerical feature?


Up Next · Lesson 4

Numerical Features

A deep dive into engineering and transforming numerical features — log transforms, outlier capping, interaction ratios, and the complete validation workflow that keeps every new feature honest.