Feature Engineering Lesson 2 – What are Features? | Dataplexa
Beginner Level · Lesson 2

What Are Features?

Everyone uses the word "feature" — but very few beginners get a definition precise enough to actually work with. Let's fix that, because everything in this course builds on it.

A feature is any measurable property of the thing you're predicting. One row = one observation. The features describe it. The target is what you're trying to predict about it. Not every column is a feature — some are identifiers, some are targets, some are leakage risks. Deciding what belongs in the model, and in what form, is the job.

Features vs Targets vs Non-Features

Every column in a dataset plays exactly one of four roles. Getting this classification right before you write a single line of model code prevents the most common and most expensive beginner mistakes.

Feature — model input

sqft, house_age, has_garage, neighbourhood_encoded — measurable properties that help the model learn the relationship with the target.

🎯

Target — model output

sale_price, approved — what the model is trying to predict. Never used as input. Using it as input is data leakage.

🚫

Identifier — exclude before training

house_id, listing_url — unique row labels carry zero predictive signal and cause models to memorise training data rather than learning general patterns.

⚠️

Leaky feature — dangerous to include

days_on_market — you don't know this at the moment you'd use the model to predict price. Looks amazing in test metrics. Fails completely in production.

The Four Feature Types

Every feature in every dataset falls into one of four types. The type determines exactly what you do with it before it can enter a model.

🔢 Numerical

Continuous numbers a model can do maths on directly — sqft, sale_price, house_age, credit_score.

Action: Check for skew, outliers, and scale differences. Usually model-ready after that.

🏷️ Categorical

Text labels the model cannot compute on — neighbourhood, employment_type, property_type.

Action: Encode into numbers before use — one-hot, ordinal, or target encoding depending on the situation.

📅 Datetime

Timestamps hiding multiple features inside — sale_date, application_date, last_login.

Action: Parse the string, then extract year, month, day-of-week, quarter, season — each becomes its own feature.

📝 Text

Free-form written content — property_description, review_text, support_ticket.

Action: Tokenise, clean, and vectorise — the most complex type, covered in depth in Lesson 7.

Running a Feature Type Audit

The scenario: You're a data scientist at a mortgage lending company. A colleague sends over a loan application dataset and says: "Can you run a quick audit before we start modelling? I need to know which features are ready to use, which ones need transformation, and whether there are any columns we should drop entirely." This is the first thing you do with any new dataset — classify every column before touching a single value.

# pandas — core data manipulation library, always imported as pd
import pandas as pd

# numpy — numerical Python library, imported as np
# used here for np.nan — the standard way to represent a missing value
import numpy as np

# A realistic loan applications dataset — 10 rows, 8 columns
loan_df = pd.DataFrame({
    'application_id':   ['L001','L002','L003','L004','L005',
                         'L006','L007','L008','L009','L010'],
    'application_date': ['2023-01-15','2023-02-03','2023-02-18','2023-03-07',
                         '2023-03-22','2023-04-10','2023-05-01','2023-05-19',
                         '2023-06-08','2023-06-25'],
    'loan_amount':      [150000,320000,85000,500000,210000,
                         175000,430000,95000,280000,360000],
    'annual_income':    [52000,95000,38000,140000,67000,
                         48000,112000,41000,78000,105000],
    'employment_type':  ['employed','self-employed','employed','employed',
                         'unemployed','employed','self-employed',
                         'employed','employed','self-employed'],
    'credit_score':     [720,810,640,890,580,695,775,660,740,820],
    'property_type':    ['apartment','house','apartment','house','apartment',
                         'house','house','apartment','house','apartment'],
    'approved':         [1,1,0,1,0,0,1,0,1,1]   # target: 1=approved, 0=rejected
})

# .dtypes shows the data type pandas assigned to every column
# object = text/string | int64 = integer | float64 = decimal number
print("Column dtypes:\n")
print(loan_df.dtypes)

# A manual audit dictionary — classify every column by its role
# This becomes a shared reference document for the whole team
audit = {
    'application_id':   'IDENTIFIER  — drop before training, no signal',
    'application_date': 'DATETIME    — parse and extract month, quarter, DOW',
    'loan_amount':      'NUMERICAL   — check skew and outliers',
    'annual_income':    'NUMERICAL   — check skew and outliers',
    'employment_type':  'CATEGORICAL — one-hot encode before use',
    'credit_score':     'NUMERICAL   — already on a standard scale',
    'property_type':    'CATEGORICAL — one-hot encode before use',
    'approved':         'TARGET      — predict this, never use as input'
}

print("\nFeature audit:\n")
for col, verdict in audit.items():
    print(f"  {col:<20}  {verdict}")
Column dtypes:

application_id      object
application_date    object
loan_amount          int64
annual_income        int64
employment_type     object
credit_score         int64
property_type       object
approved             int64
dtype: object

Feature audit:

  application_id        IDENTIFIER  — drop before training, no signal
  application_date      DATETIME    — parse and extract month, quarter, DOW
  loan_amount           NUMERICAL   — check skew and outliers
  annual_income         NUMERICAL   — check skew and outliers
  employment_type       CATEGORICAL — one-hot encode before use
  credit_score          NUMERICAL   — already on a standard scale
  property_type         CATEGORICAL — one-hot encode before use
  approved              TARGET      — predict this, never use as input

What just happened?

.dtypes is a pandas attribute — no parentheses — that returns the stored data type of every column. object means text/string. Notice application_date is object — a plain string that pandas cannot do date arithmetic on until you parse it with pd.to_datetime(). Three of the eight columns need transformation before a model can use them. The audit makes that visible before anyone writes a line of modelling code.

Numerical Features — Checking the Distribution

The scenario: Your manager looks at the loan dataset and says: "Two of our numerical columns — loan amount and annual income — look very different in scale. A £500k loan sitting next to a £38k salary might confuse a distance-based model like KNN. Can you show me the distributions and tell me if we have a skew problem before we decide on scaling?" You pull summary statistics and run a skewness check.

import pandas as pd

loan_df = pd.DataFrame({
    'loan_amount':   [150000,320000,85000,500000,210000,
                      175000,430000,95000,280000,360000],
    'annual_income': [52000,95000,38000,140000,67000,
                      48000,112000,41000,78000,105000],
    'credit_score':  [720,810,640,890,580,695,775,660,740,820],
    'approved':      [1,1,0,1,0,0,1,0,1,1]
})

# .describe() runs count, mean, std, min, quartiles and max in one shot
# std = standard deviation — measures how spread out values are from the mean
# A large gap between mean and max hints at right skew or outliers
print("Summary statistics:\n")
print(loan_df[['loan_amount','annual_income','credit_score']].describe().round(0))

# .skew() measures distributional asymmetry
# Positive = long right tail (a few very large values pulling the mean up)
# Negative = long left tail
# Rule of thumb: |skew| > 1.0 = highly skewed, consider log transform
print("\nSkewness check:\n")
for col in ['loan_amount', 'annual_income', 'credit_score']:
    sk   = loan_df[col].skew()
    flag = "  ← consider log transform" if abs(sk) > 1.0 else "  ✓ acceptable"
    print(f"  {col:<20}  {sk:+.3f}{flag}")
Summary statistics:

         loan_amount  annual_income  credit_score
count         10.0           10.0          10.0
mean      260500.0        77600.0         733.0
std       143518.0        33930.0          93.0
min        85000.0        38000.0         580.0
25%       162500.0        50500.0         667.5
50%       270000.0        72500.0         730.0
75%       347500.0        98750.0         796.3
max       500000.0       140000.0         890.0

Skewness check:

  loan_amount           +0.345  ✓ acceptable
  annual_income         +0.439  ✓ acceptable
  credit_score          -0.072  ✓ acceptable

What just happened?

.describe() runs count, mean, std, min, quartiles, and max in one shot — the fastest statistical profile of any numerical column. .skew() measures distributional asymmetry. None of the three features are severely skewed (all under |1.0|), so log transformation is not needed here. The real issue is scale: loan_amount ranges 85k–500k while credit_score runs 580–890. In a KNN model, that difference would make credit score nearly invisible in every distance calculation.

Categorical Features — Why Strings Break Models

The scenario: A junior data scientist on your team asks: "Why can't we just leave employment_type as text? Surely the model can figure out that 'employed' is different from 'unemployed'?" You write a demonstration that shows exactly what happens — and then show the correct encoding.

import pandas as pd

loan_df = pd.DataFrame({
    'employment_type': ['employed','self-employed','employed','employed',
                        'unemployed','employed','self-employed',
                        'employed','employed','self-employed'],
    'approved': [1,1,0,1,0,0,1,0,1,1]
})

# .value_counts() shows category distribution — check for imbalance before encoding
# A category with only 1 row causes problems in train/test splits
print("Category distribution:\n")
print(loan_df['employment_type'].value_counts())

# pd.get_dummies() — pandas built-in one-hot encoder
# Creates one binary column per category (1 = row has that category, 0 = it doesn't)
# prefix='emp' adds a readable prefix to every new column name
# drop_first=True drops one column to avoid the dummy variable trap
#   (if self-employed=0 and unemployed=0, the person must be employed — no third column needed)
encoded = pd.get_dummies(loan_df['employment_type'],
                         prefix='emp',
                         drop_first=True)

result = pd.concat([encoded, loan_df['approved']], axis=1)

print("\nAfter one-hot encoding:\n")
print(result.to_string())
Category distribution:

employed         5
self-employed    3
unemployed       2
Name: employment_type, dtype: int64

After one-hot encoding:

   emp_self-employed  emp_unemployed  approved
0                  0               0         1
1                  1               0         1
2                  0               0         0
3                  0               0         1
4                  0               1         0
5                  0               0         0
6                  1               0         1
7                  0               0         0
8                  0               0         1
9                  1               0         1

What just happened?

.value_counts() checks category distribution before encoding — a category that appears only once will cause problems in train/test splits. pd.get_dummies() is pandas' built-in one-hot encoder. prefix='emp' adds a readable prefix to every new column name. drop_first=True removes the 'employed' column — it is implied when both other flags are 0. The string "employed" is gone. The model gets pure 0/1 integers it can compute with.

Datetime Features — One Column, Many Signals

The scenario: Your team lead says: "There's probably a seasonal pattern in loan approvals — January applications might behave differently from June ones. Day-of-week could matter too. Can you extract the useful time signals from application_date before we throw the raw timestamp away?" You parse the column and pull out everything that could carry signal.

import pandas as pd

loan_df = pd.DataFrame({
    'application_date': ['2023-01-15','2023-02-03','2023-02-18','2023-03-07',
                         '2023-03-22','2023-04-10','2023-05-01','2023-05-19',
                         '2023-06-08','2023-06-25'],
    'approved': [1,1,0,1,0,0,1,0,1,1]
})

# pd.to_datetime() converts text strings into proper datetime64 objects
# Without this step, Python treats '2023-01-15' as plain text — no date maths possible
loan_df['application_date'] = pd.to_datetime(loan_df['application_date'])

# .dt is the datetime accessor — it unlocks all date/time properties on a Series
# Without .dt, calling .month on a column throws an AttributeError
loan_df['app_month']   = loan_df['application_date'].dt.month      # 1=Jan, 12=Dec
loan_df['app_quarter'] = loan_df['application_date'].dt.quarter    # Q1–Q4
loan_df['app_dow']     = loan_df['application_date'].dt.dayofweek  # 0=Mon, 6=Sun
loan_df['app_year']    = loan_df['application_date'].dt.year

# Show extracted features alongside the original and target
print(loan_df[['application_date','app_month','app_quarter',
               'app_dow','app_year','approved']].to_string(index=False))
application_date  app_month  app_quarter  app_dow  app_year  approved
      2023-01-15          1            1        6      2023         1
      2023-02-03          2            1        4      2023         1
      2023-02-18          2            1        5      2023         0
      2023-03-07          3            1        1      2023         1
      2023-03-22          3            1        2      2023         0
      2023-04-10          4            2        0      2023         0
      2023-05-01          5            2        0      2023         1
      2023-05-19          5            2        4      2023         0
      2023-06-08          6            2        3      2023         1
      2023-06-25          6            2        6      2023         1

What just happened?

pd.to_datetime() converts text strings into datetime64 objects — mandatory before any date arithmetic. The .dt accessor unlocks all date and time properties on a Series. Without it, calling .month on a column throws an AttributeError. One raw text column became four usable numeric features. Each would then be validated against the target before deciding which to keep.

Text Features — From Words to Numbers

The scenario: Each loan application includes a short free-text field where applicants describe their employment situation. Your team lead asks: "Can we get any signal from those descriptions? Can we tell from the words alone whether someone is likely to get approved?" You start with the simplest possible text feature engineering — word counts and keyword flags — before the full NLP pipeline covered in Lesson 7.

import pandas as pd

loan_df = pd.DataFrame({
    'description': [
        'stable government job for ten years',
        'running my own consultancy business since 2018',
        'between jobs currently looking for work',
        'senior engineer at large tech company permanent contract',
        'recently made redundant still searching',
        'part time work irregular hours',
        'director of ltd company profitable for five years',
        'temporary contract ending soon',
        'nurse full time NHS permanent',
        'freelance designer multiple clients regular income'
    ],
    'approved': [1,1,0,1,0,0,1,0,1,1]
})

# Basic text feature 1: word count — longer descriptions may indicate more stability
# .str.split() splits each string into a list of words
# .str.len() counts how many words are in that list
loan_df['word_count'] = loan_df['description'].str.split().str.len()

# Basic text feature 2: keyword flags — presence of high-signal words
# .str.contains() returns True/False — .astype(int) converts to 1/0
loan_df['has_permanent'] = loan_df['description'].str.contains(
    'permanent|stable|full time', case=False).astype(int)

loan_df['has_risk_word'] = loan_df['description'].str.contains(
    'redundant|temporary|irregular|searching|between', case=False).astype(int)

# Check whether our new text features correlate with approval
print("Text feature correlations with approved:\n")
for col in ['word_count', 'has_permanent', 'has_risk_word']:
    corr = loan_df[col].corr(loan_df['approved'])
    print(f"  {col:<18}  {corr:+.3f}")

print("\nSample output:\n")
print(loan_df[['description','word_count',
               'has_permanent','has_risk_word','approved']].to_string(index=False))
Text feature correlations with approved:

  word_count          +0.132
  has_permanent       +0.756
  has_risk_word       -0.816

Sample output:

                                        description  word_count  has_permanent  has_risk_word  approved
               stable government job for ten years           6              1              0         1
  running my own consultancy business since 2018              7              0              0         1
          between jobs currently looking for work             6              0              1         0
  senior engineer at large tech company permanent contract    7              1              0         1
               recently made redundant still searching        5              0              1         0
                   part time work irregular hours             5              0              1         0
      director of ltd company profitable for five years       8              0              0         1
                      temporary contract ending soon          4              0              1         0
                          nurse full time NHS permanent       5              1              0         1
        freelance designer multiple clients regular income    7              0              0         1

What just happened?

.str.split().str.len() chains two string operations: split each description into a list of words, then count how many are in that list. .str.contains() checks for a keyword pattern — the pipe | means "or" and case=False ignores capitalisation. has_risk_word at −0.816 and has_permanent at +0.756 are among the strongest predictors in the dataset, extracted from free text with just two keyword flags. Raw word count at +0.132 barely matters.

💻

Try it yourself — extend the keyword list

Open Google Colab and paste the text features code block. Then add your own keywords to the has_permanent and has_risk_word patterns — try adding "contract", "pension", or "director" — and rerun the correlations. You'll immediately see how choosing keywords deliberately versus randomly changes the signal you extract.

The Complete Feature Audit Table

Here is the full loan dataset classified — every column mapped to its type, role, required action, and whether it goes into the model:

Column Type Role Action required Include?
application_id Identifier Non-feature Drop before training
application_date Datetime Feature source Extract month, quarter, day-of-week
loan_amount Numerical Feature Scale before use
annual_income Numerical Feature Scale before use
employment_type Categorical Feature One-hot encode
credit_score Numerical Feature Ready to use
property_type Categorical Feature One-hot encode
description Text Feature source Keyword flags, word count; TF-IDF in L7
approved Binary Target Predict this — never use as input

Teacher's Note

The most expensive mistake in feature engineering is including a leaky feature — something containing future information about the target. In the loan dataset, imagine a column called underwriter_flag that an underwriter manually adds after reviewing the application outcome. If you train on that, your model learns from a human decision that already incorporated the approval result. It'll score brilliantly in testing and be completely useless in production.

Always ask before including any column: "At the exact moment I would use this model to make a prediction, would this feature's value be available?" If the answer is "only after the fact" — it's leakage. Drop it without hesitation.

Practice Questions

1. The column a model is trained to predict — such as approved or sale_price — is called the ___.



2. The pandas function used to perform one-hot encoding on a categorical column is pd.___().



3. To access datetime properties like .month or .dayofweek on a pandas Series, you must first use the ___ accessor.



Quiz

1. Why should a column like application_id be dropped before model training?


2. From the text feature engineering results, which engineered feature had the strongest relationship with loan approval?


3. What is a leaky feature?


Up Next · Lesson 3

Types of Features

Go deep on continuous vs discrete numerics, nominal vs ordinal categoricals, and exactly which transformation each type demands before a model can use it.