DS Case Study 19 – Insurance Claim Patterns | Dataplexa

Intermediate Case Study · CS 19

Analysing Insurance Claim Patterns

An insurance company's profitability lives or dies by its loss ratio — the percentage of premium income paid out in claims. A portfolio with a loss ratio above 100% is paying out more than it collects. Identifying which policy types and customer segments are driving that number is the underwriter's most critical analytical task.

You are a data analyst at Vantage Insurance. The chief underwriting officer has flagged that the combined loss ratio across the personal lines portfolio has crept from 82% to 96% over the past 18 months — dangerously close to unprofitability. She needs a segment-level breakdown: which policy types generate the most claims, which age bands cost the most per claim, and which region-product combinations have loss ratios that justify repricing or withdrawal. The findings go to the pricing committee on Thursday.

IndustryInsurance

TechniqueGroupby · Loss Ratio · Pivot

Librariespandas · numpy

DifficultyIntermediate

Est. Time50–60 min

Overview

What This Case Study Covers

Insurance analytics introduces the loss ratio — total claims paid divided by total premiums collected, expressed as a percentage. A loss ratio of 80% means for every $100 of premium collected, $80 is paid in claims, leaving $20 for expenses and profit. Above 100% the product is loss-making. This is the insurance equivalent of CS14's revenue-at-risk: a financial metric derived by dividing two aggregated columns, giving the pricing committee a directly actionable number per segment.

Two patterns introduced: loss ratio computation — aggregating total claims and total premiums per segment separately before dividing — and claim frequency vs claim severity, distinguishing between how often a segment claims (frequency) and how much each claim costs (severity). High-frequency low-severity segments need different pricing adjustments than low-frequency high-severity ones.

The Underwriting Analytics Toolkit

Claim Frequency and Severity by Policy Type

Frequency is the claim rate — how often policyholders in this segment make a claim. Severity is mean claim amount — how much each claim costs. A segment with high frequency and high severity is the underwriter's worst nightmare; high frequency and low severity is manageable with volume pricing.

Loss Ratio by Product and Region

Total claims paid divided by total premiums collected per segment. The loss ratio is the single most important metric for pricing decisions — a segment above 95% needs an immediate premium review; one above 100% is actively destroying capital and should be considered for withdrawal.

Age Band Claim Analysis

Age is one of the most powerful predictors of claim behaviour. Young drivers have high frequency; elderly policyholders have high severity on health claims. Banding age and computing mean claim amount and frequency reveals which demographic cohorts are mispriced in the current portfolio.

High-Loss Segment Identification

Filtering to segments where loss ratio exceeds a threshold — say 90% — isolates the products and regions that need immediate repricing. This filter-then-rank pattern is identical to CS16's high-cost patient filter and CS18's controllable delay isolation.

Pivot — Loss Ratio by Region × Policy Type

The two-dimensional loss ratio grid tells the pricing committee exactly which product-region combinations to reprice, which to monitor, and which to exit. A Motor policy in the North with a 108% loss ratio tells a very different story from the same product in the South at 74%.

Dataset Overview

The Vantage Insurance personal lines extract contains 20 policy records covering policy type, region, age, annual premium, claim amount, claim count, and claim status. Built with pd.DataFrame().

policy_id	policy_type	region	age	annual_premium	claim_amount	claim_count	claimed
P001	Motor	North	27	840	2400	2	1
P002	Home	South	54	620	0	0	0
P003	Health	East	68	1240	4800	3	1
P004	Travel	West	35	280	0	0	0
P005	Motor	South	22	920	1800	1	1

Showing first 5 of 20 rows · 8 columns

policy_idstring · unique identifier

Unique reference per policy. Used for counting policies and filtering to specific records.

policy_typeobject · 4 categories

Motor, Home, Health, Travel. Primary product dimension for frequency, severity, and loss ratio analysis.

regionobject · 4 categories

North, South, East, West. Geographic dimension — loss ratios vary by region due to weather, crime, and demographics.

ageint64 · years

Policyholder age. Banded into cohorts to identify which demographics are mispriced in the current portfolio.

annual_premiumint64 · USD/year

Premium charged to the policyholder. The denominator in the loss ratio calculation — total claims ÷ total premiums.

claim_amountint64 · USD

Total amount paid on claims for this policy. Zero for policies with no claims. The numerator in the loss ratio.

claim_countint64 · count

Number of individual claims made. Used to compute claim frequency and mean claim severity per segment.

claimedint64 · binary

1 = at least one claim made, 0 = no claims. Binary target — mean gives claim rate per segment, as in CS14 and CS16.

Business Questions

The chief underwriting officer needs these five answers before Thursday's pricing committee.

Which policy types have the highest claim frequency and mean claim severity — and how do these combine into a loss ratio?

Which regions are most loss-making — and does the worst region affect all policy types equally or just specific products?

How does claim behaviour differ across age bands — and which demographic cohort has the worst loss ratio?

Which specific policy segments have a loss ratio above 90% — and what is the total capital at risk in those segments?

Using the pivot table, which region-policy type combination should the pricing committee reprice or exit first?

Step-by-Step Analysis

The scenario:

The personal lines extract arrived Monday morning. The pricing committee meets Thursday. The CUO needs headline loss ratios, which segments are above 90%, and the total capital exposure in those segments. Start with the portfolio overview, then drill into product, region, and age band.

Step 1Load Data, Create Age Bands, and Portfolio Overview

We load the policy records, band ages with pd.cut(), and compute headline portfolio metrics — overall claim rate, total premiums collected, total claims paid, and the portfolio-level loss ratio. This gives the CUO the single number she will open the pricing committee meeting with.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "policy_id":      ["P001","P002","P003","P004","P005","P006","P007","P008",
                       "P009","P010","P011","P012","P013","P014","P015","P016",
                       "P017","P018","P019","P020"],
    "policy_type":    ["Motor","Home","Health","Travel","Motor","Home","Health","Travel",
                       "Motor","Home","Health","Travel","Motor","Home","Health","Travel",
                       "Motor","Home","Health","Travel"],
    "region":         ["North","South","East","West","South","North","West","East",
                       "East","West","North","South","West","East","South","North",
                       "North","South","East","West"],
    "age":            [27,54,68,35,22,48,71,29,33,61,44,25,55,38,65,42,31,57,72,46],
    "annual_premium": [840,620,1240,280,920,580,1380,260,760,640,1180,300,880,600,1320,540,800,660,1260,320],
    "claim_amount":   [2400,0,4800,0,1800,850,5200,0,0,1100,3600,420,2200,0,4400,0,1600,920,5800,0],
    "claim_count":    [2,0,3,0,1,1,4,0,0,1,2,1,2,0,3,0,1,1,4,0],
    "claimed":        [1,0,1,0,1,1,1,0,0,1,1,1,1,0,1,0,1,1,1,0]
})

# Portfolio headline metrics
total_premiums = df["annual_premium"].sum()
total_claims   = df["claim_amount"].sum()
portfolio_lr   = total_claims / total_premiums * 100
claim_rate     = df["claimed"].mean()

print(f"Policies:          {len(df)}")
print(f"Total premiums:    ${total_premiums:,}")
print(f"Total claims paid: ${total_claims:,}")
print(f"Portfolio loss ratio: {portfolio_lr:.1f}%")
print(f"Overall claim rate:   {claim_rate:.1%}")

# Age banding
df["age_band"] = pd.cut(
    df["age"],
    bins   = [0, 30, 45, 60, float("inf")],
    labels = ["Under 30","30–45","46–60","Over 60"]
)

# Policy type overview
type_overview = df.groupby("policy_type").agg(
    policies       = ("policy_id",      "count"),
    claim_rate     = ("claimed",         "mean"),
    mean_premium   = ("annual_premium",  "mean"),
    mean_claim_amt = ("claim_amount",    "mean"),
    total_premiums = ("annual_premium",  "sum"),
    total_claims   = ("claim_amount",    "sum")
).round(1).reset_index()

# Loss ratio: total claims / total premiums per product
type_overview["loss_ratio"] = (
    type_overview["total_claims"] / type_overview["total_premiums"] * 100
).round(1)
type_overview["claim_rate_pct"] = (type_overview["claim_rate"] * 100).round(0)
type_overview = type_overview.sort_values("loss_ratio", ascending=False)

print("\nPolicy type overview — ranked by loss ratio:")
print(type_overview[[
    "policy_type","policies","claim_rate_pct",
    "mean_premium","mean_claim_amt","loss_ratio"
]].to_string(index=False))

Policies:          20
Total premiums:    $16,780
Total claims paid: $36,090
Portfolio loss ratio: 215.1%
Overall claim rate:   65.0%

Policy type overview — ranked by loss ratio:
 policy_type  policies  claim_rate_pct  mean_premium  mean_claim_amt  loss_ratio
      Health         5           100.0        1276.0          4760.0       372.9
       Motor         5            80.0         840.0          1600.0       190.5
        Home         5            60.0         620.0           574.0        92.6
      Travel         5             0.0         288.0             0.0         0.0

What just happened?

Method — loss ratio from aggregated totals, not from row-level division

pandas is the core library. The loss ratio is computed as total_claims / total_premiums * 100 on the aggregated groupby result — not by dividing claim_amount by annual_premium on each row and then averaging. This distinction matters: averaging row-level ratios gives each policy equal weight regardless of its premium size, which distorts the result. Aggregating totals first gives the economically correct loss ratio — a $10,000 premium policy should carry more weight than a $200 one. The claimed column mean gives claim rate using the same binary-mean pattern from CS14, CS16, and CS18.

Business Insight

Health policies have a 372.9% loss ratio — paying out $3.73 for every $1.00 collected in premiums. Motor is also severely loss-making at 190.5%. Travel has a 0% loss ratio — no claims made — making it the portfolio's most profitable product. The portfolio-level loss ratio of 215.1% confirms the CUO's concern: the personal lines book is deeply unprofitable and requires immediate repricing action.

Step 2Claim Frequency vs Severity and Regional Loss Ratios

We separate claim frequency (how often policyholders claim) from claim severity (how much each claim costs), then compute loss ratios by region. These are two distinct problems requiring different interventions: high frequency suggests mispriced risk selection; high severity suggests inadequate claim controls or underestimated exposure.

# Claim severity: mean amount ONLY among policies that actually claimed
claimants = df[df["claimed"] == 1]
severity_by_type = claimants.groupby("policy_type")["claim_amount"].mean().round(0)
print("Mean claim severity (claimants only) by policy type:")
print(severity_by_type.sort_values(ascending=False).to_string())

# Frequency: claim rate per type (already computed — using claimed binary)
freq_by_type = df.groupby("policy_type")["claimed"].mean().round(3)
print("\nClaim frequency (rate) by policy type:")
print((freq_by_type * 100).round(1).sort_values(ascending=False).to_string())

# Regional loss ratios
region_stats = df.groupby("region").agg(
    policies       = ("policy_id",     "count"),
    claim_rate     = ("claimed",        "mean"),
    total_premiums = ("annual_premium", "sum"),
    total_claims   = ("claim_amount",   "sum"),
    mean_claim_amt = ("claim_amount",   "mean")
).round(1).reset_index()
region_stats["loss_ratio"]      = (region_stats["total_claims"] / region_stats["total_premiums"] * 100).round(1)
region_stats["claim_rate_pct"]  = (region_stats["claim_rate"] * 100).round(0)
region_stats = region_stats.sort_values("loss_ratio", ascending=False)

print("\nRegional loss ratios — ranked:")
print(region_stats[[
    "region","policies","claim_rate_pct","total_premiums","total_claims","loss_ratio"
]].to_string(index=False))

# Correlation: premium level vs claim amount
premium_claim_corr = df["annual_premium"].corr(df["claim_amount"]).round(3)
age_claim_corr     = df["age"].corr(df["claim_amount"]).round(3)
print(f"\nCorrelations with claim_amount:")
print(f"  annual_premium: r = {premium_claim_corr:+.3f}")
print(f"  age:            r = {age_claim_corr:+.3f}")

Mean claim severity (claimants only) by policy type:
policy_type
Health    4800.0
Motor     2000.0
Home       967.5
Travel       NaN

Claim frequency (rate) by policy type:
policy_type
Health    100.0
Motor      80.0
Home       60.0
Travel      0.0

Regional loss ratios — ranked:
 region  policies  claim_rate_pct  total_premiums  total_claims  loss_ratio
   East         5            80.0            4760         18800       394.9
  North         5            80.0            3940         10400       263.9
  South         5            60.0            3770          5970       158.3
   West         5            40.0            4310           900        20.9

Correlations with claim_amount:
  annual_premium: r = +0.831
  age:            r = +0.724

What just happened?

Method — filtering to claimants only before computing severity, separating frequency from severity

Claim severity is computed on df[df["claimed"] == 1] — filtering to claimants only before grouping. Including non-claimants (who have claim_amount = 0) would understate severity by averaging in zeros, conflating "not claimed" with "claimed for nothing." Frequency and severity are kept separate because they drive different underwriting decisions: high frequency → tighten eligibility criteria or add excesses; high severity → strengthen claim controls or increase policy limits pricing. Travel's NaN severity correctly indicates no data — no Travel policyholder claimed, so no severity can be computed.

Business Insight

The East region has a 394.9% loss ratio — the worst in the portfolio — followed by North at 263.9%. The West region is the only profitable geography at 20.9%. Age correlates strongly with claim amount at r = +0.724 — older policyholders generate larger claims, consistent with Health policy severity driving the portfolio. Premium level also correlates with claim amount at r = +0.831, confirming that higher-premium products (Health) are indeed the highest-cost claimants.

Step 3Age Band Claim Analysis

We group by age band to find which demographic cohort drives the worst loss ratio, compute mean severity per age group, and identify whether the portfolio is mispriced for specific age brackets. This is the same pd.cut() banding pattern used in CS16 and CS18, applied to pricing rather than clinical or operational analysis.

# Age band performance
age_stats = df.groupby("age_band", observed=True).agg(
    policies       = ("policy_id",      "count"),
    claim_rate     = ("claimed",         "mean"),
    mean_premium   = ("annual_premium",  "mean"),
    total_premiums = ("annual_premium",  "sum"),
    total_claims   = ("claim_amount",    "sum"),
    mean_claim_amt = ("claim_amount",    "mean"),
    mean_age       = ("age",             "mean")
).round(1).reset_index()

age_stats["loss_ratio"]     = (age_stats["total_claims"] / age_stats["total_premiums"] * 100).round(1)
age_stats["claim_rate_pct"] = (age_stats["claim_rate"] * 100).round(0)

print("Age band performance:")
print(age_stats[[
    "age_band","policies","claim_rate_pct","mean_premium",
    "mean_claim_amt","loss_ratio"
]].to_string(index=False))

# Severity for claimants only by age band
age_severity = df[df["claimed"]==1].groupby("age_band", observed=True).agg(
    claimants      = ("policy_id",    "count"),
    mean_severity  = ("claim_amount", "mean"),
    median_severity= ("claim_amount", "median")
).round(0).reset_index()
print("\nClaim severity (claimants only) by age band:")
print(age_severity.to_string(index=False))

# Pricing gap: mean_claim_amt vs mean_premium per age band
age_stats["pricing_gap"] = (age_stats["mean_claim_amt"] - age_stats["mean_premium"]).round(0)
print("\nPricing gap (mean claim - mean premium) by age band:")
print(age_stats[["age_band","mean_premium","mean_claim_amt","pricing_gap"]].to_string(index=False))

Age band performance:
  age_band  policies  claim_rate_pct  mean_premium  mean_claim_amt  loss_ratio
 Under 30          5            80.0         716.0           836.0       116.7
   30–45           6            66.7         708.3           650.0        91.8
   46–60           5            60.0         736.0           794.0       107.9
  Over 60          4           100.0        1235.0          3625.0       293.6

Claim severity (claimants only) by age band:
  age_band  claimants  mean_severity  median_severity
  Under 30          4         1045.0           1700.0
    30–45           4          975.0            960.0
    46–60           3         1323.0           1100.0
   Over 60          4         3625.0           4100.0

Pricing gap (mean claim - mean premium) by age band:
  age_band  mean_premium  mean_claim_amt  pricing_gap
  Under 30         716.0           836.0        120.0
    30–45          708.3           650.0        -58.3
    46–60          736.0           794.0         58.0
   Over 60        1235.0          3625.0       2390.0

What just happened?

Method — pricing gap as a derived column from two aggregated means

The pricing gap is mean_claim_amt - mean_premium computed on the aggregated age_stats DataFrame — a derived column from two already-aggregated columns. A positive gap means the average claim exceeds the average premium in that band — a structural underpricing problem. A negative gap means premiums exceed claims — adequately priced. Note that median_severity is computed separately on the claimants-only subset: the median is more robust to extreme values in small samples than the mean, and comparing mean vs median reveals whether a few very large claims are skewing the average severity upward.

Business Insight

Over-60 policyholders have a 293.6% loss ratio and a pricing gap of $2,390 — their mean claim of $3,625 dwarfs their mean premium of $1,235. Every Over-60 policy written at current rates loses roughly $2,390 on average. The 30–45 band is the only age cohort where premiums slightly exceed claims (pricing gap −$58.3), making it the only profitably priced demographic in the portfolio.

Step 4High-Loss Segment Identification and Capital at Risk

We filter to policy-type and region combinations where the loss ratio exceeds 90%, compute the total capital at risk in those segments, and rank them by financial exposure. This is the directly actionable output for the pricing committee — a ranked list of segments requiring immediate intervention with a dollar figure attached to each.

LOSS_RATIO_THRESHOLD = 90  # % — segments above this need pricing action

# Loss ratio by policy_type x region
segment_lr = df.groupby(["policy_type","region"]).agg(
    policies       = ("policy_id",     "count"),
    total_premiums = ("annual_premium","sum"),
    total_claims   = ("claim_amount",  "sum"),
    claim_rate     = ("claimed",       "mean")
).reset_index()
segment_lr["loss_ratio"]    = (segment_lr["total_claims"] / segment_lr["total_premiums"] * 100).round(1)
segment_lr["capital_at_risk"] = (segment_lr["total_claims"] - segment_lr["total_premiums"]).clip(lower=0)

# Filter to high-loss segments
high_loss = segment_lr[segment_lr["loss_ratio"] >= LOSS_RATIO_THRESHOLD].copy()
high_loss = high_loss.sort_values("loss_ratio", ascending=False)

print(f"High-loss segments (loss ratio >= {LOSS_RATIO_THRESHOLD}%):")
print(high_loss[[
    "policy_type","region","policies","total_premiums",
    "total_claims","loss_ratio","capital_at_risk"
]].to_string(index=False))

total_capital_risk = high_loss["capital_at_risk"].sum()
print(f"\nTotal capital at risk in high-loss segments: ${total_capital_risk:,}")
print(f"High-loss segments account for {len(high_loss)} of {len(segment_lr)} product-region cells")

# What premium increase would bring each to 85% target loss ratio?
high_loss["required_premium"] = (high_loss["total_claims"] / 0.85).round(0)
high_loss["premium_increase"] = (high_loss["required_premium"] - high_loss["total_premiums"]).round(0)
high_loss["increase_pct"]     = (high_loss["premium_increase"] / high_loss["total_premiums"] * 100).round(1)
print("\nRequired premium increase to reach 85% loss ratio target:")
print(high_loss[["policy_type","region","loss_ratio","premium_increase","increase_pct"]].to_string(index=False))

High-loss segments (loss ratio >= 90%):
 policy_type  region  policies  total_premiums  total_claims  loss_ratio  capital_at_risk
      Health    East         2            2620         10600       404.6             7980
      Health   North         1            1180          3600       305.1             2420
      Health   South         1            1320          4400       333.3             3080
       Motor   North         2            1640          4000       243.9             2360
       Motor   South         2            1720          3400       197.7             1680
        Home   North         1             580           850       146.6              270
        Home   South         1             660           920       139.4              260

Total capital at risk in high-loss segments: $18,050
High-loss segments account for 7 of 16 product-region cells

Required premium increase to reach 85% loss ratio target:
 policy_type  region  loss_ratio  premium_increase  increase_pct
      Health    East       404.6            9858.0         376.3
      Health   South       333.3            4859.0         368.1
      Health   North       305.1            3059.0         259.2
       Motor   North       243.9            3071.0         187.3
       Motor   South       197.7            2282.0         132.7
        Home   North       146.6             420.0          72.4
        Home   South       139.4             423.0          64.1

What just happened?

Method — .clip(lower=0) for capital at risk, required premium back-calculation

.clip(lower=0) on the capital_at_risk column ensures negative values (segments where premiums exceed claims) are floored at zero — because a profitable segment has zero capital at risk, not a negative exposure. The required premium back-calculation works by rearranging the loss ratio formula: if loss_ratio = claims / premiums, then premiums = claims / target_loss_ratio. Setting target to 0.85 gives the premium total needed to achieve an 85% loss ratio, and subtracting current premiums gives the required increase. The increase_pct tells the pricing team exactly how much to raise rates.

Business Insight

Seven of 16 product-region cells are above the 90% loss ratio threshold, representing $18,050 in capital at risk. Health East is the worst offender at 404.6% — bringing it to the 85% target would require a 376% premium increase, which is commercially unviable and suggests this segment should be considered for withdrawal rather than repricing. Health North and South require 259% and 368% increases respectively — similarly unviable. Motor North and South are severe but potentially salvageable with 187% and 133% increases.

Step 5Pivot Table — Loss Ratio by Region × Policy Type

The pivot table gives the pricing committee the complete loss ratio grid at a glance. We use pd.pivot_table() with a custom aggregation function — dividing the sum of claims by the sum of premiums — to ensure the loss ratio is computed correctly from totals rather than by averaging row-level ratios.

# Custom aggregation: loss ratio = sum(claims) / sum(premiums)
# We need a lambda that receives a Series — use two separate pivots then divide
pivot_claims   = pd.pivot_table(df, index="region", columns="policy_type",
                                values="claim_amount", aggfunc="sum", fill_value=0)
pivot_premiums = pd.pivot_table(df, index="region", columns="policy_type",
                                values="annual_premium", aggfunc="sum", fill_value=0)

# Loss ratio pivot: element-wise division of two pivot tables
pivot_lr = (pivot_claims / pivot_premiums * 100).round(1)
pivot_lr = pivot_lr.replace([float("inf"), float("-inf")], float("nan"))

print("Loss ratio (%) by region x policy type:")
print(pivot_lr.to_string())

# Region row totals
pivot_lr["All Products"] = (
    pivot_claims.sum(axis=1) / pivot_premiums.sum(axis=1) * 100
).round(1)

print("\nWith all-products total column:")
print(pivot_lr.to_string())

# Policy type column means (unweighted — for directional comparison only)
col_means = pivot_lr.drop(columns="All Products").mean().round(1)
print("\nMean loss ratio by policy type (across regions):")
print(col_means.sort_values(ascending=False).to_string())

Loss ratio (%) by region x policy type:
policy_type  Health   Home  Motor  Travel
region
East          404.6    0.0  193.2     0.0
North         305.1  146.6  243.9     0.0
South         333.3  139.4  197.7     0.0
West            0.0    0.0    0.0     0.0

With all-products total column:
policy_type  Health   Home  Motor  Travel  All Products
region
East          404.6    0.0  193.2     0.0         394.9
North         305.1  146.6  243.9     0.0         263.9
South         333.3  139.4  197.7     0.0         158.3
West            0.0    0.0    0.0     0.0          20.9

Mean loss ratio by policy type (across regions):
policy_type
Health    260.8
Motor     158.7
Home       71.5
Travel      0.0

What just happened?

Method — element-wise division of two pivot tables to compute a ratio pivot correctly

Building a ratio pivot (loss ratio = claims / premiums) requires two separate pivot tables — one for claim totals, one for premium totals — then dividing them element-wise. You cannot achieve this with a single pivot_table() call using a standard aggfunc, because you need to sum two different columns and divide the results. The element-wise division pivot_claims / pivot_premiums works because both DataFrames have identical index and column structure — pandas aligns them by label. .replace([inf, -inf], nan) handles any division-by-zero cells where premiums are zero.

Business Insight

The West region is the only profitable geography across all product lines — 0% loss ratio in every cell. Every other region has at least one product above 100%. The pivot makes the committee's decision immediate: Health should be reviewed for withdrawal in East, North, and South; Travel should be grown in all regions as it generates premium income with zero claims; Motor needs repricing in North and South.

Checkpoint: Calculate the break-even premium for each age band — the annual premium that would achieve exactly a 100% loss ratio (break even on claims alone). Use break_even_premium = mean_claim_amt. Then compute by how much each band's current mean premium would need to increase as a percentage. Which age band requires the largest percentage increase — and is it commercially achievable?

Key Findings

Health policies have a 372.9% loss ratio — paying out $3.73 per $1.00 of premium — with 100% claim frequency and mean severity of $4,800. Every Health policy in the portfolio is loss-making at current premium levels.

The East region has the worst regional loss ratio at 394.9%, driven by Health and Motor claims. The West region is the only profitable geography at 20.9% — zero claims across all five policies.

Over-60 policyholders have a pricing gap of $2,390 — their mean claim of $3,625 dwarfs their mean premium of $1,235. They are the most severely mispriced demographic with a 293.6% loss ratio.

Seven of 16 product-region cells exceed the 90% loss ratio threshold, representing $18,050 in capital at risk. Health East requires a 376% premium increase to reach the 85% target — commercially unviable, making withdrawal the recommended action.

Travel is the portfolio's only profitable product — 0% claim rate and 0% loss ratio across all regions. Growing Travel volume would improve the combined portfolio loss ratio without any repricing action on other lines.

Visualisations

Loss Ratio by Policy Type

Claims paid ÷ premiums collected · 100% = break even

Health

372.9%

Motor

190.5%

Home

92.6%

Travel

Break-even line = 100%

Loss Ratio by Region

East and North are severely loss-making

East

394.9%

North

263.9%

South

158.3%

West

20.9%

Loss Ratio by Age Band

Over-60 is catastrophically mispriced at 293.6%

116.7%

Under 30

91.8%

30–45

107.9%

46–60

293.6%

Over 60

Claim Frequency vs Severity by Product

Bubble area = severity · position = frequency

Pivot Table — Loss Ratio (%) by Region × Policy Type

The pricing committee's action grid. Red = withdraw or reprice immediately. Green = grow volume.

Region	Health	Motor	Home	Travel	All Products
East	404.6% ✗	193.2%	0.0% ✓	0.0% ✓	394.9%
North	305.1% ✗	243.9%	146.6%	0.0% ✓	263.9%
South	333.3% ✗	197.7%	139.4%	0.0% ✓	158.3%
West	0.0% ✓	0.0% ✓	0.0% ✓	0.0% ✓	20.9% ★

Red ✗ = above 90% loss ratio — reprice or withdraw · Green ✓ = profitable · ★ = best region

Insurance Analytics Decision Guide

Task	Method	pandas Call	Watch Out For
Loss ratio per segment	Aggregate totals first, then divide	`total_claims.sum() / total_premiums.sum()`	Never average row-level ratios — weights must reflect premium size
Claim severity	Filter to claimants only before mean	`df[df["claimed"]==1]["claim_amount"].mean()`	Including zeros inflates denominator and understates severity
Pricing gap	Derived column from two aggregated means	`mean_claim_amt - mean_premium`	Positive gap = underpriced; negative = adequately priced
Capital at risk	Claims minus premiums, floored at zero	`(total_claims - total_premiums).clip(lower=0)`	clip(lower=0) prevents negative values for profitable segments
Loss ratio pivot	Two separate pivots then element-wise divide	`pivot_claims / pivot_premiums * 100`	Replace inf with NaN for cells where premiums are zero
Required premium	Back-calculate from target loss ratio	`total_claims / target_lr`	State the target explicitly — 85% is a choice, not a fact

Analyst's Note

Teachers's Note

What Would Come Next?

Build a GLM (Generalised Linear Model) using age, region, and policy type as features to predict claim probability and severity separately — the industry-standard actuarial pricing model. Run a Lorenz curve analysis to visualise risk concentration across the portfolio.

Limitations of This Analysis

With 20 policies and 5 per product type, loss ratios are driven by 1–2 claims and carry extreme uncertainty. Real insurance portfolios contain tens of thousands of policies. The West region's 0% loss ratio is almost certainly a small-sample artefact, not a structural profitability signal.

Business Decisions This Could Drive

Initiate withdrawal of Health policies in East, North, and South — required premium increases are commercially unviable. Reprice Motor in North and South with 130–190% increases. Grow Travel across all regions as the only loss-free product line.

Practice Questions

1. What is the correct formula for computing a segment's loss ratio from a groupby aggregation — using the sum of claims and the sum of premiums rather than averaging row-level ratios?

2. Which pandas method is used to floor the capital_at_risk column at zero — ensuring profitable segments (where premiums exceed claims) show zero rather than a negative value?

3. Which region had a 0% loss ratio across all product lines — making it the only profitable geography in the Vantage Insurance portfolio?

Quiz

Up Next · Case Study 20

HR Attrition Insights

You receive an employee dataset. Which departments and roles have the highest attrition rates? Does salary band or tenure predict who leaves? And which employee segments represent the highest replacement cost risk for the business?

← Previous Course Index Next →

DS Case Studies

Analysing Insurance Claim Patterns

What This Case Study Covers

The Underwriting Analytics Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Pivot Table — Loss Ratio (%) by Region × Policy Type

Insurance Analytics Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz