DS Case Studies
Analysing Insurance Claim Patterns
An insurance company's profitability lives or dies by its loss ratio — the percentage of premium income paid out in claims. A portfolio with a loss ratio above 100% is paying out more than it collects. Identifying which policy types and customer segments are driving that number is the underwriter's most critical analytical task.
You are a data analyst at Vantage Insurance. The chief underwriting officer has flagged that the combined loss ratio across the personal lines portfolio has crept from 82% to 96% over the past 18 months — dangerously close to unprofitability. She needs a segment-level breakdown: which policy types generate the most claims, which age bands cost the most per claim, and which region-product combinations have loss ratios that justify repricing or withdrawal. The findings go to the pricing committee on Thursday.
What This Case Study Covers
Insurance analytics introduces the loss ratio — total claims paid divided by total premiums collected, expressed as a percentage. A loss ratio of 80% means for every $100 of premium collected, $80 is paid in claims, leaving $20 for expenses and profit. Above 100% the product is loss-making. This is the insurance equivalent of CS14's revenue-at-risk: a financial metric derived by dividing two aggregated columns, giving the pricing committee a directly actionable number per segment.
Two patterns introduced: loss ratio computation — aggregating total claims and total premiums per segment separately before dividing — and claim frequency vs claim severity, distinguishing between how often a segment claims (frequency) and how much each claim costs (severity). High-frequency low-severity segments need different pricing adjustments than low-frequency high-severity ones.
The Underwriting Analytics Toolkit
Claim Frequency and Severity by Policy Type
Frequency is the claim rate — how often policyholders in this segment make a claim. Severity is mean claim amount — how much each claim costs. A segment with high frequency and high severity is the underwriter's worst nightmare; high frequency and low severity is manageable with volume pricing.Loss Ratio by Product and Region
Total claims paid divided by total premiums collected per segment. The loss ratio is the single most important metric for pricing decisions — a segment above 95% needs an immediate premium review; one above 100% is actively destroying capital and should be considered for withdrawal.Age Band Claim Analysis
Age is one of the most powerful predictors of claim behaviour. Young drivers have high frequency; elderly policyholders have high severity on health claims. Banding age and computing mean claim amount and frequency reveals which demographic cohorts are mispriced in the current portfolio.High-Loss Segment Identification
Filtering to segments where loss ratio exceeds a threshold — say 90% — isolates the products and regions that need immediate repricing. This filter-then-rank pattern is identical to CS16's high-cost patient filter and CS18's controllable delay isolation.Pivot — Loss Ratio by Region × Policy Type
The two-dimensional loss ratio grid tells the pricing committee exactly which product-region combinations to reprice, which to monitor, and which to exit. A Motor policy in the North with a 108% loss ratio tells a very different story from the same product in the South at 74%.Dataset Overview
The Vantage Insurance personal lines extract contains 20 policy records covering policy type, region, age, annual premium, claim amount, claim count, and claim status. Built with pd.DataFrame().
| policy_id | policy_type | region | age | annual_premium | claim_amount | claim_count | claimed |
|---|---|---|---|---|---|---|---|
| P001 | Motor | North | 27 | 840 | 2400 | 2 | 1 |
| P002 | Home | South | 54 | 620 | 0 | 0 | 0 |
| P003 | Health | East | 68 | 1240 | 4800 | 3 | 1 |
| P004 | Travel | West | 35 | 280 | 0 | 0 | 0 |
| P005 | Motor | South | 22 | 920 | 1800 | 1 | 1 |
Showing first 5 of 20 rows · 8 columns
Unique reference per policy. Used for counting policies and filtering to specific records.
Motor, Home, Health, Travel. Primary product dimension for frequency, severity, and loss ratio analysis.
North, South, East, West. Geographic dimension — loss ratios vary by region due to weather, crime, and demographics.
Policyholder age. Banded into cohorts to identify which demographics are mispriced in the current portfolio.
Premium charged to the policyholder. The denominator in the loss ratio calculation — total claims ÷ total premiums.
Total amount paid on claims for this policy. Zero for policies with no claims. The numerator in the loss ratio.
Number of individual claims made. Used to compute claim frequency and mean claim severity per segment.
1 = at least one claim made, 0 = no claims. Binary target — mean gives claim rate per segment, as in CS14 and CS16.
Business Questions
The chief underwriting officer needs these five answers before Thursday's pricing committee.
Which policy types have the highest claim frequency and mean claim severity — and how do these combine into a loss ratio?
Which regions are most loss-making — and does the worst region affect all policy types equally or just specific products?
How does claim behaviour differ across age bands — and which demographic cohort has the worst loss ratio?
Which specific policy segments have a loss ratio above 90% — and what is the total capital at risk in those segments?
Using the pivot table, which region-policy type combination should the pricing committee reprice or exit first?
Step-by-Step Analysis
The scenario:
The personal lines extract arrived Monday morning. The pricing committee meets Thursday. The CUO needs headline loss ratios, which segments are above 90%, and the total capital exposure in those segments. Start with the portfolio overview, then drill into product, region, and age band.
We load the policy records, band ages with pd.cut(), and compute headline portfolio metrics — overall claim rate, total premiums collected, total claims paid, and the portfolio-level loss ratio. This gives the CUO the single number she will open the pricing committee meeting with.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"policy_id": ["P001","P002","P003","P004","P005","P006","P007","P008",
"P009","P010","P011","P012","P013","P014","P015","P016",
"P017","P018","P019","P020"],
"policy_type": ["Motor","Home","Health","Travel","Motor","Home","Health","Travel",
"Motor","Home","Health","Travel","Motor","Home","Health","Travel",
"Motor","Home","Health","Travel"],
"region": ["North","South","East","West","South","North","West","East",
"East","West","North","South","West","East","South","North",
"North","South","East","West"],
"age": [27,54,68,35,22,48,71,29,33,61,44,25,55,38,65,42,31,57,72,46],
"annual_premium": [840,620,1240,280,920,580,1380,260,760,640,1180,300,880,600,1320,540,800,660,1260,320],
"claim_amount": [2400,0,4800,0,1800,850,5200,0,0,1100,3600,420,2200,0,4400,0,1600,920,5800,0],
"claim_count": [2,0,3,0,1,1,4,0,0,1,2,1,2,0,3,0,1,1,4,0],
"claimed": [1,0,1,0,1,1,1,0,0,1,1,1,1,0,1,0,1,1,1,0]
})
# Portfolio headline metrics
total_premiums = df["annual_premium"].sum()
total_claims = df["claim_amount"].sum()
portfolio_lr = total_claims / total_premiums * 100
claim_rate = df["claimed"].mean()
print(f"Policies: {len(df)}")
print(f"Total premiums: ${total_premiums:,}")
print(f"Total claims paid: ${total_claims:,}")
print(f"Portfolio loss ratio: {portfolio_lr:.1f}%")
print(f"Overall claim rate: {claim_rate:.1%}")
# Age banding
df["age_band"] = pd.cut(
df["age"],
bins = [0, 30, 45, 60, float("inf")],
labels = ["Under 30","30–45","46–60","Over 60"]
)
# Policy type overview
type_overview = df.groupby("policy_type").agg(
policies = ("policy_id", "count"),
claim_rate = ("claimed", "mean"),
mean_premium = ("annual_premium", "mean"),
mean_claim_amt = ("claim_amount", "mean"),
total_premiums = ("annual_premium", "sum"),
total_claims = ("claim_amount", "sum")
).round(1).reset_index()
# Loss ratio: total claims / total premiums per product
type_overview["loss_ratio"] = (
type_overview["total_claims"] / type_overview["total_premiums"] * 100
).round(1)
type_overview["claim_rate_pct"] = (type_overview["claim_rate"] * 100).round(0)
type_overview = type_overview.sort_values("loss_ratio", ascending=False)
print("\nPolicy type overview — ranked by loss ratio:")
print(type_overview[[
"policy_type","policies","claim_rate_pct",
"mean_premium","mean_claim_amt","loss_ratio"
]].to_string(index=False))
Policies: 20
Total premiums: $16,780
Total claims paid: $36,090
Portfolio loss ratio: 215.1%
Overall claim rate: 65.0%
Policy type overview — ranked by loss ratio:
policy_type policies claim_rate_pct mean_premium mean_claim_amt loss_ratio
Health 5 100.0 1276.0 4760.0 372.9
Motor 5 80.0 840.0 1600.0 190.5
Home 5 60.0 620.0 574.0 92.6
Travel 5 0.0 288.0 0.0 0.0What just happened?
Method — loss ratio from aggregated totals, not from row-level divisionpandas is the core library. The loss ratio is computed as total_claims / total_premiums * 100 on the aggregated groupby result — not by dividing claim_amount by annual_premium on each row and then averaging. This distinction matters: averaging row-level ratios gives each policy equal weight regardless of its premium size, which distorts the result. Aggregating totals first gives the economically correct loss ratio — a $10,000 premium policy should carry more weight than a $200 one. The claimed column mean gives claim rate using the same binary-mean pattern from CS14, CS16, and CS18.
Health policies have a 372.9% loss ratio — paying out $3.73 for every $1.00 collected in premiums. Motor is also severely loss-making at 190.5%. Travel has a 0% loss ratio — no claims made — making it the portfolio's most profitable product. The portfolio-level loss ratio of 215.1% confirms the CUO's concern: the personal lines book is deeply unprofitable and requires immediate repricing action.
We separate claim frequency (how often policyholders claim) from claim severity (how much each claim costs), then compute loss ratios by region. These are two distinct problems requiring different interventions: high frequency suggests mispriced risk selection; high severity suggests inadequate claim controls or underestimated exposure.
# Claim severity: mean amount ONLY among policies that actually claimed
claimants = df[df["claimed"] == 1]
severity_by_type = claimants.groupby("policy_type")["claim_amount"].mean().round(0)
print("Mean claim severity (claimants only) by policy type:")
print(severity_by_type.sort_values(ascending=False).to_string())
# Frequency: claim rate per type (already computed — using claimed binary)
freq_by_type = df.groupby("policy_type")["claimed"].mean().round(3)
print("\nClaim frequency (rate) by policy type:")
print((freq_by_type * 100).round(1).sort_values(ascending=False).to_string())
# Regional loss ratios
region_stats = df.groupby("region").agg(
policies = ("policy_id", "count"),
claim_rate = ("claimed", "mean"),
total_premiums = ("annual_premium", "sum"),
total_claims = ("claim_amount", "sum"),
mean_claim_amt = ("claim_amount", "mean")
).round(1).reset_index()
region_stats["loss_ratio"] = (region_stats["total_claims"] / region_stats["total_premiums"] * 100).round(1)
region_stats["claim_rate_pct"] = (region_stats["claim_rate"] * 100).round(0)
region_stats = region_stats.sort_values("loss_ratio", ascending=False)
print("\nRegional loss ratios — ranked:")
print(region_stats[[
"region","policies","claim_rate_pct","total_premiums","total_claims","loss_ratio"
]].to_string(index=False))
# Correlation: premium level vs claim amount
premium_claim_corr = df["annual_premium"].corr(df["claim_amount"]).round(3)
age_claim_corr = df["age"].corr(df["claim_amount"]).round(3)
print(f"\nCorrelations with claim_amount:")
print(f" annual_premium: r = {premium_claim_corr:+.3f}")
print(f" age: r = {age_claim_corr:+.3f}")
Mean claim severity (claimants only) by policy type: policy_type Health 4800.0 Motor 2000.0 Home 967.5 Travel NaN Claim frequency (rate) by policy type: policy_type Health 100.0 Motor 80.0 Home 60.0 Travel 0.0 Regional loss ratios — ranked: region policies claim_rate_pct total_premiums total_claims loss_ratio East 5 80.0 4760 18800 394.9 North 5 80.0 3940 10400 263.9 South 5 60.0 3770 5970 158.3 West 5 40.0 4310 900 20.9 Correlations with claim_amount: annual_premium: r = +0.831 age: r = +0.724
What just happened?
Method — filtering to claimants only before computing severity, separating frequency from severityClaim severity is computed on df[df["claimed"] == 1] — filtering to claimants only before grouping. Including non-claimants (who have claim_amount = 0) would understate severity by averaging in zeros, conflating "not claimed" with "claimed for nothing." Frequency and severity are kept separate because they drive different underwriting decisions: high frequency → tighten eligibility criteria or add excesses; high severity → strengthen claim controls or increase policy limits pricing. Travel's NaN severity correctly indicates no data — no Travel policyholder claimed, so no severity can be computed.
The East region has a 394.9% loss ratio — the worst in the portfolio — followed by North at 263.9%. The West region is the only profitable geography at 20.9%. Age correlates strongly with claim amount at r = +0.724 — older policyholders generate larger claims, consistent with Health policy severity driving the portfolio. Premium level also correlates with claim amount at r = +0.831, confirming that higher-premium products (Health) are indeed the highest-cost claimants.
We group by age band to find which demographic cohort drives the worst loss ratio, compute mean severity per age group, and identify whether the portfolio is mispriced for specific age brackets. This is the same pd.cut() banding pattern used in CS16 and CS18, applied to pricing rather than clinical or operational analysis.
# Age band performance
age_stats = df.groupby("age_band", observed=True).agg(
policies = ("policy_id", "count"),
claim_rate = ("claimed", "mean"),
mean_premium = ("annual_premium", "mean"),
total_premiums = ("annual_premium", "sum"),
total_claims = ("claim_amount", "sum"),
mean_claim_amt = ("claim_amount", "mean"),
mean_age = ("age", "mean")
).round(1).reset_index()
age_stats["loss_ratio"] = (age_stats["total_claims"] / age_stats["total_premiums"] * 100).round(1)
age_stats["claim_rate_pct"] = (age_stats["claim_rate"] * 100).round(0)
print("Age band performance:")
print(age_stats[[
"age_band","policies","claim_rate_pct","mean_premium",
"mean_claim_amt","loss_ratio"
]].to_string(index=False))
# Severity for claimants only by age band
age_severity = df[df["claimed"]==1].groupby("age_band", observed=True).agg(
claimants = ("policy_id", "count"),
mean_severity = ("claim_amount", "mean"),
median_severity= ("claim_amount", "median")
).round(0).reset_index()
print("\nClaim severity (claimants only) by age band:")
print(age_severity.to_string(index=False))
# Pricing gap: mean_claim_amt vs mean_premium per age band
age_stats["pricing_gap"] = (age_stats["mean_claim_amt"] - age_stats["mean_premium"]).round(0)
print("\nPricing gap (mean claim - mean premium) by age band:")
print(age_stats[["age_band","mean_premium","mean_claim_amt","pricing_gap"]].to_string(index=False))
Age band performance:
age_band policies claim_rate_pct mean_premium mean_claim_amt loss_ratio
Under 30 5 80.0 716.0 836.0 116.7
30–45 6 66.7 708.3 650.0 91.8
46–60 5 60.0 736.0 794.0 107.9
Over 60 4 100.0 1235.0 3625.0 293.6
Claim severity (claimants only) by age band:
age_band claimants mean_severity median_severity
Under 30 4 1045.0 1700.0
30–45 4 975.0 960.0
46–60 3 1323.0 1100.0
Over 60 4 3625.0 4100.0
Pricing gap (mean claim - mean premium) by age band:
age_band mean_premium mean_claim_amt pricing_gap
Under 30 716.0 836.0 120.0
30–45 708.3 650.0 -58.3
46–60 736.0 794.0 58.0
Over 60 1235.0 3625.0 2390.0What just happened?
Method — pricing gap as a derived column from two aggregated meansThe pricing gap is mean_claim_amt - mean_premium computed on the aggregated age_stats DataFrame — a derived column from two already-aggregated columns. A positive gap means the average claim exceeds the average premium in that band — a structural underpricing problem. A negative gap means premiums exceed claims — adequately priced. Note that median_severity is computed separately on the claimants-only subset: the median is more robust to extreme values in small samples than the mean, and comparing mean vs median reveals whether a few very large claims are skewing the average severity upward.
Over-60 policyholders have a 293.6% loss ratio and a pricing gap of $2,390 — their mean claim of $3,625 dwarfs their mean premium of $1,235. Every Over-60 policy written at current rates loses roughly $2,390 on average. The 30–45 band is the only age cohort where premiums slightly exceed claims (pricing gap −$58.3), making it the only profitably priced demographic in the portfolio.
We filter to policy-type and region combinations where the loss ratio exceeds 90%, compute the total capital at risk in those segments, and rank them by financial exposure. This is the directly actionable output for the pricing committee — a ranked list of segments requiring immediate intervention with a dollar figure attached to each.
LOSS_RATIO_THRESHOLD = 90 # % — segments above this need pricing action
# Loss ratio by policy_type x region
segment_lr = df.groupby(["policy_type","region"]).agg(
policies = ("policy_id", "count"),
total_premiums = ("annual_premium","sum"),
total_claims = ("claim_amount", "sum"),
claim_rate = ("claimed", "mean")
).reset_index()
segment_lr["loss_ratio"] = (segment_lr["total_claims"] / segment_lr["total_premiums"] * 100).round(1)
segment_lr["capital_at_risk"] = (segment_lr["total_claims"] - segment_lr["total_premiums"]).clip(lower=0)
# Filter to high-loss segments
high_loss = segment_lr[segment_lr["loss_ratio"] >= LOSS_RATIO_THRESHOLD].copy()
high_loss = high_loss.sort_values("loss_ratio", ascending=False)
print(f"High-loss segments (loss ratio >= {LOSS_RATIO_THRESHOLD}%):")
print(high_loss[[
"policy_type","region","policies","total_premiums",
"total_claims","loss_ratio","capital_at_risk"
]].to_string(index=False))
total_capital_risk = high_loss["capital_at_risk"].sum()
print(f"\nTotal capital at risk in high-loss segments: ${total_capital_risk:,}")
print(f"High-loss segments account for {len(high_loss)} of {len(segment_lr)} product-region cells")
# What premium increase would bring each to 85% target loss ratio?
high_loss["required_premium"] = (high_loss["total_claims"] / 0.85).round(0)
high_loss["premium_increase"] = (high_loss["required_premium"] - high_loss["total_premiums"]).round(0)
high_loss["increase_pct"] = (high_loss["premium_increase"] / high_loss["total_premiums"] * 100).round(1)
print("\nRequired premium increase to reach 85% loss ratio target:")
print(high_loss[["policy_type","region","loss_ratio","premium_increase","increase_pct"]].to_string(index=False))
High-loss segments (loss ratio >= 90%):
policy_type region policies total_premiums total_claims loss_ratio capital_at_risk
Health East 2 2620 10600 404.6 7980
Health North 1 1180 3600 305.1 2420
Health South 1 1320 4400 333.3 3080
Motor North 2 1640 4000 243.9 2360
Motor South 2 1720 3400 197.7 1680
Home North 1 580 850 146.6 270
Home South 1 660 920 139.4 260
Total capital at risk in high-loss segments: $18,050
High-loss segments account for 7 of 16 product-region cells
Required premium increase to reach 85% loss ratio target:
policy_type region loss_ratio premium_increase increase_pct
Health East 404.6 9858.0 376.3
Health South 333.3 4859.0 368.1
Health North 305.1 3059.0 259.2
Motor North 243.9 3071.0 187.3
Motor South 197.7 2282.0 132.7
Home North 146.6 420.0 72.4
Home South 139.4 423.0 64.1What just happened?
Method — .clip(lower=0) for capital at risk, required premium back-calculation.clip(lower=0) on the capital_at_risk column ensures negative values (segments where premiums exceed claims) are floored at zero — because a profitable segment has zero capital at risk, not a negative exposure. The required premium back-calculation works by rearranging the loss ratio formula: if loss_ratio = claims / premiums, then premiums = claims / target_loss_ratio. Setting target to 0.85 gives the premium total needed to achieve an 85% loss ratio, and subtracting current premiums gives the required increase. The increase_pct tells the pricing team exactly how much to raise rates.
Seven of 16 product-region cells are above the 90% loss ratio threshold, representing $18,050 in capital at risk. Health East is the worst offender at 404.6% — bringing it to the 85% target would require a 376% premium increase, which is commercially unviable and suggests this segment should be considered for withdrawal rather than repricing. Health North and South require 259% and 368% increases respectively — similarly unviable. Motor North and South are severe but potentially salvageable with 187% and 133% increases.
The pivot table gives the pricing committee the complete loss ratio grid at a glance. We use pd.pivot_table() with a custom aggregation function — dividing the sum of claims by the sum of premiums — to ensure the loss ratio is computed correctly from totals rather than by averaging row-level ratios.
# Custom aggregation: loss ratio = sum(claims) / sum(premiums)
# We need a lambda that receives a Series — use two separate pivots then divide
pivot_claims = pd.pivot_table(df, index="region", columns="policy_type",
values="claim_amount", aggfunc="sum", fill_value=0)
pivot_premiums = pd.pivot_table(df, index="region", columns="policy_type",
values="annual_premium", aggfunc="sum", fill_value=0)
# Loss ratio pivot: element-wise division of two pivot tables
pivot_lr = (pivot_claims / pivot_premiums * 100).round(1)
pivot_lr = pivot_lr.replace([float("inf"), float("-inf")], float("nan"))
print("Loss ratio (%) by region x policy type:")
print(pivot_lr.to_string())
# Region row totals
pivot_lr["All Products"] = (
pivot_claims.sum(axis=1) / pivot_premiums.sum(axis=1) * 100
).round(1)
print("\nWith all-products total column:")
print(pivot_lr.to_string())
# Policy type column means (unweighted — for directional comparison only)
col_means = pivot_lr.drop(columns="All Products").mean().round(1)
print("\nMean loss ratio by policy type (across regions):")
print(col_means.sort_values(ascending=False).to_string())
Loss ratio (%) by region x policy type: policy_type Health Home Motor Travel region East 404.6 0.0 193.2 0.0 North 305.1 146.6 243.9 0.0 South 333.3 139.4 197.7 0.0 West 0.0 0.0 0.0 0.0 With all-products total column: policy_type Health Home Motor Travel All Products region East 404.6 0.0 193.2 0.0 394.9 North 305.1 146.6 243.9 0.0 263.9 South 333.3 139.4 197.7 0.0 158.3 West 0.0 0.0 0.0 0.0 20.9 Mean loss ratio by policy type (across regions): policy_type Health 260.8 Motor 158.7 Home 71.5 Travel 0.0
What just happened?
Method — element-wise division of two pivot tables to compute a ratio pivot correctlyBuilding a ratio pivot (loss ratio = claims / premiums) requires two separate pivot tables — one for claim totals, one for premium totals — then dividing them element-wise. You cannot achieve this with a single pivot_table() call using a standard aggfunc, because you need to sum two different columns and divide the results. The element-wise division pivot_claims / pivot_premiums works because both DataFrames have identical index and column structure — pandas aligns them by label. .replace([inf, -inf], nan) handles any division-by-zero cells where premiums are zero.
The West region is the only profitable geography across all product lines — 0% loss ratio in every cell. Every other region has at least one product above 100%. The pivot makes the committee's decision immediate: Health should be reviewed for withdrawal in East, North, and South; Travel should be grown in all regions as it generates premium income with zero claims; Motor needs repricing in North and South.
Checkpoint: Calculate the break-even premium for each age band — the annual premium that would achieve exactly a 100% loss ratio (break even on claims alone). Use break_even_premium = mean_claim_amt. Then compute by how much each band's current mean premium would need to increase as a percentage. Which age band requires the largest percentage increase — and is it commercially achievable?
Key Findings
Health policies have a 372.9% loss ratio — paying out $3.73 per $1.00 of premium — with 100% claim frequency and mean severity of $4,800. Every Health policy in the portfolio is loss-making at current premium levels.
The East region has the worst regional loss ratio at 394.9%, driven by Health and Motor claims. The West region is the only profitable geography at 20.9% — zero claims across all five policies.
Over-60 policyholders have a pricing gap of $2,390 — their mean claim of $3,625 dwarfs their mean premium of $1,235. They are the most severely mispriced demographic with a 293.6% loss ratio.
Seven of 16 product-region cells exceed the 90% loss ratio threshold, representing $18,050 in capital at risk. Health East requires a 376% premium increase to reach the 85% target — commercially unviable, making withdrawal the recommended action.
Travel is the portfolio's only profitable product — 0% claim rate and 0% loss ratio across all regions. Growing Travel volume would improve the combined portfolio loss ratio without any repricing action on other lines.
Visualisations
Pivot Table — Loss Ratio (%) by Region × Policy Type
The pricing committee's action grid. Red = withdraw or reprice immediately. Green = grow volume.
| Region | Health | Motor | Home | Travel | All Products |
|---|---|---|---|---|---|
| East | 404.6% ✗ | 193.2% | 0.0% ✓ | 0.0% ✓ | 394.9% |
| North | 305.1% ✗ | 243.9% | 146.6% | 0.0% ✓ | 263.9% |
| South | 333.3% ✗ | 197.7% | 139.4% | 0.0% ✓ | 158.3% |
| West | 0.0% ✓ | 0.0% ✓ | 0.0% ✓ | 0.0% ✓ | 20.9% ★ |
Red ✗ = above 90% loss ratio — reprice or withdraw · Green ✓ = profitable · ★ = best region
Insurance Analytics Decision Guide
| Task | Method | pandas Call | Watch Out For |
|---|---|---|---|
| Loss ratio per segment | Aggregate totals first, then divide | total_claims.sum() / total_premiums.sum() | Never average row-level ratios — weights must reflect premium size |
| Claim severity | Filter to claimants only before mean | df[df["claimed"]==1]["claim_amount"].mean() | Including zeros inflates denominator and understates severity |
| Pricing gap | Derived column from two aggregated means | mean_claim_amt - mean_premium | Positive gap = underpriced; negative = adequately priced |
| Capital at risk | Claims minus premiums, floored at zero | (total_claims - total_premiums).clip(lower=0) | clip(lower=0) prevents negative values for profitable segments |
| Loss ratio pivot | Two separate pivots then element-wise divide | pivot_claims / pivot_premiums * 100 | Replace inf with NaN for cells where premiums are zero |
| Required premium | Back-calculate from target loss ratio | total_claims / target_lr | State the target explicitly — 85% is a choice, not a fact |
Analyst's Note
Teachers's Note
What Would Come Next?
Build a GLM (Generalised Linear Model) using age, region, and policy type as features to predict claim probability and severity separately — the industry-standard actuarial pricing model. Run a Lorenz curve analysis to visualise risk concentration across the portfolio.
Limitations of This Analysis
With 20 policies and 5 per product type, loss ratios are driven by 1–2 claims and carry extreme uncertainty. Real insurance portfolios contain tens of thousands of policies. The West region's 0% loss ratio is almost certainly a small-sample artefact, not a structural profitability signal.
Business Decisions This Could Drive
Initiate withdrawal of Health policies in East, North, and South — required premium increases are commercially unviable. Reprice Motor in North and South with 130–190% increases. Grow Travel across all regions as the only loss-free product line.
Practice Questions
1. What is the correct formula for computing a segment's loss ratio from a groupby aggregation — using the sum of claims and the sum of premiums rather than averaging row-level ratios?
2. Which pandas method is used to floor the capital_at_risk column at zero — ensuring profitable segments (where premiums exceed claims) show zero rather than a negative value?
3. Which region had a 0% loss ratio across all product lines — making it the only profitable geography in the Vantage Insurance portfolio?
Quiz
1. Why must the loss ratio be computed from aggregated totals rather than by averaging row-level claim/premium ratios?
2. Why is claim severity computed only on policyholders who actually claimed — rather than on the full dataset?
3. Why does building a loss ratio pivot table require two separate pd.pivot_table() calls rather than one?
Up Next · Case Study 20
HR Attrition Insights
You receive an employee dataset. Which departments and roles have the highest attrition rates? Does salary band or tenure predict who leaves? And which employee segments represent the highest replacement cost risk for the business?