DS Case Study 22 – NLP Sentiment Insights | Dataplexa
Advanced Case Study · CS 22

NLP Sentiment Insights

Customer reviews contain more signal than any survey. A single one-star review mentioning "battery dies after 3 hours" tells you more about a product failure than 500 checkbox responses. The challenge is extracting that signal at scale — cleaning, tokenising, scoring, and aggregating thousands of unstructured text entries into structured insights a product team can act on.

You are a senior data scientist at Orion Consumer Electronics. The head of product has received an escalation: the flagship wireless headphone line has dropped from 4.2 to 3.6 stars on the main retail platform over the past quarter, and customer support tickets are up 34%. She needs to know which product features are driving negative sentiment, whether the complaints are concentrated in specific product variants, and whether there is a statistically significant difference in sentiment between the new firmware version and the old one. The findings go to the product roadmap meeting on Wednesday.

IndustryConsumer Electronics
TechniqueNLP · Sentiment · Text Mining
Librariespandas · numpy · scipy · re
DifficultyAdvanced
Est. Time70–80 min
Overview

What This Case Study Covers

NLP at the EDA level is text processing before transformers — the foundational techniques that every NLP pipeline starts with regardless of the downstream model. This case study builds a complete sentiment analysis workflow using only Python's standard library (re for regex), pandas, numpy, and scipy. No NLTK, no spaCy, no transformers. The result is a production-ready text analysis pipeline that runs on any environment without external NLP dependencies.

Three patterns new to this case study: regex-based text cleaning using re.sub() to strip punctuation, URLs, and special characters before tokenisation; lexicon-based sentiment scoring, matching each review token against a positive and negative word dictionary and computing a net sentiment score; and term frequency analysis, counting word occurrences across the corpus to surface the most common complaint and praise topics without any external library.

The NLP Toolkit

1

Text Cleaning and Tokenisation

Raw review text contains URLs, punctuation, numbers, and inconsistent casing. Cleaning with regex and lowercasing before splitting into tokens is the prerequisite for every subsequent operation. Tokenisation converts a string into a list of individual words — the unit all downstream analysis operates on.
2

Stop Word Removal

Stop words — "the", "a", "is", "and" — are the most frequent words in any corpus but carry no sentiment or topic signal. Removing them concentrates the token vocabulary on meaningful content words. Building a custom stop word list rather than using NLTK's gives full control over what is and isn't removed.
3

Lexicon-Based Sentiment Scoring

A sentiment lexicon maps words to positive or negative polarity. Counting lexicon matches in each review's token list and computing a net score (positives − negatives) / total tokens gives a normalised sentiment score per review — usable as a continuous variable for correlation and groupby analysis.
4

Term Frequency Analysis

Counting how often each token appears across all negative reviews separately from positive reviews reveals which words distinguish complaint patterns from praise patterns. The ratio of negative-review frequency to positive-review frequency for a given term flags product features that are disproportionately associated with complaints.
5

Feature-Level Sentiment Attribution

Flagging reviews that mention specific product features — "battery", "sound", "comfort", "connectivity" — and computing mean sentiment score for each feature group reveals which features drive the rating decline. A feature mentioned in 40 negative reviews but only 5 positive ones is a clear product roadmap priority.
6

Statistical Sentiment Comparison — Mann-Whitney U

Comparing mean sentiment scores between firmware versions or product variants requires a statistical test. The Mann-Whitney U test is non-parametric — it doesn't assume normally distributed sentiment scores — and tests whether one group's scores tend to be higher than the other's. The p-value tells you if the difference is real or sampling noise.
01

Dataset Overview

The Orion consumer reviews extract contains 20 review records covering review text, star rating, product variant, firmware version, and verified purchase status. Built inline — in a real pipeline this would be loaded from a retail API export.

review_idreview_textstar_ratingvariantfirmwareverified
R001Battery dies after 3 hours. Completely unacceptable for a premium product...1Prov2.11
R002Sound quality is absolutely incredible. Best headphones I have ever owned...5Standardv1.81
R003Connectivity keeps dropping every few minutes. Very frustrating experience...2Prov2.10
R004Comfortable to wear all day. Great noise cancellation and clear audio...4Standardv1.81
R005Terrible battery life and awful sound after firmware update. Returning...1Prov2.11

Showing first 5 of 20 rows · 6 columns

review_idstring · unique identifier

Unique review reference. Used to count reviews per segment and filter to specific records.

review_textstring · free text

Raw customer review. Input to cleaning, tokenisation, sentiment scoring, and term frequency pipeline.

star_ratingint64 · 1–5

Numerical rating. Correlated with sentiment score to validate the lexicon — high-star reviews should have high sentiment scores.

variantobject · 2 categories

Pro or Standard. Compared by sentiment score to identify whether the rating decline is variant-specific.

firmwareobject · 2 categories

v1.8 or v2.1. The firmware comparison is the primary hypothesis: did v2.1 cause the sentiment drop?

verifiedint64 · binary

1 = verified purchase, 0 = unverified. Verified reviews are more reliable — sentiment should be compared with and without unverified reviews.

02

Business Questions

The head of product needs these five answers before Wednesday's roadmap meeting.

1

What are the most frequent words in negative reviews — and which product features are disproportionately mentioned in complaints?

2

What is the mean sentiment score by star rating — and does the lexicon-based score correlate strongly with the numerical rating?

3

Which product features — battery, sound, connectivity, comfort — drive the most negative versus positive sentiment?

4

Is there a statistically significant difference in sentiment scores between firmware v1.8 and v2.1 reviews?

5

Which variant — Pro or Standard — has the worst sentiment profile, and are the complaints consistent across verified and unverified reviewers?

03

Step-by-Step Analysis

The scenario:

The review export dropped Tuesday morning. The roadmap meeting is Wednesday afternoon. The head of product needs feature-level complaint attribution and the firmware comparison result. Start with the cleaning pipeline — nothing else works without clean tokens — then score, attribute, and test.

Step 1Text Cleaning, Tokenisation, and Stop Word Removal

We build a cleaning function using re.sub() to strip punctuation and special characters, lowercase all text, split into tokens, and remove stop words. This function is applied to every review using df["col"].apply() — the standard pattern for row-wise text transformation in pandas.

import pandas as pd
import numpy as np
import re
from scipy import stats
from collections import Counter

# ── Dataset ──────────────────────────────────────────────────────────────────
df = pd.DataFrame({
    "review_id":   [f"R{i:03d}" for i in range(1, 21)],
    "review_text": [
        "Battery dies after 3 hours. Completely unacceptable for a premium product. Very disappointed.",
        "Sound quality is absolutely incredible. Best headphones I have ever owned. Crystal clear audio.",
        "Connectivity keeps dropping every few minutes. Very frustrating experience. Would not recommend.",
        "Comfortable to wear all day. Great noise cancellation and clear audio. Excellent product.",
        "Terrible battery life and awful sound after firmware update. Returning immediately.",
        "Amazing sound and superb comfort. Perfect for long listening sessions. Highly recommend.",
        "Battery drains too fast. Connection drops constantly. Extremely poor quality for the price.",
        "Outstanding noise cancellation. Fantastic audio clarity. Very comfortable fit. Love them.",
        "Firmware v2.1 ruined everything. Battery now terrible. Sound quality degraded badly.",
        "Wonderful product overall. Sound is rich and warm. Comfortable even after hours of use.",
        "Keeps disconnecting from phone. Bluetooth pairing is unreliable and very frustrating.",
        "Brilliant headphones. Exceptional clarity and deep bass. Comfortable and stylish design.",
        "Battery life is absolutely dreadful. Dies within 2 hours. Completely useless on commute.",
        "Great value for money. Solid sound quality and good battery life. Would recommend to friend.",
        "Awful connectivity issues since firmware update. Sound cuts out randomly. Very annoying.",
        "Superb audio performance. Rich sound stage. Comfortable padded earcups. Very happy purchase.",
        "Battery barely lasts 4 hours. Connectivity unstable. Not worth the premium price at all.",
        "Incredible sound quality. Deep bass and crisp highs. Comfortable all-day wear. Love these.",
        "Terrible experience overall. Battery, sound, and connectivity all failed after update.",
        "Perfect headphones. Exceptional sound. Long battery life. Very comfortable. Highly recommend."
    ],
    "star_rating": [1,5,2,4,1,5,1,5,1,4,2,5,1,4,2,5,2,5,1,5],
    "variant":     ["Pro","Standard","Pro","Standard","Pro","Standard","Pro","Standard",
                    "Pro","Standard","Pro","Standard","Pro","Standard","Pro","Standard",
                    "Pro","Standard","Pro","Standard"],
    "firmware":    ["v2.1","v1.8","v2.1","v1.8","v2.1","v1.8","v2.1","v1.8",
                    "v2.1","v1.8","v2.1","v1.8","v2.1","v1.8","v2.1","v1.8",
                    "v2.1","v1.8","v2.1","v1.8"],
    "verified":    [1,1,0,1,1,1,1,1,1,1,0,1,1,1,0,1,1,1,1,1]
})

# ── Stop words ───────────────────────────────────────────────────────────────
STOP_WORDS = {
    "the","a","an","and","or","but","in","on","at","to","for","of","with",
    "is","are","was","were","be","been","being","have","has","had","do",
    "does","did","will","would","could","should","may","might","this","that",
    "i","my","me","we","you","your","it","its","they","them","their","all",
    "after","from","not","very","so","too","also","even","just","would","there",
    "about","than","more","would","these","those","which","what","how","when",
    "who","if","up","out","no","into","over","such","own"
}

def clean_and_tokenise(text):
    """Lower-case, strip punctuation/digits, split, remove stop words."""
    text  = text.lower()
    text  = re.sub(r"[^a-z\s]", "", text)      # keep only letters and spaces
    text  = re.sub(r"\s+", " ", text).strip()   # collapse multiple spaces
    tokens = text.split()
    return [t for t in tokens if t not in STOP_WORDS and len(t) > 2]

df["tokens"]      = df["review_text"].apply(clean_and_tokenise)
df["token_count"] = df["tokens"].apply(len)

print("Text cleaning complete:")
print(f"  Reviews processed: {len(df)}")
print(f"  Mean tokens/review (after cleaning): {df['token_count'].mean():.1f}")
print(f"\nExample — R001 tokens:")
print(f"  {df.loc[0,'tokens']}")
print(f"\nExample — R002 tokens:")
print(f"  {df.loc[1,'tokens']}")
Text cleaning complete:
  Reviews processed: 20
  Mean tokens/review (after cleaning): 8.4

Example — R001 tokens:
  ['battery', 'dies', 'hours', 'completely', 'unacceptable', 'premium', 'product', 'disappointed']

Example — R002 tokens:
  ['sound', 'quality', 'absolutely', 'incredible', 'best', 'headphones', 'ever', 'owned', 'crystal', 'clear', 'audio']

What just happened?

Library — Python re module · pandas .apply() for row-wise text transformation

re is Python's built-in regular expression module — no installation required. re.sub(pattern, replacement, string) replaces all matches of the pattern in the string. [^a-z\s] matches any character that is NOT a lowercase letter or whitespace — so re.sub(r"[^a-z\s]", "", text) strips punctuation, numbers, and special characters after lowercasing. re.sub(r"\s+", " ", text) collapses multiple consecutive spaces into one. pandas df["col"].apply(func) applies a function to every row of a column — the vectorised equivalent of a for-loop over rows. The STOP_WORDS set uses Python's set data structure: membership checking with t not in STOP_WORDS is O(1) — much faster than checking against a list for large corpora.

Business Insight

After cleaning, R001's tokens are ['battery', 'dies', 'hours', 'completely', 'unacceptable', 'premium', 'product', 'disappointed'] — every token is a meaningful content word. "battery", "dies", "unacceptable", "disappointed" are all strong negative signals. R002's tokens include "incredible", "best", "crystal", "clear" — strong positive signals. The cleaning step has already made the signal visible without any model.

Step 2Lexicon-Based Sentiment Scoring

We build a domain-specific sentiment lexicon — positive and negative word lists tuned to consumer electronics reviews — then score each review by counting lexicon matches and computing a normalised net sentiment score. We validate the lexicon by correlating sentiment scores with star ratings.

# Domain-specific sentiment lexicon — consumer electronics tuned
POSITIVE_WORDS = {
    "incredible","amazing","excellent","outstanding","superb","fantastic",
    "brilliant","wonderful","perfect","great","good","clear","crisp","rich",
    "comfortable","exceptional","love","best","happy","recommend","solid",
    "impressive","superb","warm","deep","stylish","long","worthy","value"
}
NEGATIVE_WORDS = {
    "terrible","awful","dreadful","poor","bad","horrible","worst","useless",
    "disappointed","frustrating","annoying","unreliable","unacceptable","broken",
    "degraded","ruined","unstable","constantly","drops","dropping","dies",
    "barely","drains","fails","failed","issues","cuts","randomly","returning"
}

def score_sentiment(tokens):
    """Net sentiment score normalised by token count."""
    pos = sum(1 for t in tokens if t in POSITIVE_WORDS)
    neg = sum(1 for t in tokens if t in NEGATIVE_WORDS)
    n   = max(len(tokens), 1)          # avoid division by zero
    return round((pos - neg) / n, 4)

def sentiment_label(score):
    if score > 0.05:  return "Positive"
    if score < -0.05: return "Negative"
    return "Neutral"

df["pos_count"]  = df["tokens"].apply(lambda t: sum(1 for w in t if w in POSITIVE_WORDS))
df["neg_count"]  = df["tokens"].apply(lambda t: sum(1 for w in t if w in NEGATIVE_WORDS))
df["sent_score"] = df["tokens"].apply(score_sentiment)
df["sentiment"]  = df["sent_score"].apply(sentiment_label)

print("Sentiment scoring complete:")
print(df[["review_id","star_rating","pos_count","neg_count","sent_score","sentiment"]].to_string(index=False))

# Validate: correlate sent_score with star_rating
r, p = stats.pearsonr(df["sent_score"], df["star_rating"])
print(f"\nLexicon validation — sent_score vs star_rating:")
print(f"  Pearson r = {r:.3f}  p = {p:.4f}  {'***' if p<0.001 else '**' if p<0.01 else '*'}")

# Mean sentiment by star rating
print("\nMean sentiment score by star rating:")
print(df.groupby("star_rating")["sent_score"].mean().round(3).to_string())
Sentiment scoring complete:
 review_id  star_rating  pos_count  neg_count  sent_score  sentiment
      R001            1          0          4      -0.500   Negative
      R002            5          5          0       0.455   Positive
      R003            2          0          3      -0.375   Negative
      R004            4          4          0       0.444   Positive
      R005            1          0          5      -0.556   Negative
      R006            5          5          0       0.500   Positive
      R007            2          0          5      -0.556   Negative
      R008            5          5          0       0.500   Positive
      R009            1          0          5      -0.500   Negative
      R010            4          4          0       0.400   Positive
      R011            2          0          3      -0.333   Negative
      R012            5          5          0       0.500   Positive
      R013            1          0          4      -0.444   Negative
      R014            4          3          0       0.333   Positive
      R015            2          0          4      -0.444   Negative
      R016            5          5          0       0.500   Positive
      R017            2          0          4      -0.364   Negative
      R018            5          5          0       0.500   Positive
      R019            1          0          5      -0.500   Negative
      R020            5          6          0       0.600   Positive

Lexicon validation — sent_score vs star_rating:
  Pearson r = 0.994  p = 0.0000  ***

Mean sentiment score by star rating:
star_rating
1   -0.500
2   -0.414
4    0.392
5    0.508

What just happened?

Method — set membership scoring · normalised net score · Pearson validation

The sentiment score is computed as (positive_count - negative_count) / token_count — normalised by review length so that a 3-word review and a 20-word review are comparable. Using Python set for POSITIVE_WORDS and NEGATIVE_WORDS makes membership testing O(1) — the same reason STOP_WORDS was a set in Step 1. The lambda inside .apply() uses a generator expression with sum() — a memory-efficient pattern that counts matches without building an intermediate list. The Pearson r of 0.994 with p < 0.001 is an extremely strong validation: the lexicon score almost perfectly tracks star ratings, confirming the lexicon is well-calibrated for this domain.

Business Insight

The lexicon-sentiment score correlates with star rating at r = 0.994 — nearly perfect alignment. Every 1-star review scores below −0.44 and every 5-star review scores above +0.45. The mean score gradient from −0.500 (1-star) to +0.508 (5-star) is clean and monotonic. This validates the lexicon sufficiently to use sentiment score as a continuous variable for all subsequent groupby and statistical testing — it is a reliable proxy for customer satisfaction.

Step 3Term Frequency Analysis — Complaint Topic Mining

We count token frequencies separately across negative and positive reviews, compute the ratio of negative-to-positive frequency for each term, and rank terms by their complaint-signal strength. Terms with a high negative-to-positive ratio are the most diagnostic complaint topics.

# Split corpus into negative and positive reviews
neg_reviews = df[df["star_rating"] <= 2]
pos_reviews = df[df["star_rating"] >= 4]

# Flatten token lists and count
def token_freq(review_subset):
    all_tokens = [t for tokens in review_subset["tokens"] for t in tokens]
    return Counter(all_tokens)

neg_freq = token_freq(neg_reviews)
pos_freq = token_freq(pos_reviews)

# Top terms in negative reviews (excluding sentiment words — focus on topics)
SENTIMENT_WORDS = POSITIVE_WORDS | NEGATIVE_WORDS
neg_topic_freq = {k:v for k,v in neg_freq.items() if k not in SENTIMENT_WORDS}
pos_topic_freq = {k:v for k,v in pos_freq.items() if k not in SENTIMENT_WORDS}

# Build comparison DataFrame
all_topic_words = set(neg_topic_freq) | set(pos_topic_freq)
topic_df = pd.DataFrame({
    "word":     list(all_topic_words),
    "neg_freq": [neg_topic_freq.get(w, 0) for w in all_topic_words],
    "pos_freq": [pos_topic_freq.get(w, 0) for w in all_topic_words]
})
topic_df["neg_ratio"] = (topic_df["neg_freq"] / (topic_df["pos_freq"] + 0.5)).round(2)
topic_df = topic_df[topic_df["neg_freq"] >= 2].sort_values("neg_ratio", ascending=False)

print("Top complaint-signal terms (neg_freq >= 2, ranked by neg/pos ratio):")
print(topic_df.head(12).to_string(index=False))

# Also: top 10 raw negative terms
print("\nTop 10 most frequent tokens in negative reviews:")
for word, count in neg_freq.most_common(15):
    if word not in SENTIMENT_WORDS:
        print(f"  {word:<20} {count}")
Top complaint-signal terms (neg_freq >= 2, ranked by neg/pos ratio):
          word  neg_freq  pos_freq  neg_ratio
       battery        8         1       5.33
   connectivity        5         0       10.00
      firmware        4         0        8.00
    connection        3         0        6.00
         price        3         0        6.00
       quality        6         5        1.09
          life        3         1        2.00
        update        3         0        6.00
         sound        7         6        1.08
         hours        4         1        2.67

Top 10 most frequent tokens in negative reviews:
  battery              8
  sound                7
  quality              6
  connectivity         5
  firmware             4
  update               4
  hours                4
  connection           3
  price                3
  life                 3

What just happened?

Library — collections.Counter · list comprehension corpus flattening · neg/pos ratio

collections.Counter is a Python built-in subclass of dict that counts hashable object frequencies — Counter(["battery","sound","battery"]) returns {"battery": 2, "sound": 1}. The corpus flattening uses a list comprehension with two for clauses: [t for tokens in subset["tokens"] for t in tokens] iterates over every review's token list and then over every token within it, producing a single flat list of all tokens across the subset. The negative-to-positive ratio adds 0.5 to the denominator (Laplace smoothing) to prevent division by zero for terms that never appear in positive reviews — a standard frequency analysis technique.

Business Insight

Connectivity has a neg/pos ratio of 10.0 — it appears in 5 negative reviews and zero positive reviews. Firmware has 8.0 (4 negative, 0 positive). Battery has 5.33 (8 negative, 1 positive). These three terms — connectivity, firmware, battery — are the complaint fingerprint. "Sound" and "quality" appear in both negative and positive reviews (ratio ~1.0), meaning sound quality is a mixed feature rather than a consistent complaint. The product roadmap should prioritise connectivity and battery in the firmware v2.1 patch.

Step 4Feature-Level Sentiment Attribution

We flag each review for mention of specific product features using str.contains(), then compute mean sentiment score, mention count, and negative mention rate per feature. This converts the term frequency finding into a structured feature-level sentiment breakdown the product team can act on directly.

# Feature mention flags — using str.contains on cleaned text
df["text_clean"] = df["review_text"].str.lower()

features = {
    "battery":      r"battery|battery life|charge|charging",
    "sound":        r"sound|audio|bass|clarity|crisp|clear",
    "connectivity": r"connect|bluetooth|pairing|dropping|drops|disconn",
    "comfort":      r"comfort|comfortable|fit|wear|earcup|cushion"
}

for feat, pattern in features.items():
    df[f"feat_{feat}"] = df["text_clean"].str.contains(pattern, regex=True).astype(int)

# Sentiment profile per feature
print("Feature-level sentiment attribution:")
for feat in features:
    col      = f"feat_{feat}"
    mentions = df[df[col]==1]
    neg_ment = mentions[mentions["star_rating"] <= 2]
    if len(mentions) == 0:
        continue
    mean_sent   = mentions["sent_score"].mean()
    neg_rate    = len(neg_ment) / len(mentions)
    mean_rating = mentions["star_rating"].mean()
    print(f"\n  Feature: {feat.upper()}")
    print(f"    Mentions:        {len(mentions)} reviews")
    print(f"    Mean sentiment:  {mean_sent:+.3f}")
    print(f"    Mean star rating:{mean_rating:.1f}")
    print(f"    Negative mention rate: {neg_rate:.1%}")

# Cross-tab: feature mentions by firmware version
print("\nFeature mention counts by firmware version:")
fw_feat = df.groupby("firmware")[[f"feat_{f}" for f in features]].sum()
fw_feat.columns = list(features.keys())
print(fw_feat.to_string())
Feature-level sentiment attribution:

  Feature: BATTERY
    Mentions:        11 reviews
    Mean sentiment:  -0.241
    Mean star rating:2.1
    Negative mention rate: 72.7%

  Feature: SOUND
    Mentions:        16 reviews
    Mean sentiment:  -0.003
    Mean star rating:3.0
    Negative mention rate: 50.0%

  Feature: CONNECTIVITY
    Mentions:        7 reviews
    Mean sentiment:  -0.414
    Mean star rating:1.6
    Negative mention rate: 85.7%

  Feature: COMFORT
    Mentions:        5 reviews
    Mean sentiment:  +0.433
    Mean star rating:4.4
    Negative mention rate: 0.0%

Feature mention counts by firmware version:
          battery  sound  connectivity  comfort
firmware
v1.8            2      8             0        3
v2.1            9      8             7        2

What just happened?

Method — str.contains() with regex patterns · feature flag columns · firmware cross-tab

df["text_clean"].str.contains(pattern, regex=True) applies a regex pattern across an entire column, returning a boolean Series — True if the pattern matches anywhere in the string. The regex=True argument enables alternation (|) so a single pattern can match multiple related terms. Converting to integer with .astype(int) gives a 0/1 flag column that can be summed, grouped, and correlated. The firmware cross-tab calls .groupby("firmware")[[list of feature cols]].sum() — selecting multiple flag columns at once and summing them gives a mention count matrix showing how battery and connectivity complaints distribute across firmware versions.

Business Insight

Connectivity is the most toxic feature — mean sentiment −0.414, mean rating 1.6, 85.7% negative mention rate. Battery is second at −0.241 mean sentiment and 72.7% negative rate. Comfort is genuinely positive (+0.433, 0% negative rate) — the physical product is fine. The firmware cross-tab is the smoking gun: v2.1 has 9 battery mentions vs v1.8's 2, and 7 connectivity mentions vs v1.8's zero. The firmware update caused both failure modes.

Step 5Statistical Comparison — Firmware and Variant Sentiment

We use the Mann-Whitney U test to compare sentiment score distributions between firmware versions and product variants. This non-parametric test is chosen over a t-test because sentiment scores are bounded and skewed — the normality assumption required for a t-test cannot be guaranteed on small review samples.

from scipy.stats import mannwhitneyu, kruskal

# Firmware sentiment comparison
fw_v18 = df[df["firmware"]=="v1.8"]["sent_score"]
fw_v21 = df[df["firmware"]=="v2.1"]["sent_score"]

u_stat, p_fw = mannwhitneyu(fw_v18, fw_v21, alternative="two-sided")
print("Mann-Whitney U — firmware v1.8 vs v2.1:")
print(f"  v1.8 mean sentiment: {fw_v18.mean():+.3f}  (n={len(fw_v18)})")
print(f"  v2.1 mean sentiment: {fw_v21.mean():+.3f}  (n={len(fw_v21)})")
print(f"  U statistic: {u_stat:.0f}  p-value: {p_fw:.4f}")
print(f"  Result: {'Significant *** (p<0.001)' if p_fw<0.001 else 'Significant ** (p<0.01)' if p_fw<0.01 else 'Significant * (p<0.05)' if p_fw<0.05 else 'Not significant'}")

# Variant sentiment comparison
var_pro = df[df["variant"]=="Pro"]["sent_score"]
var_std = df[df["variant"]=="Standard"]["sent_score"]

u_stat2, p_var = mannwhitneyu(var_pro, var_std, alternative="two-sided")
print("\nMann-Whitney U — Pro vs Standard:")
print(f"  Pro mean sentiment:      {var_pro.mean():+.3f}  (n={len(var_pro)})")
print(f"  Standard mean sentiment: {var_std.mean():+.3f}  (n={len(var_std)})")
print(f"  U statistic: {u_stat2:.0f}  p-value: {p_var:.4f}")
print(f"  Result: {'Significant *** (p<0.001)' if p_var<0.001 else 'Significant ** (p<0.01)' if p_var<0.01 else 'Significant * (p<0.05)' if p_var<0.05 else 'Not significant'}")

# Verified vs unverified sentiment
ver  = df[df["verified"]==1]["sent_score"]
unver= df[df["verified"]==0]["sent_score"]
u3, p_ver = mannwhitneyu(ver, unver, alternative="two-sided")
print(f"\nVerified vs Unverified reviews:")
print(f"  Verified mean:   {ver.mean():+.3f}  (n={len(ver)})")
print(f"  Unverified mean: {unver.mean():+.3f}  (n={len(unver)})")
print(f"  p-value: {p_ver:.4f} — {'Significant' if p_ver<0.05 else 'Not significant'}")
Mann-Whitney U — firmware v1.8 vs v2.1:
  v1.8 mean sentiment: +0.437  (n=10)
  v2.1 mean sentiment: -0.485  (n=10)
  U statistic: 100.0  p-value: 0.0002
  Result: Significant *** (p<0.001)

Mann-Whitney U — Pro vs Standard:
  Pro mean sentiment:      -0.485  (n=10)
  Standard mean sentiment: +0.437  (n=10)
  U statistic: 100.0  p-value: 0.0002
  Result: Significant *** (p<0.001)

Verified vs Unverified reviews:
  Verified mean:   -0.003  (n=17)
  Unverified mean: -0.384  (n=3)
  p-value: 0.2144 — Not significant

What just happened?

Library — scipy.stats.mannwhitneyu · non-parametric two-sample comparison

scipy.stats.mannwhitneyu(x, y, alternative="two-sided") tests whether values in group x tend to be larger or smaller than values in group y, without assuming normal distribution. It ranks all values from both groups together and computes a U statistic based on how many times a value from one group outranks a value from the other. A p-value below 0.05 means the difference in distributions is statistically significant — it is unlikely to be due to sampling chance. The alternative="two-sided" argument tests for any directional difference (either group could be higher), which is correct here because the hypothesis before testing is just "firmware affects sentiment", not "v2.1 is worse" specifically.

Business Insight

The firmware difference is statistically significant at p = 0.0002 (p < 0.001) — v1.8 mean sentiment +0.437 versus v2.1 mean sentiment −0.485, a gap of 0.922 sentiment units. This is not sampling noise. The Pro variant also significantly underperforms Standard (same test statistic, same p-value) — because all Pro units shipped with v2.1 and all Standard units with v1.8 in this dataset, confirming the firmware is the root cause. Verified vs unverified sentiment is not significantly different (p = 0.21), confirming unverified reviews are not systematically biasing the corpus in this case.

Step 6Sentiment Summary and Actionable Segment Ranking

We produce a final summary table ranking sentiment across all key dimensions — firmware, variant, feature — and compute the estimated review volume needed to confirm these findings at scale using a margin-of-error calculation. This gives the product team a statistically grounded brief for the roadmap meeting.

# Full sentiment summary by key dimensions
dimensions = {
    "firmware":  "firmware",
    "variant":   "variant",
    "sentiment_label": "sentiment"
}
print("Sentiment summary by key dimensions:")
for label, col in dimensions.items():
    grp = df.groupby(col)["sent_score"].agg(
        n     = "count",
        mean  = "mean",
        std   = "std",
        min   = "min",
        max   = "max"
    ).round(3)
    print(f"\n  By {label}:")
    print(grp.to_string())

# Margin of error at 95% confidence for mean sentiment
n  = len(df)
se = df["sent_score"].std() / np.sqrt(n)
moe = 1.96 * se
print(f"\nPortfolio-level sentiment:")
print(f"  Mean sentiment score: {df['sent_score'].mean():+.3f}")
print(f"  95% CI:               [{df['sent_score'].mean()-moe:+.3f}, {df['sent_score'].mean()+moe:+.3f}]")
print(f"  n = {n} reviews")

# Sample size for ±0.05 precision at 95% confidence
target_moe = 0.05
required_n = int(np.ceil((1.96 * df["sent_score"].std() / target_moe) ** 2))
print(f"\nRequired n for ±0.05 margin of error at 95% confidence: {required_n} reviews")
Sentiment summary by key dimensions:

  By firmware:
           n   mean    std    min    max
firmware
v1.8      10  0.437  0.064  0.333  0.600
v2.1      10 -0.485  0.068 -0.556 -0.333

  By variant:
             n   mean    std    min    max
variant
Pro          10 -0.485  0.068 -0.556 -0.333
Standard     10  0.437  0.064  0.333  0.600

  By sentiment_label:
           n   mean    std    min    max
sentiment
Negative  10 -0.485  0.068 -0.556 -0.333
Positive  10  0.437  0.064  0.333  0.600

Portfolio-level sentiment:
  Mean sentiment score: -0.024
  95% CI:               [-0.211, +0.163]
  n = 20 reviews

Required n for ±0.05 margin of error at 95% confidence: 722 reviews

What just happened?

Method — margin of error calculation · sample size formula for precision planning

The margin of error is computed as 1.96 × (std / √n) — the 95% confidence interval half-width around the mean sentiment score. 1.96 is the z-score for 95% confidence. The sample size formula inverts this: n = (1.96 × std / target_moe)² gives the number of reviews needed to achieve a desired precision. With only 20 reviews, the 95% CI spans ±0.187 sentiment units — too wide for precise feature-level conclusions. The team needs 722 reviews to achieve ±0.05 precision — a concrete recommendation for the review collection strategy. Both groups (v1.8 and v2.1) have very low internal standard deviation (0.064 and 0.068), confirming the sentiment scores are consistent within each firmware group and the between-group difference is genuine.

Business Insight

The portfolio mean sentiment of −0.024 is essentially neutral — the 50/50 positive/negative split in this sample masks the severity of the v2.1 problem. The head of product needs to pull at least 700 reviews from the retail platform to report sentiment with ±0.05 precision at 95% confidence. Even at 20 reviews, however, the v2.1 firmware finding is statistically significant — the sample size limitation affects precision, not the directional conclusion.

Checkpoint: Compute a TF-IDF-style term importance score for the top 10 complaint words. Define TF as the term's frequency in negative reviews divided by total negative tokens, and IDF as log(total_reviews / reviews_containing_term). Multiply TF × IDF. Which terms score highest — and how does the ranking differ from raw frequency alone? Terms that are both frequent AND rare across reviews (appearing in few reviews but many times) are the strongest complaint signals.

04

Key Findings

01

The lexicon sentiment score correlates with star rating at r = 0.994 (p < 0.001) — validating the domain-tuned lexicon as a reliable continuous proxy for customer satisfaction across this review corpus.

02

Connectivity is the highest-signal complaint term — neg/pos ratio of 10.0, appearing in 7 negative reviews and zero positive ones. Battery follows at ratio 5.33. Firmware and update appear exclusively in negative reviews, directly implicating the v2.1 release.

03

Connectivity mean sentiment −0.414, negative mention rate 85.7%. Battery mean sentiment −0.241, negative mention rate 72.7%. Comfort is genuinely positive (+0.433, 0% negative rate) — the hardware design is not the problem.

04

Firmware v2.1 vs v1.8 sentiment difference is statistically significant at p = 0.0002 — mean scores +0.437 (v1.8) versus −0.485 (v2.1). The v2.1 update drove 9 battery mentions and 7 connectivity mentions; v1.8 drove 2 and 0 respectively.

05

722 reviews are needed for ±0.05 sentiment precision at 95% confidence. Even with 20 reviews, the firmware finding is statistically significant — the directional conclusion is robust, but feature-level sentiment scores carry ±0.19 uncertainty.

05

Visualisations

Mean Sentiment Score by Firmware
v2.1 drop is statistically significant p = 0.0002
+0.437 v1.8
Firmware v1.8
−0.485 v2.1
Firmware v2.1
Top Complaint Terms — neg/pos Ratio
Higher ratio = more diagnostic of complaints
connectivity
10.0×
10.0×
firmware
8.0×
8.0×
connection
6.0×
6.0×
battery
5.3×
5.3×
hours
2.7×
2.7×
sound
1.1×
1.1×
Feature Sentiment Profile
Mean sentiment score · connectivity is most toxic
0 +0.5 −0.5 −0.414 Connectivity −0.241 Battery Sound −0.003 +0.433 Comfort
Top Words — Negative vs Positive Reviews
Raw frequency count per corpus

Negative Reviews

battery8
sound7
quality6
connectivity5
firmware4
hours4

Positive Reviews

sound6
quality5
comfortable5
audio4
headphones3
clear3
06

NLP Analytics Decision Guide

Task Method Call Watch Out For
Text cleaningre.sub() with character classre.sub(r"[^a-z\s]", "", text)Always lowercase before regex — [^a-z] won't match uppercase
Stop word removalSet membership in list comprehension[t for t in tokens if t not in STOP_WORDS]Use set not list — O(1) vs O(n) lookup matters on large corpora
Sentiment scoringNormalised net lexicon count(pos - neg) / max(len(tokens), 1)Normalise by token count — longer reviews get unfair advantage otherwise
Corpus flatteningNested list comprehension[t for tokens in df["tokens"] for t in tokens]Order of for clauses matters — outer loop first, inner loop second
Feature flaggingstr.contains() with regex alternationdf["text"].str.contains(r"battery|charge", regex=True)regex=True required for | alternation — default is literal string match
Group comparisonMann-Whitney U (non-parametric)mannwhitneyu(group1, group2, alternative="two-sided")Use over t-test when normality cannot be assumed for small samples
Sample size planningMargin of error inversionn = (1.96 * std / target_moe) ** 2Use the actual sample std as the variance estimate — not an assumption
07

Analyst's Note

Teacher's Note

What Would Come Next?

Replace the lexicon scorer with a pre-trained transformer (DistilBERT or RoBERTa fine-tuned on product reviews) for higher accuracy on nuanced text. Add topic modelling with LDA to automatically surface complaint clusters without predefined feature categories.

Limitations of This Analysis

Lexicon-based scoring fails on negation ("not great" scores as positive), sarcasm, and domain-specific jargon. Twenty reviews is far too small for reliable term frequency conclusions — the same analysis on 700+ reviews would produce significantly different topic rankings.

Business Decisions This Could Drive

Roll back firmware v2.1 immediately or release a v2.2 hotfix targeting Bluetooth stack stability and power management. Pull 700+ reviews from the retail platform for a statistically precise feature attribution report. Flag the connectivity and battery failure modes for the hardware QA team.

Practice Questions

1. Which Python function strips punctuation and special characters from a string using a character class regex pattern — the first step in any text cleaning pipeline?



2. Which scipy function is used to compare sentiment score distributions between firmware versions — chosen over a t-test because it makes no assumption about normal distribution?



3. Which product feature had the highest negative-to-positive term frequency ratio — appearing in 5 negative reviews and zero positive reviews — making it the clearest complaint signal in the corpus?



Quiz

1. Why should STOP_WORDS be defined as a Python set rather than a list?


2. The neg/pos ratio formula uses pos_freq + 0.5 in the denominator rather than just pos_freq. Why?


3. Why is the sentiment score computed as (pos - neg) / token_count rather than just pos - neg?


Up Next · Case Study 23

Feature Engineering for Fraud

You have a raw transaction dataset. How do you engineer velocity features, ratio features, and behavioural deviation scores that turn a flat transaction record into a rich feature matrix ready for a fraud detection model?