Speech AI Lesson 44 – Keyword Spotting | Dataplexa

Keyword Spotting

Keyword Spotting is the ability of a system to detect specific words or phrases from continuous audio.

Unlike full speech recognition, keyword spotting focuses only on listening for particular words, often called wake words or trigger words.

This task is critical for always-on, low-power Speech AI systems.

What Is Keyword Spotting?

Keyword spotting answers the question:

“Did the target word appear in this audio stream?”

Examples of keywords:

  • “Hey Siri”
  • “OK Google”
  • “Alexa”
  • Custom command words

The system ignores all other speech.

Why Keyword Spotting Is Different

Keyword spotting systems must be:

  • Always listening
  • Extremely fast
  • Low power
  • Highly precise

False activations are more harmful than missed detections.

How Keyword Spotting Works (High-Level)

Most systems follow this pipeline:

Audio → Feature Extraction → Keyword Model → Decision

The model outputs a probability that the keyword is present.

Feature Extraction for Keyword Spotting

Keyword spotting does not need full linguistic information.

Instead, it relies on short-term acoustic patterns.

Common features:

  • MFCCs
  • Log-mel spectrograms
  • Energy features

Why This Code Exists

This code simulates extracting short audio features used in keyword detection.


import numpy as np

# Simulated feature window (time frames × features)
features = np.random.rand(30, 40)

print(features.shape)
  

What happens inside:

  • Audio is split into short windows
  • Each window captures acoustic patterns
(30, 40)

Sliding Window Detection

Keyword spotting systems scan audio continuously.

They use a sliding window to check overlapping segments.

Why This Code Exists

This example simulates scanning audio segments for keyword probability.


def keyword_probability(features):
    return np.random.rand()

windows = [np.random.rand(30, 40) for _ in range(5)]

scores = [keyword_probability(w) for w in windows]
print(scores)
  

What happens:

  • Each window is evaluated independently
  • A probability score is produced
[0.12, 0.08, 0.91, 0.15, 0.05]

How to read this:

Higher values mean higher confidence that the keyword is present.

Threshold-Based Decision

A detection occurs only if the probability exceeds a threshold.

Why This Code Exists

This logic prevents false activations.


threshold = 0.8

detections = [i for i, s in enumerate(scores) if s > threshold]
print(detections)
  

What happens:

  • Low-confidence windows are ignored
  • Only strong signals trigger activation
[2]

Small Models, Big Impact

Keyword spotting models are usually tiny:

  • CNNs
  • Depthwise separable networks
  • Quantized models

This allows them to run on:

  • Smart speakers
  • Wearables
  • IoT devices

False Positives vs False Negatives

Designing thresholds involves trade-offs:

  • Lower threshold → more false positives
  • Higher threshold → missed activations

Production systems favor precision.

Noise and Robustness

Keyword spotting systems must handle:

  • Background conversations
  • TV or music
  • Different accents

Noise augmentation during training is essential.

Real-World Applications

  • Voice assistants
  • Hands-free control
  • Emergency keyword detection
  • Smart home activation

Practice

What task detects specific words from continuous audio?



What technique scans overlapping audio segments?



What value controls activation sensitivity?



Quick Quiz

Keyword spotting often detects:





Keyword spotting models must be:





What controls false activations?





Recap: Keyword spotting detects specific trigger words using lightweight models and threshold-based decisions.

Next up: You’ll learn about Speech Emotion Recognition and how systems infer emotions from voice.