Speech AI Course
Keyword Spotting
Keyword Spotting is the ability of a system to detect specific words or phrases from continuous audio.
Unlike full speech recognition, keyword spotting focuses only on listening for particular words, often called wake words or trigger words.
This task is critical for always-on, low-power Speech AI systems.
What Is Keyword Spotting?
Keyword spotting answers the question:
“Did the target word appear in this audio stream?”
Examples of keywords:
- “Hey Siri”
- “OK Google”
- “Alexa”
- Custom command words
The system ignores all other speech.
Why Keyword Spotting Is Different
Keyword spotting systems must be:
- Always listening
- Extremely fast
- Low power
- Highly precise
False activations are more harmful than missed detections.
How Keyword Spotting Works (High-Level)
Most systems follow this pipeline:
Audio → Feature Extraction → Keyword Model → Decision
The model outputs a probability that the keyword is present.
Feature Extraction for Keyword Spotting
Keyword spotting does not need full linguistic information.
Instead, it relies on short-term acoustic patterns.
Common features:
- MFCCs
- Log-mel spectrograms
- Energy features
Why This Code Exists
This code simulates extracting short audio features used in keyword detection.
import numpy as np
# Simulated feature window (time frames × features)
features = np.random.rand(30, 40)
print(features.shape)
What happens inside:
- Audio is split into short windows
- Each window captures acoustic patterns
Sliding Window Detection
Keyword spotting systems scan audio continuously.
They use a sliding window to check overlapping segments.
Why This Code Exists
This example simulates scanning audio segments for keyword probability.
def keyword_probability(features):
return np.random.rand()
windows = [np.random.rand(30, 40) for _ in range(5)]
scores = [keyword_probability(w) for w in windows]
print(scores)
What happens:
- Each window is evaluated independently
- A probability score is produced
How to read this:
Higher values mean higher confidence that the keyword is present.
Threshold-Based Decision
A detection occurs only if the probability exceeds a threshold.
Why This Code Exists
This logic prevents false activations.
threshold = 0.8
detections = [i for i, s in enumerate(scores) if s > threshold]
print(detections)
What happens:
- Low-confidence windows are ignored
- Only strong signals trigger activation
Small Models, Big Impact
Keyword spotting models are usually tiny:
- CNNs
- Depthwise separable networks
- Quantized models
This allows them to run on:
- Smart speakers
- Wearables
- IoT devices
False Positives vs False Negatives
Designing thresholds involves trade-offs:
- Lower threshold → more false positives
- Higher threshold → missed activations
Production systems favor precision.
Noise and Robustness
Keyword spotting systems must handle:
- Background conversations
- TV or music
- Different accents
Noise augmentation during training is essential.
Real-World Applications
- Voice assistants
- Hands-free control
- Emergency keyword detection
- Smart home activation
Practice
What task detects specific words from continuous audio?
What technique scans overlapping audio segments?
What value controls activation sensitivity?
Quick Quiz
Keyword spotting often detects:
Keyword spotting models must be:
What controls false activations?
Recap: Keyword spotting detects specific trigger words using lightweight models and threshold-based decisions.
Next up: You’ll learn about Speech Emotion Recognition and how systems infer emotions from voice.