Speech AI Lesson 39 – Virtual Assistants | Dataplexa

Virtual Assistants

Speech AI becomes truly powerful when it is part of a system that can listen, understand, decide, and respond.

That complete experience is what we call a Virtual Assistant.

Virtual assistants are not just “speech tools”. They are full conversational systems where Speech AI is only one layer.

What Is a Virtual Assistant?

A virtual assistant is a software system that can:

Receive user input (voice or text)
Understand intent
Perform actions
Respond naturally (often in voice)

In voice assistants, the pipeline is usually:

Audio → ASR → NLU → Dialogue Manager → Action → NLG → TTS → Audio

Core Components of a Voice Assistant

1) ASR (Automatic Speech Recognition)

Converts spoken audio into text. Without a strong ASR layer, everything else fails.

2) NLU (Natural Language Understanding)

Extracts intent and entities from text. Example: “Book a table for 2 at 7 PM” → intent: reservation.

3) Dialogue Manager

Decides what to do next based on context. If needed, it asks follow-up questions.

4) Action Layer

Calls tools/APIs (calendar, weather, reminders, tickets).

5) TTS (Text to Speech)

Converts response text into audio for the user.

Intent Detection (NLU Basics)

Intent detection is the foundation of assistant understanding.

The system tries to classify what the user wants:

“Set an alarm” → alarm intent
“Play music” → media intent
“What’s the weather?” → weather intent

Why This Code Exists

This code demonstrates a simple intent classifier using keywords. (Real systems use ML models, but the logic concept remains the same.)


def detect_intent(text):
    t = text.lower()
    if "weather" in t:
        return "get_weather"
    if "alarm" in t or "wake" in t:
        return "set_alarm"
    if "play" in t and "music" in t:
        return "play_music"
    return "unknown"

print(detect_intent("Can you set an alarm for 7 AM?"))

What happens inside:

User text is normalized to lowercase
Simple rules classify intent

set_alarm

Why this matters:

Even advanced assistants still rely on intent routing to choose the right action path.

Entity Extraction

Entities are important details inside the command.

Example:

“Set alarm for 7 AM” → entity: time = 7 AM
“Call John” → entity: contact = John

Why This Code Exists

This example shows extracting a basic time entity.


import re

def extract_time(text):
    m = re.search(r"(\d{1,2})\s?(am|pm)", text.lower())
    if not m:
        return None
    return f"{m.group(1)} {m.group(2)}"

print(extract_time("Set an alarm for 7 AM"))

What happens here:

Regex finds time patterns
Extracts a clean structured value

7 am

Dialogue Management

Dialogue management handles the conversation flow.

If a user says:

"Book a table"

The system must ask:

For how many people?
Which time?
Which restaurant?

Why This Code Exists

This code shows a simple dialogue state machine.


state = {"intent": "set_alarm", "time": None}

def next_question(state):
    if state["intent"] == "set_alarm" and state["time"] is None:
        return "What time should I set the alarm for?"
    return "Done."

print(next_question(state))

What happens here:

The system checks missing info
Asks only what is required

What time should I set the alarm for?

Action Layer (Tool Execution)

Once intent and entities are ready, the assistant performs an action:

Set an alarm
Call an API
Update a database

In production, action layers are often microservices.

Why This Code Exists

This code simulates executing an action safely.


def set_alarm(time_str):
    return f"Alarm set for {time_str}"

print(set_alarm("7 am"))

What happens:

Action is executed using structured data
A confirmation message is created

Alarm set for 7 am

Response Generation (NLG + TTS)

After the assistant decides what to say, it produces a response that is natural and polite.

Then TTS converts the response into speech.

This is where prosody, emotion, and realism matter.

Engineering Challenges in Virtual Assistants

Latency: must respond quickly
Noise handling: voice input is messy
Context: conversation memory matters
Safety: prevent misuse and sensitive actions

Real-World Use Cases

Smart home control
Customer support automation
Scheduling assistants
In-car voice assistants

Practice

Which component converts audio into text?

Which component identifies intent and entities?

Which component decides what the assistant does next?

Quick Quiz

ASR converts:

Text to audio
Audio to text
Audio to audio

NLU is mainly responsible for:

Pitch
Intent detection
Vocoder

One major production challenge in assistants is:

Latency
Font size
Paper

Recap: Virtual assistants combine ASR, NLU, dialogue management, tools/actions, and TTS into a full conversational pipeline.

Next up: You’ll learn about Call Center Automation and how Speech AI powers real business workflows at scale.

← Previous Course Index Next →

Speech AI Course

Virtual Assistants

What Is a Virtual Assistant?

Core Components of a Voice Assistant

1) ASR (Automatic Speech Recognition)

2) NLU (Natural Language Understanding)

3) Dialogue Manager

4) Action Layer

5) TTS (Text to Speech)

Intent Detection (NLU Basics)

Why This Code Exists

Entity Extraction

Why This Code Exists

Dialogue Management

Why This Code Exists

Action Layer (Tool Execution)

Why This Code Exists

Response Generation (NLG + TTS)

Engineering Challenges in Virtual Assistants

Real-World Use Cases

Practice

Quick Quiz