Speech AI Lesson 39 – Virtual Assistants | Dataplexa

Virtual Assistants

Speech AI becomes truly powerful when it is part of a system that can listen, understand, decide, and respond.

That complete experience is what we call a Virtual Assistant.

Virtual assistants are not just “speech tools”. They are full conversational systems where Speech AI is only one layer.

What Is a Virtual Assistant?

A virtual assistant is a software system that can:

  • Receive user input (voice or text)
  • Understand intent
  • Perform actions
  • Respond naturally (often in voice)

In voice assistants, the pipeline is usually:

Audio → ASR → NLU → Dialogue Manager → Action → NLG → TTS → Audio

Core Components of a Voice Assistant

1) ASR (Automatic Speech Recognition)

Converts spoken audio into text. Without a strong ASR layer, everything else fails.

2) NLU (Natural Language Understanding)

Extracts intent and entities from text. Example: “Book a table for 2 at 7 PM” → intent: reservation.

3) Dialogue Manager

Decides what to do next based on context. If needed, it asks follow-up questions.

4) Action Layer

Calls tools/APIs (calendar, weather, reminders, tickets).

5) TTS (Text to Speech)

Converts response text into audio for the user.

Intent Detection (NLU Basics)

Intent detection is the foundation of assistant understanding.

The system tries to classify what the user wants:

  • “Set an alarm” → alarm intent
  • “Play music” → media intent
  • “What’s the weather?” → weather intent

Why This Code Exists

This code demonstrates a simple intent classifier using keywords. (Real systems use ML models, but the logic concept remains the same.)


def detect_intent(text):
    t = text.lower()
    if "weather" in t:
        return "get_weather"
    if "alarm" in t or "wake" in t:
        return "set_alarm"
    if "play" in t and "music" in t:
        return "play_music"
    return "unknown"

print(detect_intent("Can you set an alarm for 7 AM?"))
  

What happens inside:

  • User text is normalized to lowercase
  • Simple rules classify intent
set_alarm

Why this matters:

Even advanced assistants still rely on intent routing to choose the right action path.

Entity Extraction

Entities are important details inside the command.

Example:

  • “Set alarm for 7 AM” → entity: time = 7 AM
  • “Call John” → entity: contact = John

Why This Code Exists

This example shows extracting a basic time entity.


import re

def extract_time(text):
    m = re.search(r"(\d{1,2})\s?(am|pm)", text.lower())
    if not m:
        return None
    return f"{m.group(1)} {m.group(2)}"

print(extract_time("Set an alarm for 7 AM"))
  

What happens here:

  • Regex finds time patterns
  • Extracts a clean structured value
7 am

Dialogue Management

Dialogue management handles the conversation flow.

If a user says:

"Book a table"

The system must ask:

  • For how many people?
  • Which time?
  • Which restaurant?

Why This Code Exists

This code shows a simple dialogue state machine.


state = {"intent": "set_alarm", "time": None}

def next_question(state):
    if state["intent"] == "set_alarm" and state["time"] is None:
        return "What time should I set the alarm for?"
    return "Done."

print(next_question(state))
  

What happens here:

  • The system checks missing info
  • Asks only what is required
What time should I set the alarm for?

Action Layer (Tool Execution)

Once intent and entities are ready, the assistant performs an action:

  • Set an alarm
  • Call an API
  • Update a database

In production, action layers are often microservices.

Why This Code Exists

This code simulates executing an action safely.


def set_alarm(time_str):
    return f"Alarm set for {time_str}"

print(set_alarm("7 am"))
  

What happens:

  • Action is executed using structured data
  • A confirmation message is created
Alarm set for 7 am

Response Generation (NLG + TTS)

After the assistant decides what to say, it produces a response that is natural and polite.

Then TTS converts the response into speech.

This is where prosody, emotion, and realism matter.

Engineering Challenges in Virtual Assistants

  • Latency: must respond quickly
  • Noise handling: voice input is messy
  • Context: conversation memory matters
  • Safety: prevent misuse and sensitive actions

Real-World Use Cases

  • Smart home control
  • Customer support automation
  • Scheduling assistants
  • In-car voice assistants

Practice

Which component converts audio into text?



Which component identifies intent and entities?



Which component decides what the assistant does next?



Quick Quiz

ASR converts:





NLU is mainly responsible for:





One major production challenge in assistants is:





Recap: Virtual assistants combine ASR, NLU, dialogue management, tools/actions, and TTS into a full conversational pipeline.

Next up: You’ll learn about Call Center Automation and how Speech AI powers real business workflows at scale.