Voice-Controlled AI Agent Tutorial: Build Your Own Using AssemblyAI and Groq in Python

Manaal KhanApril 15, 2026 at 9:23 PM7 min read

Key Takeaways

The system converts voice commands into structured intents that trigger actions like code generation and file creation
Using API-based services (AssemblyAI, Groq) solved major performance issues compared to running local models
The architecture supports compound commands, letting users request multiple actions in a single voice input
Adding human-in-the-loop confirmation for file operations provides an important safety layer

ℹ️

Read in Short

This tutorial walks through building a voice-controlled AI agent in Python. You'll use AssemblyAI for converting speech to text, Groq's llama-3.1-8b-instant model for understanding what the user wants, and Streamlit for a simple web interface. The whole thing can execute commands like 'create a Python file with a Fibonacci function' from just your voice.

So you want to talk to your computer and have it actually do stuff? Not just answer questions, but write code, create files, summarize documents? Yeah, that's exactly what one developer built, and the approach is surprisingly accessible if you know where to put the pieces together.

The project we're looking at today is a voice-controlled AI agent that takes your spoken commands and converts them into real actions. Think of it as building your own Jarvis, except it actually works and you can understand how it works. Let me break down the architecture and the lessons learned along the way.

How the Pipeline Works

The system follows a straightforward flow that's easy to wrap your head around. Audio goes in, gets converted to text, the AI figures out what you want, and then it executes the appropriate tool. Here's the actual pipeline:

Audio Input: User uploads or records an audio file
Speech-to-Text: AssemblyAI transcribes the audio into text
Intent Detection: Groq's LLM analyzes the text and returns structured JSON with intents and parameters
Tool Execution: The system runs the appropriate action based on detected intents
Output: Results display in the Streamlit UI

What I like about this design is that each stage is modular. If AssemblyAI isn't working for you, swap in another STT service. Want to use a different LLM? Go for it. The pieces don't care about each other as long as the interfaces stay consistent.

The Tech Stack

Component	Technology	Why This Choice
Speech-to-Text	AssemblyAI	Fast, accurate, no local setup headaches
Language Model	Groq (llama-3.1-8b-instant)	Blazing fast inference, generous free tier
Frontend	Streamlit	Quick prototyping, built-in components
Backend	Python	Obvious choice for ML/AI work

Intent Detection: The Brain of the Operation

Here's where things get interesting. When you say something like 'Create a Python file with a Fibonacci function,' the LLM doesn't just parrot back your words. It returns structured data that the system can actually act on.

json

{
  "intents": ["write_code", "create_file"],
  "params": {
    "filename": "fibonacci.py",
    "language": "python"
  }
}

See what happened there? The model understood you wanted two things: code generation AND file creation. It extracted the filename and programming language from context. This structured output is what makes the whole system work. Without it, you're just doing fancy dictation.

Compound Commands Are the Real Magic

The kicker? You can chain multiple actions in a single voice command. Say 'Summarize this text and save it to summary.txt' and the system handles both the summarization AND the file creation. That's not trivial to implement, but it's what separates a useful tool from a toy demo.

⚠️

Safety First

All file operations are sandboxed to an output/ directory. The system also includes human-in-the-loop confirmation before any file gets created or modified. You won't accidentally overwrite important files because you mumbled something weird.

The Struggles Were Real

Look, the finished product sounds smooth, but getting there was messy. The developer initially tried running everything locally, and it was a disaster.

Local Models: A Cautionary Tale

The first attempt used Whisper from HuggingFace for speech-to-text and Ollama for the language model. Sounds reasonable, right? In theory, yes. In practice:

FFmpeg setup on Windows was a nightmare
Memory usage went through the roof
Performance on CPU was painfully slow
The system crashed. A lot.

After digging through Reddit threads and developer forums, the switch to API-based services made everything better. AssemblyAI handles the speech recognition. Groq handles the LLM inference. Both are fast, stable, and way easier to set up than wrestling with local model deployments.

10x+ faster

Switching from local models to API-based services dramatically improved both speed and stability

Model Deprecation Headaches

Here's something nobody warns you about when building with LLM APIs: models get deprecated mid-development. The Groq team updated their available models during the project, which meant scrambling to update model names and adapt to API changes on the fly. Always build with some flexibility for model swaps.

Cleaning Up LLM Output

Another gotcha: language models love to explain themselves. Ask for code and you might get code plus a paragraph about why the code works that way. Great for learning, terrible for automation. The fix involves strict prompting and post-processing to strip out explanations before saving files.

Building This Yourself

Want to try this? Here's the basic flow you'd follow:

Sign up for AssemblyAI and grab your API key
Create a Groq account and get API access
Set up a Python environment with streamlit, requests, and your API client libraries
Build the audio upload component in Streamlit
Connect the uploaded audio to AssemblyAI's transcription endpoint
Send the transcribed text to Groq with a prompt designed to extract intents
Parse the JSON response and route to appropriate tool functions
Display results and maintain session history

python

# Simplified intent detection prompt structure
system_prompt = """
Analyze the user's command and return JSON with:
- intents: list of actions (write_code, create_file, summarize, chat)
- params: relevant parameters extracted from the command

Return ONLY valid JSON, no explanations.
"""

# Send to Groq
response = groq_client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": transcribed_text}
    ]
)

Graceful Degradation Matters

One smart design choice here: if the LLM fails to return proper intent detection, the system falls back to keyword-based classification. It's not as smart, but it doesn't crash. Your users don't care about your elegant architecture. They care that stuff works.

ℹ️

Session Memory

The agent maintains interaction history within each session, so you can reference previous commands and build on earlier results. This makes the experience feel more like working with an actual assistant rather than a stateless tool.

Why This Architecture Makes Sense

The beauty of this setup is in its simplicity. Each component does one thing well. The STT service transcribes. The LLM understands intent. The tool executor acts. The UI presents results. You can upgrade any piece without touching the others.

And honestly? This is how a lot of production AI systems work. The fancy demos at tech conferences are built on modular pipelines just like this one. Understanding this pattern gives you the foundation to build increasingly sophisticated agents.

What's Next?

This project is a solid foundation, but there's plenty of room to grow. You could add more tools (web search, database queries, system commands). You could improve the intent detection with fine-tuned models. You could add real-time streaming transcription instead of file uploads.

The point is: you now have a working pattern for voice-to-action AI agents. That's not nothing. Go build something cool with it.

Frequently Asked Questions

Is this free to build?

Mostly. Both AssemblyAI and Groq offer free tiers that are generous enough for development and personal projects. You'll only hit paywalls at significant scale.

Can I use a different LLM instead of Groq?

Absolutely. The architecture is model-agnostic. OpenAI, Anthropic, or even local models through Ollama would work with some prompt adjustments.

How accurate is the speech recognition?

AssemblyAI is quite good for clear audio. Background noise and heavy accents will degrade quality, same as any STT system.

Is this secure for production use?

For personal projects, yes. For production, you'd want additional sandboxing, input validation, and probably shouldn't let it execute arbitrary code without serious guardrails.

Source: DEV Community