All posts
AI & Machine Learning

Open-Source Voice AI Decides to Speak Every 0.4 Seconds

Huma Shazia6 June 2026 at 4:42 pm5 min read
Open-Source Voice AI Decides to Speak Every 0.4 Seconds

Key Takeaways

Open-Source Voice AI Decides to Speak Every 0.4 Seconds
Source: The Decoder
  • Audio Interaction processes continuous audio in 0.4-second chunks, deciding after each whether to respond or stay silent
  • The model was trained on 302,000 hours of synthetic audio data and combines multiple tasks previously requiring separate models
  • It outperforms its base model on benchmarks and beats Gemini 3 Flash in proactive noise detection

Today's voice assistants have a problem. They wait for you to finish talking before they respond. It's like having a conversation with someone who stares blankly until you say "over." GPT-4o, Qwen 3.5-Omni, and similar models work this way. They process audio only after the recording ends.

A new open-source model called Audio Interaction takes a different approach. It listens continuously and makes a decision every 0.4 seconds: speak or stay silent. The model comes from researchers at institutions in China, Hong Kong, and Singapore.

0.4 seconds
The interval at which Audio Interaction decides whether to respond, allowing near-instantaneous conversational timing

How the 0.4-Second Loop Works

The system breaks incoming audio into 0.4-second chunks. After each chunk, it outputs one of two tokens: <silent> or <response>. If it picks <silent>, it keeps listening. Only when it outputs <response> does it start generating speech.

This lets the model handle multiple tasks in a single stream. Translation, transcription, dialogue, and reacting to everyday sounds all run through the same 3-billion-parameter architecture. Classic tasks like "Translate into English" become instructions within the continuous audio flow.

Overview of Audio-Interaction showing a spectrogram of a continuous audio stream and four application boxes for online instruction following, real-time transcription, voice chatting, and proactive intervention, where the model decides between silence and response.
Audio Interaction processes a continuous audio stream and decides moment by moment whether to stay silent or react
By breaking the audio stream into atomic 0.4-second chunks, we move away from traditional 'turn-taking' AI and toward a model that can finally coexist with human conversation flows.

— Zhifei Xie, Lead Researcher

One Model Replaces Many

Current streaming systems like Moshi for dialogue or Paraformer for live subtitles can listen in real-time, but they only handle one task. They also treat sounds like coughing as background noise to ignore.

Audio Interaction combines recognition, translation, dialogue, and proactive response in a single setup. The model scored 58.15 points on the MMAU audio benchmark, narrowly beating its base model Qwen2.5-Omni-3B. It also comes close to much larger 7B models on several tasks.

Diagram comparing specialized single-task models on the left for ASR, translation, and speech dialog with the unified Audio-Interaction model on the right, which uses an audio encoder, adapter, and language model to handle multiple tasks through a single stream.
Audio Interaction combines multiple tasks that previously required separate specialized models

On English-Chinese translation, the model shows substantial improvement over the base. In proactive noise detection tests, it beat Gemini 3 Flash.

Training Data: 302,000 Hours of Synthetic Audio

Teaching a model when to jump into a conversation requires specific training data. Existing audio datasets consist of short, isolated clips. They lack the long sequences with sparse response signals that Audio Interaction needs.

The research team built their own training data in three stages. First, a language model designed plausible settings, like a 30-second household scene. The team then generated 302,000 hours of synthetic audio to train the model's perceive-decide-respond loop.

Timeline of a 30-second household scene where Audio-Interaction decides every 0.4 seconds whether to stay silent or react, including proactive warnings for a child crying, identifying Debussy
A 30-second household scene where the model decides every 0.4 seconds whether to respond

The architecture, called SoundFlow, uses an audio encoder, adapter, and model working together. This design lets the system process listening and speaking in parallel, minimizing the wait time for responses.

Training diagram of the SoundFlow framework with audio encoder, adapter, and model predicting special tokens for silence or response per 0.4-second chunk, shown across audio understanding, counting, simultaneous interpretation, and proactive intervention.
The SoundFlow framework architecture with audio encoder, adapter, and model components

Open Source Under Apache 2.0

The model is released under an Apache 2.0 license. This has generated excitement in developer communities, particularly on Hacker News and Reddit's r/LocalLLaMA. The permissive license means developers can use it in commercial applications.

Some developers are already working on integrating Audio Interaction into open-source home automation projects. The SoundFlow architecture is designed to run on consumer-grade hardware, not just cloud infrastructure.

This isn't just a chatbot that talks back; it's a model that understands the silence between words as much as the words themselves.

— AI Research Analyst, The Decoder

There's debate about performance in highly noisy backgrounds. Some users report that proprietary models like Gemini 3 still handle certain edge cases better. But for an open-source model that developers can run locally, Audio Interaction represents a significant step forward.

Why This Matters for Voice Applications

For years, voice AI has forced users into unnatural patterns. Push-to-talk. Wait for the beep. Speak clearly into the microphone. These constraints exist because models needed discrete audio segments to process.

Audio Interaction's always-on loop mimics how humans actually converse. We decide constantly whether to interject, acknowledge with a "mm-hmm," or stay quiet. Making this work in a lightweight model that runs locally opens new possibilities for real-time voice agents.

Developers building voice interfaces no longer need to rely entirely on massive cloud APIs. A 3-billion-parameter model that handles multiple tasks gives them a foundation to build responsive, full-duplex voice applications.

ℹ️

Logicity's Take

Frequently Asked Questions

What makes Audio Interaction different from GPT-4o voice?

GPT-4o waits for you to finish speaking before responding. Audio Interaction listens continuously and decides every 0.4 seconds whether to speak, enabling more natural back-and-forth conversation.

Can Audio Interaction run on consumer hardware?

Yes. The 3-billion-parameter model is designed for local deployment on consumer-grade hardware, not just cloud infrastructure.

What tasks can Audio Interaction handle?

It combines dialogue, translation, transcription, and sound recognition in a single model. Previous systems needed separate models for each task.

Is Audio Interaction free to use commercially?

Yes. The model is released under an Apache 2.0 license, which permits commercial use.

How does Audio Interaction compare to Gemini 3 Flash?

Audio Interaction beat Gemini 3 Flash in proactive noise detection tests. However, some users report Gemini 3 still handles highly noisy backgrounds better in certain scenarios.

ℹ️

Need Help Implementing This?

Source: The Decoder / Jonathan Kemper

H

Huma Shazia

Senior AI & Tech Writer