In the previous chapter, Central Data & Identity Server, we gave our AI a permanent memory so it remembers who you are. Before that, in The "Stage" (Visual Presentation Layer), we gave it a face to smile at you.
But right now, communication is a bit one-sided. You have to type everything on a keyboard. To make airi feel like a real companion, we want to talk to it naturally, just like talking to a friend on a call.
In this chapter, we will build the Sensory Audio Processing system. This gives our AI "ears."
We want the user to be able to speak freely. However, we cannot just record 24/7 and send everything to a server.
The Solution: We need a smart filter. We need a system that listens locally on your device, detects only when a human is speaking, and then sends only that specific sentence to be transcribed.
To understand how Airi hears, think of a News Reporter.
This is the cameraman. It grabs the raw audio stream from your microphone. It creates a continuous flow of data (floating point numbers).
VAD stands for Voice Activity Detection. Think of this as the producer in the van. They are watching the feed live. They don't care what acts are being said; they only care if someone is speaking.
Once the Gatekeeper says "That was a sentence," we pass that audio clip to the Scribe. The Scribe uses a heavy AI model (OpenAI Whisper) to turn those sounds into text: "Hello Airi."
In the frontend application (The Stage), we use a "Composable" helper to manage the microphone. This hides the complex browser Media APIs.
Here is how we set up the recording logic (simplified from packages/stage-ui/src/composables/audio/audio-recorder.ts):
// Import the helper
import { useAudioRecorder } from './audio-recorder'
// 1. Initialize the recorder
const { startRecord, stopRecord, recording } = useAudioRecorder(mediaStream)
// 2. Start capturing audio
await startRecord()
// ... time passes ...
// 3. Stop capturing and get the file (Blob)
await stopRecord()
console.log(recording.value) // This is the audio file!
Explanation: This is the manual way to record. But we don't want to manually press buttons. We want the AI to listen automatically. That's where VAD comes in.
How does sound travel from your mouth to the AI's brain?
The system is split into two parts:
Running AI models in a web browser can make the UI laggy. To prevent this, airi runs the VAD in a Web Worker (a background thread).
We use a library called onnx-community/silero-vad. Let's look at packages/stage-ui/src/workers/vad/vad.ts.
The VAD processor receives small chunks of audio (buffers) continuously.
// Inside the processAudio loop
// 1. Ask the small AI model: "Is this speech?"
const isSpeech = await this.detectSpeech(inputBuffer)
// 2. If it is speech...
if (isSpeech) {
if (!this.isRecording) {
// Notify the app: "User started talking!"
this.emit('speech-start', undefined)
}
// Keep recording
this.isRecording = true
return
}
Explanation: This code runs dozens of times per second. It checks the probability that the current sound is human speech.
We don't want to cut off immediately if you pause for a split second.
// If not speech, we count how long the silence is
this.postSpeechSamples += inputBuffer.length
// Only stop if silence is longer than the limit (e.g. 0.5 seconds)
if (this.postSpeechSamples >= minSilenceDurationSamples) {
// Package up the audio we collected
this.processSpeechSegment()
// Notify the app: "User finished talking!"
this.emit('speech-end', undefined)
}
Explanation: This logic ensures smooth recording. It waits for a "comfortable pause" before deciding you are done talking.
Once the browser has a tidy audio file of your sentence, it sends it to the Rust backend. This is handled by a Tauri plugin: crates/tauri-plugin-ipc-audio-transcription-ort.
We use Rust because the Transcription model (Whisper) is large and would crash a web browser.
This function receives the raw audio numbers (Vec<f32>).
#[tauri::command]
async fn ipc_audio_transcription<R: Runtime>(
app: tauri::AppHandle<R>,
chunk: Vec<f32>, // The audio data
language: Option<String>,
) -> Result<String, String> {
// 1. Get access to the loaded Whisper model
let data = app.state::<Mutex<AppDataWhisperProcessor>>();
let mut data = data.lock().unwrap();
// 2. Configure the settings
let mut config = whisper::whisper::GenerationConfig::default();
config.language = language;
// 3. Transcribe! (Audio -> Text)
let transcription = processor
.transcribe(chunk.as_slice(), &config)
.map_err(|e| e.to_string())?;
// 4. Return the text string
Ok(transcription)
}
Explanation:
Mutex: We lock the model so two people can't talk at the exact same time (thread safety).transcribe: This performs the magic. It takes the numbers and outputs a string like "Hello."In this chapter, we gave airi the sense of hearing.
Now our AI can see us (via the Stage), remember us (via the Server), and hear us (via Audio Processing). But it is still trapped inside the computer screen. It cannot control anything outside of its chat window.
In the next chapter, we will give the AI hands to control your computer.
Next Chapter: Native Capabilities Bridge
Generated by Code IQ