Chapter 5 · CORE

Sensory Audio Processing (Hearing)

📄 05_sensory_audio_processing__hearing_.md 🏷 Core

Chapter 5: Sensory Audio Processing (Hearing)

In the previous chapter, Central Data & Identity Server, we gave our AI a permanent memory so it remembers who you are. Before that, in The "Stage" (Visual Presentation Layer), we gave it a face to smile at you.

But right now, communication is a bit one-sided. You have to type everything on a keyboard. To make airi feel like a real companion, we want to talk to it naturally, just like talking to a friend on a call.

In this chapter, we will build the Sensory Audio Processing system. This gives our AI "ears."

The Motivation: The "Hot Mic" Problem

We want the user to be able to speak freely. However, we cannot just record 24/7 and send everything to a server.

Privacy: We don't want to record you eating chips or typing.
Cost: Processing audio is expensive (computationally).
Latency: Sending hours of silence to a server slows everything down.

The Solution: We need a smart filter. We need a system that listens locally on your device, detects only when a human is speaking, and then sends only that specific sentence to be transcribed.

Key Concepts

To understand how Airi hears, think of a News Reporter.

1. The Recorder (Audio Capture)

This is the cameraman. It grabs the raw audio stream from your microphone. It creates a continuous flow of data (floating point numbers).

2. The Gatekeeper (VAD)

VAD stands for Voice Activity Detection. Think of this as the producer in the van. They are watching the feed live. They don't care what acts are being said; they only care if someone is speaking.

Silence/Noise: The producer yells "Cut!" (Discard data).
Human Voice: The producer yells "Action!" (Start recording).

3. The Scribe (Transcription)

Once the Gatekeeper says "That was a sentence," we pass that audio clip to the Scribe. The Scribe uses a heavy AI model (OpenAI Whisper) to turn those sounds into text: "Hello Airi."

How to Use: The Audio Recorder

In the frontend application (The Stage), we use a "Composable" helper to manage the microphone. This hides the complex browser Media APIs.

Here is how we set up the recording logic (simplified from packages/stage-ui/src/composables/audio/audio-recorder.ts):

// Import the helper
import { useAudioRecorder } from './audio-recorder'

// 1. Initialize the recorder
const { startRecord, stopRecord, recording } = useAudioRecorder(mediaStream)

// 2. Start capturing audio
await startRecord()

// ... time passes ...

// 3. Stop capturing and get the file (Blob)
await stopRecord()
console.log(recording.value) // This is the audio file!

Explanation: This is the manual way to record. But we don't want to manually press buttons. We want the AI to listen automatically. That's where VAD comes in.

Internal Implementation: The Hearing Pipeline

How does sound travel from your mouth to the AI's brain?

The Flow of Sound

The system is split into two parts:

Browser (Lightweight): Detects when to record.
Rust Backend (Heavyweight): Translates audio to text.

sequenceDiagram participant User participant VAD as Browser VAD (Worker) participant Rust as Rust Core (Whisper) participant Brain as Cognitive Brain User->>VAD: Speaks "Hello!" Note right of VAD: Detects human voice<br/>Starts buffering audio. User->>VAD: Stops speaking. Note right of VAD: Detects silence.<br/>Cuts the clip. VAD->>Rust: Sends audio clip (Blob) Rust->>Rust: Runs Whisper Model Rust->>Brain: Returns text: "Hello!"

Deep Dive 1: The Gatekeeper (VAD) inside the Browser

Running AI models in a web browser can make the UI laggy. To prevent this, airi runs the VAD in a Web Worker (a background thread).

We use a library called onnx-community/silero-vad. Let's look at packages/stage-ui/src/workers/vad/vad.ts.

Detecting Speech

The VAD processor receives small chunks of audio (buffers) continuously.

// Inside the processAudio loop
// 1. Ask the small AI model: "Is this speech?"
const isSpeech = await this.detectSpeech(inputBuffer)

// 2. If it is speech...
if (isSpeech) {
  if (!this.isRecording) {
    // Notify the app: "User started talking!"
    this.emit('speech-start', undefined)
  }
  
  // Keep recording
  this.isRecording = true
  return
}

Explanation: This code runs dozens of times per second. It checks the probability that the current sound is human speech.

Handling Silence

We don't want to cut off immediately if you pause for a split second.

// If not speech, we count how long the silence is
this.postSpeechSamples += inputBuffer.length

// Only stop if silence is longer than the limit (e.g. 0.5 seconds)
if (this.postSpeechSamples >= minSilenceDurationSamples) {
  
  // Package up the audio we collected
  this.processSpeechSegment()
  
  // Notify the app: "User finished talking!"
  this.emit('speech-end', undefined)
}

Explanation: This logic ensures smooth recording. It waits for a "comfortable pause" before deciding you are done talking.

Deep Dive 2: The Scribe (Rust Backend)

Once the browser has a tidy audio file of your sentence, it sends it to the Rust backend. This is handled by a Tauri plugin: crates/tauri-plugin-ipc-audio-transcription-ort.

We use Rust because the Transcription model (Whisper) is large and would crash a web browser.

The Rust Command

This function receives the raw audio numbers (Vec<f32>).

#[tauri::command]
async fn ipc_audio_transcription<R: Runtime>(
  app: tauri::AppHandle<R>,
  chunk: Vec<f32>, // The audio data
  language: Option<String>,
) -> Result<String, String> {
  
  // 1. Get access to the loaded Whisper model
  let data = app.state::<Mutex<AppDataWhisperProcessor>>();
  let mut data = data.lock().unwrap();
  
  // 2. Configure the settings
  let mut config = whisper::whisper::GenerationConfig::default();
  config.language = language;

  // 3. Transcribe! (Audio -> Text)
  let transcription = processor
    .transcribe(chunk.as_slice(), &config)
    .map_err(|e| e.to_string())?;

  // 4. Return the text string
  Ok(transcription)
}

Explanation:

Mutex: We lock the model so two people can't talk at the exact same time (thread safety).
transcribe: This performs the magic. It takes the numbers and outputs a string like "Hello."
Return: This text is sent back to the frontend, which then forwards it to The Cognitive Brain.

Summary

In this chapter, we gave airi the sense of hearing.

We solved the "Hot Mic" problem using VAD (Voice Activity Detection).
We perform VAD in the Browser to be fast and private.
We perform Transcription in Rust to be powerful and accurate.

Now our AI can see us (via the Stage), remember us (via the Server), and hear us (via Audio Processing). But it is still trapped inside the computer screen. It cannot control anything outside of its chat window.

In the next chapter, we will give the AI hands to control your computer.

Next Chapter: Native Capabilities Bridge

Generated by Code IQ

← Previous

Central Data & Identity Server

Native Capabilities Bridge