Welcome to the Jarvis AI Assistant project!
In this first chapter, we are going to build the "Director" of our application. Before we worry about complex AI models or transcribing audio, we need a system that knows when to listen and when to act.
Imagine using a standard AI tool (like ChatGPT):
That is too slow.
Our goal is Push-to-Talk Orchestration. You press a key (like a Walkie-Talkie), speak, and when you release the key, the AI instantly types the answer into your active window.
To achieve this, we need a "Director"โa central piece of code that orchestrates the flow:
This is the trigger. Unlike a normal website key press (which only works when the browser is open), our app needs to detect the Fn (Function) key even when the app is minimized or in the background.
The application needs to know what "mode" it is in. It acts like a traffic light:
This is the code that connects the input (Key Press) to the service (Audio Recorder). It ensures that if you release the key too quickly (a tap), it might do one thing, but if you hold it, it does another.
Before looking at code, let's visualize the "Director" in action.
Let's look at how we build this, starting from the lowest level (the keyboard) up to the high-level orchestration.
Standard JavaScript cannot listen to global keys on your computer for security reasons. We use a small piece of C++/Objective-C code to "hook" into the operating system.
We will cover the deep details of this in Native System Bridges, but here is the logic used in src/native/fn_key_monitor.mm.
// src/native/fn_key_monitor.mm
// A simplified look at how we catch the key press
CGEventRef eventCallback(...) {
// Check if the flags (like Fn, Ctrl, Alt) changed
if (type == kCGEventFlagsChanged) {
// Did the Fn key state change?
if (isFnKeyPressed != wasFnKeyPressed) {
// Send signal to JavaScript world!
sendToMainProcess("FN_KEY_CHANGE");
}
}
return event;
}
Explanation: This code runs deep in the system. When it detects the specific "flag" (the specific electronic signal of the Fn key), it wakes up our JavaScript app.
In src/main.ts, our application receives that signal. This is where the "Director" decides what to do. It handles logic like: "Is this a double tap?" or "Should I start recording?"
// src/main.ts
async function handleHotkeyDown() {
// 1. Check if we are already recording (prevent glitches)
if (pushToTalkService.active) return;
// 2. Start the visual feedback (show the waveform window)
waveformWindow.webContents.send('push-to-talk-start');
// 3. Tell the service to start listening
await pushToTalkService.start();
}
When the user releases the key, the Director shouts "Cut!":
// src/main.ts
async function handleHotkeyUp() {
// 1. Only stop if we were actually recording
if (pushToTalkService.active) {
// 2. Stop the service
// The service handles the "magic" (transcription) automatically after stop
await pushToTalkService.stop();
// 3. Hide the visual feedback
waveformWindow.webContents.send('push-to-talk-stop');
}
}
Finally, we have the PushToTalkService in src/input/push-to-talk-refactored.ts. This class hides the complexity. When main.ts calls .start(), this service warms up the microphone, prepares the AI, and manages errors.
// src/input/push-to-talk-refactored.ts
export class PushToTalkService {
private orchestrator: PushToTalkOrchestrator;
// The Main process calls this
async start(): Promise<void> {
// We delegate the heavy lifting to the internal orchestrator
// This ensures thread safety and state management
await this.orchestrator.start();
}
async stop(): Promise<void> {
// When stopped, the orchestrator automatically triggers
// the Transcription -> AI -> Paste pipeline
await this.orchestrator.stop();
}
}
In this chapter, we established the Push-to-Talk Orchestration:
This architecture decouples the input (the key press) from the action (recording). This allows us to easily change the hotkey later or add features like "Hands-Free Mode" (double-tap) without rewriting the recording logic.
In the next chapter, we will dive deeper into how that C++ code actually communicates with our TypeScript application to make the global hotkey possible.
Next Chapter: Native System Bridges
Generated by Code IQ