Chapter 2 Β· CORE

Intelligent Routing & Failover

πŸ“„ 02_intelligent_routing___failover.md 🏷 Core

Chapter 2: Intelligent Routing & Failover

Welcome to the second chapter of the cc-switch tutorial!

In the Local Proxy Gateway chapter, we built a server that intercepts your AI traffic. But right now, it's just a dumb pipeβ€”it receives a request and blindly forwards it.

What happens if Anthropic's API goes down? Or if your OpenAI credit runs out mid-coding session?

In this chapter, we will turn our dumb pipe into a Smart Router that can detect failures and automatically switch to a backup provider.

The Problem: APIs Break

Imagine you are using Claude Code. You are in the flow, writing a complex feature. Suddenly, you get a 500 Server Error. The API is down.

Without cc-switch: You have to stop coding, go to your settings, find an API key for a different provider (like OpenRouter), paste it in, and restart your tool. Flow broken.

With cc-switch: The system acts like a GPS with live traffic data. It sees the "roadblock" (API error) and instantly reroutes you to the next best path without you even noticing.

Key Concepts

To achieve this, we need three new concepts:

1. The Circuit Breaker

Think of this like the fuse box in your house. If an appliance shorts out, the fuse blows to protect the house.

2. The Failover Queue

Instead of having just one active provider, we have a prioritized list.

  1. Primary: OpenAI (Fastest)
  2. Backup 1: OpenRouter (Cheaper)
  3. Backup 2: Anthropic (Reliable)

3. The Router

This is the brain. For every request, it looks at the Failover Queue and asks the Circuit Breaker: "Is the Primary healthy? No? Okay, is Backup 1 healthy?"


Usage: Asking for a Provider

In our code, we stop asking for a specific provider by name. Instead, we ask the router for the "best available" one.

Here is how we use the ProviderRouter in our application logic:

// Inside a request handler
// Ask the router: "Who should handle this request for Claude?"
let providers = router.select_providers("claude").await?;

// The router returns a list of healthy providers
// We pick the first one (the highest priority healthy one)
let best_provider = providers.first().ok_or(AppError::NoProvidersConfigured)?;

println!("Routing request to: {}", best_provider.name);

If the primary provider is "tripped" (broken), select_providers won't even return it. It will return the backup immediately.


Internal Implementation: The Decision Flow

How does the router make this decision? Let's look at the lifecycle of a request with failover enabled.

Sequence Diagram

sequenceDiagram participant App participant Router participant DB as Database participant CB as Circuit Breaker App->>Router: select_providers("claude") Router->>DB: Get Failover Queue (Order: A, B) loop Check Provider A Router->>CB: Is A healthy? CB-->>Router: No (Circuit Open) end loop Check Provider B Router->>CB: Is B healthy? CB-->>Router: Yes (Circuit Closed) end Router-->>App: Return [Provider B]

1. Selecting the Provider (provider_router.rs)

The core logic lives in ProviderRouter::select_providers. It iterates through your configured providers and checks their health.

// src-tauri/src/proxy/provider_router.rs

pub async fn select_providers(&self, app_type: &str) -> Result<Vec<Provider>, AppError> {
    // 1. Get the ordered list of providers (Primary -> Backup)
    let ordered_ids = self.db.get_failover_queue(app_type)?;

    let mut available_providers = Vec::new();

    // 2. Iterate through them
    for provider_id in ordered_ids {
        // 3. Check the circuit breaker for this specific provider
        let circuit_key = format!("{app_type}:{}", provider_id);
        let breaker = self.get_or_create_circuit_breaker(&circuit_key).await;

        // 4. If healthy, add to list
        if breaker.is_available().await {
            available_providers.push(self.db.get_provider(&provider_id)?);
        }
    }
    
    Ok(available_providers)
}

Simplified: We loop through the list. If the "Circuit Breaker" says the provider is okay, we keep it. If not, we skip it.

2. Recording Success or Failure

The system only learns if a provider is down if we tell it. After every request, we report the result back to the router.

// src-tauri/src/proxy/provider_router.rs

pub async fn record_result(
    &self, 
    provider_id: &str, 
    success: bool
) {
    let breaker = self.get_circuit_breaker(provider_id).await;

    if success {
        // If it worked, make the provider healthy
        breaker.record_success().await;
    } else {
        // If it failed, count the error. 
        // If errors > threshold, the circuit trips (Open).
        breaker.record_failure().await;
    }
}

3. The Failover Switch (failover_switch.rs)

If the router decides to use a Backup provider (different from your usual Primary), we want to update the UI so the user knows a switch happened.

This is handled by the FailoverSwitchManager. It prevents "flickering" (switching back and forth too fast) and notifies the frontend.

// src-tauri/src/proxy/failover_switch.rs

pub async fn try_switch(&self, app_type: &str, new_provider_id: &str) -> Result<bool, AppError> {
    // 1. Check if we are already switching (debounce)
    if self.pending_switches.contains(new_provider_id) {
        return Ok(false);
    }

    // 2. Update the database to make this the new "Current" provider
    self.db.set_current_provider(app_type, new_provider_id)?;

    // 3. Tell the Frontend React App to update the UI
    app_handle.emit("provider-switched", json!({
        "appType": app_type,
        "providerId": new_provider_id
    }));

    Ok(true)
}

Putting it together

Here is the complete story of a "Failover" event:

  1. Request: You send a prompt. The Router picks Provider A (Primary).
  2. Failure: The request fails with a 500 error.
  3. Record: We call record_result(success=false). The Circuit Breaker for A trips to Open.
  4. Retry: Your tool (or our proxy) retries the request.
  5. Re-Route: The Router sees A is Open. It skips A. It checks Provider B. B is healthy.
  6. Switch: The FailoverSwitchManager updates the UI: "Switched to Provider B".
  7. Success: The request is sent to Provider B and succeeds. You keep coding.

Summary

In this chapter, we made our proxy robust:

  1. We implemented Circuit Breakers to detect broken APIs.
  2. We created a Failover Queue to order our backups.
  3. We built a Router to intelligently pick the best available provider.

Now we know where to send the request. But there is a catch: Claude Code expects to talk to Anthropic, but what if we route it to OpenAI? They speak different languages (JSON formats)!

In the next chapter, we will solve this by building the translation layer.

Next Chapter: Provider Adaptation Layer


Generated by Code IQ