Chapter 7 · CORE

LLM Client Wrapper (The Translator)

📄 07_llm_client_wrapper__the_translator_.md 🏷 Core

Chapter 7: LLM Client Wrapper (The Translator)

Welcome to the final chapter of the memU tutorial!

In the previous chapter, Workflow Pipeline Engine (The Assembly Line), we built a factory that processes information step-by-step.

However, there is one problem. Our factory speaks Python, but the AI models (like GPT-4, Claude, or Llama) speak different "languages" (APIs).

OpenAI expects a JSON object with messages.
A local model might expect a raw text prompt.
An audio model expects binary file data.

If you hardcode "OpenAI" into your pipeline, you are trapped. If you want to switch to a cheaper local model later, you have to rewrite your entire app.

We need a universal adapter. We need the LLM Client Wrapper.

The Motivation: The Universal Travel Adapter

Imagine you are traveling around the world. Every country has a different electrical outlet.

Without an adapter: You need to buy a new hair dryer in every country. (Rewriting code for every AI provider).
With an adapter: You plug your device into the adapter, and the adapter handles the wall socket.

The LLM Client Wrapper is that adapter.

It sits between memU and the outside world. The rest of the system just says "Chat with this", and the Wrapper figures out the technical details of talking to OpenAI, Anthropic, or a local server.

Key Capabilities

The Wrapper doesn't just send text. It standardizes four distinct "senses" for the AI:

Chat: Standard text conversation.
Vision: Looking at images and describing them.
Transcribe: Listening to audio files and turning them into text.
Embed: Turning text into lists of numbers (Vectors) for the database.

The Interceptor System (The "Middleman")

This is the most powerful feature of the Wrapper.

Imagine you want to know how much money you are spending on AI. You could modify every single function in your code to print the cost... or you could just place a "toll booth" inside the Wrapper.

This toll booth is called an Interceptor. It allows you to hook into the process Before or After the AI speaks.

sequenceDiagram participant App as memU Logic participant Pre as Pre-Interceptor participant AI as External AI participant Post as Post-Interceptor App->>Pre: "Hello!" Note right of Pre: Log: "User sent message" Pre->>AI: "Hello!" AI-->>Post: "Hi there!" Note right of Post: Log: "Cost: 0.002 cents" Post-->>App: "Hi there!" style Pre fill:#ffcc80 style Post fill:#ffcc80

How to Use It

You typically access the wrapper through the MemoryService (The Central Brain).

1. Basic Interaction (The Polyglot)

You can use the client to perform different tasks without worrying about API headers or JSON formats.

# 1. Access the client from the service
client = service.llm_client

# 2. Chat (Text)
response = await client.chat("Why is the sky blue?")

# 3. Vision (Images)
# The wrapper handles opening the file and encoding it
desc = await client.vision("Describe this", "photo.jpg")

# 4. Transcribe (Audio)
text = await client.transcribe("meeting.mp3")

2. Using Interceptors (Spying on the Line)

Let's say you want to print the token usage (cost) every time the AI answers. You don't need to change your business logic. You just register a hook.

# Define a function to run AFTER the AI responds
async def log_cost(context, request, response, usage):
    print(f"--- Call Finished ---")
    print(f"Model: {context.model}")
    print(f"Tokens Used: {usage.total_tokens}")

# Register it with the service
service.intercept_after_llm_call(log_cost)

# Now, whenever you use the service, it prints the cost!
await service.retrieve("Hello", "user_1")

Output:

--- Call Finished ---
Model: gpt-4o-mini
Tokens Used: 154

Internal Implementation: The Standardization Layer

How does the wrapper handle different providers returning different data formats? It converts everything into a Standard View.

The Request/Response View

LLMRequestView: No matter what you send (text, image, audio), it gets converted into this standard object.
LLMResponseView: No matter what the AI returns (JSON, XML, raw text), it gets wrapped in this object.
LLMUsage: A standard bill. Whether OpenAI calls it completion_tokens or Anthropic calls it output_tokens, memU standardizes it to output_tokens.

The `_invoke` Method (The Traffic Controller)

Let's look at src/memu/llm/wrapper.py. The heart of the system is the _invoke method. Every call (Chat, Vision, Embed) goes through this function.

async def _invoke(self, kind, call_fn, ...):
    # 1. Run "Before" Interceptors
    await self._run_before(interceptors, context, request_view)

    try:
        # 2. Call the actual AI (OpenAI, Local, etc.)
        result = await call_fn() 
        
    except Exception as e:
        # 3. Handle Errors (and run Error Interceptors)
        await self._run_on_error(...)
        raise e

    # 4. Standardize the result (and calculate tokens)
    usage = self._extract_usage(result)
    
    # 5. Run "After" Interceptors
    await self._run_after(interceptors, ..., usage)

    return result

This design ensures that no AI call ever happens unmonitored. You can always track, log, or modify data flowing in and out of your system.

Configuration (Profiles)

Remember the LLM Profiles from Chapter 2? This is where they are used.

In src/memu/app/service.py, the service decides which wrapper to build based on your config:

def _init_llm_client(self, config):
    if config.client_backend == "sdk":
        # Use official OpenAI Python Library
        return OpenAISDKClient(...)
        
    elif config.client_backend == "httpx":
        # Use raw HTTP requests (good for local models)
        return HTTPLLMClient(...)

This Factory Pattern means memU is future-proof. If a new AI provider comes out tomorrow, we just add a new backend class, and your code remains unchanged.

Tutorial Conclusion

Congratulations! You have completed the memU architecture tutorial.

We have built a complete mental model of an advanced AI memory system:

Data Model: We organize memories into Files (Items) and Folders (Categories).
MemoryService: The central brain that manages the system.
Memorize Pipeline: How the system digests raw files into facts.
Retrieve Pipeline: How the system acts like a smart librarian to find answers.
Storage Layer: Where the data lives (In-Memory or Database).
Workflow Engine: How we snap logic blocks together.
Client Wrapper: How we talk to the outside AI world.

You are now ready to build intelligent, memory-augmented applications that go far beyond simple chatbots. Good luck!

Generated by Code IQ

← Previous

Workflow Pipeline Engine (The Assembly Line)