Chapter 4 ยท CORE

Interleaved Thinking

๐Ÿ“„ 04_interleaved_thinking.md ๐Ÿท Core

Chapter 4: Interleaved Thinking

In the previous chapter, Tool Design (The Contract), we built a dashboard of buttons (Tools) for our agent. We taught it how to check the weather or search a database.

However, just because an agent can push a button doesn't mean it should.

The Problem: The Trigger-Happy Agent

Imagine you hire an assistant and say, "Book me a flight to Paris."

Most basic LLM implementations are Reactive. They see a prompt and immediately rush to generate the final answer or call a tool. This leads to hallucinations, wasted money on API calls, and silly mistakes.

We need to force the agent to "show its work" before it acts. We call this Interleaved Thinking.

What is Interleaved Thinking?

Interleaved Thinking is the process where the Model explicitly outputs a "Thought" block before it outputs an "Action" or "Answer."

Think of a mathematician solving a hard problem. They don't just stare at the page and write "42." They scribble in the margins:

  1. "First, I need to isolate X..."
  2. "Now I divide by Y..."
  3. "Wait, Y is zero, I can't do that. I need a different approach."

By writing this down, the mathematician (and the Agent) can catch their own errors before they commit to an answer.

The Loop: Think, Act, Observe

Instead of a straight line, our agent now moves in a loop.

sequenceDiagram participant U as User participant A as Agent (Brain) participant T as Tool (Action) U->>A: "Find a restaurant." loop Interleaved Thinking Note over A: THINK: "I don't know the user's location." Note over A: THINK: "I should use the location_tool first." A->>T: Calls location_tool() T->>A: Returns "New York, NY" Note over A: THINK: "Okay, I have the city." Note over A: THINK: "Now I search for restaurants in NY." A->>T: Calls restaurant_search("NY") end A->>U: "Here is a pizza place in NY."

How to Implement Interleaved Thinking

Modern models (like MiniMax M2.1 or reasoning-heavy models) support this natively. They separate the Thinking (Internal Monologue) from the Content (External Speech).

Step 1: Parsing the Thought

When the model responds, it doesn't just give us text. It gives us a structure.

# Conceptual response structure
response = {
    "thinking": "The user wants weather. I need to convert 'The Big Apple' to 'New York'.",
    "tool_call": "get_weather(city='New York')",
    "content": "" # Empty, because we are calling a tool
}

We need to display the "Thinking" to the developer (for debugging) but hide it from the final user (to keep the interface clean).

Step 2: Saving the Thought (The Critical Step)

This is where most beginners fail.

If the agent thinks "I need to check the database," and you execute the tool, you must feed that thought back into the agent's memory for the next turn.

If you don't save the thought, the agent forgets why it called the tool. It wakes up with a database result in its hand and no idea what it was looking for.

# WRONG: We only save the tool result
messages.append({"role": "tool", "content": "Database Result: 50 users"})

# RIGHT: We save the thought AND the result
messages.append({
    "role": "assistant", 
    "thinking": "I will query the database for active users.", # <--- CRITICAL
    "tool_call": "query_db()"
})
messages.append({"role": "tool", "content": "Database Result: 50 users"})

Internal Implementation: The Reasoning Loop

Let's look at how we code this loop. We are essentially building a REPL (Read-Eval-Print Loop) for the agent.

The Code Structure

We check if the agent wants to perform an action. If yes, we run it, append the result, and ask the agent again.

def run_agent_loop(user_query):
    messages = [{"role": "user", "content": user_query}]
    
    while True:
        # 1. Ask the LLM
        response = llm.generate(messages)
        
        # 2. Print the "Thinking" (Show your work)
        if response.thinking:
            print(f"๐Ÿง  THOUGHT: {response.thinking}")
            
        # 3. If it wants to use a tool, do it
        if response.tool_call:
            print(f"โš™๏ธ ACTION: {response.tool_call.name}")
            result = execute_tool(response.tool_call)
            
            # 4. CRITICAL: Add the Thought + Tool Call + Result to history
            messages.append(response.full_message_object) 
            messages.append({"role": "tool", "content": result})
        else:
            # No tool call? We are done.
            return response.content

Explanation:

  1. The while True creates the loop.
  2. We explicitly print response.thinking so we can see the agent planning.
  3. If the agent calls a tool, we don't return yet. We add the result to history and loop back to the top.
  4. The agent reads its own past thoughts, reads the tool result, and decides what to do next.

Example Scenario: Debugging a Failure

Let's see why this is powerful.

1. Thought: "I need to find the CEO." 2. Action: search("current CEO of Apple"). 3. Result: "Tim Cook." 4. Thought: "Now I need Tim Cook's birth date to calculate age." 5. Action: search("Tim Cook birth date"). 6. Result: "November 1, 1960." 7. Thought: "Now I calculate 2024 - 1960." 8. Answer: "Tim Cook is the CEO and he is 64 years old."

By interleaving Thought -> Action -> Thought -> Action, the agent can chain simple steps into complex reasoning.

Summary

In this chapter, you learned:

  1. The Problem with Reacting: Agents that don't think make mistakes.
  2. Interleaved Thinking: The process of outputting a "Thought" block before an "Action."
  3. The Loop: We must execute tools and feed the result plus the original thought back into the context.

We have a problem though. As our agent "thinks" and loops, the chat history grows very fast. Thoughts, tool calls, JSON results, error messages... our Context Window is filling up with implementation details.

If the agent works for an hour, it will forget the user's name. We need a better way to store information than just a linear list of messages.

Next Chapter: Structured Memory Systems


Generated by Code IQ