Chapter 7 · CORE

LLM-as-a-Judge

📄 07_llm_as_a_judge.md 🏷 Core

Chapter 7: LLM-as-a-Judge

In the previous chapter, Multi-Agent Architecture, we built a company of agents. We had a Supervisor delegating tasks to a Flight Booker and a Hotel Manager.

But we have a lingering problem. Just because an agent does a task doesn't mean it did a good job.

The Problem: The "Vibe Check"

In traditional programming, testing is easy.

Code: 2 + 2
Test: assert result == 4

In AI Engineering, outputs are "fuzzy."

User: "Write a polite apology email."
Agent: "We lost your package. That sucks for you."

If you run a traditional unit test on that email, it passes (it is valid English text). But as a product, it is a disaster. It is rude.

How do we write a test for "politeness"? How do we test for "creativity" or "helpfulness"?

The Solution: The Professor Agent

We solve this using LLM-as-a-Judge.

We hire a specific agent whose only job is to grade the work of other agents.

The Student (Service Agent): Drafts the email.
The Professor (Judge Agent): Reads the email, compares it to a Rubric, and assigns a grade.

If the grade is low, the Professor sends it back with feedback. This creates a Self-Correcting Loop.

Use Case: The Customer Service Reviewer

We will build a system where a "Service Agent" drafts replies, but they are not sent to the user until a "Judge Agent" rates them 4/5 or higher on "Empathy."

Key Concepts

1. The Rubric

A human teacher doesn't just say "Bad essay." They use a rubric:

Grammar: 10 points
Clarity: 10 points
Tone: 5 points

We must give our Judge Agent a strict rubric.

2. Chain of Thought (Justification)

A grade alone isn't enough. We need the Judge to explain why. This prevents the Judge from being random and provides useful feedback to the Student agent.

Implementation: The Grading Loop

Let's build this mechanism. We need to define the Judge's instructions clearly.

Step 1: Defining the Judge's Persona

We create a System Prompt specifically for evaluation. We ask for structured JSON output so our code can read the score.

judge_prompt = """
You are a Quality Assurance expert.
Evaluate the following draft email on 'Empathy'.
Scale: 1-5 (5 is best).

Output format: JSON
{
  "score": number,
  "reason": "short explanation"
}
"""

Step 2: The Student Drafts

The Service Agent tries to answer a complaint.

# The user is angry about a delay
complaint = "Where is my stuff? I ordered it weeks ago!"

# The Naive Student drafts a reply
student_draft = "It is in transit. Wait longer."

Step 3: The Judge Evaluates

Now, we combine the draft and the rubric and send it to the Judge LLM.

def evaluate_draft(draft):
    """Sends the draft to the Judge Agent."""
    input_text = f"Draft: {draft}\n\n{judge_prompt}"
    
    # In reality, this calls the OpenAI/Anthropic API
    response = llm.generate(input_text)
    
    return json.loads(response)

Explanation: We treat the Student's output as the Judge's input.

Step 4: The Decision Logic

Now our Python code acts as the gatekeeper.

evaluation = evaluate_draft(student_draft)

if evaluation['score'] >= 4:
    send_email_to_user(student_draft)
else:
    print(f"Rejected! Score: {evaluation['score']}")
    print(f"Feedback: {evaluation['reason']}")
    # Trigger a rewrite...

Internal Implementation: Under the Hood

How does this look in a full system? It works like a loop. The draft might bounce back and forth between the Student and Judge several times before it is "good enough."

sequenceDiagram participant U as User participant S as Student Agent participant J as Judge Agent U->>S: "Write an apology." S->>J: Draft: "My bad." Note over J: EVALUATING... J->>J: Score: 2/5 (Too casual) J->>S: REJECTED. Feedback: "Be more formal." Note over S: REWRITING... S->>J: Draft: "We sincerely apologize..." Note over J: EVALUATING... J->>J: Score: 5/5 (Perfect) J->>U: "We sincerely apologize..."

Advanced Skill: Pairwise Comparison

Sometimes, giving a score (1-5) is hard. Is this essay a 3 or a 4? It is often easier to compare two options. This is called Pairwise Comparison.

Student generates Draft A.
Student generates Draft B.
Judge asks: "Which is better, A or B?"

This is used heavily in the llm-as-judge-skills examples.

def compare_drafts(draft_a, draft_b):
    prompt = f"""
    Which response is more polite?
    A: {draft_a}
    B: {draft_b}
    Reply with 'A' or 'B'.
    """
    return llm.generate(prompt)

Mitigation: Position Bias

LLMs have a weird quirk: they often prefer the first option they read. To fix this, professional systems (like the one in examples/llm-as-judge-skills) swap the order and check twice.

Compare A vs B.
Compare B vs A.
Only accept the result if the Judge picks the same winner both times.

Why This Matters

By adding a Judge, we move from "Probabilistic" to "Reliable."

Without Judge: You pray the agent is polite.
With Judge: You have a guarantee that nothing enters the user's inbox unless it passes a strict rubric.

This allows you to use cheaper, faster models for the drafting (the Student) and a smart, expensive model for the checking (the Judge).

Conclusion to the Series

Congratulations! You have completed Agent Skills for Context Engineering.

Let's look back at your journey:

Context Engineering: You learned to clean the workbench.
Progressive Disclosure: You organized instructions into Skills.
Tool Design: You built strict interfaces for action.
Interleaved Thinking: You taught the agent to plan before acting.
Structured Memory: You cured the agent's amnesia.
Multi-Agent Architecture: You built a team of specialists.
LLM-as-a-Judge: You established quality control.

You now possess the complete toolkit to build robust, production-grade AI agents. Go build something amazing!

Generated by Code IQ

← Previous

Multi-Agent Architecture