In the previous chapter, Multi-Agent Architecture, we built a company of agents. We had a Supervisor delegating tasks to a Flight Booker and a Hotel Manager.
But we have a lingering problem. Just because an agent does a task doesn't mean it did a good job.
In traditional programming, testing is easy.
2 + 2assert result == 4In AI Engineering, outputs are "fuzzy."
If you run a traditional unit test on that email, it passes (it is valid English text). But as a product, it is a disaster. It is rude.
How do we write a test for "politeness"? How do we test for "creativity" or "helpfulness"?
We solve this using LLM-as-a-Judge.
We hire a specific agent whose only job is to grade the work of other agents.
If the grade is low, the Professor sends it back with feedback. This creates a Self-Correcting Loop.
We will build a system where a "Service Agent" drafts replies, but they are not sent to the user until a "Judge Agent" rates them 4/5 or higher on "Empathy."
A human teacher doesn't just say "Bad essay." They use a rubric:
We must give our Judge Agent a strict rubric.
A grade alone isn't enough. We need the Judge to explain why. This prevents the Judge from being random and provides useful feedback to the Student agent.
Let's build this mechanism. We need to define the Judge's instructions clearly.
We create a System Prompt specifically for evaluation. We ask for structured JSON output so our code can read the score.
judge_prompt = """
You are a Quality Assurance expert.
Evaluate the following draft email on 'Empathy'.
Scale: 1-5 (5 is best).
Output format: JSON
{
"score": number,
"reason": "short explanation"
}
"""
The Service Agent tries to answer a complaint.
# The user is angry about a delay
complaint = "Where is my stuff? I ordered it weeks ago!"
# The Naive Student drafts a reply
student_draft = "It is in transit. Wait longer."
Now, we combine the draft and the rubric and send it to the Judge LLM.
def evaluate_draft(draft):
"""Sends the draft to the Judge Agent."""
input_text = f"Draft: {draft}\n\n{judge_prompt}"
# In reality, this calls the OpenAI/Anthropic API
response = llm.generate(input_text)
return json.loads(response)
Explanation: We treat the Student's output as the Judge's input.
Now our Python code acts as the gatekeeper.
evaluation = evaluate_draft(student_draft)
if evaluation['score'] >= 4:
send_email_to_user(student_draft)
else:
print(f"Rejected! Score: {evaluation['score']}")
print(f"Feedback: {evaluation['reason']}")
# Trigger a rewrite...
How does this look in a full system? It works like a loop. The draft might bounce back and forth between the Student and Judge several times before it is "good enough."
Sometimes, giving a score (1-5) is hard. Is this essay a 3 or a 4? It is often easier to compare two options. This is called Pairwise Comparison.
This is used heavily in the llm-as-judge-skills examples.
def compare_drafts(draft_a, draft_b):
prompt = f"""
Which response is more polite?
A: {draft_a}
B: {draft_b}
Reply with 'A' or 'B'.
"""
return llm.generate(prompt)
LLMs have a weird quirk: they often prefer the first option they read.
To fix this, professional systems (like the one in examples/llm-as-judge-skills) swap the order and check twice.
By adding a Judge, we move from "Probabilistic" to "Reliable."
This allows you to use cheaper, faster models for the drafting (the Student) and a smart, expensive model for the checking (the Judge).
Congratulations! You have completed Agent Skills for Context Engineering.
Let's look back at your journey:
You now possess the complete toolkit to build robust, production-grade AI agents. Go build something amazing!
Generated by Code IQ