In Chapter 4: Model Context Protocol (MCP) Server, we turned our search engine into a tool that AI agents like Claude can use. But this raises a critical question: How do we know if the AI is doing a good job?
If you search for "login issues," and the AI gives you a document about "lunch menus," the system has failed. But computers don't have "common sense" to know that lunch menus aren't relevant to login issues.
In this chapter, we will build the Reward & Evaluation Logic. Think of this component as the Strict Teacher. It uses math and rules to grade the AI's performance.
"Good search" is hard to define.
We solve this with two distinct tools:
The Evaluation Harness simulates a user asking questions and checks if the correct document appears in the results.
We define a list of "Golden Queries" where we know exactly which document should appear.
We categorize queries by difficulty.
// test/eval-harness.ts
const evalQueries = [
{
query: "API versioning",
expectedDoc: "api-design", // The filename we expect
difficulty: "easy", // Direct keyword match
},
{
query: "what went wrong with the launch",
expectedDoc: "product-launch",
difficulty: "medium", // Conceptual query
}
];
The harness runs the actual qmd command we built in Chapter 1: Hybrid Search Orchestrator and captures the output.
// test/eval-harness.ts
function runSearch(query: string) {
// We execute the CLI command via the operating system
const output = execSync(
`bun src/qmd.ts search "${query}" --json -n 5`,
{ encoding: "utf-8" }
);
// Parse the JSON result
return JSON.parse(output);
}
We use a standard search metric called Hit@K.
// test/eval-harness.ts
function evaluate() {
for (const testCase of evalQueries) {
const results = runSearch(testCase.query);
// Find where the expected document ranked
const rank = results.findIndex(r =>
r.file.includes(testCase.expectedDoc)
);
if (rank === 0) console.log("โ
Perfect match!");
else if (rank < 5) console.log("โ ๏ธ Found in top 5");
else console.log("โ Failed to find document");
}
}
Why is this important? Before you change any code in the Orchestrator or the AI Service, you run this harness. If your changes cause "Hit@1" scores to drop, you know you broke something.
While the Harness checks the result, the Reward Function checks the process.
When qmd performs a "Deep Search," it uses an AI to generate synonyms and expansion terms (see Chapter 2: Local AI Service). We need to make sure the AI isn't hallucinating or being lazy.
We use a Python script (finetune/reward.py) to mathematically score the text generated by the AI. This is used during Reinforcement Learning (training the model).
The Reward Function analyzes the AI's output based on five strict rules:
lex: and vec: syntax?Let's look at the Python code that enforces these rules.
One of the biggest problems with small AI models is that they just parrot back what you said. We write code to detect this.
# finetune/reward.py
def echoes_query(expansion: str, query: str) -> bool:
exp = expansion.lower().strip()
q = query.lower().strip()
# 1. Exact match?
if exp == q:
return True
# 2. Contained within?
if q in exp and len(exp) < len(q) + 10:
return True
return False
Explanation: If the user asks "login", and the AI outputs "lex: login", that is useless. We want "lex: authentication" or "lex: sign in". This function catches those lazy answers.
If you search for "React Props", and the AI expands it to "JavaScript attributes", you might lose the context of the React library. We must protect proper nouns.
# finetune/reward.py
def extract_named_entities(query: str) -> set:
entities = set()
for word in query.split():
# Check for Capitalized words (like "React")
if word[0].isupper():
entities.add(word.lower())
# Check for technical terms (like "node.js")
if "." in word or "-" in word:
entities.add(word.lower())
return entities
We combine all these checks into a single function that returns a detailed report.
# finetune/reward.py
def score_expansion_detailed(query, expansion):
score = 0
deductions = []
# 1. Check Format
if "lex:" in expansion: score += 30
else: deductions.append("Missing lex prefix")
# 2. Check Echoing
if echoes_query(expansion, query):
score -= 20
deductions.append("Echoed query")
# ... other checks ...
return { "total": score, "issues": deductions }
Output Example:
If the AI does a bad job, this function might return:
{ "total": 45, "issues": ["Echoed query", "Lex longer than vec"] }
This feedback is crucial. In the next chapter, we will use this exact feedback loop to teach the AI to get better over time.
In this chapter, we built the "Judge" of our system.
Now that we have a way to grade the AI, we can train it.
In the final chapter, we will connect this Reward Function to a training pipeline to make our Local AI smarter than the generic models.
Next Chapter: Fine-Tuning Pipeline
Generated by Code IQ