In the previous chapter, Communication Gateway, we connected Dexter to the outside world, allowing users to chat via WhatsApp.
We now have a fully functional agent. But there is a scary question we haven't answered yet: Is Dexter actually smart?
If you change a line of code in the Financial Data Layer, how do you know you didn't accidentally break the search tool? You can't manually chat with the bot for 5 hours every time you make a change.
This chapter introduces Evaluation (Evals). Think of this as the "Quality Assurance" department or the "Final Exam" that Dexter must pass before going to work.
Imagine you are a teacher. You want to know if your student (Dexter) understands finance.
In software engineering, we call this Benchmarking. We want to run a script that automatically asks 50 hard financial questions and gives us a score: Dexter is 92% accurate.
To build this system, we need three distinct parts working together.
A list of questions and the Ground Truth (the correct answer).
This is our Agent. It takes the question and produces an answer.
Here is the tricky part.
If we use standard code to compare these strings (string A === string B), the computer will say FALSE. They look different.
But a human knows they are the same.
Solution: We use a Judge LLM. We ask another AI (like GPT-4) to compare the two answers and decide if they match.
Let's look at the flow of a single evaluation.
dataset.csv contains: "Who is the CEO of Tesla?" -> "Elon Musk".Score: 1 (Correct).
We implement this in src/evals/run.ts. It acts as the coordinator.
First, we need a wrapper function that runs Dexter. The evaluation script doesn't care about tools or gateways; it just wants a string back.
// src/evals/run.ts
async function target(inputs: { question: string }) {
// 1. Create a fresh instance of Dexter
const agent = Agent.create({ model: 'gpt-4', maxIterations: 5 });
let answer = '';
// 2. Run the agent loop (from Chapter 2)
for await (const event of agent.run(inputs.question)) {
if (event.type === 'done') {
answer = event.answer;
}
}
// 3. Return just the text
return { answer };
}
Explanation: This creates a clean, isolated version of Dexter for every single question on the test.
This is the most interesting part. We write a prompt for a "Teacher AI" to grade the "Student AI."
// src/evals/run.ts
const prompt = `
You are evaluating the correctness of an AI assistant.
Expected Answer: ${expectedAnswer}
Actual Answer: ${actualAnswer}
Evaluate and provide:
- score: 1 if the answer is correct (contains key facts), 0 if incorrect.
`;
async function correctnessEvaluator(actual, expected) {
// We use a structured LLM call to force it to return a number
const result = await structuredLlm.invoke(prompt);
return { score: result.score };
}
Explanation: We tell the Judge: "The answer is correct if it conveys the same key information." This allows Dexter to phrase things differently but still pass the test.
Finally, we loop through our dataset (CSV file) and run the test.
// src/evals/run.ts
async function* runEvaluation() {
// 1. Load the "Test Paper"
const examples = parseCSV(fs.readFileSync('dataset.csv'));
for (const example of examples) {
// 2. Student takes the test
const outputs = await target(example.inputs);
// 3. Teacher grades the test
const evalResult = await correctnessEvaluator(
outputs.answer, // What Dexter said
example.outputs.answer // What the CSV said
);
// 4. Report back to the UI
yield {
question: example.inputs.question,
score: evalResult.score
};
}
}
Explanation: This loop runs automatically. You can start it, go grab a coffee, and come back to see if Dexter passed or failed.
In the code, you will notice references to client.createRun. We use a tool called LangSmith to visualize these results.
Instead of staring at a terminal, LangSmith gives us a dashboard:
If you see a lot of Red Rows regarding "Balance Sheets," you know exactly which Skill (Chapter 3) or Tool (Chapter 5) needs fixing!
In this chapter, we built the Safety Net:
Congratulations! You have completed the Dexter tutorial.
Let's look back at what you have built:
You have moved beyond simple "chatbots" and built a Cognitive Architecture. Dexter isn't just predicting the next word; it is reasoning, using tools, following procedures, and correcting its own mistakes.
You are now ready to extend Dexter with new skills, connect it to new platforms, or use this architecture to build completely different types of agents. Good luck!
Generated by Code IQ