In the previous chapter, Context Augmentation Hooks, we built a "whisperer" that secretly feeds code context to the AI. We think this makes the AI smarter and faster.
But thinking isn't enough. In software engineering, we don't rely on "vibes"โwe rely on data.
This chapter introduces the Evaluation Framework. Think of this as the "Exam Proctor." It forces the AI to sit down, take a standardized test (SWE-bench), and grades its performance with and without GitNexus.
Imagine a student (the AI) taking a difficult calculus exam (fixing a bug).
Our Evaluation Framework sets up these two scenarios and compares the scores. We use SWE-bench, a collection of real-world GitHub issues (bugs) from popular Python libraries like Django and Flask.
To run these tests safely and accurately, we need three components:
We cannot let the AI run wild on your laptop. It might try to delete files or install weird packages. We run every test inside a Docker Container. This is a disposable, isolated room where the AI can break things without consequence.
This is the AI player. We use a customized version of mini-swe-agent. It's a loop that says:
The Evaluation Framework is written in Python, but GitNexus is written in TypeScript. We need a Bridge to let the Python test runner talk to the GitNexus MCP server.
The entry point is a Python script: eval/run_eval.py. You act as the teacher administering the test.
Let's ask the AI (using the Claude model) to fix a specific bug in Django, using GitNexus for help.
# Run in "Native Augment" mode (With GitNexus)
python run_eval.py single \
--model claude-sonnet \
--mode native_augment \
--instance django__django-16527
What happens?
Now we run the exact same test, but we take the textbook away.
# Run in "Baseline" mode (NO GitNexus)
python run_eval.py single \
--model claude-sonnet \
--mode baseline \
--instance django__django-16527
The script generates a summary.json file.
{
"instance_id": "django__django-16527",
"baseline_status": "FAILED",
"gitnexus_status": "PASSED",
"gitnexus_metrics": {
"augmentation_hits": 12,
"tool_calls": 5
}
}
In this example, the AI failed on its own but passed when GitNexus helped it 12 times!
Let's look at how the system orchestrates this complex dance.
The evaluation logic is built in Python. Let's look at the core pieces.
process_instance)
Found in eval/run_eval.py, this function sets up the "Exam Room."
def process_instance(instance, config, output_dir, model_name, mode_name):
# 1. Create the Docker Environment (The Sandbox)
env = GitNexusDockerEnvironment(image=get_image(instance), **config)
# 2. Create the Agent (The Student)
# We tell it which "Mode" to use (Baseline vs GitNexus)
agent = GitNexusAgent(model, env, gitnexus_mode=mode_name)
# 3. Run the test!
logger.info(f"Starting {instance['instance_id']}")
info = agent.run(instance["problem_statement"])
# 4. Collect the grades
result = {
"exit_status": info.get("exit_status"),
"metrics": agent.gitnexus_metrics.to_dict()
}
return result
Explanation:
GitNexusAgent.GitNexusAgent)
Found in eval/agents/gitnexus_agent.py, this extends the standard agent to track our specific metrics.
class GitNexusAgent(DefaultAgent):
def execute_actions(self, message):
# 1. Let the agent run the command (e.g., grep)
outputs = [self.env.execute(action) for action in actions]
# 2. If in "Augment" mode, check if we helped
if self.gitnexus_mode == GitNexusMode.NATIVE_AUGMENT:
for i, action in enumerate(actions):
# Did GitNexus inject extra info?
if "[GitNexus]" in outputs[i]['output']:
self.gitnexus_metrics.augmentation_hits += 1
return outputs
Explanation:
[GitNexus], we know our hook (from Chapter 6) successfully intercepted the call. We count this as a "Hit."MCPBridge)
Found in eval/bridge/mcp_bridge.py. Since Python cannot directly call TypeScript functions, we spawn GitNexus as a subprocess and talk to it via Standard Input/Output (stdio).
class MCPBridge:
def start(self):
# 1. Launch the TypeScript binary
self.process = subprocess.Popen(
["npx", "gitnexus", "mcp"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE
)
def call_tool(self, tool_name, args):
# 2. Send JSON request to GitNexus
request = {
"jsonrpc": "2.0",
"method": "tools/call",
"params": {"name": tool_name, "arguments": args}
}
self._send_request(request)
# 3. Read JSON response
return self._read_response()
Explanation:
Congratulations! You have reached the end of the GitNexus Developer Tutorial.
We have journeyed through the entire lifecycle of a modern AI tool:
You now possess a complete blueprint for building advanced RAG (Retrieval-Augmented Generation) systems for code. Go forth and build the next generation of developer tools!
Generated by Code IQ