In the previous chapter, AI Evaluation (Judging), we acted like an art critic. We used an AI to look at the server's tools and give us a qualitative opinion ("Is this description helpful?").
But in engineering, feelings aren't enough. We also need hard numbers.
This brings us to Statistics Collection. If the AI Judge provides the Quality, the Statistics system provides the Dimensions.
Imagine you are packing for a flight.
If you have a tool with a massive, complex schema (definition), it's like trying to pack a winter coat. It takes up a lot of space. If it's too big, the AI literally cannot "carry" it, and your tool becomes unusable.
Statistics Collection answers questions like:
The system breaks this down into small, calculable units using the Statistic class.
This is a small class designed to measure one specific thing.
ToolInputSchemaTokenCount: Calculates the "weight" in tokens.ToolInputSchemaMaxDepthCount: Calculates the structural depth.
When a Metric runs, it produces a StatisticValue. This isn't just a number; it links back to the calculator so the report knows what that number represents.
Using the statistics engine is very similar to using the Constraint system. You pick a calculator and feed it the server data.
Here is how you might calculate the "Token Cost" of your tools manually:
from mcp_interviewer.statistics.tool import ToolInputSchemaTokenCount
# 1. Create the calculator
calculator = ToolInputSchemaTokenCount()
# 2. Run it against your server's data (ServerScoreCard)
# This returns a generator, so we convert to a list
stats = list(calculator.compute(server_scorecard))
# 3. Print results
for stat in stats:
print(f"Tool Cost: {stat.value} tokens")
Explanation:
We instantiate the ToolInputSchemaTokenCount class. When we call .compute(), it iterates through every tool in the server, does the math, and returns the results.
How does the system actually count these abstract concepts?
Let's look at src/mcp_interviewer/statistics/tool.py to see how these calculations are implemented.
The most important statistic is the token count. We use a library called tiktoken (from OpenAI) to accurately simulate how an LLM reads text.
# src/mcp_interviewer/statistics/tool.py
class ToolInputSchemaTokenCount(ToolStatistic):
def compute_tool(self, tool: Tool):
# 1. Convert our tool to OpenAI's format
oai_tool = convert_to_openai_format(tool)
# 2. Use helper to count tokens based on model "gpt-4o"
token_count = num_tokens_for_tool(oai_tool, "gpt-4o")
# 3. Yield the result
yield StatisticValue(self, token_count)
Explanation:
convert_to_openai_format: MCP tools look slightly different from OpenAI tools. We align them first.num_tokens_for_tool: This acts like a virtual scale. It weighs every letter, punctuation mark, and whitespace according to the specific rules of GPT-4.Deeply nested JSON (objects inside objects inside objects) confuses AI models. We use a recursive function to measure this.
# src/mcp_interviewer/statistics/tool.py
class ToolInputSchemaMaxDepthCount(ToolStatistic):
def compute_tool(self, tool: Tool):
# Recursive helper function
def get_max_depth(obj, depth=0):
if isinstance(obj, dict):
# If it's a dictionary, dig deeper into values
return max(get_max_depth(v, depth + 1) for v in obj.values())
return depth
# Run the check
depth = get_max_depth(tool.inputSchema.get("properties", {}))
yield StatisticValue(self, depth)
Explanation: Think of Russian Nesting Dolls.
Just like with Constraints, we often want to run all the math at once. We use a CompositeStatistic to bundle them.
# src/mcp_interviewer/statistics/tool.py
class AllToolStatistics(CompositeStatistic):
def __init__(self) -> None:
super().__init__(
ToolInputSchemaTokenCount(),
ToolInputSchemaTotalParametersCount(),
ToolInputSchemaMaxDepthCount(),
# ... add more calculators here
)
Explanation:
This AllToolStatistics class acts as a master dashboard. When the Interviewer calls this, it triggers every calculator in the list, returning a comprehensive set of metrics in one go.
This data is crucial for Benchmarking.
In this chapter, we added objective measurements to our interview process.
We have now reached the end of the data gathering phase. We have:
The final step is to take this massive pile of data and turn it into something a human can actually read.
Next Chapter: Reporting System
Generated by Code IQ