How to evaluate with Agent Spec Eval#
Prerequisites
This guide assumes you are familiar with the following concepts:
Additionally, you need to have Python 3.10+ installed.
Overview#
Agent Spec Eval is the evaluation extension of Agent Spec. It standardizes a minimal, framework-agnostic API for evaluating agentic systems with:
Datasets: collections of samples.
Metrics: reusable measurements (deterministic or LLM-based).
Evaluators: orchestrators that run metrics over datasets, with optional concurrency control.
For the formal specification and background, see Open Agent Specification Evaluation (Agent Spec Eval).
Run an end-to-end evaluation#
Create a dataset from in-memory samples, configure one or more metrics, and run them
with an Evaluator.
import asyncio
from pyagentspec.evaluation import Dataset, Evaluator
from pyagentspec.evaluation.metrics.implementations import ExactBinaryMatchMetric
data = [
{
"query": "Where is the largest city of CH?",
"reference": "Zürich",
"response": "Zurich",
},
{
"query": "Where is the capital of Switzerland?",
"reference": "Bern",
"response": "Bern",
},
{
"query": "Where is the UN European HQ?",
"reference": "Geneva",
"response": "Genève",
},
]
dataset = Dataset.from_dict(data)
async def evaluator_example() -> None:
evaluator = Evaluator(
metrics=[
ExactBinaryMatchMetric(name="ExactBinaryMatchStrict"),
ExactBinaryMatchMetric(name="ExactBinaryMatchRelaxed", ignore_glyph=True),
]
)
results = await evaluator.evaluate(dataset)
print("as JSON:")
print(results.to_dict())
print("as DF:")
print(results.to_df())
asyncio.run(evaluator_example())
The returned EvaluationResults can be exported as:
Dictionary via
results.to_dict()(includes the metricvalueand itsdetails)a pandas DataFrame via
results.to_df()(only the main metric values)
Use different dataset field names (input mapping)#
In practice, your dataset may use different keys than the defaults expected by a metric.
For example, you may have ground_truth instead of reference and answer instead of response.
Many metrics in pyagentspec.evaluation.metrics.implementations support mapping the
dataset feature names to the metric’s expected parameters.
mapped_data = [
{"ground_truth": "Zürich", "answer": "Zurich"},
{"ground_truth": "Bern", "answer": "Bern"},
{"ground_truth": "Geneva", "answer": "Genève"},
]
mapped_dataset = Dataset.from_dict(mapped_data)
async def input_mapping_example() -> None:
evaluator = Evaluator(
metrics=[
ExactBinaryMatchMetric(
name="ExactBinaryMatchStrict",
reference_feature_name="ground_truth",
response_feature_name="answer",
),
]
)
results = await evaluator.evaluate(mapped_dataset)
print(results.to_df())
Use an LLM-based metric#
Some metrics call an LLM to judge semantic equivalence (or other rubric-based criteria). These metrics take an Agent Spec LLM configuration.
from pyagentspec.evaluation.metrics.implementations import SemanticBinaryMatchMetric
from pyagentspec.llms import OpenAiConfig
async def llm_metric_example() -> None:
llm_config = OpenAiConfig(name="openai-config", model_id="gpt-5-mini")
metric = SemanticBinaryMatchMetric(llm_config)
for reference, response in [("Zeurich", "Zurich"), ("Beijing", "Peking")]:
value, details = await metric(reference=reference, response=response)
print((value, details))
Reduce LLM non-determinism with repeats and ensembles#
LLM-based metrics can be noisy. Agent Spec Eval supports wrappers that run a metric multiple times (repeat) or run multiple semantically-equivalent metrics (ensemble), then aggregate the values.
from pyagentspec.evaluation.aggregators import MeanAggregator
from pyagentspec.evaluation.metrics.wrappers import EnsembleMetric, RepeatMetric
async def repeat_and_ensemble_example() -> None:
llm_config = OpenAiConfig(name="openai-config", model_id="gpt-5-mini")
repeat_metric = RepeatMetric(
metric=SemanticBinaryMatchMetric(llm_config),
aggregator=MeanAggregator(),
num_repeats=3,
)
ensemble_metric = EnsembleMetric(
name="SemanticBinaryMatch",
metrics=[
SemanticBinaryMatchMetric(name="SemanticBinaryMatch-A", llm_config=llm_config),
SemanticBinaryMatchMetric(name="SemanticBinaryMatch-B", llm_config=llm_config),
],
aggregator=MeanAggregator(),
)
for reference, response in [("Zeurich", "Zurich"), ("Beijing", "Peking")]:
print("repeat:")
print(await repeat_metric(reference=reference, response=response))
print("ensemble:")
print(await ensemble_metric(reference=reference, response=response))
Recap#
This guide covered how to:
Build a
pyagentspec.evaluation.datasets.dataset.Datasetfrom in-memory samples.Evaluate samples with deterministic and LLM-based metrics.
Export results to JSON or pandas DataFrame.
Map dataset feature names to metric inputs.
Use repeat/ensemble wrappers to improve robustness.
Next steps#
Check the Tracing specification of Agent Spec: Open Agent Specification Tracing.