How to evaluate with Agent Spec Eval#

Prerequisites

This guide assumes you are familiar with the following concepts:

Additionally, you need to have Python 3.10+ installed.

Overview#

Agent Spec Eval is the evaluation extension of Agent Spec. It standardizes a minimal, framework-agnostic API for evaluating agentic systems with:

  • Datasets: collections of samples.

  • Metrics: reusable measurements (deterministic or LLM-based).

  • Evaluators: orchestrators that run metrics over datasets, with optional concurrency control.

For the formal specification and background, see Open Agent Specification Evaluation (Agent Spec Eval).

Run an end-to-end evaluation#

Create a dataset from in-memory samples, configure one or more metrics, and run them with an Evaluator.

import asyncio
from pyagentspec.evaluation import Dataset, Evaluator
from pyagentspec.evaluation.metrics.implementations import ExactBinaryMatchMetric

data = [
    {
        "query": "Where is the largest city of CH?",
        "reference": "Zürich",
        "response": "Zurich",
    },
    {
        "query": "Where is the capital of Switzerland?",
        "reference": "Bern",
        "response": "Bern",
    },
    {
        "query": "Where is the UN European HQ?",
        "reference": "Geneva",
        "response": "Genève",
    },
]
dataset = Dataset.from_dict(data)


async def evaluator_example() -> None:
    evaluator = Evaluator(
        metrics=[
            ExactBinaryMatchMetric(name="ExactBinaryMatchStrict"),
            ExactBinaryMatchMetric(name="ExactBinaryMatchRelaxed", ignore_glyph=True),
        ]
    )
    results = await evaluator.evaluate(dataset)

    print("as JSON:")
    print(results.to_dict())

    print("as DF:")
    print(results.to_df())


asyncio.run(evaluator_example())

The returned EvaluationResults can be exported as:

  • Dictionary via results.to_dict() (includes the metric value and its details)

  • a pandas DataFrame via results.to_df() (only the main metric values)

Use different dataset field names (input mapping)#

In practice, your dataset may use different keys than the defaults expected by a metric. For example, you may have ground_truth instead of reference and answer instead of response.

Many metrics in pyagentspec.evaluation.metrics.implementations support mapping the dataset feature names to the metric’s expected parameters.

mapped_data = [
    {"ground_truth": "Zürich", "answer": "Zurich"},
    {"ground_truth": "Bern", "answer": "Bern"},
    {"ground_truth": "Geneva", "answer": "Genève"},
]
mapped_dataset = Dataset.from_dict(mapped_data)


async def input_mapping_example() -> None:
    evaluator = Evaluator(
        metrics=[
            ExactBinaryMatchMetric(
                name="ExactBinaryMatchStrict",
                reference_feature_name="ground_truth",
                response_feature_name="answer",
            ),
        ]
    )
    results = await evaluator.evaluate(mapped_dataset)
    print(results.to_df())

Use an LLM-based metric#

Some metrics call an LLM to judge semantic equivalence (or other rubric-based criteria). These metrics take an Agent Spec LLM configuration.

from pyagentspec.evaluation.metrics.implementations import SemanticBinaryMatchMetric
from pyagentspec.llms import OpenAiConfig


async def llm_metric_example() -> None:
    llm_config = OpenAiConfig(name="openai-config", model_id="gpt-5-mini")
    metric = SemanticBinaryMatchMetric(llm_config)
    for reference, response in [("Zeurich", "Zurich"), ("Beijing", "Peking")]:
        value, details = await metric(reference=reference, response=response)
        print((value, details))

Reduce LLM non-determinism with repeats and ensembles#

LLM-based metrics can be noisy. Agent Spec Eval supports wrappers that run a metric multiple times (repeat) or run multiple semantically-equivalent metrics (ensemble), then aggregate the values.

from pyagentspec.evaluation.aggregators import MeanAggregator
from pyagentspec.evaluation.metrics.wrappers import EnsembleMetric, RepeatMetric


async def repeat_and_ensemble_example() -> None:
    llm_config = OpenAiConfig(name="openai-config", model_id="gpt-5-mini")
    repeat_metric = RepeatMetric(
        metric=SemanticBinaryMatchMetric(llm_config),
        aggregator=MeanAggregator(),
        num_repeats=3,
    )

    ensemble_metric = EnsembleMetric(
        name="SemanticBinaryMatch",
        metrics=[
            SemanticBinaryMatchMetric(name="SemanticBinaryMatch-A", llm_config=llm_config),
            SemanticBinaryMatchMetric(name="SemanticBinaryMatch-B", llm_config=llm_config),
        ],
        aggregator=MeanAggregator(),
    )

    for reference, response in [("Zeurich", "Zurich"), ("Beijing", "Peking")]:
        print("repeat:")
        print(await repeat_metric(reference=reference, response=response))
        print("ensemble:")
        print(await ensemble_metric(reference=reference, response=response))

Recap#

This guide covered how to:

  • Build a pyagentspec.evaluation.datasets.dataset.Dataset from in-memory samples.

  • Evaluate samples with deterministic and LLM-based metrics.

  • Export results to JSON or pandas DataFrame.

  • Map dataset feature names to metric inputs.

  • Use repeat/ensemble wrappers to improve robustness.

Next steps#