Evaluation#

Open Agent Specification Evaluation (short: Agent Spec Eval) is an extension of Agent Spec that standardizes how agentic systems are evaluated in a framework-agnostic way.

Evaluation#

class pyagentspec.evaluation.Dataset(_data_source)#

Bases: _DataSource

Concrete wrapper around _DataSource implementations used during evaluation.

Parameters:

_data_source (_DataSource) –

features()#

Return the sequence of feature names provided by this data source.

Returns:

Names of all features available in samples.

Return type:

Sequence[str]

static from_df(df)#

Creating a dataset from a pandas dataframe. The dataframe must have a single level header.

Parameters:

df (DataFrame) – An instance of pandas dataframe.

Return type:

A dataset that wraps that dataframe.

Raises:

ValueError – If any of the columns headers is not string

static from_dict(data, features_consistency='strict')#

Initialize a dataset with a collection of samples and determine feature consistency.

Parameters:
  • data (Dict[Hashable, Dict[str, Any]] or List[Dict[str, Any]]) – The dataset. If a dictionary, keys are sample identifiers and values are feature dictionaries. If a list, each item is a feature dictionary and sample identifiers are assigned as sequential indices.

  • features_consistency ({"strict", "relaxed", "bypass"}, default "strict") –

    Policy for validating feature keys consistency across samples:
    • ”strict”: All samples must have identical feature keys.

    • ”relaxed”: Uses only the intersection of keys from all samples.

    • ”bypass”: Uses feature keys from the first sample only.

    Warning

    Bypass consistency control is solely for performance optimization. If there is an inconsistency in the dataset, it may later resulted into errors during evaluation. Use bypass only when you are sure about the consistency of your dataset.

Raises:
  • TypeError – If the data input is neither a dict nor a sequence of feature dictionaries.

  • ValueError – If samples are missing, feature keys are inconsistent in “strict” mode, or no features are found.

Return type:

Dataset

async get_sample(id)#

Asynchronously fetch a data sample given its identifier.

Parameters:

id (Hashable) – Unique identifier for the sample to fetch.

Returns:

Dictionary containing feature values for the sample.

Return type:

Dict[str, Any]

ids()#

Asynchronously yield all available sample identifiers.

Yields:

Hashable – Unique identifier for a sample.

Return type:

AsyncIterator[Hashable]

class pyagentspec.evaluation.Evaluator(metrics, max_concurrency=-1)#

Bases: object

Evaluator orchestrates the execution of a set of metrics over input data, supporting optional concurrency control.

Parameters:
  • metrics (Sequence[Metric[Any]]) –

  • max_concurrency (int) –

async evaluate(dataset)#

Execute every metric against dataset and collect the results.

Parameters:

dataset (Dataset) – Dataset exposing async ids/get_sample accessors. Each sample must provide the features required by the configured metrics.

Returns:

Structured view over the metric values and their associated metadata.

Return type:

EvaluationResults

Notes

Metrics run concurrently whenever max_concurrency permits. Any pyagentspec.evaluation.exceptions.EvaluationException raised by an underlying metric propagates to the caller if the metric on_failure behavior requires to raise.

class pyagentspec.evaluation.EvaluationResults(results, sample_ids=None, metric_names=None)#

Bases: object

Container for storing and accessing evaluation metric results for multiple samples and metrics.

This class provides utilities to work with evaluation results that are organized as a mapping between (sample_id, metric_name) pairs and their corresponding result values and details. It enables exporting the results to common formats such as JSON and pandas DataFrame for further analysis or reporting.

Parameters:
  • results (Dict[Tuple[Hashable, str], Tuple[Any, Dict[Hashable, Any]]]) –

  • sample_ids (List[Hashable] | None) –

  • metric_names (List[str] | None) –

results#

Dictionary mapping (sample_id, metric_name) pairs to their metric result and related details.

Type:

Dict[Tuple[Hashable, str], Tuple[Any, Dict[str, Any]]]

sample_ids#

List of sample identifiers present in the results.

Type:

List[Hashable]

metric_names#

List of metric names present in the results.

Type:

List[str]

to_df()#

Return the results as a pandas.DataFrame indexed by sample id.

Returns:

DataFrame indexed by sample_id with columns as metric names. Each cell contains the main result value for the corresponding (sample_id, metric_name) pair

Return type:

pandas.DataFrame

to_dict()#

Return the results keyed by sample and metric in dictionary form.

Returns:

Nested mapping of the form {sample_id: {metric_name: result_dict, …}, …}, where each result_dict has keys ‘value’ and ‘details’.

Return type:

Dict[Hashable, Dict[str, Dict[str, Any]]]

Aggregators#

class pyagentspec.evaluation.aggregators.Aggregator#

Bases: ABC, Generic[MetricToAggregateValueType, AggregatedValueType]

Combine a collection of metric values into a single aggregate result.

Abstract base class for aggregating a collection of values into a single, aggregated value.

This class provides a callable interface for aggregating values. Subclasses must implement the aggregate method to define the aggregation logic. When the instance is called, it invokes the aggregate method on the provided sequence of values.

Note

Call the aggregator instance directly (e.g., aggregator(values)) rather than calling method aggregate externally.

abstract aggregate(values)#

Abstract method to aggregate a sequence of input values into a single value.

Warning

This method is intended for internal use. Users should not call it directly; instead, call the aggregator instance (i.e. aggregator(values)).

Parameters:

values (Collection[MetricToAggregateValueType]) – The collection of values to aggregate. Subclasses may choose to preprocess these values if needed.

Returns:

The aggregated value resulting from applying the aggregation logic to the inputs.

Return type:

AggregatedValueType

class pyagentspec.evaluation.aggregators.HarmonicMeanAggregator#

Bases: Aggregator[bool | float | int, float]

Aggregator that computes the harmonic mean of a collection of non-negative numerical values.

Call an instance of this class with a sequence of non-negative numbers (bool, int, or float) to obtain their harmonic mean.

If any value is zero, the result is zero. Negative values will raise a ValueError.

Note

Call the aggregator instance directly (e.g., aggregator(values)) rather than calling method aggregate externally.

aggregate(values)#

Compute the harmonic mean of the provided non-negative values.

Parameters:

values (Collection[bool | float | int]) – Iterable of non-negative numeric values. bool entries are coerced to 0 or 1.

Returns:

Harmonic mean defined as len(values) / sum(1 / v for v in values).

Return type:

float

Notes

Users should not invoke aggregate() directly. Call the instance itself instead (e.g. aggregator(values)).

class pyagentspec.evaluation.aggregators.MeanAggregator#

Bases: Aggregator[bool | float | int, float]

Aggregator that computes the arithmetic mean of a collection of numerical values.

Call an instance of this class with a sequence of numbers (bool, int, or float) to obtain their arithmetic mean.

Note

Call the aggregator instance directly (e.g., aggregator(values)) rather than calling method aggregate externally.

aggregate(values)#

Return the arithmetic mean for the provided numeric values.

Parameters:

values (Collection[bool | float | int]) – Finite collection of numeric values. bool entries are coerced to 0 or 1 to match Python’s arithmetic semantics.

Returns:

Arithmetic mean computed as sum(values) / len(values).

Return type:

float

Notes

Users should not invoke aggregate() directly. Call the aggregator instance itself, e.g. aggregator(values).

Intermediates#

class pyagentspec.evaluation.intermediates.Intermediate(name, input_mapping=None)#

Bases: ABC, Generic[IntermediateValueType]

Base abstraction for reusable intermediate values shared across metrics.

Intermediates compute auxiliary artefacts (for example embeddings or normalised text) that multiple metrics may depend on. They expose a uniform compute_value coroutine to materialise the result and a __call__ wrapper that handles keyword binding and input name mapping.

Parameters:
  • name (str) –

  • input_mapping (Dict[str, str] | None) –

abstract async compute_value(*args, **kwargs)#

Compute the intermediate value and return it with metadata details.

Parameters:
  • args (Any) –

  • kwargs (Any) –

Return type:

Tuple[IntermediateValueType, Dict[str, Any]]

Metrics#

class pyagentspec.evaluation.metrics.Metric(name, input_mapping, num_retries, on_failure)#

Bases: ABC, Generic[MetricValueType]

The Metric class serves as the base for implementing both metrics and metric wrappers. To define a custom metric, inherit from this class and implement the compute_metric method.

The Metric class is generically typed. The generic type parameter should correspond to the return type of the metric.

Warning

Do not call the compute_metric method directly; instead, use the instance itself as it is callable.

Parameters:
  • name (str) –

  • input_mapping (Dict[str, str] | None) –

  • num_retries (int) –

  • on_failure (Literal['raise', 'set_none', 'set_zero'] | ~pyagentspec.evaluation.exceptions.handling_strategies.ExceptionHandlingStrategy) –

abstract async compute_metric(*args, **kwargs)#

Implements the logic for computing the metric.

Parameters:
  • forms (The method signature can take one of the following) –

    • Accepts the attributes necessary to compute the metric. For example: async def compute_metric(self, reference: str, response: str) -> ....

    • May be defined as async def compute_metric(self, *args: Any, **kwargs: Any) -> ... if the metric serves as a wrapper for other metric(s).

  • args (Any) –

  • kwargs (Any) –

Raises:

EvaluationException – Raised if the evaluation attempt fails for any reason. The exception must include an informative message and the underlying error. This method must never return None or a placeholder value.

Returns:

  • value – The computed value of the metric for the input sample.

  • value_details – Additional information about the value, such as justification, reasoning, or further details. Keys with a leading double underscore are reserved for system use.

  • .. warning:: – Any value returned by this method is considered a valid metric measurement. Never return None or a placeholder value; instead, raise an EvaluationException and use the on_failure strategy. If you return a value, retries will not be triggered.

Return type:

Tuple[MetricValueType, Dict[str, Any]]

class pyagentspec.evaluation.metrics.LlmBasedMetric(name, input_mapping, num_retries, on_failure, llm_config)#

Bases: Metric[MetricValueType]

Metric base class for scoring via a Language Model invocation.

Parameters:
  • name (str) –

  • input_mapping (Dict[str, str] | None) –

  • num_retries (int) –

  • on_failure (Literal['raise', 'set_none', 'set_zero'] | ~pyagentspec.evaluation.exceptions.handling_strategies.ExceptionHandlingStrategy) –

  • llm_config (LlmConfig) –

CONTENT = 'content'#

Key for content in an LLM conversation.

ROLE = 'role'#

Key for role in an LLM conversation.

SYSTEM = 'system'#

System role in an LLM conversation.

USER = 'user'#

User role in an LLM conversation.

async ask_llm(conversation)#

Return the assistant message text and token usage from the LLM provider.

Parameters:

conversation (List[Dict[str, str]]) –

Return type:

Tuple[str, Tuple[int, int]]

class pyagentspec.evaluation.metrics.LlmAsAJudgeMetric(name, input_mapping, num_retries, on_failure, llm_config, system_prompt, user_prompt_template, value_pattern, metadata_patterns=(), output_transformer=None)#

Bases: LlmBasedMetric[MetricValueType]

Base class for metrics that rely on an LLM as the judge.

The system_prompt encodes the rubric, whereas user_prompt_template is rendered per sample. Subclasses configure regex patterns for extracting the final value and optional metadata fields from the LLM response.

Parameters:
  • name (str) –

  • input_mapping (Dict[str, str] | None) –

  • num_retries (int) –

  • on_failure (Literal['raise', 'set_none', 'set_zero'] | ~pyagentspec.evaluation.exceptions.handling_strategies.ExceptionHandlingStrategy) –

  • llm_config (LlmConfig) –

  • system_prompt (str) –

  • user_prompt_template (str) –

  • value_pattern (str) –

  • metadata_patterns (Collection[Tuple[str, str]]) –

  • output_transformer (Callable[[Any], MetricValueType] | None) –

async compute_metric(*args, **kwargs)#

Ask the LLM to judge the sample and parse the response into value/details.

Parameters:
  • args (Any) –

  • kwargs (Any) –

Return type:

Tuple[MetricValueType, Dict[str, Any]]