Evaluation#
Open Agent Specification Evaluation (short: Agent Spec Eval) is an extension of Agent Spec that standardizes how agentic systems are evaluated in a framework-agnostic way.
Evaluation#
- class pyagentspec.evaluation.Dataset(_data_source)#
Bases:
_DataSourceConcrete wrapper around
_DataSourceimplementations used during evaluation.- Parameters:
_data_source (_DataSource) –
- features()#
Return the sequence of feature names provided by this data source.
- Returns:
Names of all features available in samples.
- Return type:
Sequence[str]
- static from_df(df)#
Creating a dataset from a pandas dataframe. The dataframe must have a single level header.
- Parameters:
df (DataFrame) – An instance of pandas dataframe.
- Return type:
A dataset that wraps that dataframe.
- Raises:
ValueError – If any of the columns headers is not string
- static from_dict(data, features_consistency='strict')#
Initialize a dataset with a collection of samples and determine feature consistency.
- Parameters:
data (Dict[Hashable, Dict[str, Any]] or List[Dict[str, Any]]) – The dataset. If a dictionary, keys are sample identifiers and values are feature dictionaries. If a list, each item is a feature dictionary and sample identifiers are assigned as sequential indices.
features_consistency ({"strict", "relaxed", "bypass"}, default "strict") –
- Policy for validating feature keys consistency across samples:
”strict”: All samples must have identical feature keys.
”relaxed”: Uses only the intersection of keys from all samples.
”bypass”: Uses feature keys from the first sample only.
Warning
Bypass consistency control is solely for performance optimization. If there is an inconsistency in the dataset, it may later resulted into errors during evaluation. Use bypass only when you are sure about the consistency of your dataset.
- Raises:
TypeError – If the data input is neither a dict nor a sequence of feature dictionaries.
ValueError – If samples are missing, feature keys are inconsistent in “strict” mode, or no features are found.
- Return type:
- async get_sample(id)#
Asynchronously fetch a data sample given its identifier.
- Parameters:
id (Hashable) – Unique identifier for the sample to fetch.
- Returns:
Dictionary containing feature values for the sample.
- Return type:
Dict[str, Any]
- ids()#
Asynchronously yield all available sample identifiers.
- Yields:
Hashable – Unique identifier for a sample.
- Return type:
AsyncIterator[Hashable]
- class pyagentspec.evaluation.Evaluator(metrics, max_concurrency=-1)#
Bases:
objectEvaluator orchestrates the execution of a set of metrics over input data, supporting optional concurrency control.
- Parameters:
metrics (Sequence[Metric[Any]]) –
max_concurrency (int) –
- async evaluate(dataset)#
Execute every metric against
datasetand collect the results.- Parameters:
dataset (Dataset) – Dataset exposing async
ids/get_sampleaccessors. Each sample must provide the features required by the configured metrics.- Returns:
Structured view over the metric values and their associated metadata.
- Return type:
Notes
Metrics run concurrently whenever
max_concurrencypermits. Anypyagentspec.evaluation.exceptions.EvaluationExceptionraised by an underlying metric propagates to the caller if the metricon_failurebehavior requires to raise.
- class pyagentspec.evaluation.EvaluationResults(results, sample_ids=None, metric_names=None)#
Bases:
objectContainer for storing and accessing evaluation metric results for multiple samples and metrics.
This class provides utilities to work with evaluation results that are organized as a mapping between (sample_id, metric_name) pairs and their corresponding result values and details. It enables exporting the results to common formats such as JSON and pandas DataFrame for further analysis or reporting.
- Parameters:
results (Dict[Tuple[Hashable, str], Tuple[Any, Dict[Hashable, Any]]]) –
sample_ids (List[Hashable] | None) –
metric_names (List[str] | None) –
- results#
Dictionary mapping (sample_id, metric_name) pairs to their metric result and related details.
- Type:
Dict[Tuple[Hashable, str], Tuple[Any, Dict[str, Any]]]
- sample_ids#
List of sample identifiers present in the results.
- Type:
List[Hashable]
- metric_names#
List of metric names present in the results.
- Type:
List[str]
- to_df()#
Return the results as a
pandas.DataFrameindexed by sample id.- Returns:
DataFrame indexed by sample_id with columns as metric names. Each cell contains the main result value for the corresponding (sample_id, metric_name) pair
- Return type:
pandas.DataFrame
- to_dict()#
Return the results keyed by sample and metric in dictionary form.
- Returns:
Nested mapping of the form {sample_id: {metric_name: result_dict, …}, …}, where each result_dict has keys ‘value’ and ‘details’.
- Return type:
Dict[Hashable, Dict[str, Dict[str, Any]]]
Aggregators#
- class pyagentspec.evaluation.aggregators.Aggregator#
Bases:
ABC,Generic[MetricToAggregateValueType,AggregatedValueType]Combine a collection of metric values into a single aggregate result.
Abstract base class for aggregating a collection of values into a single, aggregated value.
This class provides a callable interface for aggregating values. Subclasses must implement the
aggregatemethod to define the aggregation logic. When the instance is called, it invokes theaggregatemethod on the provided sequence of values.Note
Call the aggregator instance directly (e.g.,
aggregator(values)) rather than calling methodaggregateexternally.- abstract aggregate(values)#
Abstract method to aggregate a sequence of input values into a single value.
Warning
This method is intended for internal use. Users should not call it directly; instead, call the aggregator instance (i.e.
aggregator(values)).- Parameters:
values (Collection[MetricToAggregateValueType]) – The collection of values to aggregate. Subclasses may choose to preprocess these values if needed.
- Returns:
The aggregated value resulting from applying the aggregation logic to the inputs.
- Return type:
AggregatedValueType
- class pyagentspec.evaluation.aggregators.HarmonicMeanAggregator#
Bases:
Aggregator[bool|float|int,float]Aggregator that computes the harmonic mean of a collection of non-negative numerical values.
Call an instance of this class with a sequence of non-negative numbers (bool, int, or float) to obtain their harmonic mean.
If any value is zero, the result is zero. Negative values will raise a ValueError.
Note
Call the aggregator instance directly (e.g.,
aggregator(values)) rather than calling methodaggregateexternally.- aggregate(values)#
Compute the harmonic mean of the provided non-negative values.
- Parameters:
values (Collection[bool | float | int]) – Iterable of non-negative numeric values.
boolentries are coerced to0or1.- Returns:
Harmonic mean defined as
len(values) / sum(1 / v for v in values).- Return type:
float
Notes
Users should not invoke
aggregate()directly. Call the instance itself instead (e.g.aggregator(values)).
- class pyagentspec.evaluation.aggregators.MeanAggregator#
Bases:
Aggregator[bool|float|int,float]Aggregator that computes the arithmetic mean of a collection of numerical values.
Call an instance of this class with a sequence of numbers (bool, int, or float) to obtain their arithmetic mean.
Note
Call the aggregator instance directly (e.g.,
aggregator(values)) rather than calling methodaggregateexternally.- aggregate(values)#
Return the arithmetic mean for the provided numeric values.
- Parameters:
values (Collection[bool | float | int]) – Finite collection of numeric values.
boolentries are coerced to0or1to match Python’s arithmetic semantics.- Returns:
Arithmetic mean computed as
sum(values) / len(values).- Return type:
float
Notes
Users should not invoke
aggregate()directly. Call the aggregator instance itself, e.g.aggregator(values).
Intermediates#
- class pyagentspec.evaluation.intermediates.Intermediate(name, input_mapping=None)#
Bases:
ABC,Generic[IntermediateValueType]Base abstraction for reusable intermediate values shared across metrics.
Intermediates compute auxiliary artefacts (for example embeddings or normalised text) that multiple metrics may depend on. They expose a uniform
compute_valuecoroutine to materialise the result and a__call__wrapper that handles keyword binding and input name mapping.- Parameters:
name (str) –
input_mapping (Dict[str, str] | None) –
- abstract async compute_value(*args, **kwargs)#
Compute the intermediate value and return it with metadata details.
- Parameters:
args (Any) –
kwargs (Any) –
- Return type:
Tuple[IntermediateValueType, Dict[str, Any]]
Metrics#
- class pyagentspec.evaluation.metrics.Metric(name, input_mapping, num_retries, on_failure)#
Bases:
ABC,Generic[MetricValueType]The
Metricclass serves as the base for implementing both metrics and metric wrappers. To define a custom metric, inherit from this class and implement thecompute_metricmethod.The
Metricclass is generically typed. The generic type parameter should correspond to the return type of the metric.Warning
Do not call the
compute_metricmethod directly; instead, use the instance itself as it is callable.- Parameters:
name (str) –
input_mapping (Dict[str, str] | None) –
num_retries (int) –
on_failure (Literal['raise', 'set_none', 'set_zero'] | ~pyagentspec.evaluation.exceptions.handling_strategies.ExceptionHandlingStrategy) –
- abstract async compute_metric(*args, **kwargs)#
Implements the logic for computing the metric.
- Parameters:
forms (The method signature can take one of the following) –
Accepts the attributes necessary to compute the metric. For example:
async def compute_metric(self, reference: str, response: str) -> ....May be defined as
async def compute_metric(self, *args: Any, **kwargs: Any) -> ...if the metric serves as a wrapper for other metric(s).
args (Any) –
kwargs (Any) –
- Raises:
EvaluationException – Raised if the evaluation attempt fails for any reason. The exception must include an informative message and the underlying error. This method must never return None or a placeholder value.
- Returns:
value – The computed value of the metric for the input sample.
value_details – Additional information about the value, such as justification, reasoning, or further details. Keys with a leading double underscore are reserved for system use.
.. warning:: – Any value returned by this method is considered a valid metric measurement. Never return
Noneor a placeholder value; instead, raise anEvaluationExceptionand use theon_failurestrategy. If you return a value, retries will not be triggered.
- Return type:
Tuple[MetricValueType, Dict[str, Any]]
- class pyagentspec.evaluation.metrics.LlmBasedMetric(name, input_mapping, num_retries, on_failure, llm_config)#
Bases:
Metric[MetricValueType]Metric base class for scoring via a Language Model invocation.
- Parameters:
name (str) –
input_mapping (Dict[str, str] | None) –
num_retries (int) –
on_failure (Literal['raise', 'set_none', 'set_zero'] | ~pyagentspec.evaluation.exceptions.handling_strategies.ExceptionHandlingStrategy) –
llm_config (LlmConfig) –
- CONTENT = 'content'#
Key for content in an LLM conversation.
- ROLE = 'role'#
Key for role in an LLM conversation.
- SYSTEM = 'system'#
System role in an LLM conversation.
- USER = 'user'#
User role in an LLM conversation.
- async ask_llm(conversation)#
Return the assistant message text and token usage from the LLM provider.
- Parameters:
conversation (List[Dict[str, str]]) –
- Return type:
Tuple[str, Tuple[int, int]]
- class pyagentspec.evaluation.metrics.LlmAsAJudgeMetric(name, input_mapping, num_retries, on_failure, llm_config, system_prompt, user_prompt_template, value_pattern, metadata_patterns=(), output_transformer=None)#
Bases:
LlmBasedMetric[MetricValueType]Base class for metrics that rely on an LLM as the judge.
The
system_promptencodes the rubric, whereasuser_prompt_templateis rendered per sample. Subclasses configure regex patterns for extracting the final value and optional metadata fields from the LLM response.- Parameters:
name (str) –
input_mapping (Dict[str, str] | None) –
num_retries (int) –
on_failure (Literal['raise', 'set_none', 'set_zero'] | ~pyagentspec.evaluation.exceptions.handling_strategies.ExceptionHandlingStrategy) –
llm_config (LlmConfig) –
system_prompt (str) –
user_prompt_template (str) –
value_pattern (str) –
metadata_patterns (Collection[Tuple[str, str]]) –
output_transformer (Callable[[Any], MetricValueType] | None) –
- async compute_metric(*args, **kwargs)#
Ask the LLM to judge the sample and parse the response into value/details.
- Parameters:
args (Any) –
kwargs (Any) –
- Return type:
Tuple[MetricValueType, Dict[str, Any]]