Evaluations

Evaluations are how you keep an AI agent's quality from regressing. Unlike unit tests with deterministic outputs, agent behaviour has to be judged across several dimensions: correctness, tool selection, groundedness, safety, and tone. Evaluations in WSO2 Integrator give you a structured way to measure those dimensions, see why a run regressed, and watch the trend across builds.

The feature is built around three stages.

Stage	What it is	Page
Evalsets	A golden dataset of conversation traces, captured from real chats with your agent and refined in the Evalset Viewer.	Create evalsets
Evaluations	Evaluation functions, configured in a form and assembled in the visual designer, that score agent behaviour against an evalset (or run standalone logic).	Create evaluations
Runs and reports	Run an evaluation on the current agent build and review the Evaluation Report and the Evaluation History trend across runs.	Run evaluations

How the stages fit together

Chat with your agent and export the session traces into an evalset. Use the editor to edit messages, reorder or add turns, and update tool calls.
Create an evaluation from the Test Explorer. Pick the evalset to score against, set the target pass rate, then build the checks in the visual designer (including LLM-as-judge if you need subjective scoring).
Run the evaluation. The report shows pass or fail per case, and the Evaluation History view tracks pass rate across runs so you can correlate regressions with the changes that caused them.

When to use evaluations

Use evaluations when...	Look elsewhere when...
You're about to change instructions, tools, or the model and want a regression check.	You need a single deterministic unit test for a pure function.
You want a baseline of agent quality you can track across runs and commits.	You're debugging one specific failed run. Use Observability.
You need to verify safety and refusal behaviour before shipping.	You're still prototyping and haven't picked the agent's tools yet.

What's next

Create evalsets — Capture real chats and curate them into a golden dataset.
Create evaluations — Configure scoring and build the checks in the visual designer.
Run evaluations — Execute runs and review the report and history.

How the stages fit together​

When to use evaluations​

What's next​

How the stages fit together

When to use evaluations

What's next