Skip to main content

Run Evaluations

Once an evaluation is configured, you can run it and review the results in the Evaluation Report. Run history is preserved so you can track quality over time and see exactly which code changes affected each run.

Run an evaluation

In the Test Explorer, hover over the evaluation name and click the run icon next to it.

Test Explorer with the run icon highlighted next to the testToolTrajectory evaluation, with the evaluation flow visible on the canvas.

The evaluation iterates over every case in the selected evalset and records the pass rate against the target threshold.

Review the report

The Evaluation Report opens automatically after a run.

Evaluation Report with summary counters, an evaluation card showing 100% observed pass rate against an 80% target, the per-case run list, and the Test Results panel below.

SectionWhat it shows
Top countersTotal tests, passed, and failed across every evaluation in the project.
Evaluation cardPer-evaluation stats: total runs, target pass rate, observed pass rate, and a Passed or Failed badge.
Evaluation RunsThe most recent run with each evalset case listed and its pass or fail status.
Test Results panelThe terminal-style log of the run, including paths to the JSON results file and the generated HTML report.

Track runs over time

To compare runs across days or commits, open Evaluation History. There are two entry points: the Evaluation History button at the top right of the report, and the history icon next to evaluations in the Test Explorer.

Evaluation Report with the Evaluation History button highlighted at the top right and the history icon highlighted next to evaluations in the Test Explorer.

Either entry point opens the Evaluation History view.

Evaluation History view with summary counters, a pass-rate trend chart for testToolTrajectory, and a Run History table with status, code changes, and report links.

The trend chart shows the pass rate over time. The Run History table lists every recorded run.

ColumnWhat it shows
DateWhen the run was triggered.
Pass RateObserved pass rate against the target.
StatusWhether the run met the target (Passed or Failed).
Code ChangesWhether the project was committed or had local edits at the time of the run. View changes opens the diff against the current state.
OutcomesNumber of cases evaluated.
ReportOpens the full report for that run.

Inspect code changes for any run

Click View changes on any row to see what changed between that run and the current project state.

Code changes since this run dialog showing a single file with a diff that adds tools to the math tutor model.

This makes it easy to correlate a regression with a specific change. Restore to this state rolls the project back to the state at that run, replacing the current project files. The IDE prompts for confirmation before restoring; commit or stash any work you want to keep first.

What's next