Evaluation & Testing

Build test suites to validate agent behavior, catch regressions, and enable safe continuous improvement.

Evals help you test and benchmark your AI analyst's reliability. Build test suites to validate agent behavior, catch regressions, and enable safe continuous improvement.

Why Evals Matter

A successful AI analyst is a reliable AI analyst. Evals establish benchmarks for agent reliability while enabling safe iteration on instructions, context, and LLM configurations.

Two Evaluation Approaches

Deterministic Tests (Create Data Rules)

Machine-checkable assertions that verify:

Specific tables or columns are used
Row counts meet expected thresholds
Generated code is valid SQL
Output matches expected patterns

Judge Tests (LLM Judge)

Natural-language evaluation using a lightweight LLM that assesses:

Presentation quality
User experience
Reasoning quality
Adherence to custom rubrics

Creating Tests

When adding a test, specify:

User prompt — The question to test (e.g., "revenue by film chart")
Data sources — Which connections to use
LLM — Which model to evaluate
File attachments — Optional supporting files
Expectations — Pass/fail criteria

Expectation Types

Type	Description
Create Data	Conditions checking tables used, columns, row counts, or code validity
Clarify	Requires the agent to ask clarifying questions for ambiguous prompts
Judge	Evaluator model applies plain-English rubrics to assess quality

Test Suites

Group tests into suites (e.g., "Finance", "Marketing") for organized batch execution. Results display logs, expectation outcomes, and generated artifacts including code and visualizations.

Best Practices

Start with deterministic checks for foundational data integrity
Layer judge rubrics for workflow and reasoning quality
Use realistic prompts that match actual user behavior
Run suites regularly after instruction or context changes

Evaluation & Testing

On this page