MetricChat
Using MetricChat

Evaluation & Testing

Build test suites to validate agent behavior, catch regressions, and enable safe continuous improvement.

Evals help you test and benchmark your AI analyst's reliability. Build test suites to validate agent behavior, catch regressions, and enable safe continuous improvement.

Why Evals Matter

A successful AI analyst is a reliable AI analyst. Evals establish benchmarks for agent reliability while enabling safe iteration on instructions, context, and LLM configurations.

Two Evaluation Approaches

Deterministic Tests (Create Data Rules)

Machine-checkable assertions that verify:

  • Specific tables or columns are used
  • Row counts meet expected thresholds
  • Generated code is valid SQL
  • Output matches expected patterns

Judge Tests (LLM Judge)

Natural-language evaluation using a lightweight LLM that assesses:

  • Presentation quality
  • User experience
  • Reasoning quality
  • Adherence to custom rubrics

Creating Tests

When adding a test, specify:

  • User prompt — The question to test (e.g., "revenue by film chart")
  • Data sources — Which connections to use
  • LLM — Which model to evaluate
  • File attachments — Optional supporting files
  • Expectations — Pass/fail criteria

Expectation Types

TypeDescription
Create DataConditions checking tables used, columns, row counts, or code validity
ClarifyRequires the agent to ask clarifying questions for ambiguous prompts
JudgeEvaluator model applies plain-English rubrics to assess quality

Test Suites

Group tests into suites (e.g., "Finance", "Marketing") for organized batch execution. Results display logs, expectation outcomes, and generated artifacts including code and visualizations.

Best Practices

  1. Start with deterministic checks for foundational data integrity
  2. Layer judge rubrics for workflow and reasoning quality
  3. Use realistic prompts that match actual user behavior
  4. Run suites regularly after instruction or context changes

On this page