import { definitions } from '/snippets/definitions.mdx'

Evaluation is the systematic process of measuring how well your AI capability performs.

Why systematic evaluation matters

AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale.

Systematic evaluation solves this by:

  • Establishing baselines: Measure current performance before making changes
  • Preventing regressions: Catch quality degradation before it reaches production
  • Enabling experimentation: Compare different models, prompts, or architectures
  • Building confidence: Deploy changes knowing they improve aggregate performance

Evaluation approaches

Axiom supports two complementary approaches:

  • Offline evaluations test your capability against a curated collection of inputs with expected outputs (ground truth). Run them before deploying to catch regressions.
  • Online evaluations score live production traffic with reference-free scorers. Run them after deploying to monitor quality continuously.

Both approaches use the same Scorer API. The scorers you write for one context work in the other.

Which evaluation approach to use

Use offline evaluations when you need to test against known-good answers before shipping. Use online evaluations when you want to continuously monitor production quality. You can use both approaches together to get the best of both worlds.

Offline evaluations Online evaluations
When Development, before deploy Production, on live traffic
Expected values Requires expected output per case No ground truth needed
Scorers Can compare output to expected Reference-free
Execution CLI runner with vitest Fire-and-forget inside your app
Sampling Runs every case Per-scorer sampling rate
Telemetry OTel spans in eval dataset OTel spans linked to production traces

Offline evaluation workflow

Offline evaluations test your capability against a curated dataset before you deploy. Axiom's evaluation framework follows a simple pattern:

Build a set of test cases with inputs and expected outputs (ground truth). Start small with 10-20 examples and grow over time. Write functions that compare your capability's output against the expected result. Use custom logic or prebuilt scorers from libraries like `autoevals`. Execute your capability against the collection and score the results. Track metrics like accuracy, pass rate, and cost. Review results in the Axiom Console. Compare against baselines. Identify failures. Make improvements and re-evaluate.

Online evaluation workflow

Online evaluations score live production traffic continuously after you deploy. They use the same Scorer API as offline evaluations, but without expected values.

Create scorers that assess output quality using only the input and output without ground truth required. Use heuristic checks for format and structure, or LLM-as-judge patterns for semantic quality. Call `onlineEval` inside your capability code to run scorers as fire-and-forget operations that don't affect your response latency. Set per-scorer sampling rates to balance coverage and cost. Run cheap heuristic scorers on every request and expensive LLM judges on a fraction of traffic. Review online evaluation scores in the Axiom Console alongside your production traces. Use the insights to add targeted offline test cases and refine your capability.

What's next?

Shared:

  • To set up your environment and authenticate, see Quickstart.
  • To learn how to write scoring functions that work in both offline and online evaluations, see Scorers.

Offline evaluations:

Online evaluations:

Good morning

I'm here to help you with the docs.

I
AIBased on your context