import { definitions } from '/snippets/definitions.mdx'
Evaluation is the systematic process of measuring how well your AI
Why systematic evaluation matters
AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale.
Systematic evaluation solves this by:
- Establishing baselines: Measure current performance before making changes
- Preventing regressions: Catch quality degradation before it reaches production
- Enabling experimentation: Compare different models, prompts, or architectures
- Building confidence: Deploy changes knowing they improve aggregate performance
Evaluation approaches
Axiom supports two complementary approaches:
- Offline evaluations test your capability against a curated collection of inputs with expected outputs (ground truth). Run them before deploying to catch regressions.
- Online evaluations score live production traffic with reference-free scorers. Run them after deploying to monitor quality continuously.
Both approaches use the same Scorer API. The scorers you write for one context work in the other.
Which evaluation approach to use
Use offline evaluations when you need to test against known-good answers before shipping. Use online evaluations when you want to continuously monitor production quality. You can use both approaches together to get the best of both worlds.
| Offline evaluations | Online evaluations | |
|---|---|---|
| When | Development, before deploy | Production, on live traffic |
| Expected values | Requires expected output per case | No ground truth needed |
| Scorers | Can compare output to expected | Reference-free |
| Execution | CLI runner with vitest | Fire-and-forget inside your app |
| Sampling | Runs every case | Per-scorer sampling rate |
| Telemetry | OTel spans in eval dataset | OTel spans linked to production traces |
Offline evaluation workflow
Offline evaluations test your capability against a curated dataset before you deploy. Axiom's evaluation framework follows a simple pattern:
Online evaluation workflow
Online evaluations score live production traffic continuously after you deploy. They use the same Scorer API as offline evaluations, but without expected values.
What's next?
Shared:
- To set up your environment and authenticate, see Quickstart.
- To learn how to write scoring functions that work in both offline and online evaluations, see Scorers.
Offline evaluations:
- To learn how to write evaluation functions, see Write offline evaluations.
- To understand flags and experiments, see Flags and experiments.
- To view results in the Console, see Analyze results.
Online evaluations:
- To learn how to write and run online evaluation functions, see Write and run online evaluations.
- To view results in the Console, see Analyze online evaluation results.