Overview — axiomhq/docs

import { definitions } from '/snippets/definitions.mdx'

Evaluation is the systematic process of measuring how well your AI capability performs.

Why systematic evaluation matters

AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale.

Systematic evaluation solves this by:

Establishing baselines: Measure current performance before making changes
Preventing regressions: Catch quality degradation before it reaches production
Enabling experimentation: Compare different models, prompts, or architectures
Building confidence: Deploy changes knowing they improve aggregate performance

Evaluation approaches

Axiom supports two complementary approaches:

Offline evaluations test your capability against a curated collection of inputs with expected outputs (ground truth). Run them before deploying to catch regressions.
Online evaluations score live production traffic with reference-free scorers. Run them after deploying to monitor quality continuously.

Both approaches use the same Scorer API. The scorers you write for one context work in the other.

Which evaluation approach to use

Use offline evaluations when you need to test against known-good answers before shipping. Use online evaluations when you want to continuously monitor production quality. You can use both approaches together to get the best of both worlds.

	Offline evaluations	Online evaluations
When	Development, before deploy	Production, on live traffic
Expected values	Requires expected output per case	No ground truth needed
Scorers	Can compare output to expected	Reference-free
Execution	CLI runner with vitest	Fire-and-forget inside your app
Sampling	Runs every case	Per-scorer sampling rate
Telemetry	OTel spans in eval dataset	OTel spans linked to production traces

Offline evaluation workflow

Offline evaluations test your capability against a curated dataset before you deploy. Axiom's evaluation framework follows a simple pattern:

Build a set of test cases with inputs and expected outputs (ground truth). Start small with 10-20 examples and grow over time. Write functions that compare your capability's output against the expected result. Use custom logic or prebuilt scorers from libraries like `autoevals`. Execute your capability against the collection and score the results. Track metrics like accuracy, pass rate, and cost. Review results in the Axiom Console. Compare against baselines. Identify failures. Make improvements and re-evaluate.

Online evaluation workflow

Online evaluations score live production traffic continuously after you deploy. They use the same Scorer API as offline evaluations, but without expected values.

Create scorers that assess output quality using only the input and output without ground truth required. Use heuristic checks for format and structure, or LLM-as-judge patterns for semantic quality. Call `onlineEval` inside your capability code to run scorers as fire-and-forget operations that don't affect your response latency. Set per-scorer sampling rates to balance coverage and cost. Run cheap heuristic scorers on every request and expensive LLM judges on a fraction of traffic. Review online evaluation scores in the Axiom Console alongside your production traces. Use the insights to add targeted offline test cases and refine your capability.

What's next?

Shared:

To set up your environment and authenticate, see Quickstart.
To learn how to write scoring functions that work in both offline and online evaluations, see Scorers.

Offline evaluations:

To learn how to write evaluation functions, see Write offline evaluations.
To understand flags and experiments, see Flags and experiments.
To view results in the Console, see Analyze results.

Online evaluations:

To learn how to write and run online evaluation functions, see Write and run online evaluations.
To view results in the Console, see Analyze online evaluation results.

#Why systematic evaluation matters

#Evaluation approaches

#Which evaluation approach to use

#Offline evaluation workflow

#Online evaluation workflow

#What's next?

Good morning