import { Badge } from "/snippets/badge.jsx" import { definitions } from "/snippets/definitions.mdx"
The Measure stage is where you quantify the quality and effectiveness of your AI
Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time.
Prerequisites
Follow the Quickstart:
- To run evals within the context of an existing AI app, follow the instrumentation setup in the Quickstart.
- To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app.
Write evalulation function
The Eval function provides a simple, declarative way to define a test suite for your capability directly in your codebase.
The key parameters of the Eval function:
data: An async function that returns your collection of{ input, expected }pairs, which serve as your ground truth.task: The function that executes your AI capability, taking aninputand producing anoutput.scorers: An array of scorer functions that score theoutputagainst theexpectedvalue.metadata: Optional metadata for the evaluation, such as a description.
The example below creates an evaluation for a support ticket classification system in the file /src/evals/ticket-classification.eval.ts.
import { Eval, Scorer } from 'axiom/ai/evals';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { flag, pickFlags } from '../lib/app-scope';
import { z } from 'zod';
// The function you want to evaluate
async function classifyTicket({ subject, content }: { subject?: string; content: string }) {
const model = flag('ticketClassification.model');
const result = await generateObject({
model: wrapAISDKModel(openai(model)),
messages: [
{
role: 'system',
content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report.
If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`,
},
{
role: 'user',
content: subject ? `Subject: ${subject}\n\n${content}` : content,
},
],
schema: z.object({
category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
response: z.string()
}),
});
return result.object;
}
// Custom exact-match scorer that returns score and metadata
const ExactMatchScorer = Scorer(
'Exact-Match',
({ output, expected }: { output: { response: string }; expected: { response: string } }) => {
const normalizedOutput = output.response.trim().toLowerCase();
const normalizedExpected = expected.response.trim().toLowerCase();
return {
score: normalizedOutput === normalizedExpected,
metadata: {
details: 'A scorer that checks for exact match',
},
};
});
}
);
// Custom spam classification scorer
const SpamClassificationScorer = Scorer(
"Spam-Classification",
({ output, expected }: {
output: { category: string };
expected: { category: string };
}) => {
const isSpam = (item: { category: string }) => item.category === "spam";
return isSpam(output) === isSpam(expected) ? 1 : 0;
}
);
// Define the evaluation
Eval('spam-classification', {
// Specify which flags this eval uses
configFlags: pickFlags('ticketClassification'),
// Test data with input/expected pairs
data: [
{
input: {
subject: "Congratulations! You've Been Selected for an Exclusive Reward",
content: 'Claim your $500 gift card now by clicking this link!',
},
expected: {
category: 'spam',
response: "We're sorry, but your message has been automatically closed.",
},
},
{
input: {
subject: 'FREE CA$H',
content: 'BUY NOW ON WWW.BEST-DEALS.COM!',
},
expected: {
category: 'spam',
response: "We're sorry, but your message has been automatically closed.",
},
},
],
// The task to run for each test case
task: async ({ input }) => {
return await classifyTicket(input);
},
// Scorers to measure performance
scorers: [SpamClassificationScorer, ExactMatchScorer],
// Optional metadata
metadata: {
description: 'Classify support tickets as spam or not spam',
},
});Set up flags
Create the file src/lib/app-scope.ts:
import { createAppScope } from 'axiom/ai';
import { z } from 'zod';
export const flagSchema = z.object({
ticketClassification: z.object({
model: z.string().default('gpt-4o-mini'),
}),
});
const { flag, pickFlags } = createAppScope({ flagSchema });
export { flag, pickFlags };Run evaluations
To run your evaluation suites from your terminal, install the Axiom CLI and use the following commands.
| Description | Command |
|---|---|
| Run all evals | axiom eval |
| Run specific eval file | axiom eval src/evals/ticket-classification.eval.ts |
| Run evals matching a glob pattern | axiom eval "**/*spam*.eval.ts" |
| Run eval by name | axiom eval "spam-classification" |
| List available evals without running | axiom eval --list |
Analyze results in Console
When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with eval.* attributes, allowing you to deeply analyze results in the Axiom Console.
The results of evals:
- Pass/fail status for each test case
- Scores from each scorer
- Comparison to baseline (if available)
- Links to view detailed traces in Axiom
The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
Additional configuration options
Custom scorers
A scorer is a function that scores a capability’s output. Scorers receive the input, the generated output, and the expected value, and return a score.
The example above uses two custom scorers. Scorers can return metadata alongside the score.
You can use the autoevals library instead of custom scorers. autoevals provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching.
Run experiments
Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime.
The example above uses the ticketClassification flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways:
-
Override flags directly when you run the eval:
axiom eval --flag.ticketClassification.model=gpt-4o -
Alternatively, specify the flag overrides in a JSON file.
{ "ticketClassification": { "model": "gpt-4o" } }And then specify the JSON file as the value of the
flags-configparameter when you run the eval:axiom eval --flags-config=experiment.json
What’s next?
A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following:
- Baseline comparisons: Run evals multiple times to track regression over time.
- Experiment with flags: Test different models or strategies using flag overrides.
- Advanced scorers: Build custom scorers for domain-specific metrics.
- CI/CD integration: Add
axiom evalto your CI pipeline to catch regressions.
The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see Observe.