import { definitions } from "/snippets/definitions.mdx"
An offline evaluation is a test suite for your AI capability. It runs your capability against a Eval API.
Prerequisites
- Follow the procedure in Quickstart to set up Axiom AI SDK in your TypeScript project.
- For offline evaluations, use an API token with permissions to ingest and query your dataset. Other AI engineering workflows only require a token with ingest permissions.
- Wrap your AI model with
wrapAISDKModelfor automatic tracing. See Instrumentation with Axiom AI SDK for details.
Instead of using environment variables explained in the Quickstart, you can authenticate using OAuth instead.
The Axiom AI SDK includes a CLI for authenticating and running offline evaluations. Authenticate so that evaluation runs are recorded in Axiom and attributed to your user account.
Login
npx axiom auth loginThis opens your browser and prompts you to authorize the CLI with your Axiom account. Once authorized, the CLI stores your credentials locally.
Check authentication status
npx axiom auth statusSwitch organizations
If you belong to multiple Axiom organizations:
npx axiom auth switchLogout
npx axiom auth logoutAnatomy of an offline evaluation
The Eval function defines a complete test suite for your capability. Here’s the basic structure:
import { Eval } from 'axiom/ai/evals';
import { Scorer } from 'axiom/ai/scorers';
Eval('evaluation-name', {
data: [/* test cases */],
task: async ({ input }) => {/* run capability */},
scorers: [/* scoring functions */],
metadata: {/* optional metadata */},
});Key parameters
data: An array of test cases, or a function that returns an array of test cases. Each test case has aninput(what you send to your capability) and anexpectedoutput (the ground truth).task: An async function that executes your capability for a given input and returns the output.scorers: An array of scorer functions that evaluate the output against the expected result.metadata: Optional metadata like a description or tags.
Create collections
The data parameter defines your collection of test cases. Start with a small set of examples and grow it over time as you discover edge cases.
Inline collections
For small collections, define test cases directly in the offline evaluation:
Eval('classify-sentiment', {
data: [
{
input: { text: 'I love this product!' },
expected: { sentiment: 'positive' },
},
{
input: { text: 'This is terrible.' },
expected: { sentiment: 'negative' },
},
{
input: { text: 'It works as expected.' },
expected: { sentiment: 'neutral' },
},
],
// ... rest of eval
});External collections
For larger collections, load test cases from external files or databases:
import { readFile } from 'fs/promises';
Eval('classify-sentiment', {
data: async () => {
const content = await readFile('./test-cases/sentiment.json', 'utf-8');
return JSON.parse(content);
},
// ... rest of eval
});Define tasks
The task function executes your AI capability for each test case. It receives the input from the test case and should return the output your capability produces.
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
async function classifySentiment(text: string) {
const result = await generateText({
model: wrapAISDKModel(openai('gpt-4o-mini')),
prompt: `Classify the sentiment of this text as positive, negative, or neutral: "${text}"`,
});
return { sentiment: result.text };
}
Eval('classify-sentiment', {
data: [/* ... */],
task: async ({ input }) => {
return await classifySentiment(input.text);
},
scorers: [/* ... */],
});Create scorers
Scorers evaluate your capability's output. In offline evaluations, scorers receive input, output, and expected (ground truth), and return a score. For the full Scorer API reference including return types, patterns, and third-party integrations, see Scorers.
Here's a quick example of an offline scorer that compares output to expected values:
import { Scorer } from 'axiom/ai/scorers';
const ExactMatchScorer = Scorer(
'exact-match',
({ output, expected }) => {
return output.sentiment === expected.sentiment ? true : false;
}
);Complete example
Here's a complete evaluation for a support ticket classification system:
import { Eval } from 'axiom/ai/evals';
import { Scorer } from 'axiom/ai/scorers';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { z } from 'zod';
// The capability function
async function classifyTicket({
subject,
content
}: {
subject?: string;
content: string
}) {
const result = await generateObject({
model: wrapAISDKModel(openai('gpt-4o-mini')),
messages: [
{
role: 'system',
content: `You are a customer support engineer. Classify tickets as:
spam, question, feature_request, or bug_report.`,
},
{
role: 'user',
content: subject ? `Subject: ${subject}\n\n${content}` : content,
},
],
schema: z.object({
category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
confidence: z.number().min(0).max(1),
}),
});
return result.object;
}
// Custom scorer for category matching
const CategoryScorer = Scorer(
'category-match',
({ output, expected }) => {
return output.category === expected.category ? true : false;
}
);
// Custom scorer for high-confidence predictions
const ConfidenceScorer = Scorer(
'high-confidence',
({ output }) => {
return output.confidence >= 0.8 ? true : false;
}
);
// Define the evaluation
Eval('spam-classification', {
data: [
{
input: {
subject: "Congratulations! You've Won!",
content: 'Claim your $500 gift card now!',
},
expected: {
category: 'spam',
},
},
{
input: {
subject: 'How do I reset my password?',
content: 'I forgot my password and need help resetting it.',
},
expected: {
category: 'question',
},
},
{
input: {
subject: 'Feature request: Dark mode',
content: 'Would love to see a dark mode option in the app.',
},
expected: {
category: 'feature_request',
},
},
{
input: {
subject: 'App crashes on startup',
content: 'The app crashes immediately when I try to open it.',
},
expected: {
category: 'bug_report',
},
},
],
task: async ({ input }) => {
return await classifyTicket(input);
},
scorers: [CategoryScorer, ConfidenceScorer],
metadata: {
description: 'Classify support tickets into categories',
},
});File naming conventions
Name your evaluation files with the .eval.ts extension so they're automatically discovered by the Axiom CLI:
src/
└── lib/
└── capabilities/
└── classify-ticket/
└── evaluations/
├── spam-classification.eval.ts
├── category-accuracy.eval.ts
└── edge-cases.eval.ts
The CLI will find all files matching **/*.eval.{ts,js,mts,mjs,cts,cjs} based on your axiom.config.ts configuration.
What's next?
- To parameterize your capabilities and run experiments, see Flags and experiments.
- To run offline evaluations using the CLI, see Run offline evaluations.