Write Evaluations — axiomhq/docs

import { definitions } from "/snippets/definitions.mdx"

An offline evaluation is a test suite for your AI capability. It runs your capability against a collection of test cases and scores the results using scorers. This page explains how to write offline evaluation functions using Axiom's Eval API.

This page covers writing offline evaluations. For online evaluations, see [Online evaluations](/ai-engineering/evaluate/online-evaluations/write-run-evaluations).

Prerequisites

Follow the procedure in Quickstart to set up Axiom AI SDK in your TypeScript project.
For offline evaluations, use an API token with permissions to ingest and query your dataset. Other AI engineering workflows only require a token with ingest permissions.
Wrap your AI model with wrapAISDKModel for automatic tracing. See Instrumentation with Axiom AI SDK for details.

Instead of using environment variables explained in the Quickstart, you can authenticate using OAuth instead.

The Axiom AI SDK includes a CLI for authenticating and running offline evaluations. Authenticate so that evaluation runs are recorded in Axiom and attributed to your user account.

npx axiom auth login

This opens your browser and prompts you to authorize the CLI with your Axiom account. Once authorized, the CLI stores your credentials locally.

Check authentication status

npx axiom auth status

Switch organizations

If you belong to multiple Axiom organizations:

npx axiom auth switch

Logout

npx axiom auth logout

Anatomy of an offline evaluation

The Eval function defines a complete test suite for your capability. Here’s the basic structure:

import { Eval } from 'axiom/ai/evals';
import { Scorer } from 'axiom/ai/scorers';
 
Eval('evaluation-name', {
  data: [/* test cases */],
  task: async ({ input }) => {/* run capability */},
  scorers: [/* scoring functions */],
  metadata: {/* optional metadata */},
});

Key parameters

data: An array of test cases, or a function that returns an array of test cases. Each test case has an input (what you send to your capability) and an expected output (the ground truth).
task: An async function that executes your capability for a given input and returns the output.
scorers: An array of scorer functions that evaluate the output against the expected result.
metadata: Optional metadata like a description or tags.

Create collections

The data parameter defines your collection of test cases. Start with a small set of examples and grow it over time as you discover edge cases.

Inline collections

For small collections, define test cases directly in the offline evaluation:

Eval('classify-sentiment', {
  data: [
    {
      input: { text: 'I love this product!' },
      expected: { sentiment: 'positive' },
    },
    {
      input: { text: 'This is terrible.' },
      expected: { sentiment: 'negative' },
    },
    {
      input: { text: 'It works as expected.' },
      expected: { sentiment: 'neutral' },
    },
  ],
  // ... rest of eval
});

External collections

For larger collections, load test cases from external files or databases:

import { readFile } from 'fs/promises';
 
Eval('classify-sentiment', {
  data: async () => {
    const content = await readFile('./test-cases/sentiment.json', 'utf-8');
    return JSON.parse(content);
  },
  // ... rest of eval
});

We recommend storing collections in version control alongside your code. This makes it easy to track how your test suite evolves and ensures evaluations are reproducible.

Define tasks

The task function executes your AI capability for each test case. It receives the input from the test case and should return the output your capability produces.

import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
 
async function classifySentiment(text: string) {
  const result = await generateText({
    model: wrapAISDKModel(openai('gpt-4o-mini')),
    prompt: `Classify the sentiment of this text as positive, negative, or neutral: "${text}"`,
  });
  
  return { sentiment: result.text };
}
 
Eval('classify-sentiment', {
  data: [/* ... */],
  task: async ({ input }) => {
    return await classifySentiment(input.text);
  },
  scorers: [/* ... */],
});

The task function should generally be the same code you use in your actual capability. This ensures your evaluations accurately reflect real-world behavior.

Create scorers

Scorers evaluate your capability's output. In offline evaluations, scorers receive input, output, and expected (ground truth), and return a score. For the full Scorer API reference including return types, patterns, and third-party integrations, see Scorers.

Here's a quick example of an offline scorer that compares output to expected values:

import { Scorer } from 'axiom/ai/scorers';
 
const ExactMatchScorer = Scorer(
  'exact-match',
  ({ output, expected }) => {
    return output.sentiment === expected.sentiment ? true : false;
  }
);

Complete example

Here's a complete evaluation for a support ticket classification system:

import { Eval } from 'axiom/ai/evals';
import { Scorer } from 'axiom/ai/scorers';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { z } from 'zod';
 
// The capability function
async function classifyTicket({ 
  subject, 
  content 
}: { 
  subject?: string; 
  content: string 
}) {
  const result = await generateObject({
    model: wrapAISDKModel(openai('gpt-4o-mini')),
    messages: [
      {
        role: 'system',
        content: `You are a customer support engineer. Classify tickets as: 
        spam, question, feature_request, or bug_report.`,
      },
      {
        role: 'user',
        content: subject ? `Subject: ${subject}\n\n${content}` : content,
      },
    ],
    schema: z.object({
      category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
      confidence: z.number().min(0).max(1),
    }),
  });
 
  return result.object;
}
 
// Custom scorer for category matching
const CategoryScorer = Scorer(
  'category-match',
  ({ output, expected }) => {
    return output.category === expected.category ? true : false;
  }
);
 
// Custom scorer for high-confidence predictions
const ConfidenceScorer = Scorer(
  'high-confidence',
  ({ output }) => {
    return output.confidence >= 0.8 ? true : false;
  }
);
 
// Define the evaluation
Eval('spam-classification', {
  data: [
    {
      input: {
        subject: "Congratulations! You've Won!",
        content: 'Claim your $500 gift card now!',
      },
      expected: {
        category: 'spam',
      },
    },
    {
      input: {
        subject: 'How do I reset my password?',
        content: 'I forgot my password and need help resetting it.',
      },
      expected: {
        category: 'question',
      },
    },
    {
      input: {
        subject: 'Feature request: Dark mode',
        content: 'Would love to see a dark mode option in the app.',
      },
      expected: {
        category: 'feature_request',
      },
    },
    {
      input: {
        subject: 'App crashes on startup',
        content: 'The app crashes immediately when I try to open it.',
      },
      expected: {
        category: 'bug_report',
      },
    },
  ],
  
  task: async ({ input }) => {
    return await classifyTicket(input);
  },
  
  scorers: [CategoryScorer, ConfidenceScorer],
  
  metadata: {
    description: 'Classify support tickets into categories',
  },
});

File naming conventions

Name your evaluation files with the .eval.ts extension so they're automatically discovered by the Axiom CLI:

src/
└── lib/
    └── capabilities/
        └── classify-ticket/
            └── evaluations/
                ├── spam-classification.eval.ts
                ├── category-accuracy.eval.ts
                └── edge-cases.eval.ts

The CLI will find all files matching **/*.eval.{ts,js,mts,mjs,cts,cjs} based on your axiom.config.ts configuration.

What's next?

To parameterize your capabilities and run experiments, see Flags and experiments.
To run offline evaluations using the CLI, see Run offline evaluations.

#Prerequisites

#Login

#Check authentication status

#Switch organizations

#Logout

#Anatomy of an offline evaluation

#Key parameters

#Create collections

#Inline collections

#External collections

#Define tasks

#Create scorers

#Complete example

#File naming conventions

#What's next?

Good morning