Scorers are functions that measure your AI capability's output. They receive the inputs and outputs of a capability run, and return a score. The same Scorer API works in both offline and online evaluations.

The key difference between the two contexts is what the scorer receives:

  • Offline scorers receive input, output, and expected (ground truth from your test collection).
  • Online scorers are reference-free. They receive input and output without an expected value.

Because the API is the same, you can reuse scorers across both contexts. A scorer you write for offline evaluations works in online evaluations as long as it doesn't depend on expected.

Create scorers

Create scorers using the Scorer wrapper. A scorer takes a name and a scoring function:

import { Scorer } from 'axiom/ai/scorers';
 
const MyScorer = Scorer(
  'my-scorer',
  ({ input, output }) => {
    // Return a boolean, a number (0-1), or { score, metadata }
  }
);

Return types

Scorers can return three types of values:

Boolean

Return true or false for simple pass/fail checks. The SDK converts booleans to 1 (pass) or 0 (fail) and marks the score as boolean in telemetry.

const isKnownCategory = Scorer(
  'is-known-category',
  ({ output }: { output: string }) => {
    return ['support', 'complaint', 'spam', 'unknown'].includes(output);
  },
);

Numeric

Return a number between 0 and 1 for graded scoring:

const formatConfidence = Scorer(
  'format-confidence',
  ({ output }: { output: string }) => {
    const trimmed = output.trim().toLowerCase();
    const isSingleWord = !trimmed.includes(' ');
    const isClean = /^[a-z_]+$/.test(trimmed);
 
    return (isSingleWord ? 0.5 : 0) + (isClean ? 0.5 : 0);
  },
);

Score with metadata

Return an object with score and metadata to attach additional context to the eval span:

const validCategory = Scorer(
  'valid-category',
  ({ output }: { output: string }) => {
    const validCategories = ['support', 'complaint', 'spam', 'unknown'];
    return {
      score: validCategories.includes(output),
      metadata: {
        category: output,
        validCategories,
      },
    };
  },
);

Scorer patterns

Exact match (offline)

Compare the output directly against the expected value. This pattern only works in offline evaluations where ground truth is available.

const ExactMatchScorer = Scorer(
  'exact-match',
  ({ output, expected }) => {
    return output.sentiment === expected.sentiment ? true : false;
  }
);

Heuristic checks

Validate output structure or format without ground truth. These scorers work in both offline and online evaluations.

const formatScorer = Scorer('format', ({ output }: { output: string }) => {
  const trimmed = output.trim();
  return /[.!?]$/.test(trimmed) && !trimmed.includes('\n') && trimmed.length <= 200;
});

LLM-as-judge

Use a second model to evaluate the output. Async scorers are useful in both contexts, especially in online evaluations where you don't have ground truth and need semantic quality assessment.

import { generateObject } from 'ai';
import { z } from 'zod';
 
const relevanceScorer = Scorer(
  'relevance',
  async ({ input, output }: { input: string; output: string }) => {
    const result = await generateObject({
      model: judgeModel,
      schema: z.object({
        relevant: z.boolean().describe('Whether the response answers the question'),
      }),
      system: 'You evaluate if an AI response answers the user question.',
      prompt: `Question: ${input}\n\nResponse: ${output}`,
    });
    return result.object.relevant;
  },
);
LLM judge scorers add latency and cost per evaluation. In online evaluations, use [sampling](/ai-engineering/evaluate/online-evaluations/write-run-evaluations#sampling) to control how often they run.

Use autoevals

The autoevals library provides prebuilt scorers for common tasks:

npm install autoevals
import { Scorer } from 'axiom/ai/scorers';
import { Levenshtein, FactualityScorer } from 'autoevals';
 
const LevenshteinScorer = Scorer(
  'levenshtein',
  ({ output, expected }) => {
    return Levenshtein({ output: output.text, expected: expected.text });
  }
);
 
const FactualityCheck = Scorer(
  'factuality',
  async ({ output, expected }) => {
    return await FactualityScorer({
      output: output.text,
      expected: expected.text,
    });
  }
);
Use multiple scorers to evaluate different aspects of your capability. For example, check both exact accuracy and semantic similarity to get a complete picture of performance.

Telemetry

Each scorer produces an OTel span with the following attributes:

Attribute Description
gen_ai.operation.name Always eval.score
eval.name The eval name
eval.score.name The scorer name
eval.score.value The numeric score (0-1)
eval.score.metadata JSON string of scorer metadata. Includes eval.score.is_boolean: true when the scorer returned a boolean.
eval.capability.name The capability being evaluated
eval.step.name The step within the capability (when set)
eval.tags ["online"] for online evaluations

What's next?

Good morning

I'm here to help you with the docs.

I
AIBased on your context