Scorers are functions that measure your AI capability's output. They receive the inputs and outputs of a capability run, and return a score. The same Scorer API works in both offline and online evaluations.
The key difference between the two contexts is what the scorer receives:
- Offline scorers receive
input,output, andexpected(ground truth from your test collection). - Online scorers are reference-free. They receive
inputandoutputwithout anexpectedvalue.
Because the API is the same, you can reuse scorers across both contexts. A scorer you write for offline evaluations works in online evaluations as long as it doesn't depend on expected.
Create scorers
Create scorers using the Scorer wrapper. A scorer takes a name and a scoring function:
import { Scorer } from 'axiom/ai/scorers';
const MyScorer = Scorer(
'my-scorer',
({ input, output }) => {
// Return a boolean, a number (0-1), or { score, metadata }
}
);Return types
Scorers can return three types of values:
Boolean
Return true or false for simple pass/fail checks. The SDK converts booleans to 1 (pass) or 0 (fail) and marks the score as boolean in telemetry.
const isKnownCategory = Scorer(
'is-known-category',
({ output }: { output: string }) => {
return ['support', 'complaint', 'spam', 'unknown'].includes(output);
},
);Numeric
Return a number between 0 and 1 for graded scoring:
const formatConfidence = Scorer(
'format-confidence',
({ output }: { output: string }) => {
const trimmed = output.trim().toLowerCase();
const isSingleWord = !trimmed.includes(' ');
const isClean = /^[a-z_]+$/.test(trimmed);
return (isSingleWord ? 0.5 : 0) + (isClean ? 0.5 : 0);
},
);Score with metadata
Return an object with score and metadata to attach additional context to the eval span:
const validCategory = Scorer(
'valid-category',
({ output }: { output: string }) => {
const validCategories = ['support', 'complaint', 'spam', 'unknown'];
return {
score: validCategories.includes(output),
metadata: {
category: output,
validCategories,
},
};
},
);Scorer patterns
Exact match (offline)
Compare the output directly against the expected value. This pattern only works in offline evaluations where ground truth is available.
const ExactMatchScorer = Scorer(
'exact-match',
({ output, expected }) => {
return output.sentiment === expected.sentiment ? true : false;
}
);Heuristic checks
Validate output structure or format without ground truth. These scorers work in both offline and online evaluations.
const formatScorer = Scorer('format', ({ output }: { output: string }) => {
const trimmed = output.trim();
return /[.!?]$/.test(trimmed) && !trimmed.includes('\n') && trimmed.length <= 200;
});LLM-as-judge
Use a second model to evaluate the output. Async scorers are useful in both contexts, especially in online evaluations where you don't have ground truth and need semantic quality assessment.
import { generateObject } from 'ai';
import { z } from 'zod';
const relevanceScorer = Scorer(
'relevance',
async ({ input, output }: { input: string; output: string }) => {
const result = await generateObject({
model: judgeModel,
schema: z.object({
relevant: z.boolean().describe('Whether the response answers the question'),
}),
system: 'You evaluate if an AI response answers the user question.',
prompt: `Question: ${input}\n\nResponse: ${output}`,
});
return result.object.relevant;
},
);Use autoevals
The autoevals library provides prebuilt scorers for common tasks:
npm install autoevalsimport { Scorer } from 'axiom/ai/scorers';
import { Levenshtein, FactualityScorer } from 'autoevals';
const LevenshteinScorer = Scorer(
'levenshtein',
({ output, expected }) => {
return Levenshtein({ output: output.text, expected: expected.text });
}
);
const FactualityCheck = Scorer(
'factuality',
async ({ output, expected }) => {
return await FactualityScorer({
output: output.text,
expected: expected.text,
});
}
);Telemetry
Each scorer produces an OTel span with the following attributes:
| Attribute | Description |
|---|---|
gen_ai.operation.name |
Always eval.score |
eval.name |
The eval name |
eval.score.name |
The scorer name |
eval.score.value |
The numeric score (0-1) |
eval.score.metadata |
JSON string of scorer metadata. Includes eval.score.is_boolean: true when the scorer returned a boolean. |
eval.capability.name |
The capability being evaluated |
eval.step.name |
The step within the capability (when set) |
eval.tags |
["online"] for online evaluations |
What's next?
- Use scorers in offline evaluations to test against known-good answers before shipping.
- Use scorers in online evaluations to monitor production quality continuously.