Code Evaluators

Code evaluators let you write Python functions that deterministically score spans without calling an LLM. They're fast, free, and ideal for structural validation.

Function Contract

Your code must define an evaluate(ctx) function that receives an EvaluationContext and returns an EvaluationResult:

def evaluate(ctx):
    output = ctx.observation.output

    has_json = False
    try:
        import json
        json.loads(str(output))
        has_json = True
    except (json.JSONDecodeError, TypeError):
        pass

    return EvaluationResult(scores=[
        Score(
            name="valid_json",
            value=has_json,
            data_type="BOOLEAN",
            comment="Output is valid JSON"
                if has_json else "Not valid JSON",
        ),
    ])

Context Fields

The ctx.observation object contains:

Field	Description
`ctx.observation.input`	Parsed `gen_ai.input.messages` (JSON if valid, raw string otherwise)
`ctx.observation.output`	Parsed `gen_ai.output.messages` (JSON if valid, raw string otherwise)
`ctx.observation.metadata`	All span tags as a dict (e.g. `{"gen_ai.request.model": "gpt-4o", ...}`)

Accessing metadata

model = ctx.observation.metadata["gen_ai.request.model"]
system = ctx.observation.metadata.get(
    "gen_ai.system_instructions", ""
)

Score Types

Each Score requires a data_type:

data_type	value type	Example
`NUMERIC`	int or float	`0.85`
`BOOLEAN`	bool	`True` / `False`
`CATEGORICAL`	str	`"good"`, `"bad"`

You can return multiple scores per evaluation.

Setup Steps

Go to GenAI → Evaluators → New Code Evaluator
Write your Python code in the editor
Select a sample span and click Run Test to verify
Save the template
Create a rule to deploy the evaluator

Runtime Constraints

Each execution runs in an isolated microVM (hardware-level isolation). The VM is destroyed after each run — no state persists between evaluations.

Constraint	Value
Language	Python only
Timeout	5 seconds
Memory	128 MB
Network access	None
File system	Ephemeral (destroyed after run)
Max source size	256 KB

Available modules

The full Python standard library is available inside the VM, including json, re, math, datetime, string, collections, itertools, functools, etc.

No third-party packages or network access.

Examples

Exact match

def evaluate(ctx):
    output = str(ctx.observation.output or "")
    expected = "Hello, world!"
    match = output.strip() == expected

    return EvaluationResult(scores=[
        Score(
            name="exact_match",
            value=match,
            data_type="BOOLEAN",
        ),
    ])

Regex validation

import re

def evaluate(ctx):
    output = str(ctx.observation.output or "")
    has_email = bool(
        re.search(r"[\w.+-]+@[\w-]+\.[\w.-]+", output)
    )

    return EvaluationResult(scores=[
        Score(
            name="contains_email",
            value=has_email,
            data_type="BOOLEAN",
        ),
    ])

JSON schema check

import json

def evaluate(ctx):
    output = str(ctx.observation.output or "")
    required_keys = ["name", "age", "email"]

    try:
        parsed = json.loads(output)
        has_all = all(k in parsed for k in required_keys)
    except (json.JSONDecodeError, TypeError):
        has_all = False

    return EvaluationResult(scores=[
        Score(
            name="schema_valid",
            value=has_all,
            data_type="BOOLEAN",
            comment="Missing keys"
                if not has_all else None,
        ),
    ])

Keyword detection

def evaluate(ctx):
    output = str(ctx.observation.output or "").lower()
    blocked = ["password", "secret", "api_key"]
    found = [w for w in blocked if w in output]

    return EvaluationResult(scores=[
        Score(
            name="no_secrets",
            value=len(found) == 0,
            data_type="BOOLEAN",
            comment=f"Found: {found}" if found else None,
        ),
    ])

Viewing Execution History

Code evaluator executions are recorded as trace spans. To view them:

Go to the Traces page
Filter by gen_ai.operation.name = code_eval
Each span shows status, duration, scores, and errors

From the Evaluators page, click View executions on any rule to jump to a pre-filtered trace view.

Error Codes

Code	Meaning
`INVALID_SOURCE`	Syntax error or missing `evaluate` function
`USER_CODE_ERROR`	Runtime exception in your code
`TIMEOUT`	Exceeded 5-second limit
`INVALID_RESULT`	Return value doesn't match expected shape
`RESULT_TOO_LARGE`	Result exceeds 256 KB

Support

If you need assistance or have any questions, please reach out to us through:

Email at [email protected]

Function Contract​

Context Fields​

Accessing metadata​

Score Types​

Setup Steps​

Runtime Constraints​

Available modules​

Examples​

Exact match​

Regex validation​

JSON schema check​

Keyword detection​

Viewing Execution History​

Error Codes​