Skip to main content

Code Evaluators

Code evaluators let you write Python functions that deterministically score spans without calling an LLM. They're fast, free, and ideal for structural validation.

Function Contract

Your code must define an evaluate(ctx) function that receives an EvaluationContext and returns an EvaluationResult:

def evaluate(ctx):
output = ctx.observation.output

has_json = False
try:
import json
json.loads(str(output))
has_json = True
except (json.JSONDecodeError, TypeError):
pass

return EvaluationResult(scores=[
Score(
name="valid_json",
value=has_json,
data_type="BOOLEAN",
comment="Output is valid JSON"
if has_json else "Not valid JSON",
),
])

Context Fields

The ctx.observation object contains:

FieldDescription
ctx.observation.inputParsed gen_ai.input.messages (JSON if valid, raw string otherwise)
ctx.observation.outputParsed gen_ai.output.messages (JSON if valid, raw string otherwise)
ctx.observation.metadataAll span tags as a dict (e.g. {"gen_ai.request.model": "gpt-4o", ...})

Accessing metadata

model = ctx.observation.metadata["gen_ai.request.model"]
system = ctx.observation.metadata.get(
"gen_ai.system_instructions", ""
)

Score Types

Each Score requires a data_type:

data_typevalue typeExample
NUMERICint or float0.85
BOOLEANboolTrue / False
CATEGORICALstr"good", "bad"

You can return multiple scores per evaluation.

Setup Steps

  1. Go to GenAI → Evaluators → New Code Evaluator
  2. Write your Python code in the editor
  3. Select a sample span and click Run Test to verify
  4. Save the template
  5. Create a rule to deploy the evaluator

Runtime Constraints

Each execution runs in an isolated microVM (hardware-level isolation). The VM is destroyed after each run — no state persists between evaluations.

ConstraintValue
LanguagePython only
Timeout5 seconds
Memory128 MB
Network accessNone
File systemEphemeral (destroyed after run)
Max source size256 KB

Available modules

The full Python standard library is available inside the VM, including json, re, math, datetime, string, collections, itertools, functools, etc.

No third-party packages or network access.

Examples

Exact match

def evaluate(ctx):
output = str(ctx.observation.output or "")
expected = "Hello, world!"
match = output.strip() == expected

return EvaluationResult(scores=[
Score(
name="exact_match",
value=match,
data_type="BOOLEAN",
),
])

Regex validation

import re

def evaluate(ctx):
output = str(ctx.observation.output or "")
has_email = bool(
re.search(r"[\w.+-]+@[\w-]+\.[\w.-]+", output)
)

return EvaluationResult(scores=[
Score(
name="contains_email",
value=has_email,
data_type="BOOLEAN",
),
])

JSON schema check

import json

def evaluate(ctx):
output = str(ctx.observation.output or "")
required_keys = ["name", "age", "email"]

try:
parsed = json.loads(output)
has_all = all(k in parsed for k in required_keys)
except (json.JSONDecodeError, TypeError):
has_all = False

return EvaluationResult(scores=[
Score(
name="schema_valid",
value=has_all,
data_type="BOOLEAN",
comment="Missing keys"
if not has_all else None,
),
])

Keyword detection

def evaluate(ctx):
output = str(ctx.observation.output or "").lower()
blocked = ["password", "secret", "api_key"]
found = [w for w in blocked if w in output]

return EvaluationResult(scores=[
Score(
name="no_secrets",
value=len(found) == 0,
data_type="BOOLEAN",
comment=f"Found: {found}" if found else None,
),
])

Viewing Execution History

Code evaluator executions are recorded as trace spans. To view them:

  1. Go to the Traces page
  2. Filter by gen_ai.operation.name = code_eval
  3. Each span shows status, duration, scores, and errors

From the Evaluators page, click View executions on any rule to jump to a pre-filtered trace view.

Error Codes

CodeMeaning
INVALID_SOURCESyntax error or missing evaluate function
USER_CODE_ERRORRuntime exception in your code
TIMEOUTExceeded 5-second limit
INVALID_RESULTReturn value doesn't match expected shape
RESULT_TOO_LARGEResult exceeds 256 KB

Support

If you need assistance or have any questions, please reach out to us through: