Skip to main content

Benchmarking

TemplateBenchmark runs automated test suites against cognitive templates. Each benchmark is a YAML file defining test questions, minimum score thresholds, and required output terms. The benchmark runs every test case and reports pass/fail rates, average quality scores, and CAI values.

from mycontext.intelligence import TemplateBenchmark

bench = TemplateBenchmark(provider="openai")

result = bench.run("root_cause_analyzer")
print(f"Passed {result.passed}/{result.total_cases}")
print(f"Avg score: {result.avg_score:.1%}")
print(f"Avg CAI: {result.avg_cai:.2f}x")
print(TemplateBenchmark.report(result))

Constructor

TemplateBenchmark(
provider: str = "openai",
eval_mode: str = "heuristic",
benchmarks_dir: Path | None = None,
model: str | None = None,
)
ParameterTypeDefaultDescription
providerstr"openai"LLM provider for execution and evaluation
eval_modestr"heuristic"How to score outputs: "heuristic", "llm", "hybrid"
benchmarks_dirPath | NoneSDK benchmarks dirCustom directory for benchmark YAML files
modelstr | NoneNoneOverride model name

Running Benchmarks

run(template_name) — Single Template

result = bench.run(
"root_cause_analyzer",
api_key="sk-...", # Optional: pass API key explicitly
)

Returns: BenchmarkResult

run_all() — All Templates

all_results = bench.run_all()
for template_name, result in all_results.items():
status = "PASS" if result.passed == result.total_cases else "PARTIAL"
print(f"[{status}] {template_name}: {result.passed}/{result.total_cases} ({result.avg_cai:.2f}x CAI)")

Returns: dict[str, BenchmarkResult]

list_benchmarks() — Discover Available Benchmarks

available = bench.list_benchmarks()
print(available)
# → ['root_cause_analyzer', 'step_by_step_reasoner', 'hypothesis_generator', ...]

Benchmark YAML Format

Benchmark files live at src/mycontext/benchmarks/<template_name>.yaml. Each file defines test cases:

template: root_cause_analyzer
description: Benchmark for root cause analysis quality

test_cases:
- question: "Why did our API response times triple after the last deployment?"
expected:
min_score: 0.60 # Minimum OutputEvaluator score (0.0–1.0)
must_contain: # Terms that must appear in the output
- "root cause"
- "recommendation"

- question: "Our mobile app crash rate spiked 300% on iOS 17 devices."
expected:
min_score: 0.65
must_contain:
- "cause"
- "impact"

- question: "Database query times increased 10x after migrating to a new server."
expected:
min_score: 0.55
must_contain:
- "investigate"

Writing Your Own Benchmarks

Create a YAML file in your benchmarks directory:

template: my_custom_template
description: Tests for my domain-specific template

test_cases:
- question: "Your test question here"
expected:
min_score: 0.60
must_contain:
- "term that must appear"
- "another required term"

Then run with a custom directory:

from pathlib import Path
from mycontext.intelligence import TemplateBenchmark

bench = TemplateBenchmark(
provider="openai",
benchmarks_dir=Path("./my-benchmarks"),
)
result = bench.run("my_custom_template")

Result Objects

BenchmarkResult

@dataclass
class BenchmarkResult:
template_name: str
total_cases: int
passed: int
failed: int
avg_score: float # Average OutputEvaluator score across all cases
avg_cai: float # Average CAI across all cases
per_case: list[CaseResult]
metadata: dict # Provider, eval_mode

CaseResult

@dataclass
class CaseResult:
question: str
passed: bool
output_score: float # OutputEvaluator score for this case
cai: float # CAI for this case
issues: list[str] # Why it failed (empty if passed)
details: dict # raw_score, verdict

report() — Detailed Report

print(TemplateBenchmark.report(result))

Output:

Benchmark Report: root_cause_analyzer
======================================

Cases: 3 | Passed: 3 | Failed: 0
Avg Output Score: 71.4%
Avg CAI: 1.58x

[PASS] Case 1: score=74.2%, CAI=1.71x
Q: Why did our API response times triple after the last deployment?...
[PASS] Case 2: score=69.8%, CAI=1.54x
Q: Our mobile app crash rate spiked 300% on iOS 17 devices....
[PASS] Case 3: score=70.3%, CAI=1.48x
Q: Database query times increased 10x after migrating to a new server....

Examples

Regression Testing in CI

import os
from mycontext.intelligence import TemplateBenchmark

def test_template_quality():
bench = TemplateBenchmark(
provider="openai",
eval_mode="heuristic",
)

critical_templates = [
"root_cause_analyzer",
"risk_assessor",
"decision_framework",
]

for template_name in critical_templates:
result = bench.run(template_name)

# Assert minimum pass rate
pass_rate = result.passed / result.total_cases if result.total_cases > 0 else 0
assert pass_rate >= 0.80, (
f"{template_name}: only {result.passed}/{result.total_cases} passed"
)

# Assert minimum average quality
assert result.avg_score >= 0.60, (
f"{template_name}: avg score {result.avg_score:.1%} below 60%"
)

# Assert positive CAI (template adds value)
assert result.avg_cai >= 1.1, (
f"{template_name}: CAI {result.avg_cai:.2f}x — template not adding value"
)

print(f"[PASS] {template_name}: {result.passed}/{result.total_cases}, "
f"avg={result.avg_score:.1%}, CAI={result.avg_cai:.2f}x")

Inspect Failures

bench = TemplateBenchmark(provider="openai")
result = bench.run("risk_assessor")

for case in result.per_case:
if not case.passed:
print(f"FAILED: {case.question}")
for issue in case.issues:
print(f" - {issue}")
print(f" Score: {case.output_score:.1%}, CAI: {case.cai:.2f}x")
print(f" Raw score was: {case.details.get('raw_score', 0):.1%}")

Compare Two Models

from mycontext.intelligence import TemplateBenchmark

template = "root_cause_analyzer"

bench_mini = TemplateBenchmark(provider="openai", model="gpt-4o-mini")
bench_full = TemplateBenchmark(provider="openai", model="gpt-4o")

r1 = bench_mini.run(template)
r2 = bench_full.run(template)

print(f"gpt-4o-mini: {r1.avg_score:.1%} avg score, {r1.avg_cai:.2f}x CAI")
print(f"gpt-4o: {r2.avg_score:.1%} avg score, {r2.avg_cai:.2f}x CAI")
print(f"Quality delta: +{(r2.avg_score - r1.avg_score) * 100:.1f}%")

Run Full Suite, Filter Poor Performers

bench = TemplateBenchmark(provider="openai", eval_mode="heuristic")
all_results = bench.run_all()

print("Templates with CAI < 1.2 (may need improvement):")
for name, result in all_results.items():
if result.total_cases > 0 and result.avg_cai < 1.2:
print(f" {name}: CAI={result.avg_cai:.2f}x, score={result.avg_score:.1%}")

print("\nTop performers by CAI:")
sorted_by_cai = sorted(
[(n, r) for n, r in all_results.items() if r.total_cases > 0],
key=lambda x: x[1].avg_cai,
reverse=True,
)
for name, result in sorted_by_cai[:5]:
print(f" {name}: CAI={result.avg_cai:.2f}x, score={result.avg_score:.1%}")

API Reference

TemplateBenchmark

MethodReturnsDescription
__init__(provider, eval_mode, benchmarks_dir, model)Initialize
run(template_name, **kwargs)BenchmarkResultRun benchmark for one template
run_all(**kwargs)dict[str, BenchmarkResult]Run all available benchmarks
list_benchmarks()list[str]Names of available benchmark files
report(result)strHuman-readable report (static method)

BenchmarkResult

FieldTypeDescription
template_namestrTemplate that was benchmarked
total_casesintTotal test cases
passedintCases that passed all criteria
failedintCases that failed
avg_scorefloatAverage output quality score
avg_caifloatAverage Context Amplification Index
per_caselist[CaseResult]Individual case results

CaseResult

FieldTypeDescription
questionstrTest question
passedboolWhether all criteria were met
output_scorefloatOutputEvaluator score
caifloatCAI for this case
issueslist[str]Failure reasons
detailsdictraw_score, verdict