Benchmarking

TemplateBenchmark runs automated test suites against cognitive templates. Each benchmark is a YAML file defining test questions, minimum score thresholds, and required output terms. The benchmark runs every test case and reports pass/fail rates, average quality scores, and CAI values.

from mycontext.intelligence import TemplateBenchmark

bench = TemplateBenchmark(provider="openai")

result = bench.run("root_cause_analyzer")
print(f"Passed {result.passed}/{result.total_cases}")
print(f"Avg score: {result.avg_score:.1%}")
print(f"Avg CAI:   {result.avg_cai:.2f}x")
print(TemplateBenchmark.report(result))

Constructor

TemplateBenchmark(
    provider: str = "openai",
    eval_mode: str = "heuristic",
    benchmarks_dir: Path | None = None,
    model: str | None = None,
)

Parameter	Type	Default	Description
`provider`	`str`	`"openai"`	LLM provider for execution and evaluation
`eval_mode`	`str`	`"heuristic"`	How to score outputs: `"heuristic"`, `"llm"`, `"hybrid"`
`benchmarks_dir`	`Path \| None`	SDK benchmarks dir	Custom directory for benchmark YAML files
`model`	`str \| None`	`None`	Override model name

Running Benchmarks

`run(template_name)` — Single Template

result = bench.run(
    "root_cause_analyzer",
    api_key="sk-...",   # Optional: pass API key explicitly
)

Returns: BenchmarkResult

`run_all()` — All Templates

all_results = bench.run_all()
for template_name, result in all_results.items():
    status = "PASS" if result.passed == result.total_cases else "PARTIAL"
    print(f"[{status}] {template_name}: {result.passed}/{result.total_cases} ({result.avg_cai:.2f}x CAI)")

Returns: dict[str, BenchmarkResult]

`list_benchmarks()` — Discover Available Benchmarks

available = bench.list_benchmarks()
print(available)
# → ['root_cause_analyzer', 'step_by_step_reasoner', 'hypothesis_generator', ...]

Benchmark YAML Format

Benchmark files live at src/mycontext/benchmarks/<template_name>.yaml. Each file defines test cases:

template: root_cause_analyzer
description: Benchmark for root cause analysis quality

test_cases:
  - question: "Why did our API response times triple after the last deployment?"
    expected:
      min_score: 0.60           # Minimum OutputEvaluator score (0.0–1.0)
      must_contain:             # Terms that must appear in the output
        - "root cause"
        - "recommendation"

  - question: "Our mobile app crash rate spiked 300% on iOS 17 devices."
    expected:
      min_score: 0.65
      must_contain:
        - "cause"
        - "impact"

  - question: "Database query times increased 10x after migrating to a new server."
    expected:
      min_score: 0.55
      must_contain:
        - "investigate"

Writing Your Own Benchmarks

Create a YAML file in your benchmarks directory:

template: my_custom_template
description: Tests for my domain-specific template

test_cases:
  - question: "Your test question here"
    expected:
      min_score: 0.60
      must_contain:
        - "term that must appear"
        - "another required term"

Then run with a custom directory:

from pathlib import Path
from mycontext.intelligence import TemplateBenchmark

bench = TemplateBenchmark(
    provider="openai",
    benchmarks_dir=Path("./my-benchmarks"),
)
result = bench.run("my_custom_template")

Result Objects

`BenchmarkResult`

@dataclass
class BenchmarkResult:
    template_name: str
    total_cases: int
    passed: int
    failed: int
    avg_score: float           # Average OutputEvaluator score across all cases
    avg_cai: float             # Average CAI across all cases
    per_case: list[CaseResult]
    metadata: dict             # Provider, eval_mode

`CaseResult`

@dataclass
class CaseResult:
    question: str
    passed: bool
    output_score: float        # OutputEvaluator score for this case
    cai: float                 # CAI for this case
    issues: list[str]          # Why it failed (empty if passed)
    details: dict              # raw_score, verdict

`report()` — Detailed Report

print(TemplateBenchmark.report(result))

Output:

Benchmark Report: root_cause_analyzer
======================================

Cases: 3  |  Passed: 3  |  Failed: 0
Avg Output Score: 71.4%
Avg CAI: 1.58x

  [PASS] Case 1: score=74.2%, CAI=1.71x
         Q: Why did our API response times triple after the last deployment?...
  [PASS] Case 2: score=69.8%, CAI=1.54x
         Q: Our mobile app crash rate spiked 300% on iOS 17 devices....
  [PASS] Case 3: score=70.3%, CAI=1.48x
         Q: Database query times increased 10x after migrating to a new server....

Examples

Regression Testing in CI

import os
from mycontext.intelligence import TemplateBenchmark

def test_template_quality():
    bench = TemplateBenchmark(
        provider="openai",
        eval_mode="heuristic",
    )
    
    critical_templates = [
        "root_cause_analyzer",
        "risk_assessor",
        "decision_framework",
    ]
    
    for template_name in critical_templates:
        result = bench.run(template_name)
        
        # Assert minimum pass rate
        pass_rate = result.passed / result.total_cases if result.total_cases > 0 else 0
        assert pass_rate >= 0.80, (
            f"{template_name}: only {result.passed}/{result.total_cases} passed"
        )
        
        # Assert minimum average quality
        assert result.avg_score >= 0.60, (
            f"{template_name}: avg score {result.avg_score:.1%} below 60%"
        )
        
        # Assert positive CAI (template adds value)
        assert result.avg_cai >= 1.1, (
            f"{template_name}: CAI {result.avg_cai:.2f}x — template not adding value"
        )
        
        print(f"[PASS] {template_name}: {result.passed}/{result.total_cases}, "
              f"avg={result.avg_score:.1%}, CAI={result.avg_cai:.2f}x")

Inspect Failures

bench = TemplateBenchmark(provider="openai")
result = bench.run("risk_assessor")

for case in result.per_case:
    if not case.passed:
        print(f"FAILED: {case.question}")
        for issue in case.issues:
            print(f"  - {issue}")
        print(f"  Score: {case.output_score:.1%}, CAI: {case.cai:.2f}x")
        print(f"  Raw score was: {case.details.get('raw_score', 0):.1%}")

Compare Two Models

from mycontext.intelligence import TemplateBenchmark

template = "root_cause_analyzer"

bench_mini = TemplateBenchmark(provider="openai", model="gpt-4o-mini")
bench_full = TemplateBenchmark(provider="openai", model="gpt-4o")

r1 = bench_mini.run(template)
r2 = bench_full.run(template)

print(f"gpt-4o-mini: {r1.avg_score:.1%} avg score, {r1.avg_cai:.2f}x CAI")
print(f"gpt-4o:      {r2.avg_score:.1%} avg score, {r2.avg_cai:.2f}x CAI")
print(f"Quality delta: +{(r2.avg_score - r1.avg_score) * 100:.1f}%")

Run Full Suite, Filter Poor Performers

bench = TemplateBenchmark(provider="openai", eval_mode="heuristic")
all_results = bench.run_all()

print("Templates with CAI < 1.2 (may need improvement):")
for name, result in all_results.items():
    if result.total_cases > 0 and result.avg_cai < 1.2:
        print(f"  {name}: CAI={result.avg_cai:.2f}x, score={result.avg_score:.1%}")

print("\nTop performers by CAI:")
sorted_by_cai = sorted(
    [(n, r) for n, r in all_results.items() if r.total_cases > 0],
    key=lambda x: x[1].avg_cai,
    reverse=True,
)
for name, result in sorted_by_cai[:5]:
    print(f"  {name}: CAI={result.avg_cai:.2f}x, score={result.avg_score:.1%}")

API Reference

`TemplateBenchmark`

Method	Returns	Description
`__init__(provider, eval_mode, benchmarks_dir, model)`	—	Initialize
`run(template_name, **kwargs)`	`BenchmarkResult`	Run benchmark for one template
`run_all(**kwargs)`	`dict[str, BenchmarkResult]`	Run all available benchmarks
`list_benchmarks()`	`list[str]`	Names of available benchmark files
`report(result)`	`str`	Human-readable report (static method)

`BenchmarkResult`

Field	Type	Description
`template_name`	`str`	Template that was benchmarked
`total_cases`	`int`	Total test cases
`passed`	`int`	Cases that passed all criteria
`failed`	`int`	Cases that failed
`avg_score`	`float`	Average output quality score
`avg_cai`	`float`	Average Context Amplification Index
`per_case`	`list[CaseResult]`	Individual case results

`CaseResult`

Field	Type	Description
`question`	`str`	Test question
`passed`	`bool`	Whether all criteria were met
`output_score`	`float`	OutputEvaluator score
`cai`	`float`	CAI for this case
`issues`	`list[str]`	Failure reasons
`details`	`dict`	`raw_score`, `verdict`

Constructor​

Running Benchmarks​

run(template_name) — Single Template​

run_all() — All Templates​

list_benchmarks() — Discover Available Benchmarks​

Benchmark YAML Format​

Writing Your Own Benchmarks​

Result Objects​

BenchmarkResult​

CaseResult​

report() — Detailed Report​

Examples​

Regression Testing in CI​

Inspect Failures​

Compare Two Models​

Run Full Suite, Filter Poor Performers​

API Reference​

TemplateBenchmark​

BenchmarkResult​

CaseResult​

Constructor

Running Benchmarks

`run(template_name)` — Single Template

`run_all()` — All Templates

`list_benchmarks()` — Discover Available Benchmarks

Benchmark YAML Format

Writing Your Own Benchmarks

Result Objects

`BenchmarkResult`

`CaseResult`

`report()` — Detailed Report

Examples

Regression Testing in CI

Inspect Failures

Compare Two Models

Run Full Suite, Filter Poor Performers

API Reference

`TemplateBenchmark`

`BenchmarkResult`

`CaseResult`