Eval Criteria

eval_criteria provides a catalog of pre-built GEvalCriteria objects for use with DeepEval's GEval metric. Each criterion is an LLM-judge rubric — a name, scoring steps, and threshold — validated against mycontext's own template evaluation experiments.

Optional dependency

eval_criteria requires DeepEval: pip install deepeval. It is not needed for QualityMetrics or OutputEvaluator, which run without any extra dependencies.

When to use this (vs the native evaluators)

Tool	When to use
`QualityMetrics`	Score any prompt/context on 6 dimensions — instant, no LLM call, use in every experiment
`OutputEvaluator`	Score LLM output on 5 dimensions — fast heuristic, use in every experiment
`eval_criteria` + DeepEval	Validate specific behavioral properties (gap honesty, causation discipline) on a sample — use when you need LLM-judge confidence on a targeted criterion

Import

from mycontext.intelligence import get_criteria, to_deepeval_metrics
from mycontext.intelligence.eval_criteria import (
    EVIDENCE_CITATION,
    CAUSATION_DISCIPLINE,
    DATA_GAP_HONESTY,
    INSTRUCTION_ADHERENCE,
    ACTIONABILITY,
    REASONING_SOUNDNESS,
    STRUCTURE_COMPLIANCE,
    COGNITIVE_SCAFFOLDING_USE,
    CODE_REVIEW_SEVERITY_ACCURACY,
    CODE_REVIEW_ACTIONABILITY,
)

The ten criteria

Criterion	Bundle	What it measures
`EVIDENCE_CITATION`	`data_analysis`, `reasoning`	Does every claim cite the specific data point supporting it?
`CAUSATION_DISCIPLINE`	`data_analysis`, `reasoning`	Does the response avoid inferring causation from correlation?
`DATA_GAP_HONESTY`	`data_analysis`	Does the response explicitly name what data is missing and what that prevents?
`INSTRUCTION_ADHERENCE`	`instruction_following`, `general`	Does the output follow all explicit instructions in the context?
`ACTIONABILITY`	`general`	Are recommendations specific enough to act on without further research?
`REASONING_SOUNDNESS`	`reasoning`	Is the reasoning chain logically valid and free of fallacies?
`STRUCTURE_COMPLIANCE`	`instruction_following`	Does the output match the requested format/structure?
`COGNITIVE_SCAFFOLDING_USE`	`general`	Does the output use the reasoning framework specified in the template?
`CODE_REVIEW_SEVERITY_ACCURACY`	`code_review`	Are severity labels (critical/medium/low) applied correctly?
`CODE_REVIEW_ACTIONABILITY`	`code_review`	Does each finding include a concrete, executable fix?

Using a bundle

from mycontext.intelligence import get_criteria, to_deepeval_metrics

criteria = get_criteria("data_analysis")
# → [EVIDENCE_CITATION, CAUSATION_DISCIPLINE, DATA_GAP_HONESTY]

metrics = to_deepeval_metrics(criteria)
# → list[deepeval.metrics.GEval]

Full example with DeepEval

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from mycontext.intelligence import get_criteria, to_deepeval_metrics
from mycontext.templates.free.analysis import DataAnalyzer

template = DataAnalyzer()
ctx = template.build_context(dataset_description="Monthly sales by product line, Q1–Q4 2024")
response = ctx.execute(provider="openai")

test_case = LLMTestCase(
    input=ctx.assemble(),
    actual_output=response.response,
)

metrics = to_deepeval_metrics(get_criteria("data_analysis"))
evaluate([test_case], metrics)

Using individual criteria

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from mycontext.intelligence.eval_criteria import DATA_GAP_HONESTY

metric = GEval(
    name=DATA_GAP_HONESTY.name,
    evaluation_steps=DATA_GAP_HONESTY.evaluation_steps,
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=DATA_GAP_HONESTY.threshold,
)

Using as a quality gate in experiments

from mycontext.intelligence import QualityMetrics, OutputEvaluator, get_criteria, to_deepeval_metrics

qm = QualityMetrics()
evaluator = OutputEvaluator()

# Step 1: fast heuristic pass (always)
prompt_score = qm.evaluate(ctx)
output_score = evaluator.evaluate(ctx, response.response)

# Step 2: targeted LLM-judge pass (on samples that pass step 1)
if output_score.overall > 0.70:
    metrics = to_deepeval_metrics(get_criteria("data_analysis"))
    # run deepeval evaluate(...) on this sample

When to use this (vs the native evaluators)​

Import​

The ten criteria​

Using a bundle​

Full example with DeepEval​

Using individual criteria​

Using as a quality gate in experiments​

See also​