Skip to main content

Eval Criteria

eval_criteria provides a catalog of pre-built GEvalCriteria objects for use with DeepEval's GEval metric. Each criterion is an LLM-judge rubric — a name, scoring steps, and threshold — validated against mycontext's own template evaluation experiments.

Optional dependency

eval_criteria requires DeepEval: pip install deepeval. It is not needed for QualityMetrics or OutputEvaluator, which run without any extra dependencies.

When to use this (vs the native evaluators)

ToolWhen to use
QualityMetricsScore any prompt/context on 6 dimensions — instant, no LLM call, use in every experiment
OutputEvaluatorScore LLM output on 5 dimensions — fast heuristic, use in every experiment
eval_criteria + DeepEvalValidate specific behavioral properties (gap honesty, causation discipline) on a sample — use when you need LLM-judge confidence on a targeted criterion

Import

from mycontext.intelligence import get_criteria, to_deepeval_metrics
from mycontext.intelligence.eval_criteria import (
EVIDENCE_CITATION,
CAUSATION_DISCIPLINE,
DATA_GAP_HONESTY,
INSTRUCTION_ADHERENCE,
ACTIONABILITY,
REASONING_SOUNDNESS,
STRUCTURE_COMPLIANCE,
COGNITIVE_SCAFFOLDING_USE,
CODE_REVIEW_SEVERITY_ACCURACY,
CODE_REVIEW_ACTIONABILITY,
)

The ten criteria

CriterionBundleWhat it measures
EVIDENCE_CITATIONdata_analysis, reasoningDoes every claim cite the specific data point supporting it?
CAUSATION_DISCIPLINEdata_analysis, reasoningDoes the response avoid inferring causation from correlation?
DATA_GAP_HONESTYdata_analysisDoes the response explicitly name what data is missing and what that prevents?
INSTRUCTION_ADHERENCEinstruction_following, generalDoes the output follow all explicit instructions in the context?
ACTIONABILITYgeneralAre recommendations specific enough to act on without further research?
REASONING_SOUNDNESSreasoningIs the reasoning chain logically valid and free of fallacies?
STRUCTURE_COMPLIANCEinstruction_followingDoes the output match the requested format/structure?
COGNITIVE_SCAFFOLDING_USEgeneralDoes the output use the reasoning framework specified in the template?
CODE_REVIEW_SEVERITY_ACCURACYcode_reviewAre severity labels (critical/medium/low) applied correctly?
CODE_REVIEW_ACTIONABILITYcode_reviewDoes each finding include a concrete, executable fix?

Using a bundle

from mycontext.intelligence import get_criteria, to_deepeval_metrics

criteria = get_criteria("data_analysis")
# → [EVIDENCE_CITATION, CAUSATION_DISCIPLINE, DATA_GAP_HONESTY]

metrics = to_deepeval_metrics(criteria)
# → list[deepeval.metrics.GEval]

Full example with DeepEval

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from mycontext.intelligence import get_criteria, to_deepeval_metrics
from mycontext.templates.free.analysis import DataAnalyzer

template = DataAnalyzer()
ctx = template.build_context(dataset_description="Monthly sales by product line, Q1–Q4 2024")
response = ctx.execute(provider="openai")

test_case = LLMTestCase(
input=ctx.assemble(),
actual_output=response.response,
)

metrics = to_deepeval_metrics(get_criteria("data_analysis"))
evaluate([test_case], metrics)

Using individual criteria

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from mycontext.intelligence.eval_criteria import DATA_GAP_HONESTY

metric = GEval(
name=DATA_GAP_HONESTY.name,
evaluation_steps=DATA_GAP_HONESTY.evaluation_steps,
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=DATA_GAP_HONESTY.threshold,
)

Using as a quality gate in experiments

from mycontext.intelligence import QualityMetrics, OutputEvaluator, get_criteria, to_deepeval_metrics

qm = QualityMetrics()
evaluator = OutputEvaluator()

# Step 1: fast heuristic pass (always)
prompt_score = qm.evaluate(ctx)
output_score = evaluator.evaluate(ctx, response.response)

# Step 2: targeted LLM-judge pass (on samples that pass step 1)
if output_score.overall > 0.70:
metrics = to_deepeval_metrics(get_criteria("data_analysis"))
# run deepeval evaluate(...) on this sample

See also