Skip to main content

QualityMetrics

QualityMetrics evaluates a Context object across six dimensions and returns a score with issues, strengths, and specific improvement suggestions — before you send anything to an LLM.

from mycontext.intelligence import QualityMetrics
from mycontext.templates.free.reasoning import RootCauseAnalyzer

ctx = RootCauseAnalyzer().build_context(
problem="API response times tripled after deployment",
depth="comprehensive",
)

metrics = QualityMetrics()
score = metrics.evaluate(ctx)

print(f"Quality: {score.overall:.1%}")
print(metrics.report(score))

Why Measure Context Quality?

A poorly structured context produces poor responses. QualityMetrics lets you:

  • Catch weak prompts before execution — save tokens and API costs
  • Improve iteratively — specific suggestions tell you exactly what to fix
  • Compare approachescompare() shows exactly how much improvement you made
  • Validate in CI — assert minimum quality thresholds before deploying

The Six Dimensions

DimensionWeightWhat it measures
Completeness25%Are all components present? (role, goal, directive, rules, constraints, examples)
Clarity20%Clear, unambiguous instructions. Checks pronoun ratio, hedge density, and modal commitment
Reasoning Depth20%Chain-of-thought markers, structured steps, depth of analysis
Actionability20%Concrete, implementable recommendations with specific metrics
Specificity15%Concrete domain terms, detailed directives, not generic phrases
Efficiency10%Concise without being incomplete. Directive length is scored separately from total length

Scoring Scale

0–25%   Broken / empty / placeholder
25–45% Minimal / generic ("Expert Assistant, be helpful")
45–65% Adequate basics but missing depth
65–80% Good: clear role, specific directive, some constraints
80–95% Very good: comprehensive, examples, tight constraints
95–100% Exceptional: publishable quality

Constructor

QualityMetrics(
mode: str = "heuristic",
llm_provider: str = "openai",
llm_model: str = "gpt-4o-mini",
)

Parameters:

ParameterTypeDefaultDescription
modestr"heuristic"Evaluation mode: "heuristic", "llm", or "hybrid"
llm_providerstr"openai"Provider for LLM mode
llm_modelstr"gpt-4o-mini"Model for LLM mode

Three Evaluation Modes

ModeSpeedCostBest for
"heuristic"InstantFreeCI/CD, bulk evaluation, development
"llm"~2s~$0.02/evalAuthoritative scoring, production QA
"hybrid"FastLowBest of both: heuristic for clear cases, LLM for borderline (0.45–0.75)

evaluate(context)

score = metrics.evaluate(context)

Returns: QualityScore

@dataclass
class QualityScore:
overall: float # 0.0 to 1.0 weighted score
dimensions: dict[QualityDimension, float] # Per-dimension scores
issues: list[str] # Specific problems found
strengths: list[str] # What's working well
suggestions: list[str] # Actionable improvements
metadata: dict # Mode, word count, etc.

report(score)

Generate a human-readable quality report:

metrics = QualityMetrics()
score = metrics.evaluate(ctx)
print(metrics.report(score))

Output:

Context Quality Report
=====================

Overall Score: 84.2% ✅

Dimension Scores:
✅ Clarity: 88.0%
✅ Completeness: 90.0%
✅ Specificity: 78.0%
✅ Relevance: 85.0%
✅ Structure: 82.0%
⚠️ Efficiency: 71.0%

Strengths (5):
✓ Clear role and directive structure
✓ Specific role/guidance defined
✓ Well-defined rules (4)
✓ Good formatting with headers and lists
✓ Rich domain-specific terminology

Issues (1):
✗ No examples — add concrete examples of expected input/output

Suggestions for Improvement:
1. Strong prompt. Consider adding edge cases or constraints for even better results.

compare(context1, context2)

Measure improvement between two contexts:

from mycontext import Context
from mycontext.foundation import Guidance, Directive, Constraints

# Before: generic
ctx_before = Context(
guidance=Guidance(role="Expert Assistant"),
directive=Directive(content="Analyze the problem"),
)

# After: specific
from mycontext.templates.free.reasoning import RootCauseAnalyzer
ctx_after = RootCauseAnalyzer().build_context(
problem="API response times tripled after deployment",
depth="comprehensive",
)

metrics = QualityMetrics()
comparison = metrics.compare(ctx_before, ctx_after)

print(f"Before: {comparison['original_score']:.1%}")
print(f"After: {comparison['improved_score']:.1%}")
print(f"Lift: +{comparison['improvement_percentage']:.1f}%")

print("\nDimension changes:")
for dim, delta in comparison['dimension_changes'].items():
arrow = "↑" if delta > 0.05 else "↓" if delta < -0.05 else "→"
print(f" {arrow} {dim.value}: {delta:+.1%}")

print("\nResolved issues:", comparison['resolved_issues'])
print("New strengths:", comparison['new_strengths'])

Returns dict:

{
"original_score": 0.28,
"improved_score": 0.84,
"improvement": 0.56,
"improvement_percentage": 56.0,
"dimension_changes": {QualityDimension.COMPLETENESS: +0.70, ...},
"resolved_issues": {"Missing guidance/role component", ...},
"new_strengths": {"Clear role and directive structure", ...},
}

Examples

Evaluate a Pattern-Built Context

from mycontext.intelligence import QualityMetrics
from mycontext.templates.free.planning import ScenarioPlanner

ctx = ScenarioPlanner().build_context(
topic="AI regulation impact on our SaaS business",
timeframe="3 years",
)

metrics = QualityMetrics()
score = metrics.evaluate(ctx)
print(f"Score: {score.overall:.1%}")
for suggestion in score.suggestions:
print(f" → {suggestion}")

Assert Minimum Quality in CI

from mycontext.intelligence import QualityMetrics

def validate_context(ctx, min_quality=0.70):
metrics = QualityMetrics(mode="heuristic")
score = metrics.evaluate(ctx)

if score.overall < min_quality:
issues = "\n".join(f" - {i}" for i in score.issues)
raise ValueError(
f"Context quality {score.overall:.1%} below minimum {min_quality:.1%}\n{issues}"
)
return score

# In your pipeline
ctx = build_my_context()
score = validate_context(ctx, min_quality=0.75)
result = ctx.execute(provider="openai")

Iterative Improvement Loop

from mycontext import Context
from mycontext.foundation import Guidance, Directive, Constraints
from mycontext.intelligence import QualityMetrics

metrics = QualityMetrics()

# Start with basic context
ctx = Context(
guidance=Guidance(role="Analyst"),
directive=Directive(content="Analyze our churn problem"),
)
score = metrics.evaluate(ctx)
print(f"v1: {score.overall:.1%}")
print("Issues:", score.issues)

# Improve based on suggestions
ctx = Context(
guidance=Guidance(
role="Senior customer success analyst with SaaS retention expertise",
rules=[
"Identify root causes using the Five Whys methodology",
"Quantify impact: revenue at risk, affected cohorts, trend direction",
"Distinguish correlation from causation",
"Provide 3 prioritized recommendations with implementation timeline",
],
),
directive=Directive(
content="""Analyze our customer churn problem.

Context: 40% churn increase in Q3. New enterprise tier launched Q2.
Analyze: cohort patterns, product usage correlation, support ticket themes.
Output: root causes ranked by confidence, immediate actions, 90-day plan.""",
),
constraints=Constraints(
must_include=["cohort analysis", "revenue impact", "root cause", "recommendations"],
format_rules=["Executive summary first", "Each finding supported by evidence"],
),
)
score = metrics.evaluate(ctx)
print(f"v2: {score.overall:.1%}")

QualityDimension Enum

from mycontext.intelligence import QualityDimension

class QualityDimension(Enum):
CLARITY = "clarity"
COMPLETENESS = "completeness"
SPECIFICITY = "specificity"
RELEVANCE = "relevance"
STRUCTURE = "structure"
EFFICIENCY = "efficiency"

Access a specific dimension score:

score = metrics.evaluate(ctx)
print(score.dimensions[QualityDimension.COMPLETENESS]) # 0.90
print(score.dimensions[QualityDimension.CLARITY]) # 0.88

What Gets Penalized

Common issues detected automatically:

IssueDimensionPenalty
Extremely short content (< 10 words)GlobalUp to 60%
Minimal generic prompt ("Expert Assistant, be helpful")Global45%
Generic role + generic rulesGlobal30%
Very short prompt (< 30 words)Global30%
Missing role, directive, or goalGlobal15% each
Empty JSON schemaGlobal10%
Likely typos detectedGlobal5% per typo
High pronoun ratio (> 10%) — ambiguous referencesClarity15%
Moderate pronoun ratio (5–10%)Clarity7%
High hedge density — "try to", "if applicable", "ideally"Clarity12%
Weak modal commitment — more should/could than must/shallClarity8%
Directive too long (> 100 words) — dilutes the core instructionEfficiency10%

Research Foundation

The heuristics in QualityMetrics are grounded in two internal experiments and peer-reviewed NLP research.

Empirical experiments

Experiment 1 — POS Profile & Entropy (n=10 prompt pairs) Tested whether linguistic features (parts of speech, perplexity, semantic density) predict output quality. Key finding: named entity count and noun-to-pronoun ratio showed directional signal on a small controlled dataset.

Experiment 2 — TruthfulQA Scale Test (n=200, gpt-4o-mini) Scaled the most promising features to the TruthfulQA benchmark. Statistically significant findings that directly shaped the Clarity and Efficiency scorers:

FeatureCorrelation with accuracyp-valueApplied to
Pronoun ratior = −0.187p = 0.008Clarity — tiered penalty at 5% and 10%
Directive lengthr = −0.176p = 0.013Efficiency — penalty for directives > 100 words

Academic foundations

HeuristicSource
Pronoun ratio & ambiguityExperiment 2 (TruthfulQA, n=200)
Hedge densityHyland (1996) — hedging in instructional discourse
Modal commitment ratioDeontic logic; Ouyang et al. (2022) InstructGPT
Concreteness preferenceBrysbaert et al. (2014) — concreteness ratings for 40k English words
Frame Semantics completenessFillmore (1982) — FrameNet argument slots
Sentence complexityGibson (1998) — dependency locality theory

API Reference

QualityMetrics

MethodReturnsDescription
__init__(mode, llm_provider, llm_model)Initialize
evaluate(context, reference)QualityScoreEvaluate a context
compare(context1, context2)dictCompare two contexts
report(score)strHuman-readable report

QualityScore

FieldTypeDescription
overallfloatWeighted score (0.0–1.0)
dimensionsdict[QualityDimension, float]Per-dimension scores
issueslist[str]Problems found
strengthslist[str]What's working
suggestionslist[str]Actionable improvements
metadatadictMode, word count, component flags