Prompt Optimization Workflow

mycontext provides a complete, reproducible pipeline for turning any prompt — from a one-liner to a partially-structured paragraph — into a high-quality 9-section context. This document covers the full journey: what the workflow is, when to use each tool, and how to run it end to end.

The Two Workflows

Depending on what you're optimizing, there are two distinct paths:

You have...	Use	What it does
A raw prompt string	`PromptArchitect`	Parse → Score → Rewrite all 9 sections → Diff
An SDK template `Guidance` object	`GuidanceOptimizer`	Audit rules → Detect weak language → Rewrite to binding rules

Both tools share the same underlying principle: the 9-Section Prompt Architecture.

The 9-Section Architecture

Every high-quality prompt has nine components in a specific order, grounded in LLM attention research:

PRIMACY ZONE    → ① Role          Who the model is
                  ② Goal          What success looks like
INSTRUCTIONS    → ③ Rules         Binding behavioral constraints
                  ④ Style         Tone and voice
MIDDLE          → ⑤ Reasoning     Thinking strategy to apply
                  ⑥ Examples      Negative + positive demonstrations
LATE            → ⑦ Output Contract  Exact structure required
                  ⑧ Guard Rails   What to avoid, edge-case handling
RECENCY ZONE    → ⑨ Task          The final, clearest imperative sentence

Research basis: Liu et al. 2023 (primacy-recency bias), Li et al. 2023 (+9.7 BLEU for task-last placement), CO-STAR (Ng, 2024).

The key differentiators from generic prompts:

Rules use binding language — must/always/never, not should/try to/ideally
Goal is imperative — "Identify the root cause" not "You will look at..."
Examples include negative cases — show the WRONG approach before the correct one
Task restates the request last — the recency zone is the final attention anchor

PromptArchitect — for raw prompts

Use PromptArchitect when you have a raw string prompt that needs to be evaluated and improved.

Quick start

from mycontext.intelligence import PromptArchitect

arch = PromptArchitect(provider="openai", model="gpt-4o-mini")

# Option A: Improve an existing prompt
result = arch.improve("You are an analyst. Summarize this customer feedback.")
print(result.summary())
# Score: 28% → 79%  (+51%)
# Added sections: goal, rules, style, reasoning, examples, output_contract, guard_rails
# Resolved 4 issue(s): Missing guidance/role component; No behavioral rules defined; ...

print(result.diff_report())  # Section-by-section what changed and why
print(result.improved_prompt)  # The full assembled prompt, ready to use

# Option B: Build from scratch
result = arch.build("Analyze customer churn and identify at-risk segments")
print(result.improved_prompt)

# Option C: Just inspect — zero LLM calls
parsed = arch.parse("You are an analyst. Summarize this customer feedback.")
print(parsed.present())   # ['role', 'task']
print(parsed.missing())   # ['goal', 'rules', 'style', 'reasoning', 'examples', 'output_contract', 'guard_rails']

The `improve()` pipeline

improve() runs five steps internally:

PARSE   — heuristically detect which sections already exist (no LLM)
SCORE   — QualityMetrics on the original context
REWRITE — LLM fills missing sections, strengthens weak ones
SCORE   — QualityMetrics on the improved context
DIFF    — section-by-section change record with rationale

Result object

@dataclass
class ArchitectResult:
    improved_context: Context       # Full mycontext Context, ready for .execute()
    improved_prompt: str            # Assembled prompt string
    before_score: float             # QualityMetrics score before (0.0–1.0)
    after_score: float              # QualityMetrics score after (0.0–1.0)
    score_delta: float              # Improvement
    parsed: ParsedSections          # What was detected in the original
    diffs: list[SectionDiff]        # Per-section change records
    before_issues: list[str]        # Issues found in original
    after_issues: list[str]         # Remaining issues after rewrite
    resolved_issues: list[str]      # Issues that were fixed
    metadata: dict                  # Mode, model, provider

`parse()` — zero-cost inspection

parsed = opt.parse(prompt)

parsed.present()    # ['role', 'task']
parsed.missing()    # ['goal', 'rules', ...]

# Access what was found per section
parsed.role         # "data analyst"
parsed.rules        # ["Must include actionable recommendations"]
parsed.task         # "Summarize the attached feedback report"

Reading the diff report

print(result.diff_report())

── SECTION DIFF ──────────────────────────────────────────

[ADDED]  § ROLE
  AFTER:  Senior data analyst with expertise in customer success and churn analytics
  WHY:    Section was absent; added using 9-section principle: Include seniority level

[STRENGTHENED]  § RULES
  BEFORE: Be helpful and accurate
  AFTER:  Always quantify findings with specific metrics; Never speculate beyond the data provided; ...
  WHY:    Existing content was weak; upgraded to binding language (must/always/never)

[ADDED]  § EXAMPLES
  AFTER:  WRONG: "Customers seem unhappy with onboarding." CORRECT: "42% of churned users (Q3 cohort) ...
  WHY:    Section was absent; added using 9-section principle: Include negative + positive examples

──────────────────────────────────────────────────────

Constructor

PromptArchitect(
    provider: str = "openai",
    model: str = "gpt-4o-mini",
)

Parameter	Type	Default	Description
`provider`	`str`	`"openai"`	LLM provider for rewrite calls
`model`	`str`	`"gpt-4o-mini"`	Model for rewrite calls

parse() makes no LLM calls. improve() and build() make one LLM call each.

GuidanceOptimizer — for SDK templates

Use GuidanceOptimizer when you're working with a Guidance object inside an SDK template and want to automatically upgrade its rules to binding, evidence-backed language.

from mycontext.intelligence import GuidanceOptimizer
from mycontext.foundation import Guidance

guidance = Guidance(
    role="Data analyst",
    rules=[
        "Try to look for patterns",
        "You should mention limitations",
        "Be accurate",
    ],
)

opt = GuidanceOptimizer(provider="openai", model="gpt-4o-mini")
result = opt.optimize(guidance)

print(result.optimized_guidance.rules)
# [
#   "Identify and quantify every pattern — report the specific metric and its value.",
#   "State data limitations explicitly: what is absent, what it prevents, what is needed.",
#   "Every claim must trace to a specific data point — no unsupported assertions.",
# ]

print(f"Rule strength: {result.before_score:.0%} → {result.after_score:.0%}")

What it optimizes

GuidanceOptimizer targets three specific weaknesses that degrade template quality:

Problem	Example before	Example after
Suggestive modals	"Should mention limitations"	"Must explicitly state each data gap and what it prevents"
Vague directives	"Be accurate"	"Every claim must cite the specific metric that supports it"
Under-specified rules	"Try to look for patterns"	"Identify and quantify every pattern — report the metric and its magnitude"

The audit report

print(result.audit_report())

── GUIDANCE AUDIT ────────────────────────────────────────
Role:  Data analyst
Rules: 3 total  |  0 binding  |  3 weak

WEAK RULES DETECTED:
  [1] "Try to look for patterns"
      Issue: suggestive modal ("try to")
      Fix:   "Identify and quantify every pattern..."

  [2] "You should mention limitations"
      Issue: weak modal ("should")
      Fix:   "Must explicitly state each data gap..."

  [3] "Be accurate"
      Issue: vague directive (< 5 words, no measurable criterion)
      Fix:   "Every claim must cite the specific metric..."

Rule strength score: 15% → 91%
──────────────────────────────────────────────────────────

End-to-End Workflow

Here is the full recommended flow when improving a prompt from scratch to production-ready:

from mycontext.intelligence import PromptArchitect, QualityMetrics

arch = PromptArchitect(provider="openai", model="gpt-4o-mini")
qm = QualityMetrics()

# ── Step 1: Inspect ────────────────────────────────────────
raw_prompt = """
You are a helpful assistant. Analyze the customer support data and
provide insights. Be thorough and helpful.
"""

parsed = opt.parse(raw_prompt)
print("Present:", parsed.present())   # ['role', 'task']
print("Missing:", parsed.missing())   # ['goal', 'rules', 'style', ...]

# ── Step 2: Score the original ─────────────────────────────
from mycontext import Context
from mycontext.foundation import Guidance, Directive

original_ctx = Context(
    guidance=Guidance(role="helpful assistant", rules=[]),
    directive=Directive(content=raw_prompt),
)
before = qm.evaluate(original_ctx)
print(f"Before: {before.overall:.1%}")        # 29%
print("Issues:", before.issues[:3])

# ── Step 3: Improve ────────────────────────────────────────
result = arch.improve(raw_prompt)
print(result.summary())
# Score: 29% → 82%  (+53%)

# ── Step 4: Inspect the diff ───────────────────────────────
print(result.diff_report())

# ── Step 5: Execute the improved prompt ────────────────────
response = result.improved_context.execute(provider="openai")
print(response.response)

# ── Step 6: Evaluate the output ────────────────────────────
from mycontext.intelligence import OutputEvaluator

evaluator = OutputEvaluator(mode="heuristic")
output_score = evaluator.evaluate(result.improved_context, response.response)
print(f"Output quality: {output_score.overall:.1%}")

Typical Score Lifts

Based on internal experiments across prompt categories:

Prompt type	Typical before	Typical after	Common additions
One-liner ("Summarize this")	15–25%	70–85%	role, goal, rules, output contract
Role-only ("You are an analyst. Analyze this.")	28–35%	75–88%	goal, rules, examples, guard rails
Partially structured	45–60%	78–90%	strengthened rules, examples, guard rails
Already structured	65–75%	80–92%	binding language upgrades, gap examples

The largest single gains come from:

Adding binding rules — typically +15–20% on clarity and completeness
Adding examples — typically +10–15% on specificity
Adding an output contract — typically +8–12% on structure

Quality Gate Integration

Use PromptOptimizer as a pre-execution quality gate:

from mycontext.intelligence import PromptArchitect, QualityMetrics

def ensure_quality(prompt: str, min_score: float = 0.70) -> str:
    """Return an architected prompt that meets the quality bar."""
    qm = QualityMetrics()
    
    # Build a minimal context to check the raw prompt
    from mycontext import Context
    from mycontext.foundation import Guidance, Directive
    ctx = Context(
        guidance=Guidance(role="Assistant"),
        directive=Directive(content=prompt),
    )
    score = qm.evaluate(ctx)
    
    if score.overall >= min_score:
        return prompt  # Already good enough
    
    # Improve and return the upgraded prompt
    arch = PromptArchitect()
    result = arch.improve(prompt)
    print(f"Auto-improved: {score.overall:.0%} → {result.after_score:.0%}")
    return result.improved_prompt

# In production
prompt = ensure_quality("You are an assistant. Review this contract.")
response = Context(directive=Directive(content=prompt)).execute(provider="openai")

Choosing Between the Two Tools

Have a raw string prompt?
  └─ Use PromptArchitect.improve() or .build()

Working on an SDK template's Guidance?
  └─ Use GuidanceOptimizer.optimize()

Want to inspect without changing anything?
  └─ Use PromptOptimizer.parse() — zero LLM calls

Want to score before/after?
  └─ Use QualityMetrics.compare(ctx_before, ctx_after)

Want to evaluate the output too?
  └─ Use OutputEvaluator after .execute()

QualityMetrics → — Score a context before sending it
OutputEvaluator → — Score an LLM response after receiving it
QualityMetrics → — Detailed scoring dimensions

The Two Workflows​

The 9-Section Architecture​

PromptArchitect — for raw prompts​

Quick start​

The improve() pipeline​

Result object​

parse() — zero-cost inspection​

Reading the diff report​

Constructor​

GuidanceOptimizer — for SDK templates​

What it optimizes​

The audit report​

End-to-End Workflow​

Typical Score Lifts​

Quality Gate Integration​

Choosing Between the Two Tools​

Related​