Skip to main content

Incident Response & On-Call Triage

Scenario: Your on-call engineer gets paged at 2am. They need structured, fast analysis: what broke, why it broke, whether the system is still at risk, and what to do now. You want AI to do the first pass so the human can focus on fixing rather than diagnosing.

Patterns used:

  • RootCauseAnalyzer — immediate cause + contributing factors + timeline
  • DiagnosticRootCauseAnalyzer (enterprise) — deeper diagnostic with differential reasoning
  • SystemHealthAuditor (enterprise) — assesses whether the system is in a stable state

Integration: AutoGen multi-agent conversation — triage agent, diagnostic agent, postmortem writer


import mycontext
mycontext.activate_license("MC-ENT-YOUR-KEY")

from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager
from mycontext.templates.free.reasoning import RootCauseAnalyzer
from mycontext.templates.enterprise.diagnostic import (
DiagnosticRootCauseAnalyzer,
SystemHealthAuditor,
)

def create_incident_crew(incident: dict) -> str:
"""
incident = {
"title": "Payment service 503s",
"symptoms": "67% error rate since 14:32 UTC",
"context": "Deployment 3.8.2 at 14:28, DB CPU spike at 14:31",
"metrics": "p99 latency: 8.2s (was 220ms)",
}
"""
incident_brief = (
f"Title: {incident['title']}\n"
f"Symptoms: {incident['symptoms']}\n"
f"Context: {incident['context']}\n"
f"Metrics: {incident['metrics']}"
)

# Build specialized contexts for each agent
triage_ctx = RootCauseAnalyzer().build_context(
problem=incident_brief,
depth="immediate",
)
diagnostic_ctx = DiagnosticRootCauseAnalyzer().build_context(
observation=incident_brief,
system="payment microservice",
)
health_ctx = SystemHealthAuditor().build_context(
system="payment service + database cluster",
observation=incident_brief,
)

llm_config = {"config_list": [{"model": "gpt-4o-mini"}]}

triage_agent = AssistantAgent(
name="TriageAgent",
system_message=triage_ctx.assemble(),
llm_config=llm_config,
)
diagnostic_agent = AssistantAgent(
name="DiagnosticAgent",
system_message=diagnostic_ctx.assemble(),
llm_config=llm_config,
)
health_agent = AssistantAgent(
name="HealthAuditor",
system_message=health_ctx.assemble(),
llm_config=llm_config,
)
user_proxy = UserProxyAgent(
name="OnCallEngineer",
human_input_mode="NEVER",
max_consecutive_auto_reply=1,
code_execution_config=False,
)

group_chat = GroupChat(
agents=[user_proxy, triage_agent, diagnostic_agent, health_agent],
messages=[],
max_round=6,
speaker_selection_method="round_robin",
)
manager = GroupChatManager(groupchat=group_chat, llm_config=llm_config)

user_proxy.initiate_chat(
manager,
message=(
f"INCIDENT ACTIVE: {incident['title']}\n\n"
f"{incident_brief}\n\n"
"Each agent: provide your analysis. "
"TriageAgent: immediate cause and action. "
"DiagnosticAgent: differential diagnosis — what else could this be? "
"HealthAuditor: is the system stable enough to continue or should we roll back now?"
),
)
return group_chat.messages


# Trigger from PagerDuty webhook or CLI
incident = {
"title": "Payment service — 67% error rate",
"symptoms": "503 errors from /api/checkout, /api/payment-methods since 14:32 UTC",
"context": "Deployment 3.8.2 pushed at 14:28 UTC. DB CPU spiked to 98% at 14:31 UTC.",
"metrics": "p99: 8200ms (baseline: 220ms). Connection pool exhausted.",
}

messages = create_incident_crew(incident)

What You Get

Three independent analytical perspectives on the same incident — simultaneously:

AgentAnalytical frameworkOutput
TriageAgent5-why causal chainImmediate cause + actions to take now
DiagnosticAgentDifferential diagnosisAlternative hypotheses, rules out false leads
HealthAuditorSystem health checklistStable/unstable verdict, rollback recommendation

The conversation produces a structured incident analysis in under 60 seconds — equivalent to what typically takes an on-call engineer 20–30 minutes of log diving.

Postmortem Integration

After the incident, feed the conversation into a SynthesisBuilder to auto-draft the postmortem:

from mycontext.templates.free.reasoning import SynthesisBuilder

postmortem_ctx = SynthesisBuilder().build_context(
sources="\n\n".join([m["content"] for m in messages]),
topic="postmortem report",
)
postmortem = postmortem_ctx.execute(provider="openai").response