| Contract | Role |
|---|---|
Signals | The read-only ground truth: services, recent_deploys, metrics, logs, protected_resources. |
Diagnosis | The structured LLM output: hypothesis, suspected_resource, suspected_deploy_sha, confidence (0–1), recommended_action. Never prose. |
ProposedAction | A typed tool call (rollback_deploy / scale_service) with an explicit blast-radius scope. |
Verdict | A guardrail result — passed, the individual checks, and human-readable reasons (rendered live in the UI). |
The triage loop
RunEvents over SSE
Everything streams asRunEvents — step, gate, fallback, action, blocked, breaker, done — over Server-Sent Events, so the dashboard is just a live view of the agent’s decision trail.
The InfraBackend interface
The cluster sits behind one interface,InfraBackend, with two implementations:
K8sBackend— the realkindcluster. Reads ReplicaSet revisions and ready-ratios, performs real image rollbacks and scales.MockBackend— a deterministic fixture for tests. It mirrors the same newest-first deploy ordering as the real one so behavior is identical across both.
The triage loop, step by step
This isrun_hardened, and every step is failure-aware:
- Trigger — an alert opens a triage run.
- Gather — pull
Signalsfrom the cluster. Nothing destructive is reachable on this path;prod-dbandpaymentsare excluded from the actionableserviceslist at the source. - Redact — mask secrets and PII in the gathered logs before the model ever sees them (the cluster signals deliberately include a leaked
postgres://…credential line so you can watch this work). - Diagnose — the gateway routes to
prod-triageand returns a structuredDiagnosis. - Quality gate — rule-based groundedness (
suspected_resourcemust be a real service,suspected_deploy_shamust be a real recent deploy,confidence≥ 0.5) plus an independent LLM-as-judge that reasons about whether the action is actually justified by the evidence. Fail → re-route to a stronger model and re-diagnose. - Plan — turn the validated diagnosis into a typed
ProposedAction. - Action gate — before any write: reject
scope=all(blast radius), reject protected resources, confirm the target exists, and confirm the action matches the diagnosis. Fail → block and escalate — the destructive action simply never runs. - Execute — only a validated action runs, against the real cluster, through a narrow write path; tool failures are caught and degrade to a human hand-off.
- Notify — page on-call and open an incident ticket through the MCP Gateway.
- Resolve — re-gather to confirm the heal (
error_rate → 0.0).
run_naive path skips steps 3, 5, 7, and the tool-failure handling — it trusts the first output and has every tool in hand, so it executes the catastrophe. That contrast is the demo.
