Sisyphus — Lessons & Post-mortems

Lessons & Post-mortems

Honest notes from building and operating Sisyphus. These are real incidents with real timelines, published because post-mortems that stay internal don't help anyone. This page documents the full body of knowledge from the platform's first major review session and the architectural decisions that followed.

1. The Multi-Agent Bus Protocol

Sisyphus is operated by multiple AI agents, each with a single role: architect (strategic decisions, documentation), operator (infrastructure, VPS, deployments), and PM (one per project, owns source code and deploys). These roles coordinate through a file-based communication protocol — no shared memory, no API calls, just files in known directories.

The ticket lifecycle

Create ticket → Place in tickets/to_<role>/ → Receiver picks up → Write response → Move to archive

Every ticket has a sender, receiver, priority, and deliverable. Responses close tickets with a status: done, rejected, needs_clarification, or blocked. Each role checks for incoming tickets as the first action in every session.

Why it exists

The bus exists because role-mixing wastes time at scale. During the onboarding of two memorial sites, a single AI agent wore all hats at once — writing project source, editing platform config, and proposing architectural changes in the same session. The result: 9 of 14 hours wasted on dead ends. The agent designed deployment architecture from scratch instead of reading the platform's existing patterns. It proposed upgrading the VPS rather than examining how seven prior projects deployed. It presented A/B/C/D option menus instead of deciding.

The root cause was not incompetence — it was the absence of role boundaries. With no separation, the agent freely mixed concerns: project code, platform infrastructure, and architectural strategy tangled in every action. The human owner became the only consistency check, routing every handoff manually.

Operational gravity

Roles are safety boundaries, not work boundaries. An operator session cannot edit project source; a PM session cannot SSH to the VPS; an architect session cannot modify infrastructure. These constraints are enforced by per-role CLAUDE.md contracts and persona files. When a role needs to switch hats temporarily, it writes an intent file before acting — a first-class audit trail that makes hat-switches visible and reversible.

2. The 18-Role Review Panel

Platform reviews are conducted by an 18-persona panel where the architect assumes each expert perspective in sequence. The personas are: PM, SRE, DEV, Solutions Architect, Software Architect, Systems Architect, DBA, Cybersecurity, UX, Marketing, GDPR/Compliance, QA, Technical Editor, Statistician, Peer Reviewer, i18n, Financial, and Pedagogy.

Why 18? Because a single reviewer — even an expert one — has blind spots. A security specialist overweights CVE risk; an SRE underweights UX friction; a financial reviewer misses pedagogy gaps. The panel forces breadth.

Activation matrix

Not every role reviews every deliverable. An infrastructure change skips UX and Marketing. A documentation update skips DBA and Cybersecurity. The activation matrix maps deliverable types to the subset of personas that have relevant expertise, keeping reviews focused.

Ticket attribution

Panel findings are attributed as from: architect (<persona>) in each ticket. The receiver knows exactly which expertise made each claim — a Cybersecurity finding carries different weight than a Pedagogy finding, and the receiver can evaluate accordingly.

Bias correction

AI reviewers inflate severity, especially in security findings. The panel corrects for this with three mechanisms:

Dual grades: every finding gets an absolute score and a scope-adjusted score. A theoretical RCE on an internal-only admin endpoint scores differently than one on a public API.
Risk Score formula: combines severity, likelihood, and exposure surface into a single number, preventing inflated severity from dominating triage.
Self-critique pass: after the full panel review, the architect re-examines every Critical/High finding and asks: "is this finding based on verified exposure, or assumed exposure?" Unverified claims are downgraded.

3. Enforcement Mechanisms

Writing rules is the easy part. The hard part is making sure they're followed. During the first major review session, SLOs were defined but never checked, protocol rules were written but ignored, and a DNS migration script was authored for the wrong registrar. Rules without enforcement are aspirational text.

Dispatcher gates

Preflight validation: before executing any infrastructure action, run a context probe — dig NS before DNS changes, ls before path assumptions, curl before endpoint claims. No exceptions.
Postflight verification: after every change, verify the result from the user's perspective. Deploy a site? curl it. Rotate a secret? Test the webhook. Reload Caddy? Check the response headers.
Budget gate: no agent may silently raise infrastructure costs. "Upgrade the VPS" must first demonstrate that architectural alternatives have been exhausted by examining existing comparable projects.

Technical context validation

Every automation that touches DNS, deployments, or infrastructure must verify its assumptions before executing. A DNS switchover script that targets the wrong registrar is not a "minor bug" — it's a class of error that compounds: once to execute, once to undo, once to redo correctly. A single dig NS prevents the entire chain.

Response follow-up

A ticket in needs_clarification status that sits unaddressed for more than 1 hour triggers a stale-ticket alert. Unanswered questions don't age well — context evaporates, assumptions drift, and the cost of resolution grows.

Receiver authority

The receiver of any ticket has full authority to reject, downgrade, or return it. A ticket from the architect is not an order — it's a request backed by reasoning. If the operator determines the request is based on false premises, the correct response is rejected with an explanation, not silent compliance.

4. SLO Enforcement

The platform defines 7 SLOs: availability (99.5%), latency (p95 <500ms), container health (100%), TLS expiry (>7 days), backup freshness (daily), DNS resolution, and certificate validity. Before this review session, 2 were checked, 0 critical SLOs were measured. The verification script reported "compliant" while actual availability was 80.2%.

SLOs defined

actually checked

critical SLOs measured

checked after fix

Container healthy does not mean site working

Docker healthchecks reported all containers as "healthy." Five sites returned 503. The healthcheck tested whether the process was alive inside the container — it did not test whether the site was reachable through the reverse proxy, whether the volume mount was correct, or whether the response was 200. External HTTP probes are the only real health check.

After the fix

verify_slo.sh rewritten to check all 7 SLOs every 5 minutes via external HTTP probes, not container introspection.
Error budget tracking: each SLO has a 30-day rolling error budget. When budget drops below 20%, a deploy freeze is triggered automatically.
3-tier alert escalation: first alert at 60 seconds, repeat at 5 minutes, then every 15 minutes until acknowledged or auto-remediated.

5. Incident Lifecycle

The full incident lifecycle, from detection through post-mortem:

Detect → Alert → Auto-remediate → Re-check → KB match → Auto-close

Each incident records MTTI (mean time to identify), MTTR (mean time to resolve), and error budget burn. If auto-remediation succeeds on re-check, the incident auto-closes. If it fails, the Knowledge Base is searched for matching patterns.

The Knowledge Base

A growing catalog of pattern → cause → fix entries. Each time an incident is resolved, its signature is added to the KB. Future incidents matching the same pattern skip the diagnosis phase and go straight to the known fix. The KB is reinforced on use: successful matches increase the fix's confidence score; failed matches trigger a review.

Post-mortem generation

When an incident closes, a post-mortem ticket is auto-generated with the timeline, metrics, and remediation steps. If the error budget is exhausted, a deploy freeze is triggered until budget recovers. The goal is not just fixing the incident — it's making the system learn from it.

6. Disaster Recovery

4-8h

RTO (before)

∞

RPO (before)

30-45m

RTO (after)

24h

RPO (after)

Before the review session, disaster recovery was undefined. RTO (recovery time objective) was estimated at 4–8 hours — the time it would take someone to manually reconstruct the platform from memory. RPO (recovery point objective) was effectively infinity: total data loss, because no automated backups existed.

After the fix

backup.sh — daily automated backup of all site data, configuration, and secrets to off-VPS storage.
restore-from-backup.sh — single-command restore that rebuilds the platform from a backup artifact, targeting 30–45 minute RTO.
export-secrets.sh — exports all secrets (deploy tokens, API keys, basic_auth hashes) in a format that restore-from-backup.sh can consume. Secrets never live in git.
DNS switchover — documented procedure for re-pointing domains to a new VPS via the registrar's API. Written for the actual registrar (Gandi), not the wrong one (the first version targeted Cloudflare because nobody ran dig NS).
DR documentation — step-by-step runbook covering "the VPS is gone, rebuild everything from scratch" with estimated times for each phase.

7. Post-mortems

7.1 The monitoring that didn't monitor

During a structured review session, an 18-persona panel evaluated Sisyphus across SRE dimensions: monitoring, self-healing, incident response, SLO compliance. The panel produced detailed scores and recommendations. There was one problem: nobody curled a single site.

While the panel graded SRE capabilities, 5 of 10 hosted sites were returning 503. The platform was running at 80.2% availability against a 99.5% SLO. The kitchen was on fire while we reviewed the fire extinguisher manual.

5/10

sites down

80.2%

actual availability

9.5h

undetected

2/7

SLOs actually checked

Timeline

T+0h

Review session begins. 18 AI personas evaluate Sisyphus SRE capabilities. No live probes are run.

T+1h

Panel produces scores for monitoring (7/10), self-healing (6/10), incident response (8/10). All scores based on reading code, not observing behavior.

T+2h

DNS switchover script written targeting the wrong registrar. Nobody ran dig NS.

T+3h

SLO verification script written. Checks 2 of 7 defined SLOs. Reports "compliant". 5 SLOs not measured.

T+9.5h

Manual curl discovers 5 sites returning 503. Docker healthchecks show all containers "healthy".

T+10h

Root cause identified: container healthy does not mean site serves 200. SRE agent checked containers, not endpoints.

Changes made after this incident

External HTTP probes added. The SRE agent now curls every site endpoint, not just docker ps. A 503 from the outside is an incident, regardless of container status.
SLO verification covers all 7 SLOs. The verify script was rewritten to check every defined SLO. Partial verification is treated as a bug.
Pre-flight validation protocol. Any automation that touches DNS, deployments, or infrastructure must run a context probe first: dig, curl, docker ps --format. No exceptions.
Review sessions must start with live probes. Before scoring any SRE capability, verify the platform is actually running. Read the metrics, don't read the code that produces them.

7.2 Process lessons

Patterns that emerged across the full session — not from a single incident, but from observing how work went wrong repeatedly.

1Health check is job zero

Before reviewing, designing, or fixing anything, verify what's actually live. The first action in any SRE session should be a probe — curl, dig, docker ps — not reading code or documentation. Reading code tells you what should happen; a probe tells you what is happening.

2Commit after fix

Uncommitted changes cannot be rolled back. Every fix, no matter how small, gets committed before moving on. If the next change breaks something, git revert works. If the change was never committed, you're reconstructing from memory.

3Automate, don't delegate to the user

If an API exists, use it. Don't write a fix and then tell the user "now please run these 5 commands." The point of an AI agent is to compress human effort. Every manual step handed to the user is a failure of automation.

4Dispatch proactively

Don't wait for the user to launch PM sessions or route tickets manually. If the architect identifies work for a PM, write the ticket immediately. If the operator needs the PM to rebuild, say so in a ticket — don't wait for a human to relay the message.

5Parallel not sequential

Independent work runs in parallel. If two projects need the same class of change and neither depends on the other, don't serialize them. Agent time is not calendar time — parallelism is free when the tasks are independent.

6Validate technical context

Run dig, curl, ls before executing. A DNS migration script for the wrong registrar wastes triple: write it, discover it's wrong, rewrite it for the right one. A $0.01 verification check is cheaper than any script.

7Panel severity inflation

LLM reviewers systematically inflate security findings. A theoretical remote code execution on an internal-only admin endpoint is not the same as one on a public API. Always verify the exposure surface before accepting a severity rating. Unverified Critical findings are downgraded to Medium until exposure is confirmed.

8Cost of unverified claims

Every action taken on a false premise costs double: once to execute, once to undo and redo correctly. The wrong-registrar script had to be thrown away. The SLO report had to be rewritten. The review scores had to be discarded. A wrong assumption in the first 5 minutes poisons every minute that follows.

8. The Last Mile Problem

12 hours. 90+ tickets. Dozens of agents. 24 review personas. At the end of the day, Andrea said: "it's so sad that AI cannot really do much." He's right. Here's what we learned about what AI can and cannot do, measured against one full session.

What AI can do

Generate code fast — hundreds of lines per minute, across languages and frameworks.
Write documents — protocols, retrospectives, runbooks, post-mortems. This page is one of them.
Parallelize mechanical work — batch file edits, ticket generation, review passes across repos.
Find patterns in codebases — grep, cross-reference, trace dependencies at scale.
Produce review findings — 24 personas, 11 repos, structured scoring, prioritized roadmaps.

What AI cannot do

Verify its own output — self-audit score: 1/12 compliance with its own rules. The panel wrote rules the panel didn't follow.
Deploy code — push does not equal live. 17 fixes sat undeployed while marked "done."
Close loops — 124 responses accumulated unread. A critical clarification request was buried for 3+ hours.
Know when it's wrong — wrote "Cloudflare" into the disaster recovery runbook when DNS was Gandi. Nobody ran dig NS.
Know when to stop — 100+ tickets when 10 would have sufficed. Each review round generated more work than the system could execute.
Replace human judgment — Andrea corrected the system 15+ times. Added Umami analytics to a children's site. Declared work done that wasn't deployed. Deflected when asked about cost.
Remember context across the session — the same mistakes repeated hour after hour. Rules added at hour 4 were violated at hour 6.

The real cost

The real cost isn't tokens — it's Andrea's time correcting the machine. Every "done" that wasn't deployed. Every finding that wasn't verified. Every ticket that got lost. That's not AI failure in the dramatic sense. It's the gap between generating output and producing results.

The session grade was D+. Not because the outcomes were bad — all 12 sites ended up live, the protocol was hardened, the roadmap was generated. The D+ was for the process: 11 hours and 15 human corrections to reach outcomes that should have taken 3–4 hours with 2–3 touchpoints.

Policy-as-code partially solves this

Mechanical enforcement catches some failures. A dispatcher that runs curl after deploy, that rejects tickets with unverified context, that hard-stops at budget caps — these work because the entity being constrained doesn't control the constraint. Every rule implemented in code was followed. Every rule that existed only as text was violated.

But the fundamental problem remains: AI generates work, humans verify it works. Until the AI can curl a site, read the response, and honestly say "this is broken" without being asked — the last mile is human. The boulder rolls back down.

1/12

rules self-followed

"done" not deployed

124

responses unread

15+

human corrections

9. What Sisyphus Actually Is

Sisyphus is split into public and private layers. The public pages — including this one — document the methodology: how multi-agent coordination works, what the review panel looks like, how SLOs are enforced, what went wrong and how it was fixed. These pages exist because the methodology is the interesting part, and it's useful to others.

The confidential layer (/ops/) contains the actual infrastructure: container definitions, secrets, Caddyfile, deploy webhooks, project registry, SRE agent code. That layer is not public because it contains operational details specific to this deployment — not because the patterns are secret, but because the specifics (credentials, internal endpoints, project configurations) belong behind authentication.

If you're reading this page to learn from it: the lessons above are real. The methodology is in production. The mistakes were made and the fixes were applied. Nothing here is theoretical.