Sisyphus

Why Sisyphus

Coolify, CapRover, and Dokku give you push-to-deploy on your own server. Sisyphus does that too — but it also watches what it deployed.

An AI-powered SRE agent monitors every service, diagnoses failures with the Anthropic API, and remediates automatically. External HTTP probes check every site every 60 seconds. Three-tier alert escalation pages you when something is actually wrong — not when a container restarts.

No Kubernetes. No cluster. No control plane. One VPS, Docker, Caddy, and an agent that knows when to restart a service and when to roll back a deploy.

Under 16 EUR/month for the entire stack.

What it does

Hosts static sites, containerized apps, and AI-powered products on a single VPS. Auto-provisioned TLS, centralized authentication, file-based multi-agent coordination, and structured SLO enforcement.

Currently runs 12 sites across 4 domains with 99.5% availability target, automated backups, and a 24-role review panel for continuous improvement.

Deploy a project

Three steps to go from repo to live site:

Add a CLAUDE.md to your repo with the project tech contract and PM role section.
Add Dockerfile + validate.sh — the Dockerfile builds your app, validate.sh checks it before deploy.
Push to main — CI/CD triggers the deploy webhook, the SRE agent pulls, builds, and verifies. Post-deploy health check confirms the site is live.

Static sites are even simpler: push HTML, the webhook pulls it, Caddy serves it. No build step.

Architecture

                     Internet
                        |
               [ Caddy v2 — auto-HTTPS ]
               /    |    |    |    \
           static  static  app   app   static
           sites   sites   apps  apps  sites
               \    |    |    |    /
               [ Docker — container runtime ]
                        |
          +———————+———————+———————+
          |             |             |             |
    [ Consul ]    [ CrowdSec ]  [ SRE Agent ]  [ Headscale ]
    svc disc.     threat intel  monitor+heal   zero-trust VPN
                                     |
                              [ Claude API ]
                              AI diagnosis
                                     |
                           +—————+—————+
                           |                 |
                      [ Telegram ]    [ Incident API ]
                      operator         project owners
                      alerts           self-service

Infrastructure

OSFlatcar Container Linux

RuntimeDocker

ProxyCaddy v2 (ACME TLS)

VPNHeadscale (WireGuard)

ProvisioningBash + Butane/Ignition

Operations

MonitoringSRE Agent (Python)

AIClaude API (capped)

SecurityCrowdSec + ACME

Backupsrestic (encrypted)

Secretsenv vars (not in git)

Deploy flow

Static sites (CI-built)

push to main→ GitHub Actions builds→ gh-pages branch→ VPS pulls + serves

Static sites (local-build)

build locally→ deploy.sh (rsync)→ webhook validates HMAC→ VPS unpacks + serves

App projects

push to main→ CI tests→ deploy webhook→ VPS builds image→ health check 60s

unhealthy?→ auto-rollback to previous SHA

Self-healing

The SRE agent polls every 60 seconds. It detects anomalies, remediates automatically, verifies the fix, and notifies the project owner. Claude API provides AI root-cause analysis on correlated failures (budget-capped).

poll (60s)→ detect anomaly→ remediate→ verify→ notify owner

Automated remediations

Trigger	Action	Mode
Container unhealthy >2min	Restart (3/hr limit)	observe
Deploy unhealthy	Rollback to previous SHA	observe
Disk >85%	Prune images + containers	observe
Disk >90%	Full cleanup + log rotation	diagnose
Memory >90%	Restart heaviest container	auto
Correlated failures	Claude AI root-cause analysis	diagnose

Multi-agent bus

Three AI agents coordinate via file-based tickets. Roles define safety boundaries (what you must not do unsupervised), not work boundaries. Sensing, reading, and QA flow freely to whoever has context.

Operational gravity

Free layer — any role, no ticket needed

read any file· git log/diff· headless chromium· curl· decide (with docs)· write tickets

Hard boundaries — designated role + intent file required

operator
SSH · secrets · containers · VPS rm PM
git push project · edit source · deploy.sh architect
git push bus/ · edit protocol · edit personas

Hat-switch

When the round-trip cost of the chain exceeds the safety value of separation, an agent switches hats — a first-class protocol operation, not a violation.

agent wearing hat A→ intent file
scope + boundaries→ work as hat B→ intent file (return)

self-authorized if no hard boundary crossed user override if hard boundary

“Roles define safety boundaries, not work boundaries. The function with senses absorbs the function without them.” — Decision record, after Only-Ops: Operational Gravity in the AI Era

SLOs

Service levels

Availability99.5% / 24h

Latency p95<500ms

Container health100%

TLS expiry>7 days

Backupsdaily

Resolution targets

P015 min

P11 hour

P2half day

P31 day

P41 week

Cost

Hetzner CX22 (2 vCPU, 4 GB)	€4.00
Encrypted backup storage	€3.50
Domains	~€3.00
AI diagnosis API	€5.00 max
Total	€15.50/mo