Sisyphus / SRE Agent

What the agent observes, diagnoses, and remediates in each operating mode

Escalation Flow

Collect data
Check thresholds
Alert (Telegram)
Remediate
AI diagnosis
Verify SLOs

Each mode unlocks more steps in this pipeline. off only writes a heartbeat. observe adds monitoring and light fixes. diagnose adds AI analysis. auto enables the full closed loop.

Operating Modes

off Off €0/mo

Heartbeat only. The agent stays alive but does nothing else. Telegram commands are still processed so you can switch modes remotely.

What it does
  • Write heartbeat file every cycle healthcheck dependency
  • Poll Telegram for /sre commands mode switching
observe Observe €0/mo

Full monitoring with Telegram alerts and safe, pre-scripted remediations. No AI calls. This is the default mode.

Observes
  • System resources (CPU, memory, disk) via /proc + df
  • Container health (all containers) docker/podman ps
  • Caddy metrics endpoint HTTP poll
  • TLS certificate expiry Caddy admin API
  • Backup freshness restic snapshots
Alerts on
  • Disk ≥ 85% warning
  • Memory ≥ 90% warning
  • CPU ≥ 95% sustained warning
  • Container unhealthy / exited / dead
  • TLS cert expiring within 7 days
  • Backup older than 26 hours
  • Caddy metrics endpoint unreachable
Remediates
  • Restart unhealthy containers max 3/hr per container
  • Light disk prune on disk ≥ 85% prune_disk.sh --light
  • Light disk prune on disk ≥ 90% prune_disk.sh --light
diagnose Diagnose ~€0.01–0.05/incident

Everything in observe, plus AI-powered root cause analysis when multiple alerts fire together. Uses Claude Haiku with a monthly budget cap.

Observes
  • Everything from observe mode
Diagnoses
  • 🧠 AI root cause analysis on ≥2 correlated alerts Claude Haiku
  • 🧠 Includes recent incident history in analysis context
  • 🧠 Includes runbook/playbook in analysis prompt
  • 🧠 Auto-downgrades to observe if budget exhausted €5/mo cap
Remediates
  • Everything from observe mode
  • Full disk prune on disk ≥ 90% prune_disk.sh (full)
  • SLO verification after remediations verify_slo.sh
auto Auto ~€0.02–0.10/incident

Full closed loop. Everything in diagnose, plus aggressive remediations that involve restarting healthy-but-heavy containers to free resources.

Observes
  • Everything from diagnose mode
Diagnoses
  • 🧠 Everything from diagnose mode
Remediates
  • Everything from diagnose mode
  • Restart heaviest non-Caddy container on memory ≥ 90% max 3/hr
  • Full disk prune on disk ≥ 90%
  • SLO verification after remediations

SLO Thresholds

MetricThresholdActionMin. Mode
Disk usage≥ 85%Light pruneobserve
Disk usage≥ 90%Full prune + log cleanupdiagnose
Memory usage≥ 90%Restart heaviest containerauto
CPU sustained≥ 95%Alert onlyobserve
Container healthunhealthy > 2 minRestart (3/hr limit)observe
TLS cert expiry< 7 daysAlert (Caddy auto-renews)observe
Backup freshness> 26 hoursAlertobserve
Correlated alerts≥ 2 simultaneousAI root cause analysisdiagnose

Telegram Commands

CommandDescription
/sre statusCurrent mode, budget, container health, resource usage
/sre incidentsLast 5 incidents with triggers and actions taken
/sre offSwitch to off mode
/sre observeSwitch to observe mode
/sre diagnoseSwitch to diagnose mode
/sre autoSwitch to auto mode

AI Budget

Claude Haiku API calls are capped at €5.00/month. Budget resets on the 1st of each month. If budget hits 80%, a warning is sent via Telegram. At 100%, the agent auto-downgrades to observe mode.

EventAction
Budget ≥ 80%Telegram warning
Budget ≥ 100%Auto-downgrade to observe mode + Telegram alert
New monthBudget resets to €0.00

Data Sources

SourceMethodInterval
CPU / Memory/proc/stat, /proc/meminfoEvery poll cycle (60s)
DiskdfEvery poll cycle
Containersdocker/podman ps --format jsonEvery poll cycle
Caddy metricsHTTP GET /metricsEvery poll cycle
TLS certificatesCaddy admin API /certificatesEvery poll cycle
Backupsrestic snapshots --json --latest 1Every poll cycle
IncidentsLocal JSONL fileAppended on alert