Sisyphus / SRE Agent

What the agent observes, diagnoses, and remediates in each operating mode

Escalation Flow

Collect data

→

Check thresholds

→

Alert (Telegram)

→

Remediate

→

AI diagnosis

→

Verify SLOs

Each mode unlocks more steps in this pipeline. off only writes a heartbeat. observe adds monitoring and light fixes. diagnose adds AI analysis. auto enables the full closed loop.

Operating Modes

off Off €0/mo

Heartbeat only. The agent stays alive but does nothing else. Telegram commands are still processed so you can switch modes remotely.

What it does

✓ Write heartbeat file every cycle healthcheck dependency
✓ Poll Telegram for /sre commands mode switching

observe Observe €0/mo

Full monitoring with Telegram alerts and safe, pre-scripted remediations. No AI calls. This is the default mode.

Observes

● System resources (CPU, memory, disk) via /proc + df
● Container health (all containers) docker/podman ps
● Caddy metrics endpoint HTTP poll
● TLS certificate expiry Caddy admin API
● Backup freshness restic snapshots

Alerts on

⚠ Disk ≥ 85% warning
⚠ Memory ≥ 90% warning
⚠ CPU ≥ 95% sustained warning
⚠ Container unhealthy / exited / dead
⚠ TLS cert expiring within 7 days
⚠ Backup older than 26 hours
⚠ Caddy metrics endpoint unreachable

Remediates

⚒ Restart unhealthy containers max 3/hr per container
⚒ Light disk prune on disk ≥ 85% prune_disk.sh --light
⚒ Light disk prune on disk ≥ 90% prune_disk.sh --light

diagnose Diagnose ~€0.01–0.05/incident

Everything in observe, plus AI-powered root cause analysis when multiple alerts fire together. Uses Claude Haiku with a monthly budget cap.

Observes

● Everything from observe mode

Diagnoses

🧠 AI root cause analysis on ≥2 correlated alerts Claude Haiku
🧠 Includes recent incident history in analysis context
🧠 Includes runbook/playbook in analysis prompt
🧠 Auto-downgrades to observe if budget exhausted €5/mo cap

Remediates

⚒ Everything from observe mode
⚒ Full disk prune on disk ≥ 90% prune_disk.sh (full)
⚒ SLO verification after remediations verify_slo.sh

auto Auto ~€0.02–0.10/incident

Full closed loop. Everything in diagnose, plus aggressive remediations that involve restarting healthy-but-heavy containers to free resources.

Observes

● Everything from diagnose mode

Diagnoses

🧠 Everything from diagnose mode

Remediates

⚒ Everything from diagnose mode
⚒ Restart heaviest non-Caddy container on memory ≥ 90% max 3/hr
⚒ Full disk prune on disk ≥ 90%
⚒ SLO verification after remediations

SLO Thresholds

Metric	Threshold	Action	Min. Mode
Disk usage	≥ 85%	Light prune	observe
Disk usage	≥ 90%	Full prune + log cleanup	diagnose
Memory usage	≥ 90%	Restart heaviest container	auto
CPU sustained	≥ 95%	Alert only	observe
Container health	unhealthy > 2 min	Restart (3/hr limit)	observe
TLS cert expiry	< 7 days	Alert (Caddy auto-renews)	observe
Backup freshness	> 26 hours	Alert	observe
Correlated alerts	≥ 2 simultaneous	AI root cause analysis	diagnose

Telegram Commands

Command	Description
`/sre status`	Current mode, budget, container health, resource usage
`/sre incidents`	Last 5 incidents with triggers and actions taken
`/sre off`	Switch to off mode
`/sre observe`	Switch to observe mode
`/sre diagnose`	Switch to diagnose mode
`/sre auto`	Switch to auto mode

AI Budget

Claude Haiku API calls are capped at €5.00/month. Budget resets on the 1st of each month. If budget hits 80%, a warning is sent via Telegram. At 100%, the agent auto-downgrades to observe mode.

Event	Action
Budget ≥ 80%	Telegram warning
Budget ≥ 100%	Auto-downgrade to observe mode + Telegram alert
New month	Budget resets to €0.00

Data Sources

Source	Method	Interval
CPU / Memory	`/proc/stat`, `/proc/meminfo`	Every poll cycle (60s)
Disk	`df`	Every poll cycle
Containers	`docker/podman ps --format json`	Every poll cycle
Caddy metrics	HTTP GET `/metrics`	Every poll cycle
TLS certificates	Caddy admin API `/certificates`	Every poll cycle
Backups	`restic snapshots --json --latest 1`	Every poll cycle
Incidents	Local JSONL file	Appended on alert