What the agent observes, diagnoses, and remediates in each operating mode
Each mode unlocks more steps in this pipeline. off only writes a heartbeat. observe adds monitoring and light fixes. diagnose adds AI analysis. auto enables the full closed loop.
Heartbeat only. The agent stays alive but does nothing else. Telegram commands are still processed so you can switch modes remotely.
/sre commands mode switchingFull monitoring with Telegram alerts and safe, pre-scripted remediations. No AI calls. This is the default mode.
Everything in observe, plus AI-powered root cause analysis when multiple alerts fire together. Uses Claude Haiku with a monthly budget cap.
Full closed loop. Everything in diagnose, plus aggressive remediations that involve restarting healthy-but-heavy containers to free resources.
| Metric | Threshold | Action | Min. Mode |
|---|---|---|---|
| Disk usage | ≥ 85% | Light prune | observe |
| Disk usage | ≥ 90% | Full prune + log cleanup | diagnose |
| Memory usage | ≥ 90% | Restart heaviest container | auto |
| CPU sustained | ≥ 95% | Alert only | observe |
| Container health | unhealthy > 2 min | Restart (3/hr limit) | observe |
| TLS cert expiry | < 7 days | Alert (Caddy auto-renews) | observe |
| Backup freshness | > 26 hours | Alert | observe |
| Correlated alerts | ≥ 2 simultaneous | AI root cause analysis | diagnose |
| Command | Description |
|---|---|
/sre status | Current mode, budget, container health, resource usage |
/sre incidents | Last 5 incidents with triggers and actions taken |
/sre off | Switch to off mode |
/sre observe | Switch to observe mode |
/sre diagnose | Switch to diagnose mode |
/sre auto | Switch to auto mode |
Claude Haiku API calls are capped at €5.00/month. Budget resets on the 1st of each month. If budget hits 80%, a warning is sent via Telegram. At 100%, the agent auto-downgrades to observe mode.
| Event | Action |
|---|---|
| Budget ≥ 80% | Telegram warning |
| Budget ≥ 100% | Auto-downgrade to observe mode + Telegram alert |
| New month | Budget resets to €0.00 |
| Source | Method | Interval |
|---|---|---|
| CPU / Memory | /proc/stat, /proc/meminfo | Every poll cycle (60s) |
| Disk | df | Every poll cycle |
| Containers | docker/podman ps --format json | Every poll cycle |
| Caddy metrics | HTTP GET /metrics | Every poll cycle |
| TLS certificates | Caddy admin API /certificates | Every poll cycle |
| Backups | restic snapshots --json --latest 1 | Every poll cycle |
| Incidents | Local JSONL file | Appended on alert |