System health¶
Since v0.17.3
Pipeline-health endpoint, system-tab banner, sidebar status LED, global red banner, email alarm on sustained red > 30 min.
What it shows¶
/admin/system → System tab shows a live view of the pipeline:
- Stream backlog — how many messages are unprocessed in the Redis stream?
- Insert rate — how many check results per second right now?
- DB free + disk free — Postgres data disk
- AI provider — responding, or error status
- Worker pool — how many workers run, what's the lag
Plus historical sparklines (last 60 minutes).
Endpoint¶
{
"stream_pending": 23,
"insert_rate_per_s": 178,
"db_disk_free_gb": 142.4,
"host_disk_free_gb": 87.1,
"ai_provider": { "name": "ollama", "ok": true, "latency_ms": 45 },
"workers": [
{ "id": 0, "lag_s": 2, "active": true },
{ "id": 1, "lag_s": 4, "active": true }
],
"status": "green",
"evaluated_at": "2026-04-25T10:15:30Z"
}
Server-side cache: 5 s. More frequent calls are cheap.
Status colors¶
flowchart LR
G[green<br/>all ok] --> Y[yellow<br/>individual values over threshold]
Y --> R[red<br/>sustained, >5 min]
R --> R2[red + alarm<br/>>30 min]
Important: red is shown only when the bad value is sustained for 5 minutes. Single spikes only briefly color, not permanently.
Trend-based — no flicker on short load spikes.
Thresholds¶
Defaults:
| Metric | Yellow at | Red at |
|---|---|---|
| Stream pending | > 1 000 | > 5 000 (sustained 5 min) |
| Insert-rate drop | < 50 % of average | < 25 % (sustained) |
| DB free | < 20 % | < 10 % |
| Disk free | < 15 % | < 5 % |
| AI latency | > 10 s | > 30 s |
| Worker lag | > 30 s | > 120 s (sustained) |
Configurable in system_settings.
Banner¶
Sidebar LED¶
Small status lamp at the top of the sidebar — green, yellow, or red. Mouse-over shows status text.
Global banner on red¶
When status flips to red (sustained), a banner appears on top of the frontend („Pipeline overloaded — stream backlog growing for 7 min"). Banner is persistent, dismissible, returns until the issue is fixed.
Analogous to CriticalUpdateBanner.
Email alarm¶
On sustained-red > 30 minutes:
- Email to all super admins
- Cooldown 6 h (no spam if the issue persists)
- Background task
health_alerts.py
Email contains the current snapshot + link to system tab.
What to do on red¶
| Symptom | Possible cause | Action |
|---|---|---|
| Stream pending growing monotonically | too few workers | raise WORKER_REPLICAS, scaling |
| Insert rate dropping | DB bottleneck, index lock | check DB logs, kill long-running queries |
| DB free low | retention too long | enable TimescaleDB compression, lower retention |
| Disk free low | log volume | check log source filters |
| AI latency high | provider load | smaller model, swap provider |
| Worker lag high | receiver burst, DB locks | DB connection pool, worker concurrency |
Details: System health, Scaling.
Setup wizard sizing¶
Since v0.17.3
Setup wizard asks between tenant and SMTP for sizing (Small / Medium / Large / XL). Stored in system_settings.size_profile.
Currently the wizard sizing is advisory — server doesn't actively read it to set defaults yet. Auto-tuning is phase 2 of scaling-ux-plan, ships later.
In the meantime: watch system health, scale manually as needed.
Performance baseline¶
50 k checks scaling test (on self-hosting defaults: 1 worker, 256 MB shared_buffers):
- ~960 checks/s sustained
- p95 ≤ 150 ms
- 0 errors over 30 min soak
Details: scale-test/REPORT.md in the repo. Six bugs fixed during the test, all in v0.17.x.