System health¶

Since v0.17.3

Pipeline-health endpoint, system-tab banner, sidebar status LED, global red banner, email alarm on sustained red > 30 min.

What it shows¶

/admin/system → System tab shows a live view of the pipeline:

Stream backlog — how many messages are unprocessed in the Redis stream?
Insert rate — how many check results per second right now?
DB free + disk free — Postgres data disk
AI provider — responding, or error status
Worker pool — how many workers run, what's the lag

Plus historical sparklines (last 60 minutes).

Endpoint¶

curl -H "Authorization: Bearer <ADMIN_JWT>" \
  https://your-domain.tld/api/v1/admin/health/snapshot

{
  "stream_pending": 23,
  "insert_rate_per_s": 178,
  "db_disk_free_gb": 142.4,
  "host_disk_free_gb": 87.1,
  "ai_provider": { "name": "ollama", "ok": true, "latency_ms": 45 },
  "workers": [
    { "id": 0, "lag_s": 2, "active": true },
    { "id": 1, "lag_s": 4, "active": true }
  ],
  "status": "green",
  "evaluated_at": "2026-04-25T10:15:30Z"
}

Server-side cache: 5 s. More frequent calls are cheap.

Status colors¶

flowchart LR
    G[green<br/>all ok] --> Y[yellow<br/>individual values over threshold]
    Y --> R[red<br/>sustained, >5 min]
    R --> R2[red + alarm<br/>>30 min]

Important: red is shown only when the bad value is sustained for 5 minutes. Single spikes only briefly color, not permanently.

Trend-based — no flicker on short load spikes.

Thresholds¶

Defaults:

Metric	Yellow at	Red at
Stream pending	> 1 000	> 5 000 (sustained 5 min)
Insert-rate drop	< 50 % of average	< 25 % (sustained)
DB free	< 20 %	< 10 %
Disk free	< 15 %	< 5 %
AI latency	> 10 s	> 30 s
Worker lag	> 30 s	> 120 s (sustained)

Configurable in system_settings.

Small status lamp at the top of the sidebar — green, yellow, or red. Mouse-over shows status text.

When status flips to red (sustained), a banner appears on top of the frontend („Pipeline overloaded — stream backlog growing for 7 min"). Banner is persistent, dismissible, returns until the issue is fixed.

Analogous to CriticalUpdateBanner.

Email alarm¶

On sustained-red > 30 minutes:

Email to all super admins
Cooldown 6 h (no spam if the issue persists)
Background task health_alerts.py

Email contains the current snapshot + link to system tab.

What to do on red¶

Symptom	Possible cause	Action
Stream pending growing monotonically	too few workers	raise `WORKER_REPLICAS`, scaling
Insert rate dropping	DB bottleneck, index lock	check DB logs, kill long-running queries
DB free low	retention too long	enable TimescaleDB compression, lower retention
Disk free low	log volume	check log source filters
AI latency high	provider load	smaller model, swap provider
Worker lag high	receiver burst, DB locks	DB connection pool, worker concurrency

Details: System health, Scaling.

Setup wizard sizing¶

Since v0.17.3

Setup wizard asks between tenant and SMTP for sizing (Small / Medium / Large / XL). Stored in system_settings.size_profile.

Currently the wizard sizing is advisory — server doesn't actively read it to set defaults yet. Auto-tuning is phase 2 of scaling-ux-plan, ships later.

In the meantime: watch system health, scale manually as needed.

Performance baseline¶

50 k checks scaling test (on self-hosting defaults: 1 worker, 256 MB shared_buffers):

~960 checks/s sustained
p95 ≤ 150 ms
0 errors over 30 min soak

Details: scale-test/REPORT.md in the repo. Six bugs fixed during the test, all in v0.17.x.

Next¶

Scaling — when the system pushes back
Updates — update mechanism uses health check too

System health¶

What it shows¶

Endpoint¶

Status colors¶

Thresholds¶

Banner¶

Sidebar LED¶

Global banner on red¶