Skip to content

System health

Since v0.17.3

Pipeline-health endpoint, system-tab banner, sidebar status LED, global red banner, email alarm on sustained red > 30 min.

What it shows

/admin/system → System tab shows a live view of the pipeline:

  • Stream backlog — how many messages are unprocessed in the Redis stream?
  • Insert rate — how many check results per second right now?
  • DB free + disk free — Postgres data disk
  • AI provider — responding, or error status
  • Worker pool — how many workers run, what's the lag

Plus historical sparklines (last 60 minutes).

Endpoint

curl -H "Authorization: Bearer <ADMIN_JWT>" \
  https://your-domain.tld/api/v1/admin/health/snapshot
{
  "stream_pending": 23,
  "insert_rate_per_s": 178,
  "db_disk_free_gb": 142.4,
  "host_disk_free_gb": 87.1,
  "ai_provider": { "name": "ollama", "ok": true, "latency_ms": 45 },
  "workers": [
    { "id": 0, "lag_s": 2, "active": true },
    { "id": 1, "lag_s": 4, "active": true }
  ],
  "status": "green",
  "evaluated_at": "2026-04-25T10:15:30Z"
}

Server-side cache: 5 s. More frequent calls are cheap.

Status colors

flowchart LR
    G[green<br/>all ok] --> Y[yellow<br/>individual values over threshold]
    Y --> R[red<br/>sustained, >5 min]
    R --> R2[red + alarm<br/>>30 min]

Important: red is shown only when the bad value is sustained for 5 minutes. Single spikes only briefly color, not permanently.

Trend-based — no flicker on short load spikes.

Thresholds

Defaults:

Metric Yellow at Red at
Stream pending > 1 000 > 5 000 (sustained 5 min)
Insert-rate drop < 50 % of average < 25 % (sustained)
DB free < 20 % < 10 %
Disk free < 15 % < 5 %
AI latency > 10 s > 30 s
Worker lag > 30 s > 120 s (sustained)

Configurable in system_settings.

Small status lamp at the top of the sidebar — green, yellow, or red. Mouse-over shows status text.

Global banner on red

When status flips to red (sustained), a banner appears on top of the frontend („Pipeline overloaded — stream backlog growing for 7 min"). Banner is persistent, dismissible, returns until the issue is fixed.

Analogous to CriticalUpdateBanner.

Email alarm

On sustained-red > 30 minutes:

  • Email to all super admins
  • Cooldown 6 h (no spam if the issue persists)
  • Background task health_alerts.py

Email contains the current snapshot + link to system tab.

What to do on red

Symptom Possible cause Action
Stream pending growing monotonically too few workers raise WORKER_REPLICAS, scaling
Insert rate dropping DB bottleneck, index lock check DB logs, kill long-running queries
DB free low retention too long enable TimescaleDB compression, lower retention
Disk free low log volume check log source filters
AI latency high provider load smaller model, swap provider
Worker lag high receiver burst, DB locks DB connection pool, worker concurrency

Details: System health, Scaling.

Setup wizard sizing

Since v0.17.3

Setup wizard asks between tenant and SMTP for sizing (Small / Medium / Large / XL). Stored in system_settings.size_profile.

Currently the wizard sizing is advisory — server doesn't actively read it to set defaults yet. Auto-tuning is phase 2 of scaling-ux-plan, ships later.

In the meantime: watch system health, scale manually as needed.

Performance baseline

50 k checks scaling test (on self-hosting defaults: 1 worker, 256 MB shared_buffers):

  • ~960 checks/s sustained
  • p95 ≤ 150 ms
  • 0 errors over 30 min soak

Details: scale-test/REPORT.md in the repo. Six bugs fixed during the test, all in v0.17.x.

Next

  • Scaling — when the system pushes back
  • Updates — update mechanism uses health check too