Anomaly detection¶
Thresholds are static — >80 % CPU = WARN applies 24/7. That ignores that your server never sees more than 5 % on Sunday at 04:00 and is normally at 70 % weekday afternoons.
Anomaly detection learns these patterns and alerts when a value deviates significantly for the current time slot.
Concept¶
Per service a baseline is computed:
- Observation window: 28 days
- Bucket size: 1 hour × 7 days = 168 buckets
- Per bucket: mean + standard deviation
flowchart LR
H[28d history] --> B1[Bucket Mon 00:00<br/>μ=12, σ=3]
H --> B2[Bucket Mon 01:00<br/>μ=10, σ=2]
H --> Bn[... 168 buckets]
LIVE[Live value] --> Z[Z-score = (x - μ) / σ]
Bn --> Z
Z -->|"|z| > threshold"| A[Anomaly event]
Z-score: how many standard deviations does the current value deviate from the historic mean of the same weekday-hour bucket?
| Sensitivity | Z threshold | When |
|---|---|---|
| Low | 4 | only extreme outliers |
| Medium | 3 | default, ~99.7 % confidence |
| High | 2 | many hits, ~95 % confidence |
Prerequisites¶
- At least 7 days of data — earlier no useful baseline
- Continuous values —
gaugetypes like CPU/memory/network. Not forstatus(binary) orinfo(string).
Baseline calculation¶
Background job runs daily at 03:30 UTC and refreshes the baseline per service:
- Fetch last 28 days values
- Per bucket (weekday × hour) compute mean + standard deviation
- Store in
metric_baselines(migration 040)
The baseline rolls — new data in, old out.
Anomaly event¶
Live evaluation runs every 5 minutes:
- Fetch latest value
- Determine current bucket
- Compute z-score
- If
|z| > sensitivity_threshold→anomaly_eventsentry
Anomaly events appear:
- In the host detail anomaly section
- In the dashboard widget „Anomaly events"
- As push, if a configured channel exists
False-positive feedback¶
Each anomaly event has a „False alarm" button. Click → marked as FP, weighted lower in future calculations.
In the UI as small thumbs-down icons.
Predictions¶
Linear regression on 28-day trend → prediction when a threshold is crossed.
| Output | Meaning |
|---|---|
predicted_breach_date |
Date when value crosses crit (NULL if not in next 90 days) |
confidence |
R² of regression — low = trend too volatile, ignore |
slope_per_day |
Change rate |
Visible in dashboard widget „Predictions" with sort by urgency:
🔴 sw-storage-vol1 — 100% disk in 3 days
🟠 backup01 /var — 90% disk in 12 days
🟡 prod-pgsql RAM — 85% RAM in 30 days
Useful for capacity planning — you don't see a full disk first when it's full, but weeks before.
Configuration¶
Host detail → Anomaly:
- Set sensitivity per service (default medium)
- Pause / resume anomaly detection per service
- Trigger manual re-baseline (after major change on the machine)
What it can't¶
- Seasonal effects beyond 1 week — month-end load, Black Friday: not detected. 28d window is too short.
- Step changes — when load fundamentally shifts (migration to new service), the baseline takes 28 days to learn. Workaround: manual re-baseline.
- Multivariate patterns — „CPU high AND memory high AND disk-IO high" not recognized as combined anomaly. Per service separately.
Next¶
- Custom dashboards → Widgets — Anomaly + Predictions widgets
- Alert rules — bring anomalies into notifications