Anomaly detection¶

Thresholds are static — >80 % CPU = WARN applies 24/7. That ignores that your server never sees more than 5 % on Sunday at 04:00 and is normally at 70 % weekday afternoons.

Anomaly detection learns these patterns and alerts when a value deviates significantly for the current time slot.

Concept¶

Per service a baseline is computed:

Observation window: 28 days
Bucket size: 1 hour × 7 days = 168 buckets
Per bucket: mean + standard deviation

flowchart LR
    H[28d history] --> B1[Bucket Mon 00:00<br/>μ=12, σ=3]
    H --> B2[Bucket Mon 01:00<br/>μ=10, σ=2]
    H --> Bn[... 168 buckets]
    LIVE[Live value] --> Z[Z-score = (x - μ) / σ]
    Bn --> Z
    Z -->|"|z| > threshold"| A[Anomaly event]

Z-score: how many standard deviations does the current value deviate from the historic mean of the same weekday-hour bucket?

Sensitivity	Z threshold	When
Low	4	only extreme outliers
Medium	3	default, ~99.7 % confidence
High	2	many hits, ~95 % confidence

Prerequisites¶

At least 7 days of data — earlier no useful baseline
Continuous values — gauge types like CPU/memory/network. Not for status (binary) or info (string).

Baseline calculation¶

Background job runs daily at 03:30 UTC and refreshes the baseline per service:

Fetch last 28 days values
Per bucket (weekday × hour) compute mean + standard deviation
Store in metric_baselines (migration 040)

The baseline rolls — new data in, old out.

Anomaly event¶

Live evaluation runs every 5 minutes:

Fetch latest value
Determine current bucket
Compute z-score
If |z| > sensitivity_threshold → anomaly_events entry

Anomaly events appear:

In the host detail anomaly section
In the dashboard widget „Anomaly events"
As push, if a configured channel exists

False-positive feedback¶

Each anomaly event has a „False alarm" button. Click → marked as FP, weighted lower in future calculations.

In the UI as small thumbs-down icons.

Predictions¶

Linear regression on 28-day trend → prediction when a threshold is crossed.

Output	Meaning
`predicted_breach_date`	Date when value crosses crit (NULL if not in next 90 days)
`confidence`	R² of regression — low = trend too volatile, ignore
`slope_per_day`	Change rate

Visible in dashboard widget „Predictions" with sort by urgency:

🔴 sw-storage-vol1 — 100% disk in 3 days
🟠 backup01 /var   —  90% disk in 12 days
🟡 prod-pgsql RAM  —  85% RAM in 30 days

Useful for capacity planning — you don't see a full disk first when it's full, but weeks before.

Configuration¶

Host detail → Anomaly:

Set sensitivity per service (default medium)
Pause / resume anomaly detection per service
Trigger manual re-baseline (after major change on the machine)

What it can't¶

Seasonal effects beyond 1 week — month-end load, Black Friday: not detected. 28d window is too short.
Step changes — when load fundamentally shifts (migration to new service), the baseline takes 28 days to learn. Workaround: manual re-baseline.
Multivariate patterns — „CPU high AND memory high AND disk-IO high" not recognized as combined anomaly. Per service separately.

Next¶

Custom dashboards → Widgets — Anomaly + Predictions widgets
Alert rules — bring anomalies into notifications