Skip to content

Anomaly detection

Thresholds are static — >80 % CPU = WARN applies 24/7. That ignores that your server never sees more than 5 % on Sunday at 04:00 and is normally at 70 % weekday afternoons.

Anomaly detection learns these patterns and alerts when a value deviates significantly for the current time slot.

Concept

Per service a baseline is computed:

  • Observation window: 28 days
  • Bucket size: 1 hour × 7 days = 168 buckets
  • Per bucket: mean + standard deviation
flowchart LR
    H[28d history] --> B1[Bucket Mon 00:00<br/>μ=12, σ=3]
    H --> B2[Bucket Mon 01:00<br/>μ=10, σ=2]
    H --> Bn[... 168 buckets]
    LIVE[Live value] --> Z[Z-score = (x - μ) / σ]
    Bn --> Z
    Z -->|"|z| > threshold"| A[Anomaly event]

Z-score: how many standard deviations does the current value deviate from the historic mean of the same weekday-hour bucket?

Sensitivity Z threshold When
Low 4 only extreme outliers
Medium 3 default, ~99.7 % confidence
High 2 many hits, ~95 % confidence

Prerequisites

  • At least 7 days of data — earlier no useful baseline
  • Continuous values — gauge types like CPU/memory/network. Not for status (binary) or info (string).

Baseline calculation

Background job runs daily at 03:30 UTC and refreshes the baseline per service:

  1. Fetch last 28 days values
  2. Per bucket (weekday × hour) compute mean + standard deviation
  3. Store in metric_baselines (migration 040)

The baseline rolls — new data in, old out.

Anomaly event

Live evaluation runs every 5 minutes:

  1. Fetch latest value
  2. Determine current bucket
  3. Compute z-score
  4. If |z| > sensitivity_thresholdanomaly_events entry

Anomaly events appear:

  • In the host detail anomaly section
  • In the dashboard widget „Anomaly events"
  • As push, if a configured channel exists

False-positive feedback

Each anomaly event has a „False alarm" button. Click → marked as FP, weighted lower in future calculations.

In the UI as small thumbs-down icons.

Predictions

Linear regression on 28-day trend → prediction when a threshold is crossed.

Output Meaning
predicted_breach_date Date when value crosses crit (NULL if not in next 90 days)
confidence R² of regression — low = trend too volatile, ignore
slope_per_day Change rate

Visible in dashboard widget „Predictions" with sort by urgency:

🔴 sw-storage-vol1 — 100% disk in 3 days
🟠 backup01 /var   —  90% disk in 12 days
🟡 prod-pgsql RAM  —  85% RAM in 30 days

Useful for capacity planning — you don't see a full disk first when it's full, but weeks before.

Configuration

Host detail → Anomaly:

  • Set sensitivity per service (default medium)
  • Pause / resume anomaly detection per service
  • Trigger manual re-baseline (after major change on the machine)

What it can't

  • Seasonal effects beyond 1 week — month-end load, Black Friday: not detected. 28d window is too short.
  • Step changes — when load fundamentally shifts (migration to new service), the baseline takes 28 days to learn. Workaround: manual re-baseline.
  • Multivariate patterns — „CPU high AND memory high AND disk-IO high" not recognized as combined anomaly. Per service separately.

Next