Skip to content

Alert rules

An alert rule is the rule that defines when and how a notification fires. Standard workflow:

  1. Which hosts/services are covered? (filter)
  2. On what status / pattern does it fire? (trigger)
  3. Which channel gets the notification? (channels)
  4. In what order / after what wait? (escalation)
  5. Are simultaneous alerts grouped? (grouping)

Create a rule

/alert-rulesNew.

Filter

Which services?

Filter Example
Tenants „only Acme GmbH"
Hosts specific hostnames
Tags production, db
Profiles „only APC Smart-UPS"
Check type agent_disk, snmp_*
Service display name (regex) ^Battery

If all filters are empty, the rule applies globally.

Trigger

Trigger type Meaning
Status fires on WARN, CRIT, UNKNOWN, NO_DATA (multi-select)
Pattern (message regex) regex on check_results.message
Threshold (value compare) value > 80, value < 0.5
Anomaly when service has anomaly detection and |z| > sensitivity
Log pattern see Logs — separate rule type

Status triggers fire on hard states only — soft → recovery passes silently.

Test button

Before saving: Test simulates a hit. Shows:

  • Which hosts/services would be affected
  • Which channels would receive
  • Which escalation path would apply

Avoids your first alert wave hitting 2 000 hosts immediately.

Since v0.17.4

Test button bug fixed — it used to silently do nothing with tenant filters.

Channels

Per rule you can attach any number of channels. When the rule fires, all go out simultaneously (or staggered via escalation).

Channel config: Notification channels.

Escalation

Multi-stage strategy for „operator misses it, manager should see, then the team".

escalation:
  - after: 0min
    channels: [ops-email]
  - after: 15min
    channels: [ops-push, manager-email]
  - after: 60min
    channels: [oncall-pager, ceo-sms]

Stages are evaluated as long as the alert is active. Acknowledgement or recovery stops escalation.

Grouping

If 50 hosts go down at once — do you want 50 emails or one „50 hosts critical" email? Grouping does the latter.

grouping:
  by: [tenant, profile]   # one mail per (tenant, profile)
  wait_seconds: 30        # collect for 30 s
  group_interval: 300     # don't re-group within 5 min

Implemented in worker (AlertGrouper). First match → timer starts, more matches in 30 s collected, then a unified payload to channels.

Pause / active

Each rule has an enabled toggle. Paused: rule isn't evaluated, but stays. Useful during migrations or maintenance waves.

Order / priority

Rules have a priority value. Multiple matches: highest priority wins — relevant for mute patterns („when this rule fires, suppress all others for the service").

Inhibition (short)

A rule can declare it suppresses other rules:

inhibits:
  - target_filter: { tag: 'critical-infra' }

Practice: „If parent switch CRIT, suppress alerts on dependent hosts." More: Dependencies & inhibition.

Log alert rules

Special variant for logs — regex pattern, threshold, absence. UI under /alert-rules/log.

Details: Logs → Log alert rules.

Recovery notifications

By default, a notification is also sent on recovery (hard → OK) — „Service X is back to OK". Configurable per channel.

Audit

Alert-rule changes go to the audit log with diff. Who lowered the threshold from 80 to 50? Filter target_kind = alert_rule.

Combined examples

Rule 1: disk full

name: Disk Full Critical
filter:
  check_type: agent_disk
trigger:
  threshold: value >= 95
channels: [ops-email, ops-push]
escalation:
  - after: 0min,    channels: [ops-email]
  - after: 30min,   channels: [ops-push, manager-email]
grouping:
  by: [host]
recovery_notify: true

Rule 2: cert expiring

name: TLS cert expiring soon
filter:
  check_type: ssl_certificate
trigger:
  status: [WARN, CRIT]
channels: [ops-email]
recovery_notify: false   # once renewed, no recovery spam

Rule 3: suspicious auth pattern

name: Brute force attempts
type: log_alert
mode: threshold
pattern: 'Failed password for'
window_minutes: 5
threshold: 20
channels: [security-team-email]

Next