Alert rules¶

An alert rule is the rule that defines when and how a notification fires. Standard workflow:

Which hosts/services are covered? (filter)
On what status / pattern does it fire? (trigger)
Which channel gets the notification? (channels)
In what order / after what wait? (escalation)
Are simultaneous alerts grouped? (grouping)

Create a rule¶

/alert-rules → New.

Filter¶

Which services?

Filter	Example
Tenants	„only Acme GmbH"
Hosts	specific hostnames
Tags	`production`, `db`
Profiles	„only APC Smart-UPS"
Check type	`agent_disk`, `snmp_*`
Service display name (regex)	`^Battery`

If all filters are empty, the rule applies globally.

Trigger¶

Trigger type	Meaning
Status	fires on `WARN`, `CRIT`, `UNKNOWN`, `NO_DATA` (multi-select)
Pattern (message regex)	regex on `check_results.message`
Threshold (value compare)	`value > 80`, `value < 0.5`
Anomaly	when service has anomaly detection and `\|z\| > sensitivity`
Log pattern	see Logs — separate rule type

Status triggers fire on hard states only — soft → recovery passes silently.

Test button¶

Before saving: Test simulates a hit. Shows:

Which hosts/services would be affected
Which channels would receive
Which escalation path would apply

Avoids your first alert wave hitting 2 000 hosts immediately.

Since v0.17.4

Test button bug fixed — it used to silently do nothing with tenant filters.

Channels¶

Per rule you can attach any number of channels. When the rule fires, all go out simultaneously (or staggered via escalation).

Channel config: Notification channels.

Escalation¶

Multi-stage strategy for „operator misses it, manager should see, then the team".

escalation:
  - after: 0min
    channels: [ops-email]
  - after: 15min
    channels: [ops-push, manager-email]
  - after: 60min
    channels: [oncall-pager, ceo-sms]

Stages are evaluated as long as the alert is active. Acknowledgement or recovery stops escalation.

Grouping¶

If 50 hosts go down at once — do you want 50 emails or one „50 hosts critical" email? Grouping does the latter.

grouping:
  by: [tenant, profile]   # one mail per (tenant, profile)
  wait_seconds: 30        # collect for 30 s
  group_interval: 300     # don't re-group within 5 min

Implemented in worker (AlertGrouper). First match → timer starts, more matches in 30 s collected, then a unified payload to channels.

Pause / active¶

Each rule has an enabled toggle. Paused: rule isn't evaluated, but stays. Useful during migrations or maintenance waves.

Order / priority¶

Rules have a priority value. Multiple matches: highest priority wins — relevant for mute patterns („when this rule fires, suppress all others for the service").

Inhibition (short)¶

A rule can declare it suppresses other rules:

inhibits:
  - target_filter: { tag: 'critical-infra' }

Practice: „If parent switch CRIT, suppress alerts on dependent hosts." More: Dependencies & inhibition.

Log alert rules¶

Special variant for logs — regex pattern, threshold, absence. UI under /alert-rules/log.

Details: Logs → Log alert rules.

Recovery notifications¶

By default, a notification is also sent on recovery (hard → OK) — „Service X is back to OK". Configurable per channel.

Audit¶

Alert-rule changes go to the audit log with diff. Who lowered the threshold from 80 to 50? Filter target_kind = alert_rule.

Combined examples¶

Rule 1: disk full¶

name: Disk Full Critical
filter:
  check_type: agent_disk
trigger:
  threshold: value >= 95
channels: [ops-email, ops-push]
escalation:
  - after: 0min,    channels: [ops-email]
  - after: 30min,   channels: [ops-push, manager-email]
grouping:
  by: [host]
recovery_notify: true

Rule 2: cert expiring¶

name: TLS cert expiring soon
filter:
  check_type: ssl_certificate
trigger:
  status: [WARN, CRIT]
channels: [ops-email]
recovery_notify: false   # once renewed, no recovery spam

Rule 3: suspicious auth pattern¶

name: Brute force attempts
type: log_alert
mode: threshold
pattern: 'Failed password for'
window_minutes: 5
threshold: 20
channels: [security-team-email]

Next¶

Notification channels
Dependencies & inhibition
Downtimes — suppress alerts during maintenance