Alert rules¶
An alert rule is the rule that defines when and how a notification fires. Standard workflow:
- Which hosts/services are covered? (filter)
- On what status / pattern does it fire? (trigger)
- Which channel gets the notification? (channels)
- In what order / after what wait? (escalation)
- Are simultaneous alerts grouped? (grouping)
Create a rule¶
/alert-rules → New.
Filter¶
Which services?
| Filter | Example |
|---|---|
| Tenants | „only Acme GmbH" |
| Hosts | specific hostnames |
| Tags | production, db |
| Profiles | „only APC Smart-UPS" |
| Check type | agent_disk, snmp_* |
| Service display name (regex) | ^Battery |
If all filters are empty, the rule applies globally.
Trigger¶
| Trigger type | Meaning |
|---|---|
| Status | fires on WARN, CRIT, UNKNOWN, NO_DATA (multi-select) |
| Pattern (message regex) | regex on check_results.message |
| Threshold (value compare) | value > 80, value < 0.5 |
| Anomaly | when service has anomaly detection and |z| > sensitivity |
| Log pattern | see Logs — separate rule type |
Status triggers fire on hard states only — soft → recovery passes silently.
Test button¶
Before saving: Test simulates a hit. Shows:
- Which hosts/services would be affected
- Which channels would receive
- Which escalation path would apply
Avoids your first alert wave hitting 2 000 hosts immediately.
Since v0.17.4
Test button bug fixed — it used to silently do nothing with tenant filters.
Channels¶
Per rule you can attach any number of channels. When the rule fires, all go out simultaneously (or staggered via escalation).
Channel config: Notification channels.
Escalation¶
Multi-stage strategy for „operator misses it, manager should see, then the team".
escalation:
- after: 0min
channels: [ops-email]
- after: 15min
channels: [ops-push, manager-email]
- after: 60min
channels: [oncall-pager, ceo-sms]
Stages are evaluated as long as the alert is active. Acknowledgement or recovery stops escalation.
Grouping¶
If 50 hosts go down at once — do you want 50 emails or one „50 hosts critical" email? Grouping does the latter.
grouping:
by: [tenant, profile] # one mail per (tenant, profile)
wait_seconds: 30 # collect for 30 s
group_interval: 300 # don't re-group within 5 min
Implemented in worker (AlertGrouper). First match → timer starts, more matches in 30 s collected, then a unified payload to channels.
Pause / active¶
Each rule has an enabled toggle. Paused: rule isn't evaluated, but stays. Useful during migrations or maintenance waves.
Order / priority¶
Rules have a priority value. Multiple matches: highest priority wins — relevant for mute patterns („when this rule fires, suppress all others for the service").
Inhibition (short)¶
A rule can declare it suppresses other rules:
Practice: „If parent switch CRIT, suppress alerts on dependent hosts." More: Dependencies & inhibition.
Log alert rules¶
Special variant for logs — regex pattern, threshold, absence. UI under /alert-rules/log.
Details: Logs → Log alert rules.
Recovery notifications¶
By default, a notification is also sent on recovery (hard → OK) — „Service X is back to OK". Configurable per channel.
Audit¶
Alert-rule changes go to the audit log with diff. Who lowered the threshold from 80 to 50? Filter target_kind = alert_rule.
Combined examples¶
Rule 1: disk full¶
name: Disk Full Critical
filter:
check_type: agent_disk
trigger:
threshold: value >= 95
channels: [ops-email, ops-push]
escalation:
- after: 0min, channels: [ops-email]
- after: 30min, channels: [ops-push, manager-email]
grouping:
by: [host]
recovery_notify: true
Rule 2: cert expiring¶
name: TLS cert expiring soon
filter:
check_type: ssl_certificate
trigger:
status: [WARN, CRIT]
channels: [ops-email]
recovery_notify: false # once renewed, no recovery spam
Rule 3: suspicious auth pattern¶
name: Brute force attempts
type: log_alert
mode: threshold
pattern: 'Failed password for'
window_minutes: 5
threshold: 20
channels: [security-team-email]
Next¶
- Notification channels
- Dependencies & inhibition
- Downtimes — suppress alerts during maintenance