Alerting¶
When a check goes CRITICAL, someone needs to know — but not the wrong person, not too late, not 50 times in a row, not for every dependent symptom. That's what the alerting subsystem does.
flowchart LR
R[Check result CRIT] --> RULE[Alert rule matches?]
RULE -->|no| NOP[do nothing]
RULE -->|yes| INHIB[Inhibition: parent CRIT?]
INHIB -->|yes| MUTE[Suppress alert]
INHIB -->|no| GROUP[Group]
GROUP --> CHAN[Notification channels]
CHAN --> EMAIL[Email]
CHAN --> PUSH[Mobile push]
CHAN --> WH[Webhook]
CHAN --> SLACK[Slack/Teams]
GROUP --> ESC[Escalation stage 1, 2, 3 ...]
-
Alert rules — thresholds, patterns, escalation, test button
-
Notification channels — email, push, webhook, Slack, Teams
-
Dependencies & inhibition — parent down → suppress child alerts
-
Downtimes — planned maintenance, RRULE, mobile quick-add
-
Acknowledgements — „seen, I'm on it"