Status & state model¶
The five status values¶
| Status | Code | Color | Meaning |
|---|---|---|---|
| OK | 0 | green | Everything within thresholds |
| WARNING | 1 | yellow | Warn threshold crossed, not yet critical |
| CRITICAL | 2 | red | Crit threshold crossed or check failed |
| NO_DATA | 3 | orange | No data within expected interval |
| UNKNOWN | 4 | gray | Check ran, result not interpretable (plugin exit code 3) |
NO_DATA and UNKNOWN are often confused. Rule of thumb:
- NO_DATA = we never heard the check (agent offline, collector dead, brand-new service)
- UNKNOWN = the check did run, but the result isn't unambiguous (parser failure, missing OID on SNMP device)
Severity order¶
Important for aggregation (tenant status, host status, alert groups):
If a host has 3 services — OK, NO_DATA, WARNING — its rolled-up status is WARNING, not NO_DATA. Why: a concrete problem (WARNING) is more relevant than a data gap (NO_DATA), because WARNING requires action.
When is each status set?¶
Creating a new service¶
New host_service → initial status NO_DATA, not UNKNOWN. This makes it explicit: „awaiting first result" rather than „check is broken".
On incoming check result¶
Worker translates the plugin result into a status:
- Exit code 0, value below
threshold_warn→OK - Value ≥
threshold_warn, <threshold_crit→WARNING - Value ≥
threshold_critor exit code 2 →CRITICAL - Exit code 3 / unparseable →
UNKNOWN
When data is missing¶
Watcher loops set NO_DATA:
| Watcher | Trigger |
|---|---|
dead_agent_watcher |
Agent silent for > agent_dead_after (default 3 × heartbeat ≈ 3 min) |
dead_collector_watcher |
Collector silent for > 3 min |
service_overdue_watcher |
Service result older than interval_seconds × 1.5 |
When a result arrives, the status flips to the actual outcome.
Soft vs. hard state¶
A check that returns CRITICAL once isn't an immediate problem — network glitches, load spikes, reboot windows happen. Hence: soft/hard model.
stateDiagram-v2
[*] --> OK: check ok
OK --> SOFT_CRIT: check critical
SOFT_CRIT --> SOFT_CRIT: not yet max_attempts
SOFT_CRIT --> HARD_CRIT: max_attempts reached
SOFT_CRIT --> OK: check ok (recovery)
HARD_CRIT --> OK: check ok (recovery → notification)
HARD_CRIT --> HARD_CRIT: still critical
| Phase | What happens |
|---|---|
| First failures | state_type = SOFT, attempt = 1..max-1 |
| Threshold crossed | at attempt = max_check_attempts → state_type = HARD, alert evaluation runs |
| Recovery before hard | Status returns to OK silently |
| Recovery after hard | Status to OK, recovery notification (if enabled) |
Default: max_check_attempts = 3. Configurable per profile-check.
Retry interval¶
In soft state, the retry interval is used, not the main interval. Faster detection without paying the cost in steady state.
| Field | Meaning | Default |
|---|---|---|
interval_seconds |
Main interval in OK state | 60 s |
retry_interval_seconds |
Soft state interval | 15 s |
max_check_attempts |
Failures before hard | 3 |
Example timeline:
t=0 OK
t=60 CRIT → state SOFT, attempt 1
t=75 CRIT → state SOFT, attempt 2 (15 s retry)
t=90 CRIT → state HARD, attempt 3 (alert!)
t=150 OK → recovery (main interval)
Without retry interval, hard state would only be detected at t=180. With: at t=90 — twice as fast.
Acknowledged¶
A hard state can be acknowledged — the operator has seen it, suppress further notifications. Next recovery clears the ack automatically (sticky ack is optional).
HARD_CRIT (notified) → ACK → no more notifications
ACK + recovery → status OK, ack auto-cleared
ACK + new hard → notification re-enabled (configurable)
Details: Alerting → Acknowledgements.
Downtime¶
During an active downtime the check runs as normal, but alert rules see it as „in maintenance". No notifications are sent, the status appears in the UI with a maintenance badge.
Maintenance != ACK: ACK is „I've seen it", downtime is „this is expected".
Details: Alerting → Downtimes.
Aggregated status¶
Host status = worst severity of its services (with inhibition applied). Tenant status = worst severity of its hosts.
Inhibition can reduce effective status: if the parent host is CRITICAL, dependent services appear grayed in the UI and don't count toward tenant status.
Details: Alerting → Dependencies & inhibition.
Next¶
- Profiles & checks — where thresholds and intervals are defined
- Alerting — turning status into notifications