Status & state model¶

The five status values¶

Status	Code	Color	Meaning
OK	0	green	Everything within thresholds
WARNING	1	yellow	Warn threshold crossed, not yet critical
CRITICAL	2	red	Crit threshold crossed or check failed
NO_DATA	3	orange	No data within expected interval
UNKNOWN	4	gray	Check ran, result not interpretable (plugin exit code 3)

NO_DATA and UNKNOWN are often confused. Rule of thumb:

NO_DATA = we never heard the check (agent offline, collector dead, brand-new service)
UNKNOWN = the check did run, but the result isn't unambiguous (parser failure, missing OID on SNMP device)

Severity order¶

Important for aggregation (tenant status, host status, alert groups):

CRITICAL > WARNING > NO_DATA > UNKNOWN > OK

If a host has 3 services — OK, NO_DATA, WARNING — its rolled-up status is WARNING, not NO_DATA. Why: a concrete problem (WARNING) is more relevant than a data gap (NO_DATA), because WARNING requires action.

When is each status set?¶

Creating a new service¶

New host_service → initial status NO_DATA, not UNKNOWN. This makes it explicit: „awaiting first result" rather than „check is broken".

On incoming check result¶

Worker translates the plugin result into a status:

Exit code 0, value below threshold_warn → OK
Value ≥ threshold_warn, < threshold_crit → WARNING
Value ≥ threshold_crit or exit code 2 → CRITICAL
Exit code 3 / unparseable → UNKNOWN

When data is missing¶

Watcher loops set NO_DATA:

Watcher	Trigger
`dead_agent_watcher`	Agent silent for > `agent_dead_after` (default 3 × heartbeat ≈ 3 min)
`dead_collector_watcher`	Collector silent for > 3 min
`service_overdue_watcher`	Service result older than `interval_seconds × 1.5`

When a result arrives, the status flips to the actual outcome.

Soft vs. hard state¶

A check that returns CRITICAL once isn't an immediate problem — network glitches, load spikes, reboot windows happen. Hence: soft/hard model.

stateDiagram-v2
    [*] --> OK: check ok
    OK --> SOFT_CRIT: check critical
    SOFT_CRIT --> SOFT_CRIT: not yet max_attempts
    SOFT_CRIT --> HARD_CRIT: max_attempts reached
    SOFT_CRIT --> OK: check ok (recovery)
    HARD_CRIT --> OK: check ok (recovery → notification)
    HARD_CRIT --> HARD_CRIT: still critical

Phase	What happens
First failures	`state_type = SOFT`, `attempt = 1..max-1`
Threshold crossed	at `attempt = max_check_attempts` → `state_type = HARD`, alert evaluation runs
Recovery before hard	Status returns to `OK` silently
Recovery after hard	Status to `OK`, recovery notification (if enabled)

Default: max_check_attempts = 3. Configurable per profile-check.

Retry interval¶

In soft state, the retry interval is used, not the main interval. Faster detection without paying the cost in steady state.

Field	Meaning	Default
`interval_seconds`	Main interval in OK state	60 s
`retry_interval_seconds`	Soft state interval	15 s
`max_check_attempts`	Failures before hard	3

Example timeline:

t=0    OK
t=60   CRIT  → state SOFT, attempt 1
t=75   CRIT  → state SOFT, attempt 2          (15 s retry)
t=90   CRIT  → state HARD, attempt 3          (alert!)
t=150  OK    → recovery (main interval)

Without retry interval, hard state would only be detected at t=180. With: at t=90 — twice as fast.

Acknowledged¶

A hard state can be acknowledged — the operator has seen it, suppress further notifications. Next recovery clears the ack automatically (sticky ack is optional).

HARD_CRIT (notified) → ACK → no more notifications
ACK + recovery       → status OK, ack auto-cleared
ACK + new hard       → notification re-enabled (configurable)

Details: Alerting → Acknowledgements.

Downtime¶

During an active downtime the check runs as normal, but alert rules see it as „in maintenance". No notifications are sent, the status appears in the UI with a maintenance badge.

Maintenance != ACK: ACK is „I've seen it", downtime is „this is expected".

Details: Alerting → Downtimes.

Aggregated status¶

Host status = worst severity of its services (with inhibition applied). Tenant status = worst severity of its hosts.

Inhibition can reduce effective status: if the parent host is CRITICAL, dependent services appear grayed in the UI and don't count toward tenant status.

Details: Alerting → Dependencies & inhibition.

Next¶

Profiles & checks — where thresholds and intervals are defined
Alerting — turning status into notifications