Skip to content

Status & state model

The five status values

Status Code Color Meaning
OK 0 green Everything within thresholds
WARNING 1 yellow Warn threshold crossed, not yet critical
CRITICAL 2 red Crit threshold crossed or check failed
NO_DATA 3 orange No data within expected interval
UNKNOWN 4 gray Check ran, result not interpretable (plugin exit code 3)

NO_DATA and UNKNOWN are often confused. Rule of thumb:

  • NO_DATA = we never heard the check (agent offline, collector dead, brand-new service)
  • UNKNOWN = the check did run, but the result isn't unambiguous (parser failure, missing OID on SNMP device)

Severity order

Important for aggregation (tenant status, host status, alert groups):

CRITICAL > WARNING > NO_DATA > UNKNOWN > OK

If a host has 3 services — OK, NO_DATA, WARNING — its rolled-up status is WARNING, not NO_DATA. Why: a concrete problem (WARNING) is more relevant than a data gap (NO_DATA), because WARNING requires action.

When is each status set?

Creating a new service

New host_service → initial status NO_DATA, not UNKNOWN. This makes it explicit: „awaiting first result" rather than „check is broken".

On incoming check result

Worker translates the plugin result into a status:

  • Exit code 0, value below threshold_warnOK
  • Value ≥ threshold_warn, < threshold_critWARNING
  • Value ≥ threshold_crit or exit code 2 → CRITICAL
  • Exit code 3 / unparseable → UNKNOWN

When data is missing

Watcher loops set NO_DATA:

Watcher Trigger
dead_agent_watcher Agent silent for > agent_dead_after (default 3 × heartbeat ≈ 3 min)
dead_collector_watcher Collector silent for > 3 min
service_overdue_watcher Service result older than interval_seconds × 1.5

When a result arrives, the status flips to the actual outcome.

Soft vs. hard state

A check that returns CRITICAL once isn't an immediate problem — network glitches, load spikes, reboot windows happen. Hence: soft/hard model.

stateDiagram-v2
    [*] --> OK: check ok
    OK --> SOFT_CRIT: check critical
    SOFT_CRIT --> SOFT_CRIT: not yet max_attempts
    SOFT_CRIT --> HARD_CRIT: max_attempts reached
    SOFT_CRIT --> OK: check ok (recovery)
    HARD_CRIT --> OK: check ok (recovery → notification)
    HARD_CRIT --> HARD_CRIT: still critical
Phase What happens
First failures state_type = SOFT, attempt = 1..max-1
Threshold crossed at attempt = max_check_attemptsstate_type = HARD, alert evaluation runs
Recovery before hard Status returns to OK silently
Recovery after hard Status to OK, recovery notification (if enabled)

Default: max_check_attempts = 3. Configurable per profile-check.

Retry interval

In soft state, the retry interval is used, not the main interval. Faster detection without paying the cost in steady state.

Field Meaning Default
interval_seconds Main interval in OK state 60 s
retry_interval_seconds Soft state interval 15 s
max_check_attempts Failures before hard 3

Example timeline:

t=0    OK
t=60   CRIT  → state SOFT, attempt 1
t=75   CRIT  → state SOFT, attempt 2          (15 s retry)
t=90   CRIT  → state HARD, attempt 3          (alert!)
t=150  OK    → recovery (main interval)

Without retry interval, hard state would only be detected at t=180. With: at t=90 — twice as fast.

Acknowledged

A hard state can be acknowledged — the operator has seen it, suppress further notifications. Next recovery clears the ack automatically (sticky ack is optional).

HARD_CRIT (notified) → ACK → no more notifications
ACK + recovery       → status OK, ack auto-cleared
ACK + new hard       → notification re-enabled (configurable)

Details: Alerting → Acknowledgements.

Downtime

During an active downtime the check runs as normal, but alert rules see it as „in maintenance". No notifications are sent, the status appears in the UI with a maintenance badge.

Maintenance != ACK: ACK is „I've seen it", downtime is „this is expected".

Details: Alerting → Downtimes.

Aggregated status

Host status = worst severity of its services (with inhibition applied). Tenant status = worst severity of its hosts.

Inhibition can reduce effective status: if the parent host is CRITICAL, dependent services appear grayed in the UI and don't count toward tenant status.

Details: Alerting → Dependencies & inhibition.

Next