SLA reports¶
How available was the service in a given period? How much downtime is attributable to it? How much was planned?
Definition¶
Availability (in %) = (total seconds − unavailable seconds) / total seconds × 100.
What counts as „unavailable":
- Hard CRITICAL phases
- Hard NO_DATA phases (default)
- Hard WARNING (default not — configurable)
What does not count:
- Planned downtimes (default — configurable to count with „strict")
- ACK phases (still count as outage — ACK is just notification pause, not „world was OK")
- Inhibition phases (count for the inhibited service because from customer view it was down; don't double-count for parent)
Generate report¶
/reports/sla → New report:
| Field | Meaning |
|---|---|
| Scope | tenant / tag / hosts |
| Period | last month / quarter / custom |
| Granularity | per service / per host / per tenant |
| Downtime weighting | „don't count" (default) / „strict" |
| WARN as outage? | default no |
Output: table with availability per row, plus expandable details with all incidents.
Example output¶
Tenant Acme — April 2026
─────────────────────────────────────────────────────────────────
Service Available Planned Unplanned
api.acme.com / HTTP 99.92% 02:14h 00:32h
api.acme.com / TLS cert 100.00% - -
db01.acme.local / Postgres 100.00% - -
sw-core / IF Gi1/0/1 99.45% - 03:58h
sw-core / IF Gi1/0/2 100.00% - -
─────────────────────────────────────────────────────────────────
Tenant total 99.83% 02:14h 04:30h
Click on a row expands a detail list with start, end, duration, cause (CRIT reason, downtime comment).
Calculation¶
flowchart LR
H[check_results hypertable] --> AGG[State buckets per service]
AGG --> CALC[Seconds per status]
DT[Downtimes] --> EXCL[Exclude planned seconds]
CALC --> EXCL
EXCL --> AVAIL[Availability %]
Implemented as a Python job (api/app/services/sla.py) using TimescaleDB window functions. Performant even for 90 days × 5 000 services (~3 s).
Export¶
| Format | For |
|---|---|
| HTML in browser | Interactive view |
| Send to customer — see PDF reports | |
| CSV | Process in Excel |
Scheduled reports¶
/reports/scheduled → New:
- Monthly SLA report per tenant
- Auto-generated on the 1st at 06:00
- Result as PDF to configured email addresses
Recommendation: monthly report per tenant → customer auto-receives compliance document.
Multiple SLA classes¶
When different services have different SLA expectations („API: 99.9 %, wiki: 99.0 %"), use tags:
- Tag
sla-tier-99-9for API services - Tag
sla-tier-99-0for internal services
Filter reports by tag for separate availability views per tag.
SLA + anomaly¶
Anomaly events do not automatically count as outage — they're just hints. If an anomaly leads to a hard CRIT (after escalation or manual action), the CRIT counts, not the anomaly.
Permission¶
| Permission | Effect |
|---|---|
sla.view |
View reports |
sla.export |
Generate PDF / CSV |
sla.schedule |
Create scheduled reports |
Next¶
- PDF reports — distribution workflow
- Downtimes — how downtimes flow into availability