Architecture¶
Overview¶
Vesana is push-based: agents and collectors connect outbound to the server and send check results in. The server has no outbound connection to monitored machines.
flowchart LR
subgraph "Customer network"
A[Agent on servers] -->|HTTPS POST| EXIT[outbound 443]
C[Collector VM] -->|HTTPS POST| EXIT
end
subgraph "Vesana server"
EXIT --> R[Receiver]
R --> RS[(Redis stream)]
RS --> W1[Worker 0]
RS --> W2[Worker 1]
RS --> W3[Worker N]
W1 --> DB[(Postgres + TimescaleDB)]
W2 --> DB
W3 --> DB
DB --> API[REST API]
API --> FE[React frontend]
API --> M[Mobile app]
end
Components¶
Receiver¶
- FastAPI service receiving agent and collector packets
- Authenticates via
X-API-Key(collector) orX-Agent-Token(agent) - Validates schema, writes immediately to Redis stream — no logic, no DB write
- Goal: lowest possible latency, highest throughput
Redis stream¶
- Backpressure-capable ingress queue (
XADD/XREADGROUP) - Multiple workers consume in parallel
- Full stream → receiver rejects (
noevictionpolicy) — no silent drops
Worker¶
- Reads messages, fetches host/service context from DB
- Applies profile-check effective config, normalizes values
- Writes check results to
check_results(hypertable) - Updates
current_status(hot table withfillfactor=80) - Triggers alert evaluation, notification dispatch, AI analysis cache invalidation
API¶
- FastAPI with JWT auth, automatic tenant scope via ORM filter
- Endpoints: hosts, services, profiles, discoveries, alerts, reports, wiki, AI, admin
- Background tasks: downtime watcher, dead-collector watcher, anomaly baselines, auto-purge, tester phone-home
- Distributed locking via Redis — with multiple API replicas, each watcher runs only once
Frontend¶
- React 18 + TypeScript + Vite
- Themed via CSS variables (20 themes × dark/light)
- Lazy-loaded ECharts, lazy-loaded ReactMarkdown for wiki
Agent (Go)¶
- Single binary, statically linked (
CGO_ENABLED=0), ~6.5 MB - Fetches config every 5 minutes, runs checks locally
- Auto-update on config refresh when server reports a newer version
Collector (Go)¶
- Single binary, runs on a Linux VM in customer network
- Runs remote checks: SNMP, ping, SSH, HTTP, discovery (nmap)
- Fetches config every 60 seconds, sends results + discovery output to server
Mobile app¶
- React Native + Expo, Android APK
- Own API client, push token registration via FCM
- Tap on push → deep link to host detail
Multi-tenant isolation¶
Tenants are the central separator. Every DB table with customer data has a tenant_id column. The ORM-level apply_tenant_filter() (api/app/auth.py) enforces filtering — a query without tenant scope raises a runtime error.
Super admins have tenant_scope = null and see everything. Regular users are bound to one tenant, with optional cross-tenant read in custom roles.
flowchart TB
subgraph Super-Admin
SA[user.tenant_scope = null] --> ALLES[(all tenants)]
end
subgraph Tenant A
UA[user.tenant_id = A] --> A[(Hosts/Alerts A)]
end
subgraph Tenant B
UB[user.tenant_id = B] --> B[(Hosts/Alerts B)]
end
Security architecture¶
Four pillars:
1 — Encryption of sensitive fields¶
shared/encryption.py provides encrypt_field() / decrypt_field() (AES-256-GCM). Encrypted: SNMP communities, SSH passwords, etc. Key: FIELD_ENCRYPTION_KEY (Base64url, 32 bytes). The server holds plaintext only briefly in RAM.
Details: Security → Encryption.
2 — Token-based authentication¶
| Token | Format | Storage | Used by |
|---|---|---|---|
| User JWT | RS256 | Browser/mobile local | End-user login |
| Agent token | vesana_agent_ + 32 url-safe base64 |
SHA256 hash in agent_tokens.token_hash |
Agent → receiver |
| API key | Custom prefix + 32 bytes | SHA256 hash in api_keys.key_hash |
Collector → receiver |
Plaintext is never in the DB — only hashes. Keys are shown exactly once (at creation).
3 — Rate limiting¶
Login and 2FA-verify endpoints capped at 10 req/min per IP (slowapi). Goal: slow brute force against weak passwords and 2FA codes.
4 — Distributed locking¶
Multiple API replicas? downtime_expiry_watcher, dead_collector_watcher, etc. run only once — Redis locks with 55 s timeout ensure that.
Subsystems¶
Profile + checks¶
Two-tier model. Hosts have a profile (e.g. „APC Smart UPS"). Profiles have profile-checks (e.g. „Battery voltage"). host_services are instances per host with optional overrides.
Details: Profiles & checks.
Policies (v2)¶
Declarative rule system for „on these hosts apply this config". JsonLogic subset, form builder, AI generator. Bulk configuration without SSH grunt work.
Wiki + AI¶
Built-in knowledge base (Markdown, FTS, pgvector). AI hits the wiki first via RAG, falls back to web search, marks sources.
Auto-discovery¶
Collector scans the network with nmap. SNMP sysOID matches profiles. On match, profile is suggested.
NSCA receiver¶
Optional receiver on port 5667. Accepts packets from send_nsca clients. Migration path for Nagios estates.
Performance model¶
On self-hosting defaults (1 worker, 256 MB shared_buffers): ~960 checks/s sustained, p95 ≤ 150 ms, 0 errors over 30 min soak. Scaling: Administration → Scaling.