Architecture¶

Overview¶

Vesana is push-based: agents and collectors connect outbound to the server and send check results in. The server has no outbound connection to monitored machines.

flowchart LR
    subgraph "Customer network"
        A[Agent on servers] -->|HTTPS POST| EXIT[outbound 443]
        C[Collector VM] -->|HTTPS POST| EXIT
    end

    subgraph "Vesana server"
        EXIT --> R[Receiver]
        R --> RS[(Redis stream)]
        RS --> W1[Worker 0]
        RS --> W2[Worker 1]
        RS --> W3[Worker N]
        W1 --> DB[(Postgres + TimescaleDB)]
        W2 --> DB
        W3 --> DB
        DB --> API[REST API]
        API --> FE[React frontend]
        API --> M[Mobile app]
    end

Components¶

Receiver¶

FastAPI service receiving agent and collector packets
Authenticates via X-API-Key (collector) or X-Agent-Token (agent)
Validates schema, writes immediately to Redis stream — no logic, no DB write
Goal: lowest possible latency, highest throughput

Redis stream¶

Backpressure-capable ingress queue (XADD / XREADGROUP)
Multiple workers consume in parallel
Full stream → receiver rejects (noeviction policy) — no silent drops

Worker¶

Reads messages, fetches host/service context from DB
Applies profile-check effective config, normalizes values
Writes check results to check_results (hypertable)
Updates current_status (hot table with fillfactor=80)
Triggers alert evaluation, notification dispatch, AI analysis cache invalidation

API¶

FastAPI with JWT auth, automatic tenant scope via ORM filter
Endpoints: hosts, services, profiles, discoveries, alerts, reports, wiki, AI, admin
Background tasks: downtime watcher, dead-collector watcher, anomaly baselines, auto-purge, tester phone-home
Distributed locking via Redis — with multiple API replicas, each watcher runs only once

Frontend¶

React 18 + TypeScript + Vite
Themed via CSS variables (20 themes × dark/light)
Lazy-loaded ECharts, lazy-loaded ReactMarkdown for wiki

Agent (Go)¶

Single binary, statically linked (CGO_ENABLED=0), ~6.5 MB
Fetches config every 5 minutes, runs checks locally
Auto-update on config refresh when server reports a newer version

Collector (Go)¶

Single binary, runs on a Linux VM in customer network
Runs remote checks: SNMP, ping, SSH, HTTP, discovery (nmap)
Fetches config every 60 seconds, sends results + discovery output to server

Mobile app¶

React Native + Expo, Android APK
Own API client, push token registration via FCM
Tap on push → deep link to host detail

Multi-tenant isolation¶

Tenants are the central separator. Every DB table with customer data has a tenant_id column. The ORM-level apply_tenant_filter() (api/app/auth.py) enforces filtering — a query without tenant scope raises a runtime error.

Super admins have tenant_scope = null and see everything. Regular users are bound to one tenant, with optional cross-tenant read in custom roles.

flowchart TB
    subgraph Super-Admin
        SA[user.tenant_scope = null] --> ALLES[(all tenants)]
    end
    subgraph Tenant A
        UA[user.tenant_id = A] --> A[(Hosts/Alerts A)]
    end
    subgraph Tenant B
        UB[user.tenant_id = B] --> B[(Hosts/Alerts B)]
    end

Security architecture¶

Four pillars:

1 — Encryption of sensitive fields¶

shared/encryption.py provides encrypt_field() / decrypt_field() (AES-256-GCM). Encrypted: SNMP communities, SSH passwords, etc. Key: FIELD_ENCRYPTION_KEY (Base64url, 32 bytes). The server holds plaintext only briefly in RAM.

Details: Security → Encryption.

2 — Token-based authentication¶

Token	Format	Storage	Used by
User JWT	RS256	Browser/mobile local	End-user login
Agent token	`vesana_agent_` + 32 url-safe base64	SHA256 hash in `agent_tokens.token_hash`	Agent → receiver
API key	Custom prefix + 32 bytes	SHA256 hash in `api_keys.key_hash`	Collector → receiver

Plaintext is never in the DB — only hashes. Keys are shown exactly once (at creation).

3 — Rate limiting¶

Login and 2FA-verify endpoints capped at 10 req/min per IP (slowapi). Goal: slow brute force against weak passwords and 2FA codes.

4 — Distributed locking¶

Multiple API replicas? downtime_expiry_watcher, dead_collector_watcher, etc. run only once — Redis locks with 55 s timeout ensure that.

Subsystems¶

Profile + checks¶

Two-tier model. Hosts have a profile (e.g. „APC Smart UPS"). Profiles have profile-checks (e.g. „Battery voltage"). host_services are instances per host with optional overrides.

Details: Profiles & checks.

Policies (v2)¶

Declarative rule system for „on these hosts apply this config". JsonLogic subset, form builder, AI generator. Bulk configuration without SSH grunt work.

Wiki + AI¶

Built-in knowledge base (Markdown, FTS, pgvector). AI hits the wiki first via RAG, falls back to web search, marks sources.

Auto-discovery¶

Collector scans the network with nmap. SNMP sysOID matches profiles. On match, profile is suggested.

NSCA receiver¶

Optional receiver on port 5667. Accepts packets from send_nsca clients. Migration path for Nagios estates.

Performance model¶

On self-hosting defaults (1 worker, 256 MB shared_buffers): ~960 checks/s sustained, p95 ≤ 150 ms, 0 errors over 30 min soak. Scaling: Administration → Scaling.