Skip to content

Architecture

Overview

Vesana is push-based: agents and collectors connect outbound to the server and send check results in. The server has no outbound connection to monitored machines.

flowchart LR
    subgraph "Customer network"
        A[Agent on servers] -->|HTTPS POST| EXIT[outbound 443]
        C[Collector VM] -->|HTTPS POST| EXIT
    end

    subgraph "Vesana server"
        EXIT --> R[Receiver]
        R --> RS[(Redis stream)]
        RS --> W1[Worker 0]
        RS --> W2[Worker 1]
        RS --> W3[Worker N]
        W1 --> DB[(Postgres + TimescaleDB)]
        W2 --> DB
        W3 --> DB
        DB --> API[REST API]
        API --> FE[React frontend]
        API --> M[Mobile app]
    end

Components

Receiver

  • FastAPI service receiving agent and collector packets
  • Authenticates via X-API-Key (collector) or X-Agent-Token (agent)
  • Validates schema, writes immediately to Redis stream — no logic, no DB write
  • Goal: lowest possible latency, highest throughput

Redis stream

  • Backpressure-capable ingress queue (XADD / XREADGROUP)
  • Multiple workers consume in parallel
  • Full stream → receiver rejects (noeviction policy) — no silent drops

Worker

  • Reads messages, fetches host/service context from DB
  • Applies profile-check effective config, normalizes values
  • Writes check results to check_results (hypertable)
  • Updates current_status (hot table with fillfactor=80)
  • Triggers alert evaluation, notification dispatch, AI analysis cache invalidation

API

  • FastAPI with JWT auth, automatic tenant scope via ORM filter
  • Endpoints: hosts, services, profiles, discoveries, alerts, reports, wiki, AI, admin
  • Background tasks: downtime watcher, dead-collector watcher, anomaly baselines, auto-purge, tester phone-home
  • Distributed locking via Redis — with multiple API replicas, each watcher runs only once

Frontend

  • React 18 + TypeScript + Vite
  • Themed via CSS variables (20 themes × dark/light)
  • Lazy-loaded ECharts, lazy-loaded ReactMarkdown for wiki

Agent (Go)

  • Single binary, statically linked (CGO_ENABLED=0), ~6.5 MB
  • Fetches config every 5 minutes, runs checks locally
  • Auto-update on config refresh when server reports a newer version

Collector (Go)

  • Single binary, runs on a Linux VM in customer network
  • Runs remote checks: SNMP, ping, SSH, HTTP, discovery (nmap)
  • Fetches config every 60 seconds, sends results + discovery output to server

Mobile app

  • React Native + Expo, Android APK
  • Own API client, push token registration via FCM
  • Tap on push → deep link to host detail

Multi-tenant isolation

Tenants are the central separator. Every DB table with customer data has a tenant_id column. The ORM-level apply_tenant_filter() (api/app/auth.py) enforces filtering — a query without tenant scope raises a runtime error.

Super admins have tenant_scope = null and see everything. Regular users are bound to one tenant, with optional cross-tenant read in custom roles.

flowchart TB
    subgraph Super-Admin
        SA[user.tenant_scope = null] --> ALLES[(all tenants)]
    end
    subgraph Tenant A
        UA[user.tenant_id = A] --> A[(Hosts/Alerts A)]
    end
    subgraph Tenant B
        UB[user.tenant_id = B] --> B[(Hosts/Alerts B)]
    end

Security architecture

Four pillars:

1 — Encryption of sensitive fields

shared/encryption.py provides encrypt_field() / decrypt_field() (AES-256-GCM). Encrypted: SNMP communities, SSH passwords, etc. Key: FIELD_ENCRYPTION_KEY (Base64url, 32 bytes). The server holds plaintext only briefly in RAM.

Details: Security → Encryption.

2 — Token-based authentication

Token Format Storage Used by
User JWT RS256 Browser/mobile local End-user login
Agent token vesana_agent_ + 32 url-safe base64 SHA256 hash in agent_tokens.token_hash Agent → receiver
API key Custom prefix + 32 bytes SHA256 hash in api_keys.key_hash Collector → receiver

Plaintext is never in the DB — only hashes. Keys are shown exactly once (at creation).

3 — Rate limiting

Login and 2FA-verify endpoints capped at 10 req/min per IP (slowapi). Goal: slow brute force against weak passwords and 2FA codes.

4 — Distributed locking

Multiple API replicas? downtime_expiry_watcher, dead_collector_watcher, etc. run only once — Redis locks with 55 s timeout ensure that.

Subsystems

Profile + checks

Two-tier model. Hosts have a profile (e.g. „APC Smart UPS"). Profiles have profile-checks (e.g. „Battery voltage"). host_services are instances per host with optional overrides.

Details: Profiles & checks.

Policies (v2)

Declarative rule system for „on these hosts apply this config". JsonLogic subset, form builder, AI generator. Bulk configuration without SSH grunt work.

Wiki + AI

Built-in knowledge base (Markdown, FTS, pgvector). AI hits the wiki first via RAG, falls back to web search, marks sources.

Auto-discovery

Collector scans the network with nmap. SNMP sysOID matches profiles. On match, profile is suggested.

NSCA receiver

Optional receiver on port 5667. Accepts packets from send_nsca clients. Migration path for Nagios estates.

Performance model

On self-hosting defaults (1 worker, 256 MB shared_buffers): ~960 checks/s sustained, p95 ≤ 150 ms, 0 errors over 30 min soak. Scaling: Administration → Scaling.

Next