Skip to content

Scaling

When system health alerts, or you proactively prepare for growth.

Sizing profiles

Profile Hosts Hardware WORKER_REPLICAS WORKER_CONCURRENCY PG_SHARED_BUFFERS PG_EFFECTIVE_CACHE
Small < 50 2 GB / 1 core 1 4 256 MB 512 MB
Medium 50–500 4 GB / 2 cores 2 4 512 MB 1 GB
Large 500–2 000 8 GB / 4 cores 3 8 1 GB 4 GB
XL > 2 000 16 GB+ / 8 cores 4+ 8 2 GB 8 GB

These are starting points. Actual tuning depends on check mix — many SNMP walks need more workers, many agent checks need more DB write rate.

Workers

Workers consume the Redis stream and write to the DB.

WORKER_REPLICAS=3      # Number of worker containers
WORKER_CONCURRENCY=8   # Goroutines per worker
WORKER_BATCH_SIZE=200  # Messages per DB insert batch

More workers help with many small inserts. More batch helps with DB round-trip limits.

Rule of thumb: one extra worker per 1 000 hosts.

Apply:

docker compose -f /opt/vesana/docker-compose.prod.yml up -d

Postgres

PG_SHARED_BUFFERS=1GB       # ~25 % of RAM
PG_WORK_MEM=64MB
PG_EFFECTIVE_CACHE=4GB      # ~50 % of RAM
PG_MAX_CONNECTIONS=400

Connection limit is the most common bottleneck with multiple API replicas. If pg_stat_activity is full: front pgbouncer (see below) or raise MAX_CONNECTIONS (costs RAM).

TimescaleDB compression

check_results and logs are hypertables. Compression reduces disk by ~6×.

Since v0.17.x

Default compression window is 3 days (was 30) — migration 116. Older chunks auto-compress.

Manual trigger:

SELECT compress_chunk(c) FROM show_chunks('check_results') c;

Retention

Defaults:

Table Retention
check_results 90 days
logs 30 days

Configurable in Admin → Settings → Retention. Lower = less disk, less history.

Redis

REDIS_MAXMEMORY=1gb          # default 512mb

Policy is noeviction — Redis rejects new data instead of dropping old. On stream overflow you see this in system tab as „receiver rejecting" — meaning: scale urgently.

Redis backups: not needed, stream content is transient. On restart everything continues from last DB state.

API replicas

Multiple API containers behind nginx upstream:

api:
  deploy:
    replicas: 3

Prerequisite: distributed locking via Redis is built in — critical background jobs (watchers) run only once per replica via lock.

pgbouncer

Connection pool for many replicas. In docker-compose.prod.yml:

pgbouncer:
  image: edoburu/pgbouncer
  environment:
    DATABASE_URL: postgresql://vesana:${POSTGRES_PASSWORD}@postgres:5432/vesana
    POOL_MODE: transaction
    MAX_CLIENT_CONN: 1000
    DEFAULT_POOL_SIZE: 50

API services then via pgbouncer:6432 instead of postgres:5432.

Activate profile:

docker compose -f /opt/vesana/docker-compose.prod.yml --profile pgbouncer up -d

Hot tables

current_status, agent_tokens see hot updates. Migration 115 sets fillfactor=80 on these — update bloat reduced.

agent_tokens.last_seen_at is batched in the receiver (60 s) — otherwise every heartbeat would be an UPDATE round-trip.

Compose profiles

Profile When
Default always
ai with local Ollama
backup with backup sidecar
pgbouncer with many replicas

All active:

docker compose -f docker-compose.prod.yml \
  --profile ai --profile backup --profile pgbouncer \
  up -d

Vertical vs. horizontal scaling

Path Pro / con
Vertical (more RAM/CPU at the server) simple, no architecture refactor, hardware limit eventually
Horizontal (multiple API replicas + pgbouncer) more complexity, near-linear up to DB limit

Recommendation: vertical until 16 GB / 8 cores. Then horizontal.

Performance baseline

50 000 checks scaling test in scale-test/:

  • ~960 checks/s sustained with default config
  • p95 ≤ 150 ms
  • 0 errors over 30 min soak
  • 6 bugs found + fixed during test (see changelog v0.17.x)

For your workload: watch system tab, start scaling at yellow band.

Externalize (Postgres / Redis)

For very large setups, consider extracting Postgres or Redis from the compose stack to managed service or dedicated VMs.

Steps:

  1. Set up external instance
  2. Migrate data (pg_dump / pg_restore)
  3. Switch DATABASE_URL in .env
  4. Remove postgres service from compose or adjust dependencies

Test restore in a test environment first — this path isn't officially supported.

Next