Scaling¶

When system health alerts, or you proactively prepare for growth.

Sizing profiles¶

Profile	Hosts	Hardware	`WORKER_REPLICAS`	`WORKER_CONCURRENCY`	`PG_SHARED_BUFFERS`	`PG_EFFECTIVE_CACHE`
Small	< 50	2 GB / 1 core	1	4	256 MB	512 MB
Medium	50–500	4 GB / 2 cores	2	4	512 MB	1 GB
Large	500–2 000	8 GB / 4 cores	3	8	1 GB	4 GB
XL	> 2 000	16 GB+ / 8 cores	4+	8	2 GB	8 GB

These are starting points. Actual tuning depends on check mix — many SNMP walks need more workers, many agent checks need more DB write rate.

Workers¶

Workers consume the Redis stream and write to the DB.

WORKER_REPLICAS=3      # Number of worker containers
WORKER_CONCURRENCY=8   # Goroutines per worker
WORKER_BATCH_SIZE=200  # Messages per DB insert batch

More workers help with many small inserts. More batch helps with DB round-trip limits.

Rule of thumb: one extra worker per 1 000 hosts.

Apply:

docker compose -f /opt/vesana/docker-compose.prod.yml up -d

Postgres¶

PG_SHARED_BUFFERS=1GB       # ~25 % of RAM
PG_WORK_MEM=64MB
PG_EFFECTIVE_CACHE=4GB      # ~50 % of RAM
PG_MAX_CONNECTIONS=400

Connection limit is the most common bottleneck with multiple API replicas. If pg_stat_activity is full: front pgbouncer (see below) or raise MAX_CONNECTIONS (costs RAM).

TimescaleDB compression¶

check_results and logs are hypertables. Compression reduces disk by ~6×.

Since v0.17.x

Default compression window is 3 days (was 30) — migration 116. Older chunks auto-compress.

Manual trigger:

SELECT compress_chunk(c) FROM show_chunks('check_results') c;

Retention¶

Defaults:

Table	Retention
`check_results`	90 days
`logs`	30 days

Configurable in Admin → Settings → Retention. Lower = less disk, less history.

Redis¶

REDIS_MAXMEMORY=1gb          # default 512mb

Policy is noeviction — Redis rejects new data instead of dropping old. On stream overflow you see this in system tab as „receiver rejecting" — meaning: scale urgently.

Redis backups: not needed, stream content is transient. On restart everything continues from last DB state.

API replicas¶

Multiple API containers behind nginx upstream:

api:
  deploy:
    replicas: 3

Prerequisite: distributed locking via Redis is built in — critical background jobs (watchers) run only once per replica via lock.

pgbouncer¶

Connection pool for many replicas. In docker-compose.prod.yml:

pgbouncer:
  image: edoburu/pgbouncer
  environment:
    DATABASE_URL: postgresql://vesana:${POSTGRES_PASSWORD}@postgres:5432/vesana
    POOL_MODE: transaction
    MAX_CLIENT_CONN: 1000
    DEFAULT_POOL_SIZE: 50

API services then via pgbouncer:6432 instead of postgres:5432.

Activate profile:

docker compose -f /opt/vesana/docker-compose.prod.yml --profile pgbouncer up -d

Hot tables¶

current_status, agent_tokens see hot updates. Migration 115 sets fillfactor=80 on these — update bloat reduced.

agent_tokens.last_seen_at is batched in the receiver (60 s) — otherwise every heartbeat would be an UPDATE round-trip.

Compose profiles¶

Profile	When
Default	always
`ai`	with local Ollama
`backup`	with backup sidecar
`pgbouncer`	with many replicas

All active:

docker compose -f docker-compose.prod.yml \
  --profile ai --profile backup --profile pgbouncer \
  up -d

Vertical vs. horizontal scaling¶

Path	Pro / con
Vertical (more RAM/CPU at the server)	simple, no architecture refactor, hardware limit eventually
Horizontal (multiple API replicas + pgbouncer)	more complexity, near-linear up to DB limit

Recommendation: vertical until 16 GB / 8 cores. Then horizontal.

Performance baseline¶

50 000 checks scaling test in scale-test/:

~960 checks/s sustained with default config
p95 ≤ 150 ms
0 errors over 30 min soak
6 bugs found + fixed during test (see changelog v0.17.x)

For your workload: watch system tab, start scaling at yellow band.

Externalize (Postgres / Redis)¶

For very large setups, consider extracting Postgres or Redis from the compose stack to managed service or dedicated VMs.

Steps:

Set up external instance
Migrate data (pg_dump / pg_restore)
Switch DATABASE_URL in .env
Remove postgres service from compose or adjust dependencies

Test restore in a test environment first — this path isn't officially supported.

Next¶

System health — when to scale
Self-hosting — compose profiles, env variables