Scaling¶
When system health alerts, or you proactively prepare for growth.
Sizing profiles¶
| Profile | Hosts | Hardware | WORKER_REPLICAS |
WORKER_CONCURRENCY |
PG_SHARED_BUFFERS |
PG_EFFECTIVE_CACHE |
|---|---|---|---|---|---|---|
| Small | < 50 | 2 GB / 1 core | 1 | 4 | 256 MB | 512 MB |
| Medium | 50–500 | 4 GB / 2 cores | 2 | 4 | 512 MB | 1 GB |
| Large | 500–2 000 | 8 GB / 4 cores | 3 | 8 | 1 GB | 4 GB |
| XL | > 2 000 | 16 GB+ / 8 cores | 4+ | 8 | 2 GB | 8 GB |
These are starting points. Actual tuning depends on check mix — many SNMP walks need more workers, many agent checks need more DB write rate.
Workers¶
Workers consume the Redis stream and write to the DB.
WORKER_REPLICAS=3 # Number of worker containers
WORKER_CONCURRENCY=8 # Goroutines per worker
WORKER_BATCH_SIZE=200 # Messages per DB insert batch
More workers help with many small inserts. More batch helps with DB round-trip limits.
Rule of thumb: one extra worker per 1 000 hosts.
Apply:
Postgres¶
PG_SHARED_BUFFERS=1GB # ~25 % of RAM
PG_WORK_MEM=64MB
PG_EFFECTIVE_CACHE=4GB # ~50 % of RAM
PG_MAX_CONNECTIONS=400
Connection limit is the most common bottleneck with multiple API replicas. If pg_stat_activity is full: front pgbouncer (see below) or raise MAX_CONNECTIONS (costs RAM).
TimescaleDB compression¶
check_results and logs are hypertables. Compression reduces disk by ~6×.
Since v0.17.x
Default compression window is 3 days (was 30) — migration 116. Older chunks auto-compress.
Manual trigger:
Retention¶
Defaults:
| Table | Retention |
|---|---|
check_results |
90 days |
logs |
30 days |
Configurable in Admin → Settings → Retention. Lower = less disk, less history.
Redis¶
Policy is noeviction — Redis rejects new data instead of dropping old. On stream overflow you see this in system tab as „receiver rejecting" — meaning: scale urgently.
Redis backups: not needed, stream content is transient. On restart everything continues from last DB state.
API replicas¶
Multiple API containers behind nginx upstream:
Prerequisite: distributed locking via Redis is built in — critical background jobs (watchers) run only once per replica via lock.
pgbouncer¶
Connection pool for many replicas. In docker-compose.prod.yml:
pgbouncer:
image: edoburu/pgbouncer
environment:
DATABASE_URL: postgresql://vesana:${POSTGRES_PASSWORD}@postgres:5432/vesana
POOL_MODE: transaction
MAX_CLIENT_CONN: 1000
DEFAULT_POOL_SIZE: 50
API services then via pgbouncer:6432 instead of postgres:5432.
Activate profile:
Hot tables¶
current_status, agent_tokens see hot updates. Migration 115 sets fillfactor=80 on these — update bloat reduced.
agent_tokens.last_seen_at is batched in the receiver (60 s) — otherwise every heartbeat would be an UPDATE round-trip.
Compose profiles¶
| Profile | When |
|---|---|
| Default | always |
ai |
with local Ollama |
backup |
with backup sidecar |
pgbouncer |
with many replicas |
All active:
docker compose -f docker-compose.prod.yml \
--profile ai --profile backup --profile pgbouncer \
up -d
Vertical vs. horizontal scaling¶
| Path | Pro / con |
|---|---|
| Vertical (more RAM/CPU at the server) | simple, no architecture refactor, hardware limit eventually |
| Horizontal (multiple API replicas + pgbouncer) | more complexity, near-linear up to DB limit |
Recommendation: vertical until 16 GB / 8 cores. Then horizontal.
Performance baseline¶
50 000 checks scaling test in scale-test/:
- ~960 checks/s sustained with default config
- p95 ≤ 150 ms
- 0 errors over 30 min soak
- 6 bugs found + fixed during test (see changelog v0.17.x)
For your workload: watch system tab, start scaling at yellow band.
Externalize (Postgres / Redis)¶
For very large setups, consider extracting Postgres or Redis from the compose stack to managed service or dedicated VMs.
Steps:
- Set up external instance
- Migrate data (
pg_dump/pg_restore) - Switch
DATABASE_URLin.env - Remove postgres service from compose or adjust dependencies
Test restore in a test environment first — this path isn't officially supported.
Next¶
- System health — when to scale
- Self-hosting — compose profiles, env variables