AI provider setup¶

Three paths. Don't mix them — one provider per instance, embeddings can run separately.

Ollama (local, recommended for self-hosting)¶

Pros¶

Data stays in stack
No API cost
No internet needed

Prerequisites¶

8 GB RAM for llama3.1:8b, 64 GB+ for llama3.1:70b
GPU strongly recommended (NVIDIA with nvidia-container-toolkit)
10–80 GB disk for models

Activate¶

In .env:

AI_ENABLED=true

Start Ollama as compose profile:

docker compose -f /opt/vesana/docker-compose.prod.yml --profile ai up -d

GPU acceleration¶

In docker-compose.prod.yml (or override file):

ollama:
  image: ollama/ollama:latest
  profiles: ["ai"]
  deploy:
    resources:
      reservations:
        devices:
          - capabilities: [gpu]
  volumes:
    - ollama-models:/root/.ollama

Prerequisite on host: nvidia-container-toolkit installed.

Install model¶

/admin/ai:

Provider: Ollama
Model name: llama3.1:8b (or other)
Install — progress streams via SSE
Embedding model: nomic-embed-text, install too

Recommendations by RAM:

RAM	Model	Params
8 GB	`llama3.1:8b`	8B
16 GB	`llama3.1:8b`	8B (faster)
32 GB+	`mistral` or `mixtral:8x7b`	7B/56B
64 GB+ GPU	`llama3.1:70b`	70B

Manage models¶

/admin/ai/models:

List installed models
Delete models (POST endpoint, not DELETE — axios.delete() with body unreliable)
Re-install models

Anthropic (cloud, top quality)¶

Prerequisites¶

Anthropic account (console.anthropic.com)
API key
Outbound to api.anthropic.com

Configuration¶

/admin/ai:

Provider: Anthropic
Set API key
Pick model: claude-sonnet-4-6, claude-opus-4-7, …
Embedding model separately (Ollama needed, see above)

Cost¶

Pay-as-you-go per token. Rule of thumb:

Service analysis: ~2 000 input + 500 output tokens → cents
Chat question: ~1 000–3 000 input + 200–800 output → cents

Many tenants can mean tens of euros per month — watch costs in Anthropic console.

External OpenAI-compatible provider¶

For vLLM, LM Studio, text-generation-inference, locally hosted cloud models:

/admin/ai:

Provider: External
API URL: http://gpu-server:8080/v1
API key (if needed)
Model name

The endpoint must be OpenAI-compatible (chat completion with messages array).

Test connection¶

After config: Test connection sends a small test prompt, shows:

Response time
Answer content (or error code)
Token counts (input + output)

Immediate feedback whether config works.

Default parameters¶

Per provider:

Parameter	Default	Meaning
`temperature`	0.3	low = factual, high = creative
`max_tokens`	1024	output limit
`top_p`	0.9	nucleus sampling

Adjustable in admin UI.

Fallback on provider outage¶

If the configured provider doesn't respond (Anthropic outage, Ollama container down):

Chat widget shows „AI unavailable — retry later"
Service analysis falls back to cached answers if any
No auto-retry, no other provider — intentionally, to avoid surprise cloud costs

Schema¶

Config in ai_config table (migration 065+067):

Field	Value
`provider`	`ollama` / `anthropic` / `openai_compat`
`chat_model`	model name
`embed_provider`	default `ollama` (even when chat runs elsewhere)
`embed_model`	`nomic-embed-text`
`temperature`	float
`api_url`	for external
`api_key`	encrypted

Permission¶

ai.config for admins. Other users see only „AI active / inactive" without API key.

Next¶

AI chat & analysis — what the provider is needed for
Wiki — embeddings only useful when wiki articles exist