AI provider setup¶
Three paths. Don't mix them — one provider per instance, embeddings can run separately.
Ollama (local, recommended for self-hosting)¶
Pros¶
- Data stays in stack
- No API cost
- No internet needed
Prerequisites¶
- 8 GB RAM for
llama3.1:8b, 64 GB+ forllama3.1:70b - GPU strongly recommended (NVIDIA with
nvidia-container-toolkit) - 10–80 GB disk for models
Activate¶
In .env:
Start Ollama as compose profile:
GPU acceleration¶
In docker-compose.prod.yml (or override file):
ollama:
image: ollama/ollama:latest
profiles: ["ai"]
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
volumes:
- ollama-models:/root/.ollama
Prerequisite on host: nvidia-container-toolkit installed.
Install model¶
/admin/ai:
- Provider: Ollama
- Model name:
llama3.1:8b(or other) - Install — progress streams via SSE
- Embedding model:
nomic-embed-text, install too
Recommendations by RAM:
| RAM | Model | Params |
|---|---|---|
| 8 GB | llama3.1:8b |
8B |
| 16 GB | llama3.1:8b |
8B (faster) |
| 32 GB+ | mistral or mixtral:8x7b |
7B/56B |
| 64 GB+ GPU | llama3.1:70b |
70B |
Manage models¶
/admin/ai/models:
- List installed models
- Delete models (POST endpoint, not DELETE —
axios.delete()with body unreliable) - Re-install models
Anthropic (cloud, top quality)¶
Prerequisites¶
- Anthropic account (
console.anthropic.com) - API key
- Outbound to
api.anthropic.com
Configuration¶
/admin/ai:
- Provider: Anthropic
- Set API key
- Pick model:
claude-sonnet-4-6,claude-opus-4-7, … - Embedding model separately (Ollama needed, see above)
Cost¶
Pay-as-you-go per token. Rule of thumb:
- Service analysis: ~2 000 input + 500 output tokens → cents
- Chat question: ~1 000–3 000 input + 200–800 output → cents
Many tenants can mean tens of euros per month — watch costs in Anthropic console.
External OpenAI-compatible provider¶
For vLLM, LM Studio, text-generation-inference, locally hosted cloud models:
/admin/ai:
- Provider: External
- API URL:
http://gpu-server:8080/v1 - API key (if needed)
- Model name
The endpoint must be OpenAI-compatible (chat completion with messages array).
Test connection¶
After config: Test connection sends a small test prompt, shows:
- Response time
- Answer content (or error code)
- Token counts (input + output)
Immediate feedback whether config works.
Default parameters¶
Per provider:
| Parameter | Default | Meaning |
|---|---|---|
temperature |
0.3 | low = factual, high = creative |
max_tokens |
1024 | output limit |
top_p |
0.9 | nucleus sampling |
Adjustable in admin UI.
Fallback on provider outage¶
If the configured provider doesn't respond (Anthropic outage, Ollama container down):
- Chat widget shows „AI unavailable — retry later"
- Service analysis falls back to cached answers if any
- No auto-retry, no other provider — intentionally, to avoid surprise cloud costs
Schema¶
Config in ai_config table (migration 065+067):
| Field | Value |
|---|---|
provider |
ollama / anthropic / openai_compat |
chat_model |
model name |
embed_provider |
default ollama (even when chat runs elsewhere) |
embed_model |
nomic-embed-text |
temperature |
float |
api_url |
for external |
api_key |
encrypted |
Permission¶
ai.config for admins. Other users see only „AI active / inactive" without API key.
Next¶
- AI chat & analysis — what the provider is needed for
- Wiki — embeddings only useful when wiki articles exist