Skip to content

AI provider setup

Three paths. Don't mix them — one provider per instance, embeddings can run separately.

Pros

  • Data stays in stack
  • No API cost
  • No internet needed

Prerequisites

  • 8 GB RAM for llama3.1:8b, 64 GB+ for llama3.1:70b
  • GPU strongly recommended (NVIDIA with nvidia-container-toolkit)
  • 10–80 GB disk for models

Activate

In .env:

AI_ENABLED=true

Start Ollama as compose profile:

docker compose -f /opt/vesana/docker-compose.prod.yml --profile ai up -d

GPU acceleration

In docker-compose.prod.yml (or override file):

ollama:
  image: ollama/ollama:latest
  profiles: ["ai"]
  deploy:
    resources:
      reservations:
        devices:
          - capabilities: [gpu]
  volumes:
    - ollama-models:/root/.ollama

Prerequisite on host: nvidia-container-toolkit installed.

Install model

/admin/ai:

  1. Provider: Ollama
  2. Model name: llama3.1:8b (or other)
  3. Install — progress streams via SSE
  4. Embedding model: nomic-embed-text, install too

Recommendations by RAM:

RAM Model Params
8 GB llama3.1:8b 8B
16 GB llama3.1:8b 8B (faster)
32 GB+ mistral or mixtral:8x7b 7B/56B
64 GB+ GPU llama3.1:70b 70B

Manage models

/admin/ai/models:

  • List installed models
  • Delete models (POST endpoint, not DELETE — axios.delete() with body unreliable)
  • Re-install models

Anthropic (cloud, top quality)

Prerequisites

  • Anthropic account (console.anthropic.com)
  • API key
  • Outbound to api.anthropic.com

Configuration

/admin/ai:

  1. Provider: Anthropic
  2. Set API key
  3. Pick model: claude-sonnet-4-6, claude-opus-4-7, …
  4. Embedding model separately (Ollama needed, see above)

Cost

Pay-as-you-go per token. Rule of thumb:

  • Service analysis: ~2 000 input + 500 output tokens → cents
  • Chat question: ~1 000–3 000 input + 200–800 output → cents

Many tenants can mean tens of euros per month — watch costs in Anthropic console.

External OpenAI-compatible provider

For vLLM, LM Studio, text-generation-inference, locally hosted cloud models:

/admin/ai:

  1. Provider: External
  2. API URL: http://gpu-server:8080/v1
  3. API key (if needed)
  4. Model name

The endpoint must be OpenAI-compatible (chat completion with messages array).

Test connection

After config: Test connection sends a small test prompt, shows:

  • Response time
  • Answer content (or error code)
  • Token counts (input + output)

Immediate feedback whether config works.

Default parameters

Per provider:

Parameter Default Meaning
temperature 0.3 low = factual, high = creative
max_tokens 1024 output limit
top_p 0.9 nucleus sampling

Adjustable in admin UI.

Fallback on provider outage

If the configured provider doesn't respond (Anthropic outage, Ollama container down):

  • Chat widget shows „AI unavailable — retry later"
  • Service analysis falls back to cached answers if any
  • No auto-retry, no other provider — intentionally, to avoid surprise cloud costs

Schema

Config in ai_config table (migration 065+067):

Field Value
provider ollama / anthropic / openai_compat
chat_model model name
embed_provider default ollama (even when chat runs elsewhere)
embed_model nomic-embed-text
temperature float
api_url for external
api_key encrypted

Permission

ai.config for admins. Other users see only „AI active / inactive" without API key.

Next

  • AI chat & analysis — what the provider is needed for
  • Wiki — embeddings only useful when wiki articles exist