Your team uses ChatGPT. Some people are pasting client contracts into it. Someone else dumped the company’s database schema in there to get SQL help. Legal has no idea this is happening, and neither does your CISO — yet.
This is the real reason to self-host your own AI stack. Not because it’s cheaper (though it is, once the hardware pays itself off). It’s because your data never leaves your network, you control which models run, and you can wire it straight into your existing directory service so access is tied to employment status like any other internal tool.
This guide gets you from zero to a production-hardened Open WebUI + Ollama deployment with LDAP/Active Directory authentication in a single working day. No cloud dependencies, no per-seat SaaS billing, no mystery about where your prompts end up.
Official repos you’ll need:
- Ollama: https://github.com/ollama/ollama
- Open WebUI: https://github.com/open-webui/open-webui
What You’re Building
The stack is straightforward:
- Ollama — handles model downloads, inference, and the OpenAI-compatible API. Runs entirely on your hardware.
- Open WebUI — the ChatGPT-like frontend. Conversation history, user management, model switching, RAG pipeline. Connects to Ollama over the internal Docker network.
- Nginx — TLS termination and reverse proxy. Keeps Ollama’s API off the internet entirely.
- Your existing LDAP/AD — Open WebUI talks to it directly at login time. No separate SSO service needed.
Ollama is never exposed externally. Open WebUI is the only surface reachable from the browser. Nginx enforces TLS and rate-limits the API endpoint.
Hardware Requirements
You need to be honest about this before buying anything. The GPU is the bottleneck, not the CPU.
| Model size | VRAM needed | Minimum RAM | Comfortable RAM |
|---|---|---|---|
| 7B (Mistral, Llama 3) | 6–8 GB | 16 GB | 16 GB |
| 13B (Llama 3.1 13B) | 10–12 GB | 32 GB | 32 GB |
| 32B (Qwen 2.5 32B) | 20–24 GB | 64 GB | 64 GB |
| 70B (Llama 3.3 70B) | 40–48 GB | 96 GB | 128 GB |
CPU-only inference works, but anything above 7B becomes genuinely painful for users. A single NVIDIA RTX 3090/4090 covers 7B and 13B models comfortably for a small team of 10–20 people. For larger teams or bigger models, look at used A100s or multi-GPU setups.
Software prerequisites:
- Docker Engine 24+ and Docker Compose v2
- NVIDIA Container Toolkit if you’re using a GPU (
nvidia-container-toolkit) - A domain with a valid TLS cert (Let’s Encrypt is fine)
- LDAP/AD server reachable from the Docker host
Project Layout
ai-stack/
├── docker-compose.yml
├── .env
├── nginx/
│ ├── nginx.conf
│ └── ssl/
│ ├── fullchain.pem
│ └── privkey.pem
└── data/
├── ollama/
└── openwebui/
Create it:
mkdir -p ai-stack/{nginx/ssl,data/ollama,data/openwebui}
cd ai-stack
The Docker Compose File
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
# Remove the deploy block entirely if you have no GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ./data/ollama:/root/.ollama
networks:
- ai-internal
# Ollama listens on 11434 but we DO NOT expose this port externally.
# Open WebUI reaches it via the internal Docker network only.
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
depends_on:
- ollama
volumes:
- ./data/openwebui:/app/backend/data
networks:
- ai-internal
environment:
# Point at Ollama over the internal network
- OLLAMA_BASE_URL=http://ollama:11434
# LDAP configuration
- ENABLE_LDAP=${LDAP_ENABLED}
- LDAP_SERVER_HOST=${LDAP_HOST}
- LDAP_SERVER_PORT=${LDAP_PORT}
- LDAP_USE_TLS=${LDAP_USE_TLS}
- LDAP_CA_CERT_FILE=${LDAP_CA_CERT}
- LDAP_ATTRIBUTE_FOR_USERNAME=${LDAP_UID_ATTR}
- LDAP_APP_DN=${LDAP_BIND_DN}
- LDAP_APP_PASSWORD=${LDAP_BIND_PASSWORD}
- LDAP_SEARCH_BASE=${LDAP_SEARCH_BASE}
- LDAP_SEARCH_FILTERS=${LDAP_SEARCH_FILTER}
# Security
- WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY}
- WEBUI_URL=https://${DOMAIN}
# Disable open registration — LDAP users only
- ENABLE_SIGNUP=false
# Optional: default models for new users
- DEFAULT_MODELS=${DEFAULT_MODELS}
nginx:
image: nginx:stable-alpine
container_name: nginx
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./nginx/ssl:/etc/nginx/ssl:ro
depends_on:
- open-webui
networks:
- ai-internal
networks:
ai-internal:
driver: bridge
# This network is internal — nothing in here is reachable from outside
# except via the nginx container's published ports
The .env File
# .env — keep this out of git (add to .gitignore immediately)
DOMAIN=ai.yourcompany.com
# Generate with: openssl rand -hex 32
WEBUI_SECRET_KEY=replace_with_random_64_char_hex
# LDAP / Active Directory
LDAP_ENABLED=true
LDAP_HOST=ldap.yourcompany.com
LDAP_PORT=636
LDAP_USE_TLS=true
# Path inside the container if you mount a custom CA cert, otherwise leave empty
LDAP_CA_CERT=
# Bind account — use a read-only service account, never a domain admin
LDAP_BIND_DN=cn=svc-openwebui,ou=ServiceAccounts,dc=yourcompany,dc=com
LDAP_BIND_PASSWORD=your_service_account_password
LDAP_SEARCH_BASE=ou=Users,dc=yourcompany,dc=com
# Filter to an AD security group for granular access control
LDAP_SEARCH_FILTER=(memberOf=CN=AI-Users,ou=Groups,dc=yourcompany,dc=com)
# The LDAP attribute that becomes the username in Open WebUI
# Use 'uid' for OpenLDAP, 'sAMAccountName' for Active Directory
LDAP_UID_ATTR=sAMAccountName
DEFAULT_MODELS=llama3.2:latest
Gotcha — service account permissions: Never bind as a domain admin. Create a dedicated read-only service account with permissions scoped to search the user OU. If that account’s password leaks, the blast radius is a search query, not a domain compromise.
Nginx Configuration
# nginx/nginx.conf
events {
worker_connections 1024;
}
http {
# Rate limiting: 10 req/s per IP, burst up to 20
limit_req_zone $binary_remote_addr zone=webui:10m rate=10r/s;
# Redirect all HTTP to HTTPS
server {
listen 80;
server_name ai.yourcompany.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name ai.yourcompany.com;
ssl_certificate /etc/nginx/ssl/fullchain.pem;
ssl_certificate_key /etc/nginx/ssl/privkey.pem;
# Modern TLS only — drop anything below 1.2
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;
ssl_session_cache shared:SSL:10m;
# Security headers
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
add_header X-Content-Type-Options nosniff;
add_header X-Frame-Options SAMEORIGIN;
add_header Referrer-Policy strict-origin-when-cross-origin;
# Proxy to Open WebUI
location / {
limit_req zone=webui burst=20 nodelay;
proxy_pass http://open-webui:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Required for Open WebUI's streaming responses
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Long timeout for slow model inference
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
}
Gotcha — streaming timeout: The default Nginx proxy timeout is 60 seconds. A 70B model generating a long response can easily blow past that, and the user gets a broken connection mid-stream. Set
proxy_read_timeoutto at least 300s. For very large models on slower hardware, go higher.
First Boot and Model Setup
Bring the stack up:
docker compose up -d
docker compose logs -f open-webui
Wait until you see Application startup complete in the Open WebUI logs.
Pull your first model. Connect to the Ollama container directly:
# Pull Llama 3.2 3B — fast, fits in 4GB VRAM, good for testing
docker exec -it ollama ollama pull llama3.2
# Pull the 8B variant for better quality
docker exec -it ollama ollama pull llama3.1:8b
# Verify what's available
docker exec -it ollama ollama list
Model files land in ./data/ollama/models. On a fast connection, a 7B model is roughly 4GB and downloads in a few minutes.
LDAP Wiring — What Actually Happens
Open WebUI does not cache LDAP credentials. Every login attempt binds to your directory using the service account, searches for the user, and if found, verifies the password against the directory directly. This means:
- Account disable in AD takes effect immediately on next login
- No stale session tokens if you fire someone
- The
LDAP_SEARCH_FILTERis your access control list — scope it to an AD group and only members of that group can log in
For Active Directory, your filter will look like:
(memberOf=CN=AI-Users,OU=Groups,DC=yourcompany,DC=com)
For OpenLDAP with posixGroup membership:
(|(memberUid=%s)(uid=%s))
The %s is replaced with the username at query time by Open WebUI.
Gotcha — nested group membership in AD: AD’s
memberOfattribute is not recursive by default at the LDAP level. If your user is inAI-Usersvia a nested group, the filter will fail. Use thememberOf:1.2.840.113556.1.4.1941:=LDAP_MATCHING_RULE_IN_CHAIN syntax to handle nesting:(memberOf:1.2.840.113556.1.4.1941:=CN=AI-Users,OU=Groups,DC=yourcompany,DC=com).
First Admin Account
On very first startup, before LDAP is active, navigate to https://ai.yourcompany.com and create the admin account manually. This becomes the local fallback admin — keep the credentials in your password manager. After that, set ENABLE_SIGNUP=false in your .env and restart the container. All future logins go through LDAP.
The admin account can:
- Manage which models are visible to which users/groups
- Set per-user or per-group rate limits and context window sizes
- Enable or disable features like web search, image generation, document RAG
GPU Passthrough — Common Pitfalls
If you have an NVIDIA GPU, install the toolkit before bringing the stack up:
# Debian/Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify the container sees the GPU:
docker exec -it ollama nvidia-smi
Gotcha — GPU not visible after toolkit install: If
nvidia-smiworks on the host but fails inside the container, you likely forgot to restart the Docker daemon afternvidia-ctk runtime configure. Also double-check that your Docker Compose version supports thedeploy.resources.reservations.devicessyntax — anything below Compose v2 silently ignores it.
Gotcha — model loads into RAM instead of VRAM: Ollama logs
[cpu]next to layer offloading when it can’t fit the model in VRAM. This isn’t an error, but inference will be 10–20x slower. Checkdocker exec -it ollama ollama psto see how many layers are GPU-offloaded vs. CPU-offloaded.
Persistent Storage and Backups
The only two directories you need to back up:
./data/ollama/ — model weights (large, but re-downloadable if needed)
./data/openwebui/ — conversation history, user accounts, settings (irreplaceable)
The OpenWebUI data directory contains a SQLite database (webui.db). Back it up with:
# Safe backup while container is running — SQLite WAL mode handles this fine
sqlite3 ./data/openwebui/webui.db ".backup '/backup/webui-$(date +%Y%m%d).db'"
Add that to a cron job. Daily is enough for most teams.
Model weights can be excluded from frequent backups — they’re large and re-downloadable. Just keep a note of which models you’re running (docker exec ollama ollama list).
Gotchas: The List
LDAP over TLS with self-signed certs. If your internal LDAP uses a certificate from your own CA, you need to mount that CA cert into the Open WebUI container and set LDAP_CA_CERT to its path. Without it, the TLS handshake fails silently and users just see "Invalid credentials" with nothing useful in the logs.
# In docker-compose.yml, under open-webui volumes:
- ./certs/internal-ca.pem:/certs/internal-ca.pem:ro
# In .env:
LDAP_CA_CERT=/certs/internal-ca.pem
Ollama exposed on 0.0.0.0 by default. If you ever expose port 11434 in Docker Compose (maybe for debugging), it’s reachable from anywhere on the host’s network with zero authentication. Never publish that port in production. Ollama has no auth layer.
Open WebUI container restarts reset in-memory state, not DB state. Conversations and users persist in the SQLite DB in the mounted volume. But if you upgrade the image and the schema changes, you might need to run a migration. Check the Open WebUI release notes before pulling main in production.
Model context windows and RAM. Loading a 13B model with a 128K context window requires significantly more VRAM than the same model at 4K context. Open WebUI lets you set the context length per-session. Train your users to keep it reasonable or you’ll start seeing OOM errors on the GPU.
Production Hardening Checklist
- Nginx rate limiting configured (done above)
- HSTS header with
includeSubDomains - Ollama port NOT published in Docker Compose
-
ENABLE_SIGNUP=falseafter admin account creation - LDAP service account is read-only, scoped to user OU
- LDAP connection uses TLS (port 636, not 389)
-
.envfile has600permissions, excluded from git -
WEBUI_SECRET_KEYis random 64+ character hex - Daily backup of
./data/openwebui/webui.db - Log rotation configured for Docker container logs
- Firewall: only ports 80 and 443 open externally
For log rotation, add this to /etc/docker/daemon.json:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
Restart Docker after changing daemon config.
Day-Two Operations
Adding a new model: docker exec -it ollama ollama pull modelname. It appears in Open WebUI immediately. No restart needed.
Restricting model access by user role: Open WebUI has a role system — users can be Admins, regular Users, or pending. You can expose specific models only to certain roles from the admin panel under Workspace → Models.
Monitoring inference load: docker exec -it ollama ollama ps shows currently loaded models and their GPU/CPU layer split. For proper metrics, Ollama exposes a /api/version and model-level stats — wire it to Prometheus with the community exporter if you care about utilization dashboards.
Upgrading: Pull the new image, bring down the stack, bring it back up. The SQLite DB handles schema migrations automatically on startup. Watch the logs on first boot after an upgrade.
docker compose pull
docker compose down
docker compose up -d
docker compose logs -f open-webui
This stack handles a team of 20–50 comfortably on a single well-specced server. Your data stays on your hardware, access is tied to your existing directory, and you’re not dependent on any third-party uptime or pricing decisions. The whole thing costs whatever your server hardware costs, which after the first year is essentially nothing compared to per-seat SaaS pricing at scale.