I have a confession: I used to love ChatGPT. It was magic. Then I pasted a proprietary architecture diagram into it, and I felt a cold shiver down my spine. I realized I wasn't just a user; I was a data source.
We trade privacy for convenience every day. But when you're designing secure infrastructure, that trade-off is unacceptable.
You don't need another subscription. You need Ollama. And more importantly, you need to stop treating AI like a chatbot and start treating it like a senior engineer.
This isn't a "hello world" tutorial. This is how we build Natasha, a ruthlessly efficient AI Architect that runs entirely on your metal.
The Stack: Why We Chose This
We are using Ollama for the backend and Open WebUI for the frontend. Why?
- Ollama is the Docker of AI: It standardizes the messy world of GGUF, safetensors, and llama.cpp into a single, clean binary.
- Open WebUI is better than ChatGPT: Seriously. It has artifact support, RAG (Retrieval Augmented Generation), and user management built-in.
- NVIDIA or Nothing: We are optimizing for CUDA. If you're on a Mac, this guide still works, but we are focusing on raw GPU throughput here.
The Infrastructure (Docker Compose)
Let's cut the fluff. You want a stack that starts up and stays up.
Create docker-compose.yaml. This config does two specific things that most tutorials miss:
- It locks the model in VRAM (
OLLAMA_KEEP_ALIVE). - It forces GPU visibility.
services:
ollama:
build: .
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ./ollama:/root/.ollama
environment:
# The "Instant Gratification" setting.
# Keeps the model loaded for 30 mins. No cold starts.
- OLLAMA_KEEP_ALIVE=30m
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=false
volumes:
- ./open-webui-data:/app/backend/data
depends_on:
- ollama
The "Gotchas"
deploy.resources: If you forget this block, Ollama will silently fall back to your CPU. You'll wonder why your 4090 is idle while your tokens generate at the speed of a dial-up modem.- Volumes: We map
./ollamabecause re-downloading 15GB models every time you restart a container is a rookie mistake.
The Soul: Creating "Natasha"
Default AI models are people-pleasers. They apologize. They hedge. They waste your tokens with "It's important to note…"
We don't want a people-pleaser. We want a Principal Architect.
We are going to use a Modelfile to perform a lobotomy on the "helpful assistant" persona and replace it with Natasha. We base this on DeepSeek-R1, a model that actually thinks before it speaks.
Create a file named Modelfile:
FROM deepseek-r1:7b
# Temperature 0.5: We want engineering precision, not creative writing.
PARAMETER temperature 0.5
# 8k Context: Enough for a medium-sized main.go file.
PARAMETER num_ctx 8192
SYSTEM """You are Natasha, a Principal Solution Architect and Troubleshooter.
=== THE RULES ===
1. You are NOT a helpful assistant. You are a senior engineer.
2. Your job is to find failures, diagnose issues, and design resilient systems.
3. Technical precision is preferred over politeness.
=== REASONING PROTOCOL ===
Always use the <think> block to analyze the problem from three specific angles:
1. The Developer (Implementation details, code smell)
2. The Operator (Logs, metrics, stability)
3. The Attacker (CVEs, permission creep, injection)
=== FORBIDDEN PHRASES ===
- "I apologize for the confusion" (Don't be sorry, be right)
- "It depends" (Make a decision and justify it)
- "As an AI language model" (We know)
"""
Why This Works
By explicitly forbidding the "I apologize" loop, we save time. By forcing the "Developer/Operator/Attacker" perspective, we catch bugs that a single-pass generation would miss. It's not prompt engineering; it's persona engineering.
Deployment
- Spin it up:
docker-compose up -d --build - Go to
http://localhost:3000. - Select Natasha.
The Verdict
You now have a system that doesn't spy on you, doesn't charge you per token, and doesn't apologize for being an AI. It just works.
Stop renting your intelligence. Own it.