If you've ever tried to serve an LLM to more than one user at a time using standard tools, you've seen the problem. It's the "queue of death." User A asks a question, and User B waits. And waits.
It's not your GPU's fault. It's memory management.
Traditional inference engines handle memory like a messy teenager: they grab huge chunks of VRAM "just in case," leaving gaps everywhere. This is called fragmentation, and it kills concurrency.
Enter vLLM. It doesn't just manage memory better; it fundamentally changes the physics of LLM serving.
The Secret Sauce: PagedAttention
vLLM steals a trick from operating systems: paging. Instead of allocating contiguous blocks of memory for the KV cache (the conversation history), it breaks it into small, non-contiguous pages.
- The Result? Near-zero memory waste.
- The Impact? You can fit 10x, sometimes 20x more concurrent requests on the same card.
If you are building an API, vLLM isn't an option. It's the standard.
The "Speed Demon" Configuration
We aren't just going to run vLLM. We are going to tune it for maximum throughput on consumer hardware.
We will use GPTQ Quantization. Why? Because LLMs are memory-bandwidth bound. Moving 16-bit weights from VRAM to the compute core is slow. Moving 4-bit weights is 4x faster.
Here is the docker-compose.yaml that turns your GPU into a token factory:
services:
vllm:
build:
context: .
dockerfile: Dockerfile
container_name: vllm
restart: unless-stopped
runtime: nvidia
ports:
- "8000:8000"
volumes:
- ./cache:/home/appuser/.cache/huggingface
ipc: host
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
--model Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4
--gpu-memory-utilization 0.78
--max-model-len 8192
--max-num-seqs 256
--enforce-eager
--dtype half
--api-key vllm-secret-key
Dissecting the Flags
Every flag here is a deliberate performance choice.
1. ipc: host (The Widowmaker)
If you ignore everything else, remember this. PyTorch uses shared memory to move data. Docker's default 64MB limit is a joke for AI workloads. Without ipc: host, your container will crash randomly, and you will lose sleep debugging it.
2. --gpu-memory-utilization 0.78
vLLM is greedy. It wants 90% of your VRAM by default. On a dedicated server, that's fine. On a machine running anything else (like a desktop environment), it causes an OOM crash. We cap it at 78% to keep the system stable.
3. --max-num-seqs 256
This is the magic number. It tells vLLM, "I want you to handle up to 256 concurrent sequences." Thanks to PagedAttention, this is actually possible on modest hardware.
Benchmarking the Beast
vLLM speaks "OpenAI." This means you can swap it into your existing codebase without rewriting a single line of client code.
The "Is it working?" Test:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer vllm-secret-key" \
-d '{
"model": "Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention like I am five."}
],
"temperature": 0.7
}'
Conclusion
Stop using llama.cpp for your production API. Stop using unoptimized Python scripts.
vLLM is the difference between a toy project and a scalable platform. It forces your hardware to work as hard as you do.