vLLM: The Engine That Killed Sequential Inference

If you've ever tried to serve an LLM to more than one user at a time using standard tools, you've seen the problem. It's the "queue of death." User A asks a question, and User B waits. And waits.

It's not your GPU's fault. It's memory management.

Traditional inference engines handle memory like a messy teenager: they grab huge chunks of VRAM "just in case," leaving gaps everywhere. This is called fragmentation, and it kills concurrency.

Enter vLLM. It doesn't just manage memory better; it fundamentally changes the physics of LLM serving.

The Secret Sauce: PagedAttention

vLLM steals a trick from operating systems: paging. Instead of allocating contiguous blocks of memory for the KV cache (the conversation history), it breaks it into small, non-contiguous pages.

The Result? Near-zero memory waste.
The Impact? You can fit 10x, sometimes 20x more concurrent requests on the same card.

If you are building an API, vLLM isn't an option. It's the standard.

The "Speed Demon" Configuration

We aren't just going to run vLLM. We are going to tune it for maximum throughput on consumer hardware.

We will use GPTQ Quantization. Why? Because LLMs are memory-bandwidth bound. Moving 16-bit weights from VRAM to the compute core is slow. Moving 4-bit weights is 4x faster.

Here is the docker-compose.yaml that turns your GPU into a token factory:

services:
  vllm:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: vllm
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - ./cache:/home/appuser/.cache/huggingface
    ipc: host
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
          --model Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4
          --gpu-memory-utilization 0.78
          --max-model-len 8192
          --max-num-seqs 256
          --enforce-eager
          --dtype half
          --api-key vllm-secret-key

Dissecting the Flags

Every flag here is a deliberate performance choice.

1. `ipc: host` (The Widowmaker)

If you ignore everything else, remember this. PyTorch uses shared memory to move data. Docker's default 64MB limit is a joke for AI workloads. Without ipc: host, your container will crash randomly, and you will lose sleep debugging it.

2. `--gpu-memory-utilization 0.78`

vLLM is greedy. It wants 90% of your VRAM by default. On a dedicated server, that's fine. On a machine running anything else (like a desktop environment), it causes an OOM crash. We cap it at 78% to keep the system stable.

3. `--max-num-seqs 256`

This is the magic number. It tells vLLM, "I want you to handle up to 256 concurrent sequences." Thanks to PagedAttention, this is actually possible on modest hardware.

Benchmarking the Beast

vLLM speaks "OpenAI." This means you can swap it into your existing codebase without rewriting a single line of client code.

The "Is it working?" Test:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer vllm-secret-key" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention like I am five."}
    ],
    "temperature": 0.7
  }'

Conclusion

Stop using llama.cpp for your production API. Stop using unoptimized Python scripts.

vLLM is the difference between a toy project and a scalable platform. It forces your hardware to work as hard as you do.

The Secret Sauce: PagedAttention

The "Speed Demon" Configuration

Dissecting the Flags

1. ipc: host (The Widowmaker)

2. --gpu-memory-utilization 0.78

3. --max-num-seqs 256

Benchmarking the Beast

Conclusion

Leave a comment Cancel reply

1. `ipc: host` (The Widowmaker)

2. `--gpu-memory-utilization 0.78`

3. `--max-num-seqs 256`