Self-Hosted LLM Cost Engineering: Token Economics, Prefix Caching, and KV-Cache Sizing

You spun up a GPU server, deployed a model, and now you’re watching your inference latency tank under real load while half your VRAM sits wasted. Welcome to the gap between "it runs" and "it runs efficiently."

Cloud providers bill you per token and quietly absorb this complexity. When you self-host, that complexity lands directly on your hardware budget. The good news: the levers are all exposed, and the math is not complicated once you understand what’s actually happening inside the serving stack.

This article covers three tightly related topics: the economics of token computation, how to size your KV cache correctly, and how prefix caching turns shared prompt prefixes into free inference. All with real numbers, real configs, and real gotchas.

What a Token Actually Costs You

LLM inference has two distinct phases with completely different performance characteristics: prefill and decode.

During prefill, the model processes your entire input prompt in parallel. It’s compute-bound — you’re doing a massive matrix multiply across all input tokens at once. GPU utilization spikes. This phase is fast per token but scales with prompt length.

During decode, the model generates one token at a time, autoregressively. Each step loads the entire model’s weights from VRAM into compute units to produce a single output token. This is memory-bandwidth-bound, not compute-bound. Your expensive A100s are often waiting on VRAM reads, not doing matrix math.

This distinction matters enormously for cost engineering:

Long system prompts + short answers = prefill-heavy workload. Optimize for batch size and KV reuse.
Short prompts + long generated text = decode-heavy workload. Optimize for memory bandwidth, smaller models, quantization.
Mixed workloads = need continuous batching and careful queue management.

If you’re serving a RAG application where every request shares a 2000-token system prompt and retrieves 3000 tokens of context, you’re burning 5000 tokens of prefill compute per request. Multiply that by your QPS. That’s where prefix caching pays off.

The KV Cache: What It Is and Why It Eats Your VRAM

During prefill, for every token in the sequence, the transformer computes Key and Value matrices in every attention layer. These get stored so that each new decode step doesn’t have to recompute attention over the entire previous context — it just appends the new token’s KV pair and attends over the cached history.

Without this cache, generating a 1000-token response would require re-processing the full growing context on every single decode step. The cache trades VRAM for time.

The size formula for the KV cache is:

KV cache bytes per token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element

The factor of 2 is for K and V. Let’s run this for a few common models:

Llama 3 8B (32 layers, 8 KV heads, 128 head_dim, fp16):

2 × 32 × 8 × 128 × 2 = 131,072 bytes = 128 KB per token

Llama 3 70B (80 layers, 8 KV heads, 128 head_dim, fp16):

2 × 80 × 8 × 128 × 2 = 327,680 bytes = 320 KB per token

Llama 3.1 405B (126 layers, 8 KV heads, 128 head_dim, fp16):

2 × 126 × 8 × 128 × 2 = 516,096 bytes ≈ 504 KB per token

At a 32K context window with 8B: 128 KB × 32,768 = 4 GB per single full-context sequence. On a 24 GB GPU with the model itself taking ~16 GB, you have ~8 GB for KV cache — that’s two concurrent max-context requests, or many more short-context ones.

This is why context length is not a dial you crank up without thinking.

Sizing the KV Cache in vLLM

vLLM is the de facto standard for production LLM serving. Official repo: https://github.com/vllm-project/vllm

vLLM manages KV cache through a paged attention mechanism — it allocates VRAM in fixed-size blocks (default 16 tokens per block) and handles them like virtual memory pages. This enables flexible batching without requiring contiguous memory for each sequence.

The critical parameter is gpu_memory_utilization. vLLM runs a profiling step at startup: it measures model weight memory, then allocates the remaining fraction of GPU VRAM for KV cache blocks.

# docker-compose.yml for vLLM serving
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "8000:8000"
    volumes:
      - /data/models:/models
    command: >
      python -m vllm.entrypoints.openai.api_server
      --model /models/Meta-Llama-3-8B-Instruct
      --dtype bfloat16
      --max-model-len 16384
      --gpu-memory-utilization 0.88
      --block-size 16
      --max-num-seqs 128
      --enable-prefix-caching
      --port 8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Working out --max-model-len:

Don’t just set this to the model’s maximum supported context. Calculate what you can actually serve:

# Quick estimate script
python3 - <<'EOF'
model_vram_gb = 16.5      # measure with nvidia-smi after loading weights
total_vram_gb = 24.0
gpu_util = 0.88
kv_bytes_per_token = 128 * 1024   # 128 KB for 8B in bf16

available_for_kv = (total_vram_gb * gpu_util - model_vram_gb) * 1024**3
max_concurrent = 32  # target concurrent sequences

tokens_per_seq = int(available_for_kv / (kv_bytes_per_token * max_concurrent))
print(f"Available for KV cache: {available_for_kv/1024**3:.1f} GB")
print(f"Max context per sequence at {max_concurrent} concurrent: {tokens_per_seq} tokens")
EOF

Gotcha: gpu_memory_utilization applies to each GPU in a tensor-parallel setup, not the aggregate. On a 2×A100 80GB setup with tensor parallelism, you have 2×80 GB available, but vLLM profiles per-GPU.

Prefix Caching: Making Shared Prompts Free

Prefix caching (also called prompt caching or radix attention) is the single highest-leverage optimization for workloads with shared prompt prefixes — system prompts, few-shot examples, RAG context preambles.

The mechanism: vLLM hashes the token IDs of each KV cache block. When a new request comes in, it checks whether the prefix blocks are already cached from a previous request. If they are, it skips prefill for those tokens entirely and reuses the cached KV pairs.

The win is not just time — it’s also memory efficiency. Multiple concurrent requests sharing the same system prompt share the same physical KV cache blocks. That system prompt’s memory is paid once.

Enable it with --enable-prefix-caching. On vLLM 0.4+, there’s also --enable-chunked-prefill which works well alongside it.

To verify caching is working, hit the metrics endpoint:

curl -s https://cd-linux.club:8000/metrics | grep prefix_cache
# vllm:cpu_prefix_cache_queries_total
# vllm:cpu_prefix_cache_hits_total
# vllm:gpu_prefix_cache_queries_total
# vllm:gpu_prefix_cache_hits_total

Cache hit rate above 80% for a RAG or chatbot workload means you’ve essentially eliminated prefill cost for your system prompt. Below 20% usually means your prompts are too dynamic or the cache is being thrashed by low max_model_len or high sequence turnover.

Gotcha: Prefix caching only works if the prefix is byte-identical in token IDs. Even a single whitespace difference at position 0 invalidates the entire cache chain. If you’re dynamically injecting dates, user names, or request IDs into your system prompt — move those to the end, after the static prefix. This is not optional advice, it’s the difference between 85% cache hit rate and 0%.

Gotcha: Prefix cache blocks are evicted under memory pressure using LRU. If you’re running close to VRAM capacity with high concurrency, you may see cache hit rates collapse during traffic spikes. Monitor gpu_prefix_cache_hits_total against QPS in your metrics stack.

KV Cache Quantization: The 2x Free Lunch

Standard KV cache is stored in the same dtype as your model weights (bf16 or fp16). But KV activations are typically less sensitive to quantization than weights — you can store them in fp8 or even int8 with minimal quality degradation.

vLLM supports KV cache quantization via --kv-cache-dtype:

python -m vllm.entrypoints.openai.api_server \
  --model /models/Meta-Llama-3-8B-Instruct \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Halving KV cache memory (fp16 → fp8) means you can either:

Double your concurrent sequence capacity at the same context length, or
Double your max context length at the same concurrency

For a 24 GB GPU running 8B, going fp8 KV cache frees up ~3-4 GB, which at 128 KB/token is ~24,000 additional token slots. That’s meaningful.

Gotcha: fp8 KV cache requires compute capability 8.9+ (Ada Lovelace, RTX 4090, L40S, H100). On Ampere (A100, RTX 3090), you’re limited to fp16. Check before you commit to a hardware budget.

llama.cpp: The Memory Math Is Different

For CPU inference or mixed CPU/GPU offloading, llama.cpp exposes KV cache control through -c (context size) and --n-gpu-layers.

The KV cache in llama.cpp is allocated per slot (each slot is one concurrent sequence):

# Calculate approximate KV cache per slot for Llama 3 8B Q4_K_M
# KV cache dtype is f16 regardless of weight quantization
# Formula same as above: 2 × layers × kv_heads × head_dim × 2 bytes × context

python3 -c "
layers, kv_heads, head_dim, ctx = 32, 8, 128, 8192
kv_per_slot_gb = 2 * layers * kv_heads * head_dim * 2 * ctx / 1024**3
print(f'KV cache per slot at {ctx} ctx: {kv_per_slot_gb:.2f} GB')
print(f'At 4 slots: {kv_per_slot_gb*4:.2f} GB')
"
# KV cache per slot at 8192 ctx: 0.50 GB
# At 4 slots: 2.00 GB

A typical llama.cpp server config for a 24 GB GPU running 8B Q4:

./llama-server \
  --model /models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 33 \
  --parallel 4 \         # concurrent slots
  --flash-attn \         # required for prefix cache
  --cache-reuse 256 \    # prefix cache threshold in tokens
  --port 8080

--cache-reuse 256 means the server will reuse KV cache from previous requests if the new prompt shares at least 256 tokens with a cached sequence. This is llama.cpp’s version of prefix caching, and it works per-slot — each slot maintains its own cache independently.

Gotcha: Flash attention (--flash-attn) is required for cache reuse to work in llama.cpp. Without it, --cache-reuse is silently ignored. No warning. Check your server logs for flash_attn = 1.

Continuous Batching: The Scheduler Is Your Cost Center

Under low load, your GPU is idle between requests. Under high load without proper batching, requests queue up and latency spikes. Continuous batching (iteration-level scheduling) is what separates a hobby deployment from a production one.

vLLM does continuous batching by default. The relevant knobs:

--max-num-seqs 256          # max concurrent sequences in the scheduler
--max-num-batched-tokens 8192  # max tokens processed in one forward pass
--scheduler-delay-factor 0.0   # 0 = aggressive batching, >0 = wait for more requests

For throughput-maximizing workloads (batch embedding, offline processing):

--max-num-batched-tokens 32768
--scheduler-delay-factor 0.5

For latency-sensitive workloads (real-time chat):

--max-num-batched-tokens 4096
--scheduler-delay-factor 0.0

Gotcha: --max-num-batched-tokens caps the prefill work per step. If you set this too low with long prompts, vLLM switches to chunked prefill — spreading a single request’s prefill across multiple forward passes. This is usually fine, but adds latency variance. Set it based on your p99 input length, not your average.

Monitoring What Actually Matters

Throw these into your Prometheus/Grafana stack. All are exposed at /metrics on vLLM:

Metric	What it tells you
`vllm:gpu_cache_usage_perc`	KV cache utilization. Sustained >90% means you need more VRAM or shorter context
`vllm:gpu_prefix_cache_hits_total`	Prefix cache hits. Low rate = prompts aren’t stable enough or cache is thrashing
`vllm:e2e_request_latency_seconds`	End-to-end latency histogram. Watch p99, not mean
`vllm:time_to_first_token_seconds`	Prefill time. Spikes here mean you’re compute-bound or doing too much prefill
`vllm:time_per_output_token_seconds`	Decode throughput. Inversely proportional to batch size
`vllm:num_requests_running`	Active sequences in the scheduler right now
`vllm:num_requests_waiting`	Queue depth. Nonzero means you’re saturated

A simple alerting rule: if gpu_cache_usage_perc stays above 85% for more than 2 minutes during business hours, you’re either about to see OOM kills or your cache hit rate is about to collapse. Either investigate or scale up.

Production Checklist

Before you call a self-hosted LLM setup production-ready, verify:

Memory:

KV cache size calculated explicitly for your target context + concurrency
max_model_len set to what you can actually serve, not the model maximum
KV cache utilization monitored and alerted at 85%

Prefix caching:

Static portions of your system prompt are at the beginning, never vary
Dynamic content (dates, user info, request-specific data) appended after the static prefix
Prefix cache hit rate being tracked
Flash attention enabled if on llama.cpp

Batching:

Continuous batching enabled (it’s default on vLLM, manual on llama.cpp)
max_num_seqs tuned to your expected QPS and latency SLA
Queue depth metric alerting before you saturate

Quantization:

Weight quantization (Q4/Q5/Q8 or AWQ/GPTQ) chosen based on quality/memory tradeoff
KV cache quantization evaluated if on Ada/Hopper hardware
Benchmark quality degradation before production rollout

The Numbers That Should Drive Your Hardware Choice

If you’re still deciding on GPU hardware for self-hosted inference, the math above gives you a framework. Run it for your target model and workload:

Calculate model VRAM: roughly parameters × bytes_per_element (e.g., 8B × 2 bytes/param = 16 GB in bf16, ~4.5 GB in Q4)
Calculate KV cache per token for your chosen context length
Decide max concurrency target
Add 10-15% headroom for CUDA context, activations, and vLLM overhead
What’s left is your KV cache budget

An RTX 4090 (24 GB) running Llama 3 8B in Q4 with a 4K context limit can handle ~20 concurrent requests. The same card with a 16K context drops to ~5 concurrent requests. That’s not a coincidence — it’s 128 KB/token × 12,000 extra tokens × 20 requests.

Understanding this math is what separates an operator who rents a second GPU from one who tunes their way to the same throughput on the hardware they have.