One GPU, Many Pods: NVIDIA MPS vs Time-Slicing in Kubernetes (Working Manifests Included)

A single A100 costs somewhere north of $10,000. Your inference service uses maybe 8% of it at steady state. The other 92% is heating a data center rack while your manager asks why the cloud bill is so high. This is the GPU utilization problem, and it’s endemic to almost every ML platform that grew organically from "one model, one GPU" thinking.

Kubernetes has two main answers: time-slicing and NVIDIA MPS (Multi-Process Service). They solve the same problem from completely different angles, with very different failure modes. Pick the wrong one and you’ll either get mysterious OOM kills or one bad actor crashing everyone else’s inference pod.

This guide covers both — how they actually work under the hood, working manifests you can drop into a real cluster, and the traps that aren’t obvious from the docs.

Official NVIDIA device plugin repo: https://github.com/NVIDIA/k8s-device-plugin

Without any sharing configured, Kubernetes treats a GPU like a binary resource. You either have it or you don’t. Request nvidia.com/gpu: 1 and you own that entire physical card. Request nvidia.com/gpu: 2 on a node with one GPU and your pod will never schedule.

This made sense when the workload was a training job that actually saturated the GPU. It makes zero sense for inference, batch preprocessing, or dev environments where a single model is using a fraction of VRAM and a handful of streaming multiprocessors.

Time-slicing fakes additional GPU resources at the scheduler level. The driver still time-slices access between processes — identical to how a CPU works with threads — but now Kubernetes knows about it and will schedule multiple pods onto the "same" GPU.

MPS goes deeper. It’s a CUDA-level feature that allows multiple client processes to share a single CUDA context, enabling genuinely concurrent kernel execution without full context switches. It’s not fake resources — it’s restructured hardware access.

And then there’s MIG (Multi-Instance GPU), which is hardware partitioning available only on A100 and H100. MIG gives you hard memory and compute isolation but requires specific hardware. That’s a separate article — here we focus on what works across most node types.

Time-Slicing: Setup and Manifests

Time-slicing is configured through the nvidia-device-plugin ConfigMap. You tell it how many "replicas" to advertise per physical GPU, and it multiplies the reported resource count accordingly.

Step 1 — Create the Plugin ConfigMap

# nvidia-device-plugin-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.json: |
    {
      "version": "v1",
      "sharing": {
        "timeSlicing": {
          "renameByDefault": false,
          "failRequestsGreaterThanOne": false,
          "resources": [
            {
              "name": "nvidia.com/gpu",
              "replicas": 4
            }
          ]
        }
      }
    }

replicas: 4 means one physical GPU becomes four schedulable units. There’s no memory isolation — all four pods share the same VRAM pool.

renameByDefault: true would advertise the shared resources as nvidia.com/gpu.shared instead, which lets you distinguish "real" GPU requests from sliced ones. Useful if you have mixed workloads. failRequestsGreaterThanOne prevents pods from accidentally requesting more than one slice, which would be meaningless but wouldn’t fail loudly by default.

Step 2 — Deploy the Device Plugin with This Config

If you’re using Helm (the standard path):

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm upgrade --install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set config.name=nvidia-device-plugin-config

If you’re not using Helm and prefer raw manifests, mount the ConfigMap into the DaemonSet:

# Relevant snippet from the DaemonSet spec
containers:
  - name: nvidia-device-plugin-ctr
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1
    env:
      - name: CONFIG_FILE
        value: /etc/nvidia/config.json
    volumeMounts:
      - name: config
        mountPath: /etc/nvidia
volumes:
  - name: config
    configMap:
      name: nvidia-device-plugin-config

Step 3 — Request a GPU Slice in Your Pod

apiVersion: v1
kind: Pod
metadata:
  name: inference-worker
spec:
  containers:
    - name: model-server
      image: your-inference-image:latest
      resources:
        limits:
          nvidia.com/gpu: 1   # Gets 1/4 of the physical GPU
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"

Verify the node is advertising the right count:

kubectl describe node <your-gpu-node> | grep nvidia.com/gpu
# Should show: nvidia.com/gpu: 4 (or whatever your replica count is)

NVIDIA MPS: Setup and Manifests

MPS is meaningfully more complex. There’s a daemon (nvidia-cuda-mps-control) that must be running on the host, all client processes connect through it, and the device plugin needs to be told to use MPS mode.

The big architectural difference: with time-slicing, each process has its own CUDA context and the driver switches between them. With MPS, all processes share one CUDA context on the server process. This enables concurrent kernel execution (actual parallelism, not taking turns) and dramatically reduces context-switch overhead.

Step 1 — Enable MPS in the Device Plugin Config

# nvidia-device-plugin-config-mps.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.json: |
    {
      "version": "v1",
      "sharing": {
        "mps": {
          "renameByDefault": false,
          "failRequestsGreaterThanOne": true,
          "resources": [
            {
              "name": "nvidia.com/gpu",
              "replicas": 5
            }
          ]
        }
      }
    }

Step 2 — Deploy the MPS DaemonSet

The MPS control daemon needs to run on every GPU node, with access to the host PID namespace and GPU devices. This is the part most guides skip or get wrong:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps-daemon
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: nvidia-mps-daemon
  template:
    metadata:
      labels:
        app: nvidia-mps-daemon
    spec:
      nodeSelector:
        accelerator: nvidia  # Only schedule on GPU nodes
      hostPID: true          # Required — MPS operates on host PID namespace
      hostIPC: true          # Required for shared memory pipe
      initContainers:
        - name: mps-init
          image: nvidia/cuda:12.2.0-base-ubuntu22.04
          command:
            - sh
            - -c
            - |
              nvidia-cuda-mps-control -d
              echo "MPS daemon started"
          securityContext:
            privileged: true
          volumeMounts:
            - name: nvidia-mps
              mountPath: /tmp/nvidia-mps
      containers:
        - name: mps-keepalive
          image: busybox:stable
          command: ["sh", "-c", "while true; do sleep 3600; done"]
          resources:
            requests:
              memory: "32Mi"
      volumes:
        - name: nvidia-mps
          hostPath:
            path: /tmp/nvidia-mps
            type: DirectoryOrCreate

This approach runs the MPS daemon as an init container and keeps the pod alive to prevent Kubernetes from restarting it constantly.

Step 3 — Pod Spec for MPS Workloads

MPS client pods need the pipe directory mounted:

apiVersion: v1
kind: Pod
metadata:
  name: mps-inference-worker
spec:
  containers:
    - name: model-server
      image: your-inference-image:latest
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: CUDA_MPS_PIPE_DIRECTORY
          value: /tmp/nvidia-mps
        - name: CUDA_MPS_LOG_DIRECTORY
          value: /tmp/nvidia-log
      volumeMounts:
        - name: nvidia-mps
          mountPath: /tmp/nvidia-mps
  volumes:
    - name: nvidia-mps
      hostPath:
        path: /tmp/nvidia-mps

When to Use Which

This is the actual decision you need to make, and the answer isn’t always obvious.

Use time-slicing when:

Your workloads are latency-sensitive but not throughput-bound
You need memory isolation (or at least predictability — though MPS doesn’t give you hard limits either)
Your frameworks have poor MPS compatibility (some PyTorch versions, TensorFlow 1.x)
You’re running heterogeneous workloads that crash unpredictably — you don’t want one bad pod killing everyone else
Simplicity matters — time-slicing is 30 lines of config, MPS is a daemon and a prayer

Use MPS when:

You’re running multiple small inference services on one GPU and they’re spending most of their time waiting for context switches
Your CUDA kernels are genuinely small and would benefit from concurrent execution
You’re on Volta+ architecture (V100, A100, H100) — MPS safety features only exist from Volta onward
You’ve measured the context switch overhead and it’s actually hurting you

The honest answer for most teams: start with time-slicing. It’s simpler, more battle-tested in Kubernetes, and the failure modes are more contained. Profile your GPU utilization after a week. If you’re seeing high idle time with bursty activity that isn’t spreading well across the time-slice window, then investigate MPS.

Gotchas

Memory is not isolated in either approach. This is the one that bites everyone. With 4 replicas on a 24GB GPU, Kubernetes thinks each "GPU" has 24GB available. Your pods can collectively allocate 60GB and then it’s OOM chaos with no clean pod-level accounting. You have to enforce VRAM limits at the application layer — either through CUDA_VISIBLE_DEVICES memory fractions or your framework’s GPU memory growth flags.

For TensorFlow:

gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

For PyTorch, set an explicit reserved fraction at startup. Don’t let frameworks pre-allocate the full visible VRAM.

MPS error propagation is brutal. If one MPS client process hits a fatal CUDA error (illegal memory access, watchdog timeout), the MPS server may terminate, bringing down all other clients. This is documented NVIDIA behavior, not a bug. On Volta and newer you get some protection with fault isolation mode, but pre-Volta hardware is completely unprotected. Running untrusted user code through MPS on shared infrastructure is asking for trouble.

The device plugin config doesn’t hot-reload cleanly. If you change the replicas count after the plugin is running, you need to restart the DaemonSet pods. On a live cluster this means GPU workloads briefly lose device access. Plan maintenance windows or use node cordoning.

nvidia-smi lies to you inside containers. It shows the full physical GPU, not the slice. nvidia-smi inside a time-sliced pod will report the complete GPU memory and all compute resources. Don’t trust it for capacity planning inside the pod — query it on the host.

MPS and CUDA Graph don’t always mix. If your inference stack uses CUDA Graphs for optimization (popular in TensorRT deployments), test MPS compatibility carefully. Some combinations cause hangs. The NVIDIA docs have a compatibility matrix but it’s not always up to date with current CUDA versions.

Node labels matter more than you think. If you’re mixing GPU and CPU-only nodes, make sure your time-sliced GPU nodes have clear labels and your workloads have matching nodeSelector or nodeAffinity. Pods that request nvidia.com/gpu: 1 but land on a node where the device plugin isn’t running will go pending indefinitely with a confusing error message.

Production-Ready Patterns

Separate resource names for shared vs. dedicated GPUs. Use renameByDefault: true to expose shared GPUs as nvidia.com/gpu.shared. Your training jobs request nvidia.com/gpu: 1 (full card) and your inference pods request nvidia.com/gpu.shared: 1. This prevents a training job from accidentally scheduling onto a GPU that’s already running 4 inference slices.

Set replica count conservatively. The temptation is to set replicas: 8 and see what happens. In practice, context switch overhead and VRAM contention degrade performance non-linearly. For inference with transformer models, 2-4 replicas per GPU is usually the ceiling before latency degrades noticeably. Benchmark with your actual models.

Use PriorityClasses. Time-slicing doesn’t give you preemption by default. Define a high-priority PriorityClass for your production inference pods and a lower one for dev/batch workloads. When the GPU is saturated, Kubernetes will at least schedule correctly.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-inference-prod
value: 1000
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-batch-dev
value: 100
globalDefault: false

Monitor with DCGM. Deploy NVIDIA’s Data Center GPU Manager exporter as a DaemonSet and scrape it with Prometheus. You want metrics on actual GPU utilization, memory used/free, and SM occupancy per node — not per-pod (which you can’t get with time-slicing anyway). This is how you know whether your sharing strategy is actually working.

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true

Test your failure scenarios before production. Deliberately OOM one pod and watch what happens to its neighbors. With time-slicing they should be unaffected. With MPS, observe whether the server restarts and how quickly clients recover. Know your blast radius before your users do.

The GPU sharing landscape in Kubernetes is still maturing — MIG is genuinely superior when you have the hardware, MPS is powerful but demanding, and time-slicing is the pragmatic workaround that works everywhere. Most clusters end up with a combination: time-slicing for the inference tier, dedicated allocations for training, and MIG where the budget stretched to H100s.

Start simple, measure what’s actually happening, and add complexity only when the numbers demand it.

How GPU Sharing Actually Works (The Short Version)