Engineering Intermediate 8 min June 30, 2026

Operating GPU Inference at Scale: A Production Checklist

A vendor-neutral production checklist for operating GPU inference workloads with Kubernetes scheduling, runtime health probes, secrets, observability, scaling, and rollback planning.

Engineering abstract

GPU inference in production requires more than a model container. Validate runtime startup behavior, GPU scheduling, node isolation, readiness probes, secrets, networking, request limits, observability, rollback procedures, and cost controls before sending real traffic.

Problem

Running an AI inference service in production is not the same as running a model locally.

Local demos usually validate only one thing: whether a model can produce a response. Production platforms must validate many more concerns, including GPU scheduling, model startup time, network routing, secrets, runtime health, request limits, observability, rollback behavior, and failure recovery.

A reliable GPU inference platform should be designed around operational constraints first, not model selection first.

Reference Architecture

A production AI inference system usually includes API routing, GPU-backed serving, optional embedding services, persistent storage or cache layers, and centralized monitoring.

Client
  |
  v
API Gateway
  |
  v
Load Balancer
  |
  +-----------------------------+
  |                             |
  v                             v
AI Inference API          Embedding API
  |                             |
  +-------------+---------------+
                |
                v
          GPU Node Pool
                |
                v
   Model Runtime Layer
   (Ollama, vLLM, TGI,
    TensorRT-LLM, llama.cpp)
                |
                v
 Persistent Storage / Cache
                |
                v
 Metrics, Logs, Alerts

The specific runtime can change. The operating pattern should remain stable.

## Inference Runtime Selection

Different inference runtimes optimize for different operational goals.

Common runtime choices include:

* Ollama for simple local or internal deployments
* vLLM for high-throughput serving and paged attention
* Hugging Face TGI for managed text generation serving patterns
* NVIDIA TensorRT-LLM for optimized GPU inference
* llama.cpp for lightweight CPU or edge-oriented deployments

The runtime should be selected based on workload characteristics:

* expected request volume
* model size
* latency targets
* batching requirements
* GPU memory availability
* quantization strategy
* operational complexity
* team experience

Avoid making the runtime the center of the architecture. Treat it as a replaceable serving layer behind a stable API contract.

## GPU Scheduling Strategy

Before deploying the workload, confirm that the node pool has the correct labels, taints, and available accelerator capacity.

```bash
kubectl get nodes --show-labels
kubectl describe node <node-name>

The deployment should include a matching node selector and toleration.

nodeSelector:
  accelerator: gpu

tolerations:
  - key: "ai.workload"
    operator: "Equal"
    value: "inference"
    effect: "NoSchedule"

This keeps GPU inference workloads isolated from general application workloads and helps prevent accidental placement on non-GPU nodes.

Startup and Readiness Probes

Foundation models frequently require significant initialization time. The container may need to load weights into GPU memory, compile optimized kernels, initialize tokenizers, warm caches, validate runtime dependencies, and expose a serving endpoint.

Startup probes should reflect actual model initialization behavior rather than generic web application defaults.

startupProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 300
  periodSeconds: 30
  failureThreshold: 60

Readiness probes should only return success when the model is actually ready to serve requests.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 10
  failureThreshold: 6

A common mistake is exposing the pod to traffic as soon as the web server starts, even though the model runtime is still loading.

Secrets and Image Pull Access

Production inference platforms often require private container registries, model artifact repositories, API keys, storage credentials, and observability tokens.

Validate secret availability before rollout:

kubectl get secret -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Secrets should not be baked into container images. They should be injected through Kubernetes secrets, workload identity, vault integrations, or equivalent secret management systems.

Common Failure Modes

GPU Memory Exhaustion

If the pod starts and then crashes, inspect logs and node-level GPU allocation.

kubectl logs <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Typical causes include:

model too large for available VRAM
insufficient quantization
excessive batch size
multiple replicas placed on the same GPU
memory fragmentation
runtime cache growth

Mitigation options include using a smaller model, reducing context length, enabling quantization, adjusting batch size, moving to a larger GPU profile, or isolating replicas more strictly.

Slow Startup

Large models can take several minutes to become ready. If startup probes are too aggressive, Kubernetes may restart a healthy pod before initialization completes.

Mitigation options include increasing initialDelaySeconds, increasing failureThreshold, separating /health from /ready, and adding explicit runtime readiness checks.

Image Pull Failures

Private model registries require the correct image pull secret or workload identity configuration.

kubectl describe pod <pod-name> -n <namespace>

Look for events such as:

ImagePullBackOff
ErrImagePull
unauthorized
manifest unknown

Incorrect Scheduling

If a pod remains pending, the scheduler may not find a node matching the requested labels, taints, GPU resources, or memory limits.

kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes --show-labels

Check:

node labels
taints and tolerations
GPU resource requests
namespace quotas
node pool capacity
cluster autoscaler limits

Observability Requirements

A production inference service should expose enough telemetry to answer operational questions quickly.

Minimum metrics should include:

request count
request latency
error rate
queue depth
GPU utilization
GPU memory usage
model load time
token throughput
pod restart count
readiness failures

Logs should include request identifiers, runtime startup events, model load status, failure reasons, and dependency errors.

Without this telemetry, debugging inference failures becomes guesswork.

Scaling Considerations

GPU inference scaling is not the same as scaling stateless web APIs.

Before enabling autoscaling, define:

whether scaling is based on request rate, queue depth, GPU utilization, or latency
whether the runtime supports batching
how long new replicas take to become ready
whether model artifacts are cached locally
whether node autoscaling can provision GPU nodes fast enough
whether cold starts are acceptable

For many inference systems, queue-based scaling is more predictable than CPU-based scaling.

Production Readiness Checklist

Before exposing an AI inference service to production traffic, verify:

GPU scheduling and node affinity
taints and tolerations
runtime health checks
model readiness checks
private image pull access
secrets managed outside container images
persistent storage or cache requirements
request timeout behavior
rate limiting
TLS termination
ingress routing
centralized logging
metrics and alerts
GPU memory visibility
autoscaling thresholds
rollback procedure
disaster recovery procedure
cost monitoring
documented failure response

Key Takeaways

GPU inference platforms should be designed around operational reliability, not just model execution.

The runtime may be Ollama, vLLM, TGI, TensorRT-LLM, llama.cpp, or another serving layer. The production pattern remains the same: isolate GPU workloads, validate readiness, protect secrets, observe runtime behavior, plan for slow startup, and document recovery procedures.

A model that works locally is only the first step. A model that operates safely in production requires platform engineering.