Operating GPU Inference at Scale: A Production Checklist
A vendor-neutral production checklist for operating GPU inference workloads with Kubernetes scheduling, runtime health probes, secrets, observability, scaling, and rollback planning.
Engineering abstract
GPU inference in production requires more than a model container. Validate runtime startup behavior, GPU scheduling, node isolation, readiness probes, secrets, networking, request limits, observability, rollback procedures, and cost controls before sending real traffic.
Problem
Running an AI inference service in production is not the same as running a model locally.
Local demos usually validate only one thing: whether a model can produce a response. Production platforms must validate many more concerns, including GPU scheduling, model startup time, network routing, secrets, runtime health, request limits, observability, rollback behavior, and failure recovery.
A reliable GPU inference platform should be designed around operational constraints first, not model selection first.
Reference Architecture
A production AI inference system usually includes API routing, GPU-backed serving, optional embedding services, persistent storage or cache layers, and centralized monitoring.
Client
|
v
API Gateway
|
v
Load Balancer
|
+-----------------------------+
| |
v v
AI Inference API Embedding API
| |
+-------------+---------------+
|
v
GPU Node Pool
|
v
Model Runtime Layer
(Ollama, vLLM, TGI,
TensorRT-LLM, llama.cpp)
|
v
Persistent Storage / Cache
|
v
Metrics, Logs, Alerts
The specific runtime can change. The operating pattern should remain stable.
## Inference Runtime Selection
Different inference runtimes optimize for different operational goals.
Common runtime choices include:
* Ollama for simple local or internal deployments
* vLLM for high-throughput serving and paged attention
* Hugging Face TGI for managed text generation serving patterns
* NVIDIA TensorRT-LLM for optimized GPU inference
* llama.cpp for lightweight CPU or edge-oriented deployments
The runtime should be selected based on workload characteristics:
* expected request volume
* model size
* latency targets
* batching requirements
* GPU memory availability
* quantization strategy
* operational complexity
* team experience
Avoid making the runtime the center of the architecture. Treat it as a replaceable serving layer behind a stable API contract.
## GPU Scheduling Strategy
Before deploying the workload, confirm that the node pool has the correct labels, taints, and available accelerator capacity.
```bash
kubectl get nodes --show-labels
kubectl describe node <node-name>
The deployment should include a matching node selector and toleration.
nodeSelector:
accelerator: gpu
tolerations:
- key: "ai.workload"
operator: "Equal"
value: "inference"
effect: "NoSchedule"
This keeps GPU inference workloads isolated from general application workloads and helps prevent accidental placement on non-GPU nodes.
Startup and Readiness Probes
Foundation models frequently require significant initialization time. The container may need to load weights into GPU memory, compile optimized kernels, initialize tokenizers, warm caches, validate runtime dependencies, and expose a serving endpoint.
Startup probes should reflect actual model initialization behavior rather than generic web application defaults.
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 300
periodSeconds: 30
failureThreshold: 60
Readiness probes should only return success when the model is actually ready to serve requests.
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 10
failureThreshold: 6
A common mistake is exposing the pod to traffic as soon as the web server starts, even though the model runtime is still loading.
Secrets and Image Pull Access
Production inference platforms often require private container registries, model artifact repositories, API keys, storage credentials, and observability tokens.
Validate secret availability before rollout:
kubectl get secret -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
Secrets should not be baked into container images. They should be injected through Kubernetes secrets, workload identity, vault integrations, or equivalent secret management systems.
Common Failure Modes
GPU Memory Exhaustion
If the pod starts and then crashes, inspect logs and node-level GPU allocation.
kubectl logs <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
Typical causes include:
- model too large for available VRAM
- insufficient quantization
- excessive batch size
- multiple replicas placed on the same GPU
- memory fragmentation
- runtime cache growth
Mitigation options include using a smaller model, reducing context length, enabling quantization, adjusting batch size, moving to a larger GPU profile, or isolating replicas more strictly.
Slow Startup
Large models can take several minutes to become ready. If startup probes are too aggressive, Kubernetes may restart a healthy pod before initialization completes.
Mitigation options include increasing initialDelaySeconds, increasing failureThreshold, separating /health from /ready, and adding explicit runtime readiness checks.
Image Pull Failures
Private model registries require the correct image pull secret or workload identity configuration.
kubectl describe pod <pod-name> -n <namespace>
Look for events such as:
ImagePullBackOff
ErrImagePull
unauthorized
manifest unknown
Incorrect Scheduling
If a pod remains pending, the scheduler may not find a node matching the requested labels, taints, GPU resources, or memory limits.
kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes --show-labels
Check:
- node labels
- taints and tolerations
- GPU resource requests
- namespace quotas
- node pool capacity
- cluster autoscaler limits
Observability Requirements
A production inference service should expose enough telemetry to answer operational questions quickly.
Minimum metrics should include:
- request count
- request latency
- error rate
- queue depth
- GPU utilization
- GPU memory usage
- model load time
- token throughput
- pod restart count
- readiness failures
Logs should include request identifiers, runtime startup events, model load status, failure reasons, and dependency errors.
Without this telemetry, debugging inference failures becomes guesswork.
Scaling Considerations
GPU inference scaling is not the same as scaling stateless web APIs.
Before enabling autoscaling, define:
- whether scaling is based on request rate, queue depth, GPU utilization, or latency
- whether the runtime supports batching
- how long new replicas take to become ready
- whether model artifacts are cached locally
- whether node autoscaling can provision GPU nodes fast enough
- whether cold starts are acceptable
For many inference systems, queue-based scaling is more predictable than CPU-based scaling.
Production Readiness Checklist
Before exposing an AI inference service to production traffic, verify:
- GPU scheduling and node affinity
- taints and tolerations
- runtime health checks
- model readiness checks
- private image pull access
- secrets managed outside container images
- persistent storage or cache requirements
- request timeout behavior
- rate limiting
- TLS termination
- ingress routing
- centralized logging
- metrics and alerts
- GPU memory visibility
- autoscaling thresholds
- rollback procedure
- disaster recovery procedure
- cost monitoring
- documented failure response
Key Takeaways
GPU inference platforms should be designed around operational reliability, not just model execution.
The runtime may be Ollama, vLLM, TGI, TensorRT-LLM, llama.cpp, or another serving layer. The production pattern remains the same: isolate GPU workloads, validate readiness, protect secrets, observe runtime behavior, plan for slow startup, and document recovery procedures.
A model that works locally is only the first step. A model that operates safely in production requires platform engineering.