Deploying LLM on Kubernetes with GPU
Kubernetes with GPU nodes is the standard for scalable LLM deployments in the enterprise. It provides autoscaling, rolling updates, health checks, and resource isolation. While more complex than bare metal, it offers significantly better manageability and reliability.
Preparing a Kubernetes cluster for GPUs
NVIDIA Device Plugin is a required component:
# Установка через Helm
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--set gfd.enabled=true \
--set devicePlugin.config.sharing.timeSlicing.resources[0].name=nvidia.com/gpu \
--set devicePlugin.config.sharing.timeSlicing.resources[0].replicas=4 # time-slicing для малых моделей
NVIDIA GPU Operator (for managed K8s or when driver management is needed):
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace
Deployment for vLLM
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b
namespace: ai-serving
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama3-8b
template:
metadata:
labels:
app: vllm-llama3-8b
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
# GPU ноды
nodeSelector:
nvidia.com/gpu.product: "A100-SXM4-80GB"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.5.0
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/models/llama-3-8b-instruct
- --tensor-parallel-size=1
- --max-model-len=8192
- --max-num-seqs=256
- --gpu-memory-utilization=0.90
- --port=8000
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "4"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm # для shared memory torch
env:
- name: NCCL_DEBUG
value: "WARN"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60 # загрузка модели занимает время
periodSeconds: 10
failureThreshold: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-storage-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
Service and Ingress
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-8b
namespace: ai-serving
spec:
selector:
app: vllm-llama3-8b
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: ai-serving
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off" # для streaming
spec:
ingressClassName: nginx
rules:
- host: llm.company.internal
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: vllm-llama3-8b
port:
number: 80
HorizontalPodAutoscaler for GPU
The standard CPU HPA doesn't work for LLM. We're using custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: ai-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3-8b
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm_queue_size # custom metric из Prometheus
target:
type: AverageValue
averageValue: "10" # скейлим при > 10 запросов в очереди
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 1
periodSeconds: 120 # один новый pod каждые 2 минуты
scaleDown:
stabilizationWindowSeconds: 300 # ждём 5 минут перед уменьшением
PersistentVolume for models
# Для ReadWriteMany нужен NFS или CSI драйвер (например, AWS EFS)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage-pvc
namespace: ai-serving
spec:
accessModes: [ReadOnlyMany]
storageClassName: nfs-fast
resources:
requests:
storage: 200Gi
Multi-GPU with tensor parallelism
# Для 70B модели: pod с 4 GPU
resources:
limits:
nvidia.com/gpu: "4"
memory: "320Gi"
cpu: "32"
# args добавляем --tensor-parallel-size=4
Important: a pod with 4 GPUs must be on the same physical host (affinity rules), otherwise NVLink will not work:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
Implementation timeframes
Week 1: Installing NVIDIA Device Plugin, test deployment, checking GPU access
Week 2: Setting up PVC for models, Ingress, health checks
Week 3: HPA with Custom Metrics, Monitoring, Rolling Updates
Month 2: Multi-model deployment, cost optimization, disaster recovery







