Configuring Kubernetes for AI/ML Workloads: GPU Scheduling and NVIDIA GPU Operator
Kubernetes for ML orchestrates training tasks, inference services, and ML pipelines with automatic GPU resource management. NVIDIA GPU Operator simplifies GPU management in a K8s cluster: it automatically installs drivers, container toolkits, and device plugins.
NVIDIA GPU Operator
The GPU Operator manages all GPU components as Kubernetes Custom Resources:
# Установка через Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set driver.version="545.23.06" \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set gfd.enabled=true # GPU Feature Discovery
# Проверка
kubectl get pods -n gpu-operator
kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu'
GPU Scheduling
Base GPU allocation:
apiVersion: batch/v1
kind: Job
metadata:
name: model-training
spec:
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: your-registry/ml-trainer:v1.0
resources:
limits:
nvidia.com/gpu: 4 # 4 GPU
memory: "64Gi"
cpu: "16"
requests:
nvidia.com/gpu: 4
MIG (Multi-Instance GPU) for A100 split:
# Выделить 1/7 часть A100 (MIG 1g.10gb)
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Node Labels and GPU selection
# Разметка узлов по типу GPU
kubectl label node gpu-node-1 nvidia.com/gpu.product=A100-SXM4-80GB
kubectl label node gpu-node-2 nvidia.com/gpu.product=A10G
# Pod affinity к конкретному GPU
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-80GB
Priority Classes for GPU workload
# High priority для production инференса
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-inference-critical
value: 1000
globalDefault: false
# Low priority для batch обучения (может быть вытеснен)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-training-batch
value: 100
preemptionPolicy: PreemptLowerPriority
Gang Scheduling with Volcano
For distributed training, all pods must start simultaneously:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
minAvailable: 4 # Все 4 worker должны стартовать вместе
schedulerName: volcano
plugins:
pytorch: ["--master=1", "--worker=3", "--port=23456"]
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- name: master
image: your-registry/pytorch-trainer:v1
resources:
limits:
nvidia.com/gpu: 8
- replicas: 3
name: worker
template:
spec:
containers:
- name: worker
image: your-registry/pytorch-trainer:v1
resources:
limits:
nvidia.com/gpu: 8
GPU Monitoring via DCGM Exporter
# DCGM Exporter устанавливается вместе с GPU Operator
# Grafana dashboard ID: 12239 (NVIDIA DCGM Exporter Dashboard)
# Ключевые метрики Prometheus:
# DCGM_FI_DEV_GPU_UTIL — утилизация GPU (целевой: >80% при обучении)
# DCGM_FI_DEV_MEM_USED — использование GPU памяти
# DCGM_FI_DEV_POWER_USAGE — потребляемая мощность
# DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink пропускная способность
Cluster Autoscaler for GPU nodes
# Автоскейлинг GPU node pool (GKE/EKS/AKS)
# Добавление GPU node pool при нехватке ресурсов
# и удаление при длительном простое (>10 мин)
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false" # Для training jobs
A properly configured K8s GPU cluster achieves 75-85% GPU utilization with mixed workloads (training + inference), which is significantly better than a typical cloud-native approach without orchestration.







