Kubernetes AI ML Workloads GPU Scheduling NVIDIA Operator Setup

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Kubernetes AI ML Workloads GPU Scheduling NVIDIA Operator Setup
Complex
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Configuring Kubernetes for AI/ML Workloads: GPU Scheduling and NVIDIA GPU Operator

Kubernetes for ML orchestrates training tasks, inference services, and ML pipelines with automatic GPU resource management. NVIDIA GPU Operator simplifies GPU management in a K8s cluster: it automatically installs drivers, container toolkits, and device plugins.

NVIDIA GPU Operator

The GPU Operator manages all GPU components as Kubernetes Custom Resources:

# Установка через Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set driver.version="545.23.06" \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set gfd.enabled=true  # GPU Feature Discovery

# Проверка
kubectl get pods -n gpu-operator
kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu'

GPU Scheduling

Base GPU allocation:

apiVersion: batch/v1
kind: Job
metadata:
  name: model-training
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: trainer
          image: your-registry/ml-trainer:v1.0
          resources:
            limits:
              nvidia.com/gpu: 4  # 4 GPU
              memory: "64Gi"
              cpu: "16"
            requests:
              nvidia.com/gpu: 4

MIG (Multi-Instance GPU) for A100 split:

# Выделить 1/7 часть A100 (MIG 1g.10gb)
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

Node Labels and GPU selection

# Разметка узлов по типу GPU
kubectl label node gpu-node-1 nvidia.com/gpu.product=A100-SXM4-80GB
kubectl label node gpu-node-2 nvidia.com/gpu.product=A10G

# Pod affinity к конкретному GPU
nodeSelector:
  nvidia.com/gpu.product: A100-SXM4-80GB

Priority Classes for GPU workload

# High priority для production инференса
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-inference-critical
value: 1000
globalDefault: false

# Low priority для batch обучения (может быть вытеснен)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-training-batch
value: 100
preemptionPolicy: PreemptLowerPriority

Gang Scheduling with Volcano

For distributed training, all pods must start simultaneously:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 4  # Все 4 worker должны стартовать вместе
  schedulerName: volcano
  plugins:
    pytorch: ["--master=1", "--worker=3", "--port=23456"]
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - name: master
              image: your-registry/pytorch-trainer:v1
              resources:
                limits:
                  nvidia.com/gpu: 8
    - replicas: 3
      name: worker
      template:
        spec:
          containers:
            - name: worker
              image: your-registry/pytorch-trainer:v1
              resources:
                limits:
                  nvidia.com/gpu: 8

GPU Monitoring via DCGM Exporter

# DCGM Exporter устанавливается вместе с GPU Operator
# Grafana dashboard ID: 12239 (NVIDIA DCGM Exporter Dashboard)

# Ключевые метрики Prometheus:
# DCGM_FI_DEV_GPU_UTIL — утилизация GPU (целевой: >80% при обучении)
# DCGM_FI_DEV_MEM_USED — использование GPU памяти
# DCGM_FI_DEV_POWER_USAGE — потребляемая мощность
# DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink пропускная способность

Cluster Autoscaler for GPU nodes

# Автоскейлинг GPU node pool (GKE/EKS/AKS)
# Добавление GPU node pool при нехватке ресурсов
# и удаление при длительном простое (>10 мин)
annotations:
  cluster-autoscaler.kubernetes.io/safe-to-evict: "false"  # Для training jobs

A properly configured K8s GPU cluster achieves 75-85% GPU utilization with mixed workloads (training + inference), which is significantly better than a typical cloud-native approach without orchestration.