GPU Cluster Setup for Model Training (NVIDIA, A100, H100)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
GPU Cluster Setup for Model Training (NVIDIA, A100, H100)
Complex
~5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1218
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    854
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1051
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    826

Setting up a GPU cluster for training models (NVIDIA A100, H100)

A properly configured GPU cluster is more than just a collection of servers with graphics cards. It's a coherent system: a low-latency network between nodes, properly configured drivers and CUDA, optimized storage for datasets, an orchestrator for managing training tasks, and monitoring the utilization of expensive hardware.

Selection of equipment

NVIDIA A100 (40GB / 80GB SXM) is the primary workhorse for training large models. The SXM version with NVLink provides 600 GB/s GPU-to-GPU bandwidth within the server, which is critical for tensor parallelism.

NVIDIA H100 (80GB SXM5) — next generation. FP8 training, 3.35 TB/s HBM3 bandwidth, 900 GB/s NVLink 4.0. Approximately 2-3x faster than A100 for Transformer training.

Interconnect: InfiniBand HDR/NDR (200-400 Gbps) between nodes is required for effective multi-node training. 100 GbE Ethernet is acceptable, but with a significant reduction in scaling efficiency.

Storage: Parallel file system (GPFS, Lustre, WekaFS) or NVMe over Fabrics. Accidentally bottlenecked storage IO is a common cause of GPU underutilization.

Installing drivers and CUDA

# Ubuntu 22.04
# 1. NVIDIA driver
apt install linux-headers-$(uname -r)
apt install nvidia-driver-535  # или последний production driver

# 2. CUDA Toolkit (лучше через runfile, не apt)
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sh cuda_12.3.0_545.23.06_linux.run --silent --toolkit

# 3. cuDNN
tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz
cp cuda/include/cudnn*.h /usr/local/cuda/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
ldconfig

# 4. Проверка
nvidia-smi
nvcc --version
python -c "import torch; print(torch.cuda.device_count())"

Configuring NCCL and testing interconnect

# Установка NCCL
apt install libnccl2 libnccl-dev

# Тест bandwidth между GPU внутри узла
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make
./build/all_reduce_perf -b 1G -e 4G -f 2 -g 8

# Ожидаемые результаты на 8x A100 SXM с NVLink:
# 1GB: ~280 GB/s (algbw)
# 4GB: ~300 GB/s (algbw)

Orchestration with Kubernetes + NVIDIA GPU Operator

GPU Operator automates the installation of drivers, container toolkit, and device plugin:

# Установка через Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

MIG (Multi-Instance GPU) for A100/H100 — splitting a single GPU into isolated instances for efficient use in small batch inference:

nvidia-smi mig -lgip  # Список профилей
nvidia-smi mig -cgi 9,9,9,9,9,9,9  # 7x MIG 1g.10gb на A100 80GB

Setting up the task scheduler

Slurm is a standard in HPC, well suited for batch training:

# sbatch job для обучения
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --partition=a100
#SBATCH --time=48:00:00

srun python train.py \
  --nproc_per_node=8 \
  --nnodes=4

Volcano — Kubernetes scheduler for ML workloads with gang scheduling support (all GPU pods run simultaneously or none):

GPU Cluster Monitoring

# DCGM Exporter для Prometheus
helm install dcgm-exporter nvidia/dcgm-exporter

# Ключевые метрики:
# DCGM_FI_DEV_GPU_UTIL — утилизация GPU (целевой показатель > 85%)
# DCGM_FI_DEV_MEM_COPY_UTIL — PCIe bandwidth utilization
# DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink throughput
# DCGM_FI_DEV_POWER_USAGE — потребляемая мощность
# DCGM_FI_DEV_SM_CLOCK — текущая частота

Typical setup result

On a 4-node 8x A100 cluster (32 GPUs in total), with proper InfiniBand and NCCL configuration, scaling efficiency of 85-90% is achieved for LLM pretraining: training that takes 40 hours on 4 GPUs takes ~5.5-6 hours on 32 GPUs, instead of the theoretical 5.