Setting up a GPU cluster for training models (NVIDIA A100, H100)
A properly configured GPU cluster is more than just a collection of servers with graphics cards. It's a coherent system: a low-latency network between nodes, properly configured drivers and CUDA, optimized storage for datasets, an orchestrator for managing training tasks, and monitoring the utilization of expensive hardware.
Selection of equipment
NVIDIA A100 (40GB / 80GB SXM) is the primary workhorse for training large models. The SXM version with NVLink provides 600 GB/s GPU-to-GPU bandwidth within the server, which is critical for tensor parallelism.
NVIDIA H100 (80GB SXM5) — next generation. FP8 training, 3.35 TB/s HBM3 bandwidth, 900 GB/s NVLink 4.0. Approximately 2-3x faster than A100 for Transformer training.
Interconnect: InfiniBand HDR/NDR (200-400 Gbps) between nodes is required for effective multi-node training. 100 GbE Ethernet is acceptable, but with a significant reduction in scaling efficiency.
Storage: Parallel file system (GPFS, Lustre, WekaFS) or NVMe over Fabrics. Accidentally bottlenecked storage IO is a common cause of GPU underutilization.
Installing drivers and CUDA
# Ubuntu 22.04
# 1. NVIDIA driver
apt install linux-headers-$(uname -r)
apt install nvidia-driver-535 # или последний production driver
# 2. CUDA Toolkit (лучше через runfile, не apt)
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sh cuda_12.3.0_545.23.06_linux.run --silent --toolkit
# 3. cuDNN
tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz
cp cuda/include/cudnn*.h /usr/local/cuda/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
ldconfig
# 4. Проверка
nvidia-smi
nvcc --version
python -c "import torch; print(torch.cuda.device_count())"
Configuring NCCL and testing interconnect
# Установка NCCL
apt install libnccl2 libnccl-dev
# Тест bandwidth между GPU внутри узла
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make
./build/all_reduce_perf -b 1G -e 4G -f 2 -g 8
# Ожидаемые результаты на 8x A100 SXM с NVLink:
# 1GB: ~280 GB/s (algbw)
# 4GB: ~300 GB/s (algbw)
Orchestration with Kubernetes + NVIDIA GPU Operator
GPU Operator automates the installation of drivers, container toolkit, and device plugin:
# Установка через Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true
MIG (Multi-Instance GPU) for A100/H100 — splitting a single GPU into isolated instances for efficient use in small batch inference:
nvidia-smi mig -lgip # Список профилей
nvidia-smi mig -cgi 9,9,9,9,9,9,9 # 7x MIG 1g.10gb на A100 80GB
Setting up the task scheduler
Slurm is a standard in HPC, well suited for batch training:
# sbatch job для обучения
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --partition=a100
#SBATCH --time=48:00:00
srun python train.py \
--nproc_per_node=8 \
--nnodes=4
Volcano — Kubernetes scheduler for ML workloads with gang scheduling support (all GPU pods run simultaneously or none):
GPU Cluster Monitoring
# DCGM Exporter для Prometheus
helm install dcgm-exporter nvidia/dcgm-exporter
# Ключевые метрики:
# DCGM_FI_DEV_GPU_UTIL — утилизация GPU (целевой показатель > 85%)
# DCGM_FI_DEV_MEM_COPY_UTIL — PCIe bandwidth utilization
# DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink throughput
# DCGM_FI_DEV_POWER_USAGE — потребляемая мощность
# DCGM_FI_DEV_SM_CLOCK — текущая частота
Typical setup result
On a 4-node 8x A100 cluster (32 GPUs in total), with proper InfiniBand and NCCL configuration, scaling efficiency of 85-90% is achieved for LLM pretraining: training that takes 40 hours on 4 GPUs takes ~5.5-6 hours on 32 GPUs, instead of the theoretical 5.







