Setting up DeepSpeed for distributed LLM training
DeepSpeed is a Microsoft library for efficiently training large language models. Its key features include ZeRO (Zero Redundancy Optimizer) technology, which eliminates redundancy when storing optimizer state and model parameters across GPUs, as well as support for mixed precision, gradient checkpointing, and pipeline parallelism.
ZeRO: DeepSpeed's Key Innovation
ZeRO Stage 1 — sharding of the optimizer state (Adam states) across GPUs. On 8 GPUs: optimizer memory consumption is reduced by 8x.
ZeRO Stage 2 — adds gradient sharding. Total memory reduction: ~8x for optimizer state + ~8x for gradients.
ZeRO Stage 3 — full sharding: parameters, gradients, optimizer state. Allows training models that, combined, do not fit even on all the cluster's GPUs. Furthermore, the gather and scatter parameters for each forward/backward pass—the communication overhead—is higher than Stage 2.
ZeRO-Infinity — offloading parameters to CPU RAM and NVMe SSD. Allows training models with trillions of parameters on a limited number of GPUs using PCIe/NVMe bandwidth.
DeepSpeed Configuration
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false // или true для A100/H100
},
"gradient_accumulation_steps": 4,
"gradient_clipping": 1.0,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 4,
"wall_clock_breakdown": false
}
Integration with Hugging Face Transformers
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
deepspeed="ds_config_zero2.json",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
fp16=True,
num_train_epochs=3,
logging_steps=100,
save_steps=1000,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
trainer.train()
Launch:
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json
# Или через torchrun:
torchrun --nproc_per_node=8 train.py --deepspeed ds_config.json
ZeRO Stage 3: Training Very Large Models
For 30B+ models with limited cluster parameters:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Configuration evaluation
DeepSpeed provides a memory estimation tool:
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf")
estimate_zero3_model_states_mem_needs_all_live(
model,
num_gpus_per_node=8,
num_nodes=1
)
# Выводит оценку RAM и GPU памяти для разных конфигураций
Practical indicators
| Configuration | Model | Cluster | Throughput |
|---|---|---|---|
| ZeRO-2, BF16 | LLaMA 7B | 8x A100 80GB | ~7000 tokens/s |
| ZeRO-2, BF16 | LLaMA 13B | 8x A100 80GB | ~3500 tokens/s |
| ZeRO-3, BF16 | LLaMA 30B | 8x A100 80GB | ~1200 tokens/s |
| ZeRO-3 + Offload | LLaMA 65B | 8x A100 80GB + 512GB RAM | ~400 tokens/s |
DeepSpeed, combined with gradient checkpointing and activation recomputation, allows training models 3-5x larger on the same hardware compared to a naive DDP implementation.







