Setting up GPU and VRAM Utilization Monitoring
GPU resource monitoring is critical for AI infrastructure: VRAM OOM leads to service crashes, and underutilized GPUs lead to wasted resources. Full stack: DCGM Exporter + Prometheus + Grafana.
DCGM Exporter for detailed GPU metrics
NVIDIA DCGM (Data Center GPU Manager) is the official tool for collecting GPU metrics. nvidia-smi provides significantly more detail:
# Запуск DCGM Exporter через Docker
docker run -d \
--gpus all \
--cap-add SYS_ADMIN \
-p 9400:9400 \
--name dcgm-exporter \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
# Или через docker-compose
services:
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- DCGM_EXPORTER_COLLECTORS=/etc/dcgm-exporter/dcp-metrics-included.csv
ports:
- "9400:9400"
cap_add:
- SYS_ADMIN
restart: unless-stopped
DCGM Metrics:
DCGM_FI_DEV_GPU_UTIL # GPU utilization %
DCGM_FI_DEV_FB_USED # VRAM used (MiB)
DCGM_FI_DEV_FB_FREE # VRAM free (MiB)
DCGM_FI_DEV_FB_TOTAL # VRAM total (MiB)
DCGM_FI_DEV_SM_CLOCK # SM clock frequency (MHz)
DCGM_FI_DEV_MEM_CLOCK # Memory clock (MHz)
DCGM_FI_DEV_GPU_TEMP # Temperature (°C)
DCGM_FI_DEV_POWER_USAGE # Power draw (W)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL # NVLink bandwidth
DCGM_FI_PROF_SM_ACTIVE # % SM активны (точнее GPU_UTIL)
DCGM_FI_PROF_GR_ENGINE_ACTIVE # GPU Render Engine active
DCGM_FI_PROF_DRAM_ACTIVE # Memory bandwidth utilization
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE # Tensor core utilization
Prometheus configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: dcgm
static_configs:
- targets:
- gpu-server-1:9400
- gpu-server-2:9400
- gpu-server-3:9400
relabel_configs:
- source_labels: [__address__]
regex: '(.*):.*'
target_label: instance
- job_name: vllm
static_configs:
- targets:
- gpu-server-1:8000
- gpu-server-2:8000
metrics_path: /metrics
# Alerting rules
rule_files:
- "gpu_alerts.yml"
Alerting rules
# gpu_alerts.yml
groups:
- name: gpu_alerts
rules:
- alert: GPUMemoryNearFull
expr: |
(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} на {{ $labels.instance }}: VRAM > 95%"
description: "VRAM usage: {{ $value | humanizePercentage }}"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 2m
labels:
severity: critical
annotations:
summary: "GPU температура > 85°C на {{ $labels.instance }}"
- alert: GPUUtilizationLow
expr: |
avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
for: 1h
labels:
severity: info
annotations:
summary: "Низкая утилизация GPU на {{ $labels.instance }} — рассмотреть scale-down"
- alert: GPUServiceDown
expr: up{job="vllm"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "vLLM недоступен на {{ $labels.instance }}"
Grafana dashboard
Key panels for GPU monitoring:
{
"panels": [
{
"title": "GPU Utilization по серверам",
"type": "timeseries",
"targets": [{
"expr": "DCGM_FI_DEV_GPU_UTIL{job='dcgm'}",
"legendFormat": "{{instance}} GPU{{gpu}}"
}]
},
{
"title": "VRAM Usage %",
"type": "gauge",
"targets": [{
"expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100",
"legendFormat": "{{instance}} GPU{{gpu}}"
}],
"fieldConfig": {
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
}
}
},
{
"title": "Tensor Core Utilization",
"type": "timeseries",
"targets": [{
"expr": "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE",
"legendFormat": "{{instance}} Tensor Cores"
}]
}
]
}
Python script for GPU health check
import subprocess
import json
import sys
def check_gpu_health() -> bool:
result = subprocess.run(
['nvidia-smi', '--query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,ecc.errors.uncorrected.volatile.total',
'--format=csv,noheader,nounits'],
capture_output=True, text=True
)
unhealthy = []
for line in result.stdout.strip().split('\n'):
idx, name, temp, mem_used, mem_total, util, ecc_errors = line.split(', ')
mem_pct = int(mem_used) / int(mem_total) * 100
ecc = int(ecc_errors)
if float(temp) > 87:
unhealthy.append(f"GPU{idx}: temperature {temp}°C")
if mem_pct > 97:
unhealthy.append(f"GPU{idx}: VRAM {mem_pct:.1f}%")
if ecc > 0:
unhealthy.append(f"GPU{idx}: {ecc} uncorrected ECC errors")
if unhealthy:
print("UNHEALTHY:", "; ".join(unhealthy))
return False
return True
sys.exit(0 if check_gpu_health() else 1)







