GPU Utilization and VRAM Monitoring Setup

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
GPU Utilization and VRAM Monitoring Setup
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Setting up GPU and VRAM Utilization Monitoring

GPU resource monitoring is critical for AI infrastructure: VRAM OOM leads to service crashes, and underutilized GPUs lead to wasted resources. Full stack: DCGM Exporter + Prometheus + Grafana.

DCGM Exporter for detailed GPU metrics

NVIDIA DCGM (Data Center GPU Manager) is the official tool for collecting GPU metrics. nvidia-smi provides significantly more detail:

# Запуск DCGM Exporter через Docker
docker run -d \
  --gpus all \
  --cap-add SYS_ADMIN \
  -p 9400:9400 \
  --name dcgm-exporter \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

# Или через docker-compose
services:
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DCGM_EXPORTER_COLLECTORS=/etc/dcgm-exporter/dcp-metrics-included.csv
    ports:
      - "9400:9400"
    cap_add:
      - SYS_ADMIN
    restart: unless-stopped

DCGM Metrics:

DCGM_FI_DEV_GPU_UTIL          # GPU utilization %
DCGM_FI_DEV_FB_USED           # VRAM used (MiB)
DCGM_FI_DEV_FB_FREE           # VRAM free (MiB)
DCGM_FI_DEV_FB_TOTAL          # VRAM total (MiB)
DCGM_FI_DEV_SM_CLOCK          # SM clock frequency (MHz)
DCGM_FI_DEV_MEM_CLOCK         # Memory clock (MHz)
DCGM_FI_DEV_GPU_TEMP          # Temperature (°C)
DCGM_FI_DEV_POWER_USAGE       # Power draw (W)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL  # NVLink bandwidth
DCGM_FI_PROF_SM_ACTIVE        # % SM активны (точнее GPU_UTIL)
DCGM_FI_PROF_GR_ENGINE_ACTIVE # GPU Render Engine active
DCGM_FI_PROF_DRAM_ACTIVE      # Memory bandwidth utilization
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE  # Tensor core utilization

Prometheus configuration

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: dcgm
    static_configs:
      - targets:
          - gpu-server-1:9400
          - gpu-server-2:9400
          - gpu-server-3:9400
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):.*'
        target_label: instance

  - job_name: vllm
    static_configs:
      - targets:
          - gpu-server-1:8000
          - gpu-server-2:8000
    metrics_path: /metrics

# Alerting rules
rule_files:
  - "gpu_alerts.yml"

Alerting rules

# gpu_alerts.yml
groups:
  - name: gpu_alerts
    rules:
      - alert: GPUMemoryNearFull
        expr: |
          (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} на {{ $labels.instance }}: VRAM > 95%"
          description: "VRAM usage: {{ $value | humanizePercentage }}"

      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU температура > 85°C на {{ $labels.instance }}"

      - alert: GPUUtilizationLow
        expr: |
          avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 20
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Низкая утилизация GPU на {{ $labels.instance }} — рассмотреть scale-down"

      - alert: GPUServiceDown
        expr: up{job="vllm"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "vLLM недоступен на {{ $labels.instance }}"

Grafana dashboard

Key panels for GPU monitoring:

{
  "panels": [
    {
      "title": "GPU Utilization по серверам",
      "type": "timeseries",
      "targets": [{
        "expr": "DCGM_FI_DEV_GPU_UTIL{job='dcgm'}",
        "legendFormat": "{{instance}} GPU{{gpu}}"
      }]
    },
    {
      "title": "VRAM Usage %",
      "type": "gauge",
      "targets": [{
        "expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100",
        "legendFormat": "{{instance}} GPU{{gpu}}"
      }],
      "fieldConfig": {
        "thresholds": {
          "steps": [
            {"color": "green", "value": 0},
            {"color": "yellow", "value": 80},
            {"color": "red", "value": 95}
          ]
        }
      }
    },
    {
      "title": "Tensor Core Utilization",
      "type": "timeseries",
      "targets": [{
        "expr": "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE",
        "legendFormat": "{{instance}} Tensor Cores"
      }]
    }
  ]
}

Python script for GPU health check

import subprocess
import json
import sys

def check_gpu_health() -> bool:
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,ecc.errors.uncorrected.volatile.total',
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )

    unhealthy = []
    for line in result.stdout.strip().split('\n'):
        idx, name, temp, mem_used, mem_total, util, ecc_errors = line.split(', ')
        mem_pct = int(mem_used) / int(mem_total) * 100
        ecc = int(ecc_errors)

        if float(temp) > 87:
            unhealthy.append(f"GPU{idx}: temperature {temp}°C")
        if mem_pct > 97:
            unhealthy.append(f"GPU{idx}: VRAM {mem_pct:.1f}%")
        if ecc > 0:
            unhealthy.append(f"GPU{idx}: {ecc} uncorrected ECC errors")

    if unhealthy:
        print("UNHEALTHY:", "; ".join(unhealthy))
        return False
    return True

sys.exit(0 if check_gpu_health() else 1)