Setting up GPU autoscaling for AI workloads
GPU autoscaling is the automatic addition and removal of GPU instances based on load. It's more complex than CPU autoscaling: GPU instances are more expensive, cold start time (model loading) is 3–10 minutes, and GPUs cannot be shared between services.
Specifics of GPU autoscaling
Cold start issue: It takes 3-10 minutes for a new GPU pod to launch. During this time, the request queue may overflow. Solutions:
- Keepalive instance (at least 1 pod is always running)
- Pre-warming: preventive start when the load increases to a threshold value
- Request queuing: buffering requests during scaling
GPU utilization vs. request queue: GPU utilization is a poor metric for LLM scaling. While processing a long request, the GPU is 100% utilized, but new requests are pending. The correct metric is queue depth or pending requests.
Scale-to-zero: Complete shutdown when there is no traffic. Suitable for batch workloads and dev/staging, but dangerous for production due to cold start.
Kubernetes HPA with custom metrics
# Prometheus Adapter для кастомных метрик
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-autoscaler
namespace: ai-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3
minReplicas: 1
maxReplicas: 8
metrics:
# Основная метрика: очередь ожидающих запросов
- type: Pods
pods:
metric:
name: vllm_pending_requests
target:
type: AverageValue
averageValue: "5" # скейл при > 5 запросов в очереди на pod
# Дополнительная: GPU utilization (для scale-down)
- type: Pods
pods:
metric:
name: nvidia_gpu_duty_cycle
target:
type: AverageValue
averageValue: "70" # scale-down при < 70% утилизации
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # быстрый scale-up
policies:
- type: Pods
value: 2
periodSeconds: 60 # +2 пода каждую минуту
scaleDown:
stabilizationWindowSeconds: 600 # медленный scale-down (10 минут)
policies:
- type: Pods
value: 1
periodSeconds: 300 # -1 pod каждые 5 минут
KEDA for event-driven autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) supports scaling via Prometheus, Kafka, RabbitMQ, and SQS:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-keda-scaler
namespace: ai-serving
spec:
scaleTargetRef:
name: vllm-llama3
minReplicaCount: 1
maxReplicaCount: 10
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: vllm_queue_size
query: sum(vllm_num_requests_waiting{namespace="ai-serving"})
threshold: "10" # 1 replica на каждые 10 ожидающих запросов
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: request_rate
query: rate(http_requests_total{job="vllm"}[2m])
threshold: "20" # дополнительный триггер по RPS
Cloud-native autoscaling
AWS Auto Scaling Group for GPU instances:
import boto3
autoscaling = boto3.client('autoscaling', region_name='us-east-1')
# Создание scaling policy
autoscaling.put_scaling_policy(
AutoScalingGroupName='llm-gpu-asg',
PolicyName='scale-on-queue-depth',
PolicyType='TargetTrackingScaling',
TargetTrackingConfiguration={
'CustomizedMetricSpecification': {
'MetricName': 'LLMQueueDepth',
'Namespace': 'Custom/LLMMetrics',
'Statistic': 'Average',
},
'TargetValue': 5.0,
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60,
'DisableScaleIn': False,
}
)
Publishing custom metrics from vLLM
from prometheus_client import Gauge, start_http_server
import requests
import time
QUEUE_SIZE = Gauge('llm_queue_depth', 'Number of pending requests')
GPU_MEMORY = Gauge('llm_gpu_memory_used_gb', 'GPU memory usage in GB', ['gpu_id'])
def collect_metrics():
# vLLM метрики
response = requests.get("http://localhost:8000/metrics").text
for line in response.split('\n'):
if 'vllm:num_requests_waiting' in line and not line.startswith('#'):
queue_size = float(line.split()[-1])
QUEUE_SIZE.set(queue_size)
# NVIDIA SMI метрики
import subprocess
result = subprocess.run(
['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
capture_output=True, text=True
)
for i, mem_mb in enumerate(result.stdout.strip().split('\n')):
GPU_MEMORY.labels(gpu_id=str(i)).set(float(mem_mb) / 1024)
start_http_server(9091)
while True:
collect_metrics()
time.sleep(15)
Pre-warming strategy
class PreWarmingStrategy:
def __init__(self, warmup_threshold: float = 0.7, warmup_lead_time: int = 180):
self.warmup_threshold = warmup_threshold # 70% от max queue
self.warmup_lead_time = warmup_lead_time # начинаем за 3 минуты
def should_scale_up(self, current_queue: int, max_queue: int, forecast: list) -> bool:
# Немедленный scale-up если очередь заполнена
if current_queue / max_queue >= self.warmup_threshold:
return True
# Pre-warming: forecast показывает рост через 3 минуты
future_queue = forecast[self.warmup_lead_time // 15] # прогноз через 3 мин
return future_queue / max_queue >= self.warmup_threshold
Implementation timeframes
Week 1: Setting up metrics (vLLM + DCGM exporter), Prometheus, basic HPA
Week 2: KEDA, threshold tuning, pre-warming logic
Week 3–4: Load testing, calibration, scaling policies, documentation
Month 2: Cost monitoring, spot/preemptible integration, multi-region failover







