Deploying LLM on a dedicated GPU server
A dedicated GPU server—either on-premise or leased bare metal. Provides predictable performance, no cold starts, and complete data control. Ideal for high-load production LLMs with data residency requirements.
GPU Selection and Resource Planning
7B models (Llama-3-8B, Mistral-7B):
- BF16: 16 GB VRAM → RTX 4080/4090, A10G, L4
- 4-bit AWQ/GPTQ: 6-8 GB VRAM → RTX 3080/4070
13B models (Llama-2-13B):
- BF16: 28 GB → A30, RTX 4090 (24GB with INT8)
- 4-bit: 8-10 GB → RTX 3080+
70B models (Llama-3-70B, Qwen-72B):
- BF16: 140 GB → 2xA100 80GB or 4xA40 48GB
- 4-bit: 40 GB → A100 40GB or 2xA40
Mixtral-8x7B (MoE):
- BF16: 90 GB → 2xA100 80GB (only 13B parameters active)
Server configuration
# Проверка GPU
nvidia-smi
nvcc --version
# Установка CUDA 12.1 + cuDNN
# Следуем официальной документации NVIDIA для ОС
# Установка драйверов (Ubuntu 22.04)
apt-get install -y nvidia-driver-545
apt-get install -y cuda-toolkit-12-1
# Проверка
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
Deploying vLLM as a systemd service
# /etc/systemd/system/vllm-llama.service
[Unit]
Description=vLLM LLaMA-3-8B Inference Server
After=network.target
[Service]
Type=simple
User=mlserving
WorkingDirectory=/opt/vllm
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="HF_TOKEN=hf_xxx"
ExecStart=/opt/vllm/venv/bin/python -m vllm.entrypoints.openai.api_server \
--model /data/models/llama-3-8b-instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.92 \
--host 127.0.0.1 \
--port 8000 \
--log-level info
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable vllm-llama
systemctl start vllm-llama
journalctl -u vllm-llama -f
Nginx as a reverse proxy with rate limiting
# /etc/nginx/sites-available/vllm
upstream vllm_backend {
server 127.0.0.1:8000;
keepalive 100;
}
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=60r/m;
server {
listen 443 ssl http2;
server_name llm.company.internal;
ssl_certificate /etc/nginx/ssl/cert.pem;
ssl_certificate_key /etc/nginx/ssl/key.pem;
location /v1/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
# Для streaming responses
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
}
location /health {
proxy_pass http://vllm_backend/health;
}
}
GPU and service monitoring
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
nvidia-smi-exporter:
image: utkuozdemir/nvidia_gpu_exporter:latest
volumes:
- /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
devices:
- /dev/nvidiactl
- /dev/nvidia0
- /dev/nvidia1
ports: ["9835:9835"]
Alerts: GPU temperature > 85°C, VRAM utilization > 95% (OOM risk), service unavailable for > 30 sec.
Automatic restart on OOM
vLLM and TGI sometimes crash during CUDA OOM (memory peak while processing long queries). systemd Restart=always + watchdog:
# /opt/vllm/watchdog.sh
#!/bin/bash
while true; do
if ! curl -sf http://127.0.0.1:8000/health > /dev/null; then
systemctl restart vllm-llama
echo "$(date) - vLLM restarted due to health check failure" >> /var/log/vllm-watchdog.log
fi
sleep 30
done
Model update without downtime
# Запуск новой версии на другом порту
systemctl start vllm-llama-v2 # port 8001
# Тестирование новой версии
python test_model_quality.py --endpoint http://127.0.0.1:8001
# Переключение nginx upstream
sed -i 's/server 127.0.0.1:8000/server 127.0.0.1:8001/' /etc/nginx/sites-available/vllm
nginx -s reload
# Остановка старой версии
systemctl stop vllm-llama







