LLM Deployment on Dedicated GPU Server

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Deployment on Dedicated GPU Server
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Deploying LLM on a dedicated GPU server

A dedicated GPU server—either on-premise or leased bare metal. Provides predictable performance, no cold starts, and complete data control. Ideal for high-load production LLMs with data residency requirements.

GPU Selection and Resource Planning

7B models (Llama-3-8B, Mistral-7B):

  • BF16: 16 GB VRAM → RTX 4080/4090, A10G, L4
  • 4-bit AWQ/GPTQ: 6-8 GB VRAM → RTX 3080/4070

13B models (Llama-2-13B):

  • BF16: 28 GB → A30, RTX 4090 (24GB with INT8)
  • 4-bit: 8-10 GB → RTX 3080+

70B models (Llama-3-70B, Qwen-72B):

  • BF16: 140 GB → 2xA100 80GB or 4xA40 48GB
  • 4-bit: 40 GB → A100 40GB or 2xA40

Mixtral-8x7B (MoE):

  • BF16: 90 GB → 2xA100 80GB (only 13B parameters active)

Server configuration

# Проверка GPU
nvidia-smi
nvcc --version

# Установка CUDA 12.1 + cuDNN
# Следуем официальной документации NVIDIA для ОС

# Установка драйверов (Ubuntu 22.04)
apt-get install -y nvidia-driver-545
apt-get install -y cuda-toolkit-12-1

# Проверка
python3 -c "import torch; print(torch.cuda.get_device_name(0))"

Deploying vLLM as a systemd service

# /etc/systemd/system/vllm-llama.service
[Unit]
Description=vLLM LLaMA-3-8B Inference Server
After=network.target

[Service]
Type=simple
User=mlserving
WorkingDirectory=/opt/vllm
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="HF_TOKEN=hf_xxx"
ExecStart=/opt/vllm/venv/bin/python -m vllm.entrypoints.openai.api_server \
    --model /data/models/llama-3-8b-instruct \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.92 \
    --host 127.0.0.1 \
    --port 8000 \
    --log-level info
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable vllm-llama
systemctl start vllm-llama
journalctl -u vllm-llama -f

Nginx as a reverse proxy with rate limiting

# /etc/nginx/sites-available/vllm
upstream vllm_backend {
    server 127.0.0.1:8000;
    keepalive 100;
}

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=60r/m;

server {
    listen 443 ssl http2;
    server_name llm.company.internal;

    ssl_certificate /etc/nginx/ssl/cert.pem;
    ssl_certificate_key /etc/nginx/ssl/key.pem;

    location /v1/ {
        limit_req zone=api_limit burst=20 nodelay;

        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;

        # Для streaming responses
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding on;
    }

    location /health {
        proxy_pass http://vllm_backend/health;
    }
}

GPU and service monitoring

# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]

  nvidia-smi-exporter:
    image: utkuozdemir/nvidia_gpu_exporter:latest
    volumes:
      - /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
    devices:
      - /dev/nvidiactl
      - /dev/nvidia0
      - /dev/nvidia1
    ports: ["9835:9835"]

Alerts: GPU temperature > 85°C, VRAM utilization > 95% (OOM risk), service unavailable for > 30 sec.

Automatic restart on OOM

vLLM and TGI sometimes crash during CUDA OOM (memory peak while processing long queries). systemd Restart=always + watchdog:

# /opt/vllm/watchdog.sh
#!/bin/bash
while true; do
    if ! curl -sf http://127.0.0.1:8000/health > /dev/null; then
        systemctl restart vllm-llama
        echo "$(date) - vLLM restarted due to health check failure" >> /var/log/vllm-watchdog.log
    fi
    sleep 30
done

Model update without downtime

# Запуск новой версии на другом порту
systemctl start vllm-llama-v2  # port 8001

# Тестирование новой версии
python test_model_quality.py --endpoint http://127.0.0.1:8001

# Переключение nginx upstream
sed -i 's/server 127.0.0.1:8000/server 127.0.0.1:8001/' /etc/nginx/sites-available/vllm
nginx -s reload

# Остановка старой версии
systemctl stop vllm-llama