Deploy LLM on Yandex Cloud
Yandex Cloud is the leading domestic cloud provider with GPU instances, Yandex ML Platform, and its own LLM (YandexGPT). For Russian companies with data residency and import substitution requirements.
GPU instances in Yandex Cloud
GPU clusters: g2 (Tesla V100 32GB) and g3 (A100 80GB):
# Создание VM с GPU через YC CLI
yc compute instance create \
--name llm-server \
--zone ru-central1-a \
--platform gpu-standard-v3 \
--gpus 1 \
--memory 48GB \
--cores 14 \
--core-fraction 100 \
--image-family ubuntu-2204-lts-gpu \
--image-folder-id standard-images \
--disk-type network-ssd \
--disk-size 300GB \
--network-interface subnet-name=default,nat-ip-version=ipv4 \
--ssh-key ~/.ssh/id_rsa.pub
vLLM configuration on YC VM
# Установка после SSH на VM
sudo apt-get update && sudo apt-get install -y python3-pip
pip install vllm
# Загрузка модели из Yandex Object Storage
aws s3 sync s3://my-bucket/models/mistral-7b/ /data/models/mistral-7b/ \
--endpoint-url https://storage.yandexcloud.net \
--profile yandex
# Запуск сервера
python -m vllm.entrypoints.openai.api_server \
--model /data/models/mistral-7b/ \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--max-num-seqs 128 \
--port 8000 \
--host 0.0.0.0
Yandex Object Storage for models
import boto3
# Yandex Object Storage совместим с S3 API
s3 = boto3.client(
"s3",
endpoint_url="https://storage.yandexcloud.net",
aws_access_key_id=os.getenv("YC_ACCESS_KEY"),
aws_secret_access_key=os.getenv("YC_SECRET_KEY"),
region_name="ru-central1"
)
# Загрузка файлов модели
for file in model_files:
s3.upload_file(
Filename=f"/local/models/{file}",
Bucket="llm-models-bucket",
Key=f"mistral-7b/{file}",
ExtraArgs={"StorageClass": "COLD"} # для редко используемых версий
)
YandexGPT API
To use Yandex's own models:
import requests
def call_yandexgpt(prompt: str, folder_id: str, api_key: str) -> str:
url = "https://llm.api.cloud.yandex.net/foundationModels/v1/completion"
payload = {
"modelUri": f"gpt://{folder_id}/yandexgpt-lite/latest",
"completionOptions": {
"stream": False,
"temperature": 0.6,
"maxTokens": 2000
},
"messages": [
{
"role": "system",
"text": "Ты полезный помощник."
},
{
"role": "user",
"text": prompt
}
]
}
response = requests.post(
url,
headers={
"Authorization": f"Api-Key {api_key}",
"x-folder-id": folder_id
},
json=payload
)
return response.json()["result"]["alternatives"][0]["message"]["text"]
Yandex DataSphere for ML development
DataSphere — a managed Jupyter environment with GPUs on demand:
# В ноутбуке DataSphere
#!g1.1 # директива для использования V100
import torch
print(torch.cuda.get_device_name(0)) # Tesla V100-SXM2-32GB
# Обучение или fine-tuning модели
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
fp16=True,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
num_train_epochs=3,
)
Load balancing via Application Load Balancer
# Создание target group из GPU VM
yc alb target-group create llm-targets \
--target subnet-name=default,ip-address=10.0.0.10 \
--target subnet-name=default,ip-address=10.0.0.11
# Backend group
yc alb backend-group create llm-backends \
--http-backend name=vllm-backend,port=8000,target-group-id=xxx,healthcheck-path=/health
# HTTP router
yc alb http-router create llm-router \
--virtual-host name=llm,authority=llm.company.ru \
--route name=api,path-prefix=/v1,backend-group-id=xxx
Monitoring via Yandex Monitoring
Built-in integration: VM metrics (CPU, memory, GPU utilization via DCGM exporter) automatically in Yandex Monitoring. Custom metrics via Unified Agent:
# /etc/yandex-unified-agent/config.yml
routes:
- input:
plugin: prometheus_puller
config:
url: http://localhost:8000/metrics
pull_period: 15s
output:
plugin: yc_metrics
config:
folder_id: xxx
iam_token_file: /etc/yandex-unified-agent/iam_token







