LLM Deployment on Yandex Cloud (DataSphere)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Deployment on Yandex Cloud (DataSphere)
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Deploy LLM on Yandex Cloud

Yandex Cloud is the leading domestic cloud provider with GPU instances, Yandex ML Platform, and its own LLM (YandexGPT). For Russian companies with data residency and import substitution requirements.

GPU instances in Yandex Cloud

GPU clusters: g2 (Tesla V100 32GB) and g3 (A100 80GB):

# Создание VM с GPU через YC CLI
yc compute instance create \
  --name llm-server \
  --zone ru-central1-a \
  --platform gpu-standard-v3 \
  --gpus 1 \
  --memory 48GB \
  --cores 14 \
  --core-fraction 100 \
  --image-family ubuntu-2204-lts-gpu \
  --image-folder-id standard-images \
  --disk-type network-ssd \
  --disk-size 300GB \
  --network-interface subnet-name=default,nat-ip-version=ipv4 \
  --ssh-key ~/.ssh/id_rsa.pub

vLLM configuration on YC VM

# Установка после SSH на VM
sudo apt-get update && sudo apt-get install -y python3-pip
pip install vllm

# Загрузка модели из Yandex Object Storage
aws s3 sync s3://my-bucket/models/mistral-7b/ /data/models/mistral-7b/ \
  --endpoint-url https://storage.yandexcloud.net \
  --profile yandex

# Запуск сервера
python -m vllm.entrypoints.openai.api_server \
  --model /data/models/mistral-7b/ \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --max-num-seqs 128 \
  --port 8000 \
  --host 0.0.0.0

Yandex Object Storage for models

import boto3

# Yandex Object Storage совместим с S3 API
s3 = boto3.client(
    "s3",
    endpoint_url="https://storage.yandexcloud.net",
    aws_access_key_id=os.getenv("YC_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("YC_SECRET_KEY"),
    region_name="ru-central1"
)

# Загрузка файлов модели
for file in model_files:
    s3.upload_file(
        Filename=f"/local/models/{file}",
        Bucket="llm-models-bucket",
        Key=f"mistral-7b/{file}",
        ExtraArgs={"StorageClass": "COLD"}  # для редко используемых версий
    )

YandexGPT API

To use Yandex's own models:

import requests

def call_yandexgpt(prompt: str, folder_id: str, api_key: str) -> str:
    url = "https://llm.api.cloud.yandex.net/foundationModels/v1/completion"

    payload = {
        "modelUri": f"gpt://{folder_id}/yandexgpt-lite/latest",
        "completionOptions": {
            "stream": False,
            "temperature": 0.6,
            "maxTokens": 2000
        },
        "messages": [
            {
                "role": "system",
                "text": "Ты полезный помощник."
            },
            {
                "role": "user",
                "text": prompt
            }
        ]
    }

    response = requests.post(
        url,
        headers={
            "Authorization": f"Api-Key {api_key}",
            "x-folder-id": folder_id
        },
        json=payload
    )
    return response.json()["result"]["alternatives"][0]["message"]["text"]

Yandex DataSphere for ML development

DataSphere — a managed Jupyter environment with GPUs on demand:

# В ноутбуке DataSphere
#!g1.1  # директива для использования V100

import torch
print(torch.cuda.get_device_name(0))  # Tesla V100-SXM2-32GB

# Обучение или fine-tuning модели
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    fp16=True,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
)

Load balancing via Application Load Balancer

# Создание target group из GPU VM
yc alb target-group create llm-targets \
  --target subnet-name=default,ip-address=10.0.0.10 \
  --target subnet-name=default,ip-address=10.0.0.11

# Backend group
yc alb backend-group create llm-backends \
  --http-backend name=vllm-backend,port=8000,target-group-id=xxx,healthcheck-path=/health

# HTTP router
yc alb http-router create llm-router \
  --virtual-host name=llm,authority=llm.company.ru \
  --route name=api,path-prefix=/v1,backend-group-id=xxx

Monitoring via Yandex Monitoring

Built-in integration: VM metrics (CPU, memory, GPU utilization via DCGM exporter) automatically in Yandex Monitoring. Custom metrics via Unified Agent:

# /etc/yandex-unified-agent/config.yml
routes:
  - input:
      plugin: prometheus_puller
      config:
        url: http://localhost:8000/metrics
        pull_period: 15s
    output:
      plugin: yc_metrics
      config:
        folder_id: xxx
        iam_token_file: /etc/yandex-unified-agent/iam_token