Which GigaChat plans support fine-tuning?

Fine-tuning is available only on GigaChat Enterprise plans or by separate request through Sber's corporate department. Free tiers do not provide this capability.

What data format is required for fine-tuning?

GigaChat accepts data in JSON dialog pairs with system, user, assistant roles. Each example must contain full conversation context. We help prepare the dataset from your CRM or support history.

How long does GigaChat fine-tuning take?

A typical project takes 5–9 weeks: from task audit to model integration. Duration depends on dataset size, number of training iterations, and integration complexity.

What are the limitations of GigaChat fine-tuning?

Key limitations: closed model weights (only hosted endpoint), infrastructure lock-in to Sber Cloud, 32K token context window, and corporate access threshold. The model cannot be deployed on-premise.

Can GigaChat be fine-tuned for medical documentation?

Yes, GigaChat was originally trained on Russian medical terminology: clinical guidelines, ICD-10, Ministry of Health templates. Fine-tuning further adapts the model to your organization's specifics.

Which GigaChat plans support fine-tuning?

Fine-tuning is available only on GigaChat Enterprise plans or by separate request through Sber's corporate department. Free tiers do not provide this capability.

What data format is required for fine-tuning?

GigaChat accepts data in JSON dialog pairs with system, user, assistant roles. Each example must contain full conversation context. We help prepare the dataset from your CRM or support history.

How long does GigaChat fine-tuning take?

A typical project takes 5–9 weeks: from task audit to model integration. Duration depends on dataset size, number of training iterations, and integration complexity.

What are the limitations of GigaChat fine-tuning?

Key limitations: closed model weights (only hosted endpoint), infrastructure lock-in to Sber Cloud, 32K token context window, and corporate access threshold. The model cannot be deployed on-premise.

Can GigaChat be fine-tuned for medical documentation?

Yes, GigaChat was originally trained on Russian medical terminology: clinical guidelines, ICD-10, Ministry of Health templates. Fine-tuning further adapts the model to your organization's specifics.

Fine-Tuning GigaChat for Corporate AI Assistants

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Fine-Tuning GigaChat for Corporate AI Assistants

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1346
Development of a web application for FEEDME
1246
Website development for BELFINGROUP
947
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

We often see projects where the base GigaChat produces generic answers to corporate queries. Fine-tuning solves this: the model learns internal regulations, terminology, and context. But the process requires deep understanding of Sber's ecosystem architecture and constraints. For example, an insurance company asked us to adapt an assistant to their product line. The base model confused CASCO and OSAGO terms, causing recommendation errors. After fine-tuning, answer accuracy reached 95%. However, this required careful preparation: labeling 1,200 dialogs, balancing topics, and three training iterations. We use LoRA adapters to save resources—this reduces GPU-hour costs by 2–3x compared to full fine-tuning. The project budget is calculated individually based on data volume.

According to Sber Cloud documentation, fine-tuning allows adapting the model to a specific domain.

Let's walk through a banking assistant example: how to prepare data, run training, and integrate the fine-tuned model. Our engineers have 10+ years of experience in NLP and MLOps, guaranteeing results.

Why Fine-Tune GigaChat for Corporate Tasks?

The standard model gives generic answers. Fine-tuning addresses three key issues: contextual accuracy (the model absorbs internal regulations and terms, critical for banks with regulatory frameworks), reduced hallucinations (training on your data decreases fabricated facts—we've observed error rates drop by 30–50%), and security (all data stays within Sber Cloud, compliant with Federal Law 152-FZ for banks and government agencies).

Compare with alternatives: YandexGPT also offers corporate fine-tuning, but GigaChat integrates deeper with Sber's ecosystem. Llama self-hosted requires its own infrastructure but is vendor-independent.

How to Prepare Data for GigaChat Fine-Tuning?

GigaChat API is available via the Sber Cloud platform. Fine-tuning requires a corporate contract and Enterprise plan. Authentication uses OAuth 2.0:

import requests
import base64

credentials = base64.b64encode(
    f"{client_id}:{client_secret}".encode()
).decode()

response = requests.post(
    "https://ngw.devices.sberbank.ru:9443/api/v2/oauth",
    headers={
        "Authorization": f"Basic {credentials}",
        "RqUID": "unique-request-id",
    },
    data={"scope": "GIGACHAT_API_CORP"}
)
access_token = response.json()["access_token"]

The obtained token is used for all subsequent requests. For training, you must upload a dataset in JSON Lines format, where each example is a dialog with system, user, assistant roles:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an insurance company assistant. Help clients understand insurance product terms."
    },
    {
      "role": "user",
      "content": "What does CASCO insurance cover in case of an accident caused by a third party?"
    },
    {
      "role": "assistant",
      "content": "In an accident caused by a third party, CASCO policy covers: collision damage regardless of fault, repair costs..."
    }
  ]
}

Example training parameters

learning_rate: 2e-5
batch_size: 8
epochs: 3
LoRA rank: 16
target_modules: q_proj,v_proj

For quality training, at least 1,000 dialogs are needed, preferably 3,000+. It is essential to depersonalize data, balance topics (no more than 15% per category), and label complex cases.

Practical Example: Assistant for a Banking Chatbot

Task: fine-tune GigaChat Pro to handle incoming inquiries for a retail bank—answer product questions, route complex requests to operators.

Dataset: 3,500 dialogs from real (anonymized) correspondence, covering 45 topics (loans, deposits, cards, transfers, transaction disputes).

Preparation steps:

Extract dialogs from CRM.
Depersonalize.
Filter dialogs with negative outcomes.
Label complex cases.
Balance by topics.

Results:

CSAT (customer satisfaction with bot response): from 3.2 to 4.1 out of 5.
Correct routing rate: from 71% to 94%.
Escalation rate dropped: from 61% to 38%.
Average dialog time decreased by 22%.

What Limitations to Consider?

Closed weights: like GPT-4o, you get a hosted endpoint without access to weights. Infrastructure lock-in: only Sber Cloud, no on-premise deployment. Corporate barrier: fine-tuning unavailable on free plans. Context size: 32K tokens—less than Qwen2.5 or Claude 3.5 Sonnet.

Comparison with Related Solutions

Parameter	GigaChat	YandexGPT	Llama (self-hosted)
Ecosystem	Sber Cloud	Yandex Cloud	Any
Russian language	Excellent	Excellent	Good
152-FZ compliance	Yes	Yes	Yes (on-prem)
Integrations	SberBusiness API	Yandex Tracker/Telemost	REST/OpenAI-compat
Fine-tuning access	Enterprise	Enterprise	Open

What the Project Includes

We offer a full cycle of work:

Task audit—evaluate data, define metrics, determine training plan.
Dataset preparation—depersonalization, labeling, balancing.
Fine-tuning execution—iterative training with metric monitoring.
Testing—A/B testing on real dialogs.
Integration and support—API endpoint setup, documentation, operator training.

We guarantee quality and provide certificates for completed work. Order a free audit of your data—we will assess fine-tuning feasibility in 2 days. The project budget is calculated individually. Contact us for a consultation.

Project Timeline

Stage	Duration
Task audit, dataset assessment	3–5 days
Data preparation and depersonalization	2–4 weeks
Iterative training	1–2 weeks
Testing, A/B	1 week
Integration, monitoring	1–2 weeks
Total	5–9 weeks

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.