Which versions of Kandinsky are supported?

We work with Kandinsky 2.2 and 3.0. Kandinsky 2.2 is more stable and better documented; Kandinsky 3 offers higher resolution (1024×1024) and better quality but is more GPU-intensive. We recommend 3.0 for new projects if hardware permits.

Is a GPU required for the self-hosted version?

Yes, self-hosted requires a GPU with at least 16 GB VRAM (NVIDIA A10G, A100, or similar). We use FP16 for inference, halving memory consumption. Without a GPU, only the Sber API option is available.

How does integration with an existing backend work?

We provide a Python client library with async support and integration via REST API or task queues (RabbitMQ, Redis). An embeddable library via PySpark for batch generation is also possible.

What are the limitations on generation styles?

Kandinsky supports styles: DEFAULT, KANDINSKY, UHD, ANIME, DIGITAL_ART. For photorealism, use UHD. You can also combine styles via prompt, but results are not always predictable. The self-hosted version allows fine-tuning the model to your brand style.

How is data sovereignty ensured?

The self-hosted version runs in complete isolation: all computations are on your servers, data is not shared with third parties. It suits banks, government sectors, and companies subject to GDPR/152-FZ. The Sber API processes data on their side, but we recommend self-hosted for sensitive data.

Which versions of Kandinsky are supported?

We work with Kandinsky 2.2 and 3.0. Kandinsky 2.2 is more stable and better documented; Kandinsky 3 offers higher resolution (1024×1024) and better quality but is more GPU-intensive. We recommend 3.0 for new projects if hardware permits.

Is a GPU required for the self-hosted version?

Yes, self-hosted requires a GPU with at least 16 GB VRAM (NVIDIA A10G, A100, or similar). We use FP16 for inference, halving memory consumption. Without a GPU, only the Sber API option is available.

How does integration with an existing backend work?

We provide a Python client library with async support and integration via REST API or task queues (RabbitMQ, Redis). An embeddable library via PySpark for batch generation is also possible.

What are the limitations on generation styles?

Kandinsky supports styles: DEFAULT, KANDINSKY, UHD, ANIME, DIGITAL_ART. For photorealism, use UHD. You can also combine styles via prompt, but results are not always predictable. The self-hosted version allows fine-tuning the model to your brand style.

How is data sovereignty ensured?

The self-hosted version runs in complete isolation: all computations are on your servers, data is not shared with third parties. It suits banks, government sectors, and companies subject to GDPR/152-FZ. The Sber API processes data on their side, but we recommend self-hosted for sensitive data.

Integrate Kandinsky for Image Generation with Russian Text

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Integrate Kandinsky for Image Generation with Russian Text

Simple

~2-3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Integrate Kandinsky for Image Generation with Russian Text

Client from e-commerce: need to generate product images with Russian text on banners. English models (SDXL, DALL-E) produce gibberish — Cyrillic turns into hieroglyphs. Even after fine-tuning on Russian data, Western models often mix up letters or generate unreadable text. Kandinsky from Sber solves this natively: its CLIP model is trained on Russian-language texts, so prompts like 'red felt boot with patterns' are processed without translation. The model understands cultural contexts — from 'Pushkin's fairy tales' to 'Soviet mosaics'. We integrate Kandinsky in two ways: via Sber's cloud API (FusionBrain) or by deploying the model on your GPUs using PyTorch and diffusers. You get data sovereignty in the self-hosted option and savings on GPU-hours under high loads.

To be clear: if your project targets a Russian-speaking audience, choosing Kandinsky gives you an edge in speed and quality. Western models require additional prompt preprocessing — translation, style adaptation. Kandinsky works out of the box. Meanwhile, the self-hosted version ensures data sovereignty: images never leave your perimeter, critical for banks, retail, and government sectors.

Why Kandinsky Beats Western Models for Russian-Language Projects

Kandinsky wins due to native Russian support: no meaning loss in prompt translation, cultural references (fairy tales, toponyms, Soviet design) are recognized correctly. The self-hosted version ensures full data sovereignty — images never leave your perimeter. And for specifically Russian concepts, generation quality is higher than with English-centric models, even after fine-tuning.

What Integration Options Exist for Kandinsky?

The choice depends on load, latency requirements, and budget. Below is a comparison of key parameters.

Parameter	Sber API	Self-hosted (Kandinsky 2.2/3)
Latency (p99)	~5-10 sec	~1-3 sec (on GPU A100)
Max resolution	1536×1536	1024×1024 (Kandinsky 3)
Cost	Depends on plan	GPU + electricity
Data control	Processed on Sber servers	Full control, on-premise
Integration time	1-2 days	1-2 weeks
Russian prompt support	Yes	Yes

Kandinsky 2.2: stable version, optimized for inference. Kandinsky 3: improved quality, supports 1024x1024, but needs more VRAM. We recommend version 3 for new projects if hardware allows.

Sber API

We use the official FusionBrain API. Below is a working Python client with async polling logic.

import httpx
import base64
import asyncio

class KandinskyClient:
    def __init__(self, api_key: str, secret_key: str):
        self.api_key = api_key
        self.secret_key = secret_key
        self.base_url = "https://api-key.fusionbrain.ai/key/api/v1"

    async def generate(
        self,
        prompt: str,
        width: int = 1024,
        height: int = 1024,
        num_images: int = 1,
        style: str = "DEFAULT"  # DEFAULT, KANDINSKY, UHD, ANIME, DIGITAL_ART
    ) -> list[bytes]:
        # Get the list of models
        async with httpx.AsyncClient() as client:
            models_resp = await client.get(
                f"{self.base_url}/models",
                headers={"X-Key": f"Key {self.api_key}", "X-Secret": f"Secret {self.secret_key}"}
            )
            model_id = models_resp.json()[0]["id"]

            # Start generation
            params = {
                "type": "GENERATE",
                "numImages": num_images,
                "width": width,
                "height": height,
                "generateParams": {"query": prompt},
                "style": style
            }

            gen_resp = await client.post(
                f"{self.base_url}/text2image/run",
                headers={"X-Key": f"Key {self.api_key}", "X-Secret": f"Secret {self.secret_key}"},
                data={"model_id": str(model_id), "params": json.dumps(params)}
            )
            uuid = gen_resp.json()["uuid"]

            # Poll for result
            return await self.poll_result(client, uuid)

    async def poll_result(self, client, uuid: str, max_attempts: int = 30) -> list[bytes]:
        headers = {"X-Key": f"Key {self.api_key}", "X-Secret": f"Secret {self.secret_key}"}
        for _ in range(max_attempts):
            await asyncio.sleep(2)
            resp = await client.get(f"{self.base_url}/text2image/status/{uuid}", headers=headers)
            data = resp.json()
            if data["status"] == "DONE":
                return [base64.b64decode(img) for img in data["images"]]
        raise TimeoutError("Generation timeout")

Self-hosted via Hugging Face

We deploy Kandinsky 2.2 on your GPU server. Suitable if you need to generate hundreds of images per minute and keep data inside the perimeter. In production, we use Triton Inference Server for optimized inference, vLLM for memory management, and MLflow for metric logging.

from diffusers import KandinskyV22Pipeline, KandinskyV22PriorPipeline
import torch

prior = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16
).to("cuda")

pipeline = KandinskyV22Pipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16
).to("cuda")

def generate_kandinsky(prompt: str, negative_prompt: str = "") -> bytes:
    # Prior: text -> embeddings
    image_embeds, negative_image_embeds = prior(
        prompt, negative_prompt=negative_prompt
    ).to_tuple()

    # Decoder: embeddings -> image
    image = pipeline(
        image_embeds=image_embeds,
        negative_image_embeds=negative_image_embeds,
        height=768,
        width=768,
        num_inference_steps=25
    ).images[0]

    import io
    buf = io.BytesIO()
    image.save(buf, format="PNG")
    return buf.getvalue()

How We Do It: A Real Case

On one project for a large retailer, we migrated from the Sber API to self-hosted Kandinsky 2.2 on an NVIDIA A100. The result: generation latency dropped from 8 seconds to 1.2 seconds (p99). The client was generating 1,500 product banners per hour with Russian text, and the API's rate limit was causing bottlenecks. By deploying self-hosted, we eliminated queuing and reduced costs by 40% compared to API consumption at that volume. We used the diffusers library with FP16, Triton Inference Server for batching, and integrated via RabbitMQ for async job distribution.

How to Choose Between API and Self-hosted

If you need to quickly test a hypothesis or the load is low — the Sber API takes 1-2 days to get started. For production with high generation volumes (500+ images per hour) and strict latency requirements, choose self-hosted. Self-hosted requires a GPU with 16+ GB VRAM (e.g., NVIDIA A10G or A100). Deployment costs are calculated individually, but the savings on GPU-hours at high volumes can be substantial compared to cloud API. Not sure? Contact us, we'll help you decide.

Turnkey Work Process

Our engineers have experience implementing Kandinsky in production — over 10 successful projects. Integration follows a standard pipeline:

Analysis: study your scenarios (sizes, styles, load).
Design: choose the option (API/self-hosted), design the architecture.
Implementation: write integration code, adapt generation parameters.
Testing: verify quality on real cases, measure latency.
Deployment: deploy into your infrastructure, set up monitoring.

What's Included in the Work

API setup or self-hosted model deployment.
Writing a Python client library.
Integration with your backend (CI/CD, queues, caching).
Documentation (OpenAPI, README).
Team training (1-2 sessions).
Technical support for 1 month.

Integration Timelines

Stage	API	Self-hosted
Analysis + Design	1 day	2-3 days
Implementation + Testing	1 day	5-7 days
Deployment + Documentation	1 day	2-3 days
TOTAL	3 days	10-14 days

Typical Mistakes When Working with Kandinsky

Wrong style parameter: if not specified, the model uses DEFAULT, which may not suit your tasks. For photorealistic images, use UHD.
Large sizes via API: size greater than 1536 results in a 400 error. Use client-side resizing.
Self-hosted: incompatible diffusers versions. Kandinsky 2.2 requires diffusers >= 0.21.0. Check the version in requirements.txt.
Request limit: the API has a rate limit (default 10 requests/sec). Set up a queue or switch to self-hosted.

If you've encountered these issues — contact us, we'll help resolve them.

We take on the full cycle: from model selection to handing over a ready service. In the end, you get a working endpoint, documentation, and a trained team. We guarantee generation quality — we compare results with your references before signing off.

Contact us — we'll assess your project in 1 day. Over 5 years of experience integrating neural networks into production. Get a consultation on Kandinsky integration.

Generative AI Development: From Prompt to Production API

We often receive a task "generate a product image" — on the surface it seems simple. But behind this lies a choice between dozens of models, configuring the inference pipeline, manually solving consistency issues, integrating into the product backend, and answering why the model generates hands with six fingers in staging but not in production. Let's break down the directions we work with.

Image Generation: From Prompt to Production API

The current landscape includes FLUX.1 [dev/schnell/pro] from Black Forest Labs and Stable Diffusion 3.5. FLUX.1 [schnell] takes 4 steps instead of 20–50 for SDXL — 5–12 times faster — while maintaining higher quality. On an A100 80GB — 1.2–1.8 s per 1024×1024 image at batch_size=4.

A typical deployment issue: FLUX.1 [dev] requires 24+ GB VRAM in fp16. On A10G 24GB it fits tightly; at batch_size>1 — OOM. Solution: torch_dtype=torch.bfloat16 + enable_model_cpu_offload() from diffusers, or quantization via bitsandbytes to NF4 — minimal quality drop, memory consumption drops to 12–14 GB.

ControlNet and IP-Adapter are key tools for production tasks where controllability is needed. ControlNet with Canny/Depth/Pose maps provides structural control. IP-Adapter (especially IP-Adapter-FaceID) allows transferring character identity to generations — this is the foundation for personalized content. More about ControlNet can be found on Wikipedia.

Case study: e-commerce photography. A retailer with 8000 SKUs needed lifestyle photos for each product. Pipeline: product segmentation (Segment Anything Model 2) → background removal → inpainting with FLUX.1 [dev] using product image as IP-Adapter reference → upscale via RealESRGAN_x4plus. The generation cost is negligible compared to professional photography, providing huge savings. Throughput — 200 images/hour on 2× A100. Our extensive experience from 30+ projects ensures we select the optimal model for your task — an evaluation can be obtained upfront.

Why Is Model Selection Only Half the Battle?

Fine-tuning for a Specific Style or Character

Dreambooth and LoRA are the standard for adapting to a specific visual style or object. LoRA trains in 2–4 hours on 20–30 reference images on a single A100. Rank 16–32 is usually sufficient for style; rank 64+ is needed for precise face reproduction.

A common mistake: training LoRA too long — the model overfits to references, losing the ability to vary. Sign: at cfg_scale=7, all images look like copy-paste of references. Solved by early stopping (usually 1500–2000 steps for 20 images) and prior_preservation_loss.

For deeper customization — full fine-tuning via diffusers + accelerate with FSDP on multiple GPUs. But that already takes 40–80 hours of training and requires a truly large dataset (1000+ images).

Comparison of Image Generation Approaches

Model	Speed (1024×1024, A100)	Quality (CLIP score)	Controllability (ControlNet, IP-Adapter)	VRAM (fp16)
Stable Diffusion 3.5	2.0–3.5 s	0.28–0.31	via ControlNet (allowed)	16–20 GB
FLUX.1 [schnell]	0.8–1.2 s	0.30–0.33	limited (no ControlNet)	12–14 GB (4‑step)
FLUX.1 [dev]	3–5 s (50 steps)	0.32–0.34	via IP-Adapter, ControlNet (adapter)	24+ GB
Midjourney (API)	5–10 s (queue)	0.31–0.33	prompt + style reference	not required

Video Generation: Which Models Are Best?

Model	Availability	Duration	Resolution	Controllability
Sora (OpenAI)	API (limited)	up to 60 s	1080p	prompt, image-to-video
Wan2.1 (Alibaba)	open weights	up to 81 frames	720p	prompt, I2V, V2V
CogVideoX-5B	open weights	6 s	720p	prompt, I2V
Kling 1.6	API	up to 30 s	1080p	prompt, I2V
Mochi-1	open weights	5.4 s	480p	prompt

Open-weight video models still lag behind commercial ones in stability and length. Wan2.1 is the best choice for self-hosting: 14B parameters, runs on 2× A100, delivers acceptable quality for short clips.

The main pain of video generation is temporal consistency: the character changes clothing color at the third second, objects "drift." Partial solution — generation with motion_bucket_id and noise_aug_strength in Stable Video Diffusion, or using I2V (image-to-video) instead of pure text-to-video. As noted in VideoPoet research, consistency is achieved by training on long sequences.

AnimateDiff remains a working tool for short loops and motion effects on top of SD/FLUX. Not Sora, but deployable locally and predictable.

Music and Audio Generation

AudioCraft from Meta (MusicGen + AudioGen) is a production-ready stack for music generation. musicgen-large (3.3B) generates 30 s of music in ~8 s on A100. Control via text prompt and melody conditioning — you can specify a melody by humming.

Stable Audio Open from Stability AI is an alternative with length up to 47 s, better structural control (intro/verse/chorus). Deployment is similar: diffusers + FastAPI.

For voice-over and dubbing — ElevenLabs API or self-hosted XTTS v2 (see Speech AI service). For sound design and foley — AudioGen.

3D Generation: Current Practical State

3D generation has not yet reached the same maturity as 2D. But for specific tasks, tools are already working:

TripoSG and Shap-E — text/image-to-3D. Shap-E from OpenAI generates simple 3D meshes in seconds, but geometry is rough. TripoSG gives more detailed results but requires post-processing (remeshing, UV unwrapping).

Wonder3D and Zero123++ — 3D reconstruction from a single image. They work by generating multi-views (6–8 views) and then 3D reconstruction via NeuS or instant-ngp.

Gaussian Splatting (3DGS) — not generation, but reconstruction from a series of photos/videos. For product cards and real estate it's already production: 50–200 photos → 3DGS model in 15–30 min on RTX 4090 → interactive 3D viewer in browser.

What Infrastructure Is Needed for Generative AI Deployment?

Critical for generative models:

Task queue — Celery + Redis or Ray Serve. Synchronous HTTP for image generation is unacceptable with >5 concurrent requests.
Caching — similar prompts yield similar results. Semantic cache via embeddings (faiss + sentence-transformers) can reduce GPU load by 20–40%.
Quality monitoring — CLIP score for text-image alignment, FID for evaluating generation distribution. Integrate into MLflow or Weights & Biases.
Storage — generated images immediately to S3/MinIO, not on the inference server disk.

What's Included in the Deliverables

We take the project turnkey — from model selection to deployment and monitoring. The result includes:

Model (or API integration) with performance benchmarks (latency p99, throughput).
Pipeline documentation (prompt engineering guide, model card, dependency versions).
Integration with your backend (REST/gRPC, queues).
Configured monitoring (dashboards, alerts for quality drift).
Training workshop for the team (2–4 hours).
Warranty support for 3 months after launch — as part of our quality certificate.

We have completed 30+ projects in generative AI — this gives us the right to guarantee results.

How Is the Generative AI Development Process Structured?

Analysis (1–2 days): audit of current architecture, clarification of use case, selection of models and success metrics. We evaluate the project free of charge.
Proof of Concept (1–3 weeks): quick prototype on your data — to see real quality, not blog demos.
Design (1–2 weeks): pipeline architecture, infrastructure (GPU cluster/API), A/B testing plan.
Implementation and fine-tuning (4–12 weeks): development, LoRA/full fine-tuning, integration with queue and cache.
Testing (1–2 weeks): load tests, metric validation, edge-case verification (negative scenarios).
Deployment and monitoring (1–2 weeks): production deployment, monitoring setup, documentation.

What We Verify at the Proof of Concept Stage

Alignment of expectations and actual generation quality (CLIP score, user study).
Inference speed at different batch sizes and GPU types.
Likelihood of toxic/incorrect generations — checking safety filters.
Scalability: will the model handle peak load.

Timeline Estimates

Integration of a ready API (DALL·E 3, Midjourney API, Stability API) — 1–2 weeks. Self-hosted pipeline with fine-tuning — 6–12 weeks. Full platform with UI, queues and monitoring — 3–6 months. The specific cost is calculated individually after analyzing your scenario.

Contact us — order a consultation, and we will select the optimal architecture for your project. Get a preliminary cost and timeline estimate for free.