Generative AI: Images, Video, Music, 3D
Request "generate product image" sounds simple. Reality — choose from dozens of models, setup inference pipeline, solve frame consistency, integrate to product backend, answer "why model generates six-finger hands on staging but not production." Break down by direction.
Image Generation: From Prompt to Production API
Current landscape — FLUX.1 [dev/schnell/pro] from Black Forest Labs and Stable Diffusion 3.5. FLUX.1 [schnell] does 4 steps vs 20–50 for SDXL while keeping higher quality. On A100 80GB — 1.2–1.8 s per 1024×1024 image at batch_size=4.
Common deployment issue: FLUX.1 [dev] requires 24+ GB VRAM in fp16. On A10G 24GB barely fits, batch_size>1 causes OOM. Solution: torch_dtype=torch.bfloat16 + enable_model_cpu_offload() from diffusers, or quantization via bitsandbytes in NF4 — minimal quality loss, memory drops to 12–14 GB.
ControlNet and IP-Adapter — key production tools for control. ControlNet with Canny/Depth/Pose map gives structural control. IP-Adapter (especially IP-Adapter-FaceID) transfers character identity — foundation for personalized content.
Case: e-commerce photography. Retailer with 8000 SKU needed lifestyle photos. Pipeline: product segmentation (Segment Anything Model 2) → background removal → FLUX.1 [dev] inpainting with product as IP-Adapter → upscale via RealESRGAN_x4plus. Generation cost $0.003/image on rented A100 vs $15–40 professional shoot. Throughput — 200 images/hour on 2× A100.
Fine-tuning to Specific Style or Character
Dreambooth and LoRA — standard for adapting to specific visual style or object. LoRA trains in 2–4h on 20–30 reference images on A100. Rank 16–32 usually enough for style, 64+ for precise face reproduction.
Common mistake: train LoRA too long — overfit on references, lose variability. Sign: at cfg_scale=7 all images copy-paste references. Solution — early stopping (usually 1500–2000 steps for 20 images) and prior_preservation_loss.
For deeper customization — full fine-tuning via diffusers + accelerate with FSDP on multiple GPU. 40–80h training and really large dataset (1000+ images).
Video Generation: Technology State 2025
| Model | Availability | Length | Resolution | Control |
|---|---|---|---|---|
| Sora (OpenAI) | API (limited) | to 60 s | 1080p | prompt, image-to-video |
| Wan2.1 (Alibaba) | open weights | to 81 frames | 720p | prompt, I2V, V2V |
| CogVideoX-5B | open weights | 6 s | 720p | prompt, I2V |
| Kling 1.6 | API | to 30 s | 1080p | prompt, I2V |
| Mochi-1 | open weights | 5.4 s | 480p | prompt |
Open-weight video models lag commercial in stability and length. Wan2.1 best for self-hosted: 14B parameters, 2× A100, acceptable short-clip quality.
Main video-generation pain — temporal consistency: character changes clothes on third second, object "floats." Partial solution — generation with motion_bucket_id and noise_aug_strength in Stable Video Diffusion, or use I2V (image-to-video) instead of pure text-to-video.
AnimateDiff remains useful for short loops and motion effects atop SD/FLUX. Not Sora but self-hosted and predictable.
Music and Audio Generation
AudioCraft from Meta (MusicGen + AudioGen) — production-ready stack for music. musicgen-large (3.3B) generates 30s music in ~8s on A100. Control via text prompt and melody conditioning — specify melody by humming.
Stable Audio Open from Stability AI — alternative with up to 47s, better structure control (intro/verse/chorus). Deployment same: diffusers + FastAPI.
For voice-over and voicing — ElevenLabs API or self-hosted XTTS v2 (see Speech AI service). Sound design and foley — AudioGen.
3D Generation: Practical State
3D generation hasn't reached 2D maturity. But for specific tasks tools work:
TripoSG and Shap-E — text/image-to-3D. Shap-E from OpenAI generates simple 3D meshes in seconds but rough geometry. TripoSG more detailed, requires post-processing (remesh, UV unwrap).
Wonder3D and Zero123++ — 3D reconstruction from single image. Work via multi-view generation (6–8 views) then 3D recovery via NeuS or instant-ngp.
Gaussian Splatting (3DGS) — not generation but reconstruction from photo/video series. For product cards and real estate already production: 50–200 photos → 3DGS model in 15–30 min on RTX 4090 → interactive 3D viewer in browser.
Infrastructure and Deployment
For generative models critical:
- Task queue — Celery + Redis or Ray Serve. Sync HTTP for image generation unacceptable >5 concurrent requests.
- Caching — similar prompts yield similar results. Semantic cache via embeddings (faiss + sentence-transformers) can reduce GPU load 20–40%.
- Quality monitoring — CLIP score for text-image alignment, FID for generation distribution. Integration into MLflow or W&B.
- Storage — generated images directly to S3/MinIO, not server disk.
Workflow
Before model selection — define use case: need real-time (<3s) or batch, need control (brand style, specific faces), what GPU budget. First talk 1–2h.
Then — proof of concept on your content. Real results, not GitHub demos. Often discover hybrid needed: API for urgent + self-hosted for bulk.
Timelines: integrate ready API (DALL-E 3, Midjourney API, Stability API) — 1–2 weeks. Self-hosted pipeline with fine-tuning — 6–12 weeks. Full platform with UI, queues, monitoring — 3–6 months.







