Textual Inversion для Stable Diffusion
Textual Inversion trains a new token in the CLIP text space that "encodes" a specific subject or style. It's the easiest SD personalization method: the embedding file is 50-100 KB, trains in 30-60 minutes, and is used like a regular word in a prompt.
Operating principle
Textual Inversion doesn't change the model's weights—it finds a new vector in the CLIP embedding space that best describes the training images. Token <my-style> is added to the dictionary and used as a regular word.
from diffusers import StableDiffusionPipeline
import torch
# Обучение через diffusers скрипт
# accelerate launch textual_inversion.py \
# --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
# --train_data_dir="./ti_images" \
# --learnable_property="style" \ # style или object
# --placeholder_token="<mystyle>" \
# --initializer_token="painting" \
# --resolution=512 \
# --train_batch_size=1 \
# --max_train_steps=3000 \
# --learning_rate=5.0e-04 \
# --output_dir="./ti_output"
# Применение обученного embedding
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Загружаем embedding
pipe.load_textual_inversion("./ti_output/learned_embeds.bin")
# Используем токен в промпте
image = pipe(
"a portrait in <mystyle> style, dramatic lighting",
num_inference_steps=50,
guidance_scale=7.5
).images[0]
Python training
from diffusers import StableDiffusionPipeline
from transformers import CLIPTextModel, CLIPTokenizer
import torch
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class TextualInversionDataset(Dataset):
def __init__(self, images_dir: str, tokenizer, placeholder_token: str, size: int = 512):
self.images = [os.path.join(images_dir, f) for f in os.listdir(images_dir)
if f.endswith((".jpg", ".png", ".webp"))]
self.tokenizer = tokenizer
self.placeholder_token = placeholder_token
self.size = size
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
img = Image.open(self.images[idx]).convert("RGB").resize((self.size, self.size))
# Простые промпты с placeholder токеном
prompts = [
f"a photo of {self.placeholder_token}",
f"{self.placeholder_token} in the scene",
f"an image featuring {self.placeholder_token}"
]
import random
prompt = random.choice(prompts)
return {"image": img, "prompt": prompt}
Comparison of personalization methods
| Method | File Size | Training Time | Quality | Compatibility |
|---|---|---|---|---|
| Textual Inversion | 50–100 KB | 30–60 min | Moderate | Any SD |
| LoRA | 10–150 MB | 30–120 min | Good | Compatible architecture |
| DreamBooth (full) | 4–7 GB | 60–120 min | Excellent | Specific version |
| DreamBooth + LoRA | 50–150 MB | 30–60 min | Good | Compatible |
Textual Inversion wins in terms of size and portability: the embedding can be shared in a single 100 KB file. For precise facial rendering or complex styles, LoRA or DreamBooth LoRA are preferable. Training time for a single embedding is 30–60 minutes, and integration into the pipeline takes 1 hour.







