Fine-tuning Stable Diffusion через LoRA
LoRA (Low-Rank Adaptation) is an effective method for retraining SD with a minimal number of parameters. A LoRA file is 10-150 MB in size compared to 6-7 GB for a full model, trains in 30-120 minutes on a consumer GPU, and is applied over a base model with adjustable strength.
Differences between LoRA and DreamBooth
| Parameter | DreamBooth | LoRA |
|---|---|---|
| Modifies | Entire model | Delta matrices only |
| Result size | 6–7 GB | 10–150 MB |
| Training time | 30–60 min | 30–120 min |
| Combination | No | Up to 5 LoRA at a time |
| Application | One model | Any compatible |
LoRA Training for Style
# kohya-ss/sd-scripts — стандарт обучения LoRA
git clone https://github.com/kohya-ss/sd-scripts
cd sd-scripts
pip install -r requirements.txt
python train_network.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--dataset_config="dataset.toml" \
--output_dir="./lora_output" \
--output_name="my_style_v1" \
--network_module="networks.lora" \
--network_dim=32 \
--network_alpha=16 \
--learning_rate=1e-4 \
--max_train_epochs=10 \
--train_batch_size=2 \
--save_every_n_epochs=2 \
--mixed_precision="fp16" \
--xformers
dataset.toml:
[general]
shuffle_caption = true
caption_dropout_rate = 0.05
[[datasets]]
resolution = 1024
batch_size = 2
[[datasets.subsets]]
image_dir = "./training_images"
caption_extension = ".txt"
num_repeats = 10
Automatic image captioning
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import os
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
caption_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
def auto_caption_dataset(
images_dir: str,
trigger_word: str = "mystyle",
style_suffix: str = "in the style of mystyle"
) -> None:
for img_file in os.listdir(images_dir):
if not img_file.endswith((".jpg", ".png", ".webp")):
continue
img = Image.open(os.path.join(images_dir, img_file)).convert("RGB")
inputs = processor(img, return_tensors="pt")
caption = processor.decode(
caption_model.generate(**inputs, max_new_tokens=50)[0],
skip_special_tokens=True
)
# Добавляем триггерное слово к подписи
full_caption = f"{trigger_word}, {caption}, {style_suffix}"
txt_path = os.path.join(images_dir, img_file.rsplit(".", 1)[0] + ".txt")
with open(txt_path, "w", encoding="utf-8") as f:
f.write(full_caption)
Using multiple LoRAs
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
).to("cuda")
# Загружаем несколько LoRA
pipe.load_lora_weights("style_lora.safetensors", adapter_name="style")
pipe.load_lora_weights("character_lora.safetensors", adapter_name="character")
# Комбинируем с весами
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe(
"mystyle character, cinematic scene, detailed background",
guidance_scale=7.5,
num_inference_steps=30
).images[0]
Typical LoRA tasks
Artist Style: 50-200 images in the target style → LoRA reproduces the style on new prompts.
Specific product: 20–50 product photos with captions → LoRA generates the product in different scenes.
Character (anime/game): 30-100 images of the character → LoRA reproduces in different poses.
Professional Photography: 200+ photos by a specific photographer → LoRA transfers the shooting style.
Timeframe: LoRA training (1000 steps on RTX 3090) takes 20–40 minutes. A service with custom training of its own LoRA takes 3–4 weeks.







