Image Segmentation System Development
Segmentation is pixel-wise image annotation. Unlike detection (rectangular box), segmentation provides precise object contours. This is critical for tasks where shape matters: medical imaging, satellite data, autonomous driving, quality control with defect area measurement.
Semantic vs Instance vs Panoptic
Semantic segmentation — each pixel gets a class, objects of one class are not distinguished. Example: all cars are one class "car", all pedestrians are "person". Models: SegFormer, DeepLabV3+.
Instance segmentation — each object is separate, even of the same class. Example: car №1, car №2. Models: Mask R-CNN, YOLOv8-seg, YOLO11-seg.
Panoptic segmentation — combination of semantic and instance: "things" (countable objects) — by instances, "stuff" (sky, road) — semantically. Models: Mask2Former.
Segment Anything Model (SAM)
Meta's SAM — revolution in segmentation. Zero-shot segmentation: doesn't require training for specific classes. Input prompt: point, box, or mask.
from segment_anything import SamPredictor, sam_model_registry
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)
# Segmentation by bbox
predictor.set_image(image)
masks, scores, _ = predictor.predict(
box=np.array([x1, y1, x2, y2]),
multimask_output=False
)
SAM2 (2024) — improved version with video support: segmentation tracking through frames.
When SAM doesn't fit: tasks requiring segmentation classification (SAM doesn't know classes), and speed-critical tasks (SAM-ViT-H: ~50ms on A100, too slow for real-time).
Fine-tuning for Domain-Specific Tasks
For medical imaging and industrial data, SAM is fine-tuned for domain:
from ultralytics import SAM
# SAM2 fine-tuning via Ultralytics
model = SAM('sam2_b.pt')
model.train(
data='medical_dataset.yaml',
epochs=50,
imgsz=1024,
batch=4,
lr0=1e-4
)
For semantic segmentation — SegFormer (HuggingFace) with fine-tuning on custom data. SegFormer-B5 achieves mIoU 84.0 on Cityscapes at reasonable speed.
U-Net for Medical Tasks
U-Net — standard for biomedical segmentation. Encoder-decoder with skip connections works well with small datasets (200–500 images):
import segmentation_models_pytorch as smp
model = smp.Unet(
encoder_name='efficientnet-b4',
encoder_weights='imagenet',
in_channels=1, # for grayscale MRI/CT
classes=3, # background, organ, tumor
activation=None
)
Quality Metrics
- mIoU (mean Intersection over Union) — main metric for semantic segmentation
- AP (Average Precision) — for instance segmentation
- Dice coefficient — for medical tasks (equivalent to F1 at pixel level)
- Boundary IoU — contour quality, important for precision tasks
| Model | mIoU Cityscapes | FPS |
|---|---|---|
| SegFormer-B2 | 81.0 | 48 |
| SegFormer-B5 | 84.0 | 15 |
| DeepLabV3+ ResNet101 | 80.9 | 22 |
| YOLOv8x-seg | — | 120 (instance) |
| Task | Timeline |
|---|---|
| Instance segmentation based on YOLOv8 | 2–4 weeks |
| Semantic segmentation, custom dataset | 3–6 weeks |
| Medical segmentation, SAM fine-tuning | 5–10 weeks |







