Development of image analysis systems based on VLM
Vision-Language Models (VLMs) are an architecture where a visual encoder (ViT, CLIP) is paired with a language model (LLaMA, Mistral, Qwen). The result: the model understands an image and answers questions about it in natural language. GPT-4o, Claude Sonnet, LLaVA, and Qwen-VL are VLMs.
The gap between "asking GPT-4o about an image" and "producing a VLM pipeline" is huge. The API is expensive at scale, latency of 2–5 seconds is unacceptable for real-time, and data goes to the provider.
On-premises VLMs: Deploying on your own hardware
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
import numpy as np
class LocalVLMAnalyzer:
def __init__(self, model_name: str = 'Qwen/Qwen2-VL-7B-Instruct'):
"""
Qwen2-VL-7B: отличный баланс качества и скорости.
На A100 40GB: ~0.8 сек на изображение.
LLaVA-Next-34B: лучше, но нужна H100 или несколько GPU.
"""
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2',
device_map='auto'
)
@torch.no_grad()
def analyze(self, image: np.ndarray, question: str,
max_tokens: int = 256) -> str:
pil_image = Image.fromarray(image)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": pil_image},
{"type": "text", "text": question}
]
}
]
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = self.processor(
text=[text], images=[pil_image], return_tensors='pt'
).to('cuda')
generated = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
do_sample=False
)
return self.processor.decode(
generated[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
Model Selection: VLM 2024–2025 Comparison
| Model | VRAM | Latency (A100) | MMBench | Application |
|---|---|---|---|---|
| Qwen2-VL-2B | 6GB | 0.3 sec | 74.9 | Edge, mobile devices |
| Qwen2-VL-7B | 16GB | 0.8 sec | 83.0 | Server, production |
| LLaVA-Next-13B | 28GB | 1.4 sec | 79.7 | Server |
| InternVL2-26B | 52GB | 2.8 sec | 88.0 | High precision |
| GPT-4o API | — | 2–5 sec | 87.0+ | Cloud, no data |
Structured output: JSON from image
For production, structured JSON is more important than free text. We use constrained generation via Outlines or grammar-based decoding:
from pydantic import BaseModel
from typing import Optional
import outlines
class ProductInspectionResult(BaseModel):
defect_detected: bool
defect_type: Optional[str]
defect_location: Optional[str]
severity: str # 'none', 'minor', 'major', 'critical'
confidence: float
notes: str
class StructuredVLMInspector:
def __init__(self, model_name: str):
self.model = outlines.models.transformers(model_name)
self.generator = outlines.generate.json(
self.model, ProductInspectionResult
)
def inspect(self, image: Image.Image, context: str = '') -> ProductInspectionResult:
prompt = f"""Inspect this product image for defects.
Context: {context}
Provide structured assessment."""
return self.generator(prompt, image)
Case Study: Automating Data Markup
VLM was used to automatically label 50k images of manufacturing defects (instead of manual annotation):
- Task: determine the type of defect and its coordinates
- Model: Qwen2-VL-7B + structured output (JSON with bbox and class)
- Accuracy of automatic vs. manual marking: 87% matches
- The remaining 13% underwent quick manual correction (30 sec/image vs. 5 min from scratch)
- Result: 50k images tagged in 3 days instead of 6 weeks
VLM limitations
Hallucinations: the model can confidently describe a non-existent defect. This is critical where false positives are costly. Solution: ensemble with a classical detector, confidence threshold based on logprobs.
Repeatability: For the same image and question, the answer may vary slightly. For deterministic problems, use temperature=0.0 or structured generation.
Image tokenization: Qwen2-VL encodes 1024×1024 pixels into 256 tokens. Small defects <20px are at the VLM resolution limit. For such cases, a classic detector is better.
| Project type | Term |
|---|---|
| VLM API integration (GPT-4o/Claude) | 1–3 weeks |
| Self-hosted VLM pipeline | 4–7 weeks |
| Fine-tuning VLM on domain data | 6–12 weeks |







