Vision Language Model VLM Implementation for Image with Text Analysis

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Vision Language Model VLM Implementation for Image with Text Analysis
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Development of image analysis systems based on VLM

Vision-Language Models (VLMs) are an architecture where a visual encoder (ViT, CLIP) is paired with a language model (LLaMA, Mistral, Qwen). The result: the model understands an image and answers questions about it in natural language. GPT-4o, Claude Sonnet, LLaVA, and Qwen-VL are VLMs.

The gap between "asking GPT-4o about an image" and "producing a VLM pipeline" is huge. The API is expensive at scale, latency of 2–5 seconds is unacceptable for real-time, and data goes to the provider.

On-premises VLMs: Deploying on your own hardware

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
import numpy as np

class LocalVLMAnalyzer:
    def __init__(self, model_name: str = 'Qwen/Qwen2-VL-7B-Instruct'):
        """
        Qwen2-VL-7B: отличный баланс качества и скорости.
        На A100 40GB: ~0.8 сек на изображение.
        LLaVA-Next-34B: лучше, но нужна H100 или несколько GPU.
        """
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = AutoModelForVision2Seq.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            attn_implementation='flash_attention_2',
            device_map='auto'
        )

    @torch.no_grad()
    def analyze(self, image: np.ndarray, question: str,
                 max_tokens: int = 256) -> str:
        pil_image = Image.fromarray(image)
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": pil_image},
                    {"type": "text", "text": question}
                ]
            }
        ]

        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self.processor(
            text=[text], images=[pil_image], return_tensors='pt'
        ).to('cuda')

        generated = self.model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=False
        )
        return self.processor.decode(
            generated[0][inputs['input_ids'].shape[1]:],
            skip_special_tokens=True
        )

Model Selection: VLM 2024–2025 Comparison

Model VRAM Latency (A100) MMBench Application
Qwen2-VL-2B 6GB 0.3 sec 74.9 Edge, mobile devices
Qwen2-VL-7B 16GB 0.8 sec 83.0 Server, production
LLaVA-Next-13B 28GB 1.4 sec 79.7 Server
InternVL2-26B 52GB 2.8 sec 88.0 High precision
GPT-4o API 2–5 sec 87.0+ Cloud, no data

Structured output: JSON from image

For production, structured JSON is more important than free text. We use constrained generation via Outlines or grammar-based decoding:

from pydantic import BaseModel
from typing import Optional
import outlines

class ProductInspectionResult(BaseModel):
    defect_detected: bool
    defect_type: Optional[str]
    defect_location: Optional[str]
    severity: str  # 'none', 'minor', 'major', 'critical'
    confidence: float
    notes: str

class StructuredVLMInspector:
    def __init__(self, model_name: str):
        self.model = outlines.models.transformers(model_name)
        self.generator = outlines.generate.json(
            self.model, ProductInspectionResult
        )

    def inspect(self, image: Image.Image, context: str = '') -> ProductInspectionResult:
        prompt = f"""Inspect this product image for defects.
        Context: {context}
        Provide structured assessment."""

        return self.generator(prompt, image)

Case Study: Automating Data Markup

VLM was used to automatically label 50k images of manufacturing defects (instead of manual annotation):

  • Task: determine the type of defect and its coordinates
  • Model: Qwen2-VL-7B + structured output (JSON with bbox and class)
  • Accuracy of automatic vs. manual marking: 87% matches
  • The remaining 13% underwent quick manual correction (30 sec/image vs. 5 min from scratch)
  • Result: 50k images tagged in 3 days instead of 6 weeks

VLM limitations

Hallucinations: the model can confidently describe a non-existent defect. This is critical where false positives are costly. Solution: ensemble with a classical detector, confidence threshold based on logprobs.

Repeatability: For the same image and question, the answer may vary slightly. For deterministic problems, use temperature=0.0 or structured generation.

Image tokenization: Qwen2-VL encodes 1024×1024 pixels into 256 tokens. Small defects <20px are at the VLM resolution limit. For such cases, a classic detector is better.

Project type Term
VLM API integration (GPT-4o/Claude) 1–3 weeks
Self-hosted VLM pipeline 4–7 weeks
Fine-tuning VLM on domain data 6–12 weeks