What models does OpenVINO support?

OpenVINO supports models from TensorFlow, PyTorch (via ONNX), ONNX, PaddlePaddle, and its own Model Zoo. Conversion is done using Model Optimizer into the IR format.

How to convert a PyTorch model to IR?

Export the model to ONNX using torch.onnx.export, then run OpenVINO Model Optimizer: mo --input_model model.onnx. The resulting .xml and .bin files are loaded via Core::compile_model.

What is NPU in Intel Core Ultra?

NPU (Neural Processing Unit) is an integrated AI accelerator in Intel Core Ultra processors. It runs lightweight model inference (face detection, wake word) with low power consumption. Performance reaches 48 TOPS in Lunar Lake models.

Can OpenVINO run on ARM?

Officially OpenVINO only supports Intel x86 architecture (CPU, GPU, NPU). For ARM, use TFLite or ONNX Runtime. However, there are experimental builds for ARM under Linux.

How to accelerate inference with INT8?

Use the Post-Training Optimization Tool (POT) for INT8 calibration. It reduces model weight by 75% and accelerates inference 2–3x with minimal accuracy loss. NNCF provides even finer control.

What models does OpenVINO support?

OpenVINO supports models from TensorFlow, PyTorch (via ONNX), ONNX, PaddlePaddle, and its own Model Zoo. Conversion is done using Model Optimizer into the IR format.

How to convert a PyTorch model to IR?

Export the model to ONNX using torch.onnx.export, then run OpenVINO Model Optimizer: mo --input_model model.onnx. The resulting .xml and .bin files are loaded via Core::compile_model.

What is NPU in Intel Core Ultra?

NPU (Neural Processing Unit) is an integrated AI accelerator in Intel Core Ultra processors. It runs lightweight model inference (face detection, wake word) with low power consumption. Performance reaches 48 TOPS in Lunar Lake models.

Can OpenVINO run on ARM?

Officially OpenVINO only supports Intel x86 architecture (CPU, GPU, NPU). For ARM, use TFLite or ONNX Runtime. However, there are experimental builds for ARM under Linux.

How to accelerate inference with INT8?

Use the Post-Training Optimization Tool (POT) for INT8 calibration. It reduces model weight by 75% and accelerates inference 2–3x with minimal accuracy loss. NNCF provides even finer control.

AI Optimization with Intel OpenVINO: Conversion, Quantization, Deployment

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI Optimization with Intel OpenVINO: Conversion, Quantization, Deployment

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Your PyTorch model delivers 20 FPS on an Intel Xeon, but the client demands 60 FPS on an edge device with a 5 W power budget. A typical scenario: GPU instances are expensive, while the NPU on Intel Core Ultra sits idle. OpenVINO solves this: conversion, INT8 quantization, and deployment on NPU yield 84 FPS at 4.2 W. We use Model Optimizer to convert PyTorch, TensorFlow, and ONNX into a unified IR format. Post-Training Optimization Tool (POT) performs INT8 calibration with accuracy control. For YOLOv8n (2.3M params) on a Core i5-14500, latency drops from 35 ms to 12 ms after quantization. Infrastructure savings reach $5,000–10,000 per year compared to a GPU rack.

We integrate ML models into the Intel ecosystem: CPU (Xeon, Core), NPU (Core Ultra), VPU (Movidius, including Neural Compute Stick). Unlike TensorRT, OpenVINO is not just a runtime but a full optimization pipeline. Typical scenario: a model runs on NVIDIA, the client wants to move to Intel edge. We convert, quantize to INT8, and deploy into production with OpenVINO Model Server. Infrastructure costs drop 2–3 times due to reduced latency and power consumption.

If your model runs slowly on Intel or is incompatible with target hardware, we accelerate it 2–3 times and adapt it for NPU/VPU. With 30+ projects, we have experience with YOLO, ResNet, BERT, and custom architectures. We guarantee at least 40% latency reduction or your money back.

What Problems Does OpenVINO Solve?

High latency on CPU. A PyTorch model gives 20–50 FPS on Xeon. After conversion to IR and INT8 quantization, it delivers 80–120 FPS. Acceleration of 2–3 times.

Format incompatibility. TF SavedModel, PyTorch, ONNX — Model Optimizer turns them into a single IR. No manual graph edits.

Power consumption. NPU on Core Ultra consumes <5 W instead of 15 W on GPU. Ideal for always-on systems.

Why OpenVINO Over ONNX Runtime on Intel?

ONNX Runtime uses generic kernels, not optimized for specific Intel hardware types. OpenVINO includes runtime caching, INT8 calibration, and NPU/VPU support. On Xeon with VNNI (AVX-512), there is up to 30% gain over ORT. The license is free, and the OpenVINO documentation recommends it for edge solutions.

How We Convert Models: Step-by-Step Guide

Analyze the source model and target hardware.
Export to ONNX (if PyTorch) or import directly via Model Optimizer.
Quantize: run pot -c config.json on a calibration dataset. The INT8 model is 4x lighter, latency drops 2–3x.
Test accuracy (mAP, F1) — we allow no more than 1% drop.
Deploy: OpenVINO Model Server with gRPC or embed mode.

When testing YOLOv8n on Intel Core i5-14500 (FP32: 28 FPS), after INT8 calibration we got 84 FPS, latency 12 ms. Deployment on NPU Core Ultra reduced power consumption to 4.2 W. Savings on electricity — thousands of dollars per year compared to a GPU rack.

What INT8 Quantization Delivers?

Weight reduction of 75% and speedup of 2–3 times. With proper calibration, accuracy drops no more than 0.5–1%. For detection, classification, NLP — standard practice. We use POT (Post-training Optimization Tool) or NNCF for finer control.

What Models Do We Convert?

The table below shows typical scenarios.

Source Framework	IR Conversion	Quantization	Recommended Path
TensorFlow	Model Optimizer + tf2onnx	POT/NNCF	TF Hub -> ONNX -> IR
PyTorch	torch.onnx + mo	POT/NNCF	ONNX -> IR
ONNX	mo --input_model	POT	Direct conversion
PaddlePaddle	mo --input_model	POT	Via ONNX or direct import

Performance Table

Device	Model	Precision	Latency	Power Consumption
Core i5-14500 (CPU)	ResNet-50	FP32	4.2 ms	65 W
Core i5-14500 (CPU)	ResNet-50	INT8	1.8 ms	65 W
Core Ultra 7 155H (NPU)	ResNet-50	INT8	2.1 ms	4.2 W
Xeon Platinum 8358 (CPU)	BERT-base	FP32	7.5 ms	250 W
Xeon Platinum 8358 (CPU)	BERT-base	INT8	3.2 ms	250 W

What Is Included in the Service

Model audit and target hardware analysis.
Conversion to OpenVINO IR (FP32, FP16, INT8).
INT8 calibration with accuracy profiling.
Integration with OpenVINO Model Server or embedded runtime.
Load testing (latency p99, throughput).
Deployment and configuration documentation.
30 days of support after deployment.

Timelines and Cost

Timelines: from 1 to 3 weeks depending on model complexity. Exact estimate after analysis. We guarantee at least 40% latency reduction or your money back. Average savings from optimization — from $2,000 to $10,000 per year on infrastructure.

Order a free audit of your model — we will assess the optimization potential in 1 business day. Contact us for a project consultation. Our engineers are Intel certified on OpenVINO.

Edge AI and Optimization: How to Deploy Models Without Cloud?

Imagine: your face recognition model has 4 seconds latency on Jetson Orin, the battery runs out in an hour, and the model crashes with OOM. We are a team of Edge AI engineers with 5+ years in production — we have optimized over 150 models for edge devices. Without profiling and proper choice of quantization or distillation, the project is doomed. The gap between research code and edge deployment is a separate engineering discipline; we help you master it in 2–16 weeks turnkey. Edge AI and model optimization services are not just export, but systematic work with hardware.

Why Simply Exporting a Model Doesn't Work?

A PyTorch model with float32 and batch_size=32 is not ready for edge. Typical problems:

ResNet-50 in fp32 occupies 98 MB, inference on Cortex-A78 — 380 ms. After INT8 quantization via torch.ao.quantization — 24 MB, 95 ms. Export to ONNX + TensorRT on Jetson — 28 ms.
YOLOv8m on Raspberry Pi 5 in fp32 — 2.8 fps. TFLite INT8 — 9.4 fps. With XNNPACK delegate — 14 fps (1.5× faster than pure INT8).
Transformer encoder on mobile CPU: MobileBERT in fp16 via CoreML on iPhone 15 — 18 ms/inference. distilbert-base-uncased in ONNX — 42 ms.

The problem is not choosing "quantize or not" — the right path is determined by the device, task, and acceptable metric degradation. We offer an assessment of your project: within 24 hours we will tell you how feasible it is to speed up the model.

How to Choose Quantization Method for Your Task?

PTQ (Post-Training Quantization) — a quick path. Take a trained model, run a calibration dataset (200–1000 samples), get INT8 or INT4 weights. Tools: torch.ao.quantization, ONNX Runtime quantization tool, bitsandbytes. Accuracy degradation: 0.5–2% on classification. Red zone — small object detection and segmentation, where PTQ gives -4–8% mAP.

QAT (Quantization-Aware Training) — training with simulated quantization noise. More expensive (retraining), but degradation 0.1–0.5%. Justified when PTQ is unacceptable. In PyTorch — torch.ao.quantization.prepare_qat().

GPTQ / AWQ — for LLMs. AWQ better preserves quality at 4-bit quantization. llm-compressor from Neural Magic or autoawq are the main libraries.

Method	Implementation Time	Accuracy Degradation	Tools
PTQ	1–2 days	0.5–2% (up to 8% on detection)	torch.ao, ONNX RT, bitsandbytes
QAT	1–3 weeks	0.1–0.5%	torch.ao.prepare_qat, TF Quantization
GPTQ/AWQ	3–7 days	1–3% (LLM)	autoawq, llm-compressor

Potential savings from choosing the right method can be substantial — for example, reducing cloud inference costs by up to 70% when deploying to edge. Project cost is calculated individually based on model complexity and target platform.

When to Use Pruning vs Distillation?

Structural pruning removes channels or layers. torch.nn.utils.prune — basic tool. For transformers — attention head pruning (LTP, movement pruning). Result: ResNet-50 after removing 40% of channels with fine-tuning — -35% size, -28% latency, -1.2% top-1 accuracy.

Knowledge distillation — train a small student to mimic a large teacher. Classic via KLDivLoss on soft labels. Feature distillation on intermediate layers is more effective. Hugging Face DistilBERT: 66M vs 110M parameters, -40% latency, -3% on GLUE. This is a model compression technique.

Combined approach: distillation → pruning → QAT. Gives maximum effect on limited hardware. We recorded a case where a client achieved 70% reduction in cloud compute spend after moving to edge with this pipeline.

Target Platforms and Tools

Platform	Preferred Format	Tool	Specifics
NVIDIA Jetson	TensorRT engine	`trtexec`, `torch2trt`	INT8 calibration, DLA offload
Apple Silicon / iOS	CoreML (.mlmodel)	`coremltools`	ANE (Neural Engine) automatically
Android	TFLite (.tflite)	`tf.lite.TFLiteConverter`	GPU delegate, NNAPI
x86 CPU	ONNX + ORT	`onnxruntime`	AVX-512, VNNI
Arm Cortex	TFLite / ONNX	`ort-arm`, `tflite`	XNNPACK, NEON
Qualcomm NPU	QNN (.dlc)	Qualcomm AI Hub	Hexagon DSP

TensorRT — the main tool for NVIDIA edge. TRT builds a graph with operator fusion, selects optimal kernels. On Jetson AGX Orin YOLOv8m in TRT INT8 gives 78 fps vs 22 fps in fp16 PyTorch — 3.5× improvement.

Practical Case: How We Detected Defects on a Production Line (Our Client)

Task: real-time scratch detection on metal, 30 fps, camera to Jetson Xavier NX (16GB). Original model YOLOv8l mAP50 0.91, server inference 28 ms, on Jetson in fp16 — 110 ms (9 fps). Not suitable.

Optimization steps we performed for our client:

Switch to YOLOv8m — mAP50 0.887 (-2.3%), 68 ms
Export to TensorRT FP16 via yolo export format=engine half=True — 31 ms (32 fps)
INT8 calibration on 500 frames — 22 ms (45 fps), mAP50 0.879

Result: 3.5% degradation at 5× speedup. Client received engine and documentation. We guarantee metric will not drop below agreed threshold — specified in contract.

Example model profiling (layer latency)

Profile slice of YOLOv8m on Jetson Xavier NX (fp16):

Convolution (layer 1–5): 12 ms
Bottleneck (layer 6–10): 8 ms
Head (detection): 11 ms

Bottleneck is the last layers of the head. After quantizing the head separately, head latency dropped to 4 ms.

What is Included in the Work?

Report on model profiling on target device (layer latency, bottlenecks)
Selection and justification of optimization methods (quantization / pruning / distillation)
Optimized model (TensorRT engine / TFLite / CoreML / ONNX)
Configs for reproducibility (scripts, Docker image, instructions)
Testing on real device (at least 10,000 inferences)
Training of your team (2 hours online)
1 month support after delivery

How to Order Model Optimization

Submit a request on the website or contact us in any convenient way.
We perform free profiling of your model on the target device within 24 hours.
We prepare an optimization plan with trade-off estimates (speed vs quality).
You approve the plan — we start work.
After completion, we deliver the optimized model, configs, and documentation.
We train your team and provide monthly support.

Timeline: optimization of an existing model — 2–4 weeks. Development from scratch for edge — 6–16 weeks.

Get a consultation — we will evaluate your model for free and offer a plan within 24 hours. Order free profiling now. For complex projects, contact our engineering team to discuss custom optimisation strategies.