Which models can be converted to TFLite?

Any model from Keras, TensorFlow, or PyTorch (via ONNX). CNN, RNN, and transformers are supported. Limitations: dynamic graphs and unknown operators require custom implementation.

How does quantization affect accuracy?

Full integer quantization typically drops accuracy by 0.5–2% for most models. QAT can reduce it to <0.5%. We always test on a representative dataset.

What is a delegate and why is it needed?

A delegate is a software layer that offloads computation to specialized hardware like GPU, NPU, or DSP. Without a delegate, the model runs only on the CPU.

How long does conversion take?

A typical project takes 1–2 weeks. Complex models (transformers, GANs) may require up to 4 weeks with QAT and profiling.

Can you convert a model without access to the source code?

If the model is in SavedModel or HDF5 format, yes. For PyTorch, we need the code for torch.onnx.export.

Which models can be converted to TFLite?

Any model from Keras, TensorFlow, or PyTorch (via ONNX). CNN, RNN, and transformers are supported. Limitations: dynamic graphs and unknown operators require custom implementation.

How does quantization affect accuracy?

Full integer quantization typically drops accuracy by 0.5–2% for most models. QAT can reduce it to <0.5%. We always test on a representative dataset.

What is a delegate and why is it needed?

A delegate is a software layer that offloads computation to specialized hardware like GPU, NPU, or DSP. Without a delegate, the model runs only on the CPU.

How long does conversion take?

A typical project takes 1–2 weeks. Complex models (transformers, GANs) may require up to 4 weeks with QAT and profiling.

Can you convert a model without access to the source code?

If the model is in SavedModel or HDF5 format, yes. For PyTorch, we need the code for torch.onnx.export.

Optimizing ML Models for Mobile: Quantization, Delegates, Deployment

Q: What is a delegate and why is it needed?

A delegate is a software layer that offloads computation to specialized hardware like GPU, NPU, or DSP. Without a delegate, the model runs only on the CPU.

Q: How long does conversion take?

A typical project takes 1–2 weeks. Complex models (transformers, GANs) may require up to 4 weeks with QAT and profiling.

Q: Can you convert a model without access to the source code?

If the model is in SavedModel or HDF5 format, yes. For PyTorch, we need the code for torch.onnx.export.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Optimizing ML Models for Mobile: Quantization, Delegates, Deployment

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Optimizing ML Models for Mobile: Quantization, Delegates, Deployment

Picture this: your server-side model achieves 0.99 F1, but on Android you get 200 ms latency and 500 MB RAM usage. Users complain, uninstalls rise. We see such projects every week. Our team — 5 years of experience, 70+ converted models — helps convert your model to TFLite format with tailored quantization and delegate selection. Result: inference under 10 ms, footprint under 10 MB. Cloud infrastructure costs drop by up to 60%. In a typical project, clients save $5,000–$15,000 annually. Inference cost per request reduces by up to 80%. We guarantee a measurable speedup of at least 3x on target devices. Get your project assessed in 2 days — just reach out to us.

Why Quantization Is Critical for Mobile Apps

On a server, you can afford FP32, but on device every megabyte and millisecond counts. Quantization reduces model size by 4x (INT8) and accelerates inference 2–4x even without specialized hardware. With delegates, you get another 3–10x. Without it, your app lags behind competitors in speed and power efficiency. Learn more about quantization of neural networks.

When TFLite Conversion Is Required

If your app runs on Android, iOS, or embedded Linux (Raspberry Pi, Coral), TFLite is the de facto standard. It supports hardware acceleration via NNAPI, GPU delegate, Hexagon DSP, and Core ML. Without conversion, you're stuck with CPU — losing 5–20x performance. TFLite outperforms ONNX Runtime on mobile platforms thanks to ARM-specific optimizations and Edge TPU support. According to TensorFlow Lite documentation, the GPU delegate can speed up inference up to 10x compared to CPU.

Choosing a Quantization Method

The choice depends on accuracy and speed requirements:

Method	Weights	Activations	Accuracy Loss	Speedup (vs FP32)
Dynamic range	INT8	float	Minimal	2–3x
Full integer	INT8	INT8	0.5–2%	3–4x
Float16	float16	float	~0%	~2x (on GPU)
QAT	INT8	INT8	<0.5%	3–4x

QAT (Quantization-Aware Training) is the best choice for accuracy-critical tasks. We frequently use it on BERT-family models. For example, fine-tuning with QAT yields <0.3% accuracy drop at 4x speedup.

GPU Delegate Benefits

The GPU delegate executes tensor operations on the GPU — delivering 3–10x speedup over CPU. On Qualcomm Snapdragon with Adreno GPU, we achieved 50 FPS on MobileNet v2. For iOS, the Core ML delegate can give up to 15x. The GPU delegate outperforms CPU by 3–10x in throughput but consumes more power — a trade-off. Inference cost per request drops 4x compared to CPU.

Step-by-Step Conversion Plan

Profiling — measure size, latency, and power on target devices.
Strategy selection — post-training or QAT, INT8 or float16.
Calibration — prepare a representative dataset (at least 200–500 samples).
Conversion — use tf.lite.TFLiteConverter with optimizations.
Validation — compare accuracy and performance on real devices.
Documentation — produce integration code and deployment recommendations.

# TF/Keras → TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # post-training quantization
tflite_model = converter.convert()

PyTorch → ONNX → TFLite: PyTorch has no direct path. We use torch.onnx.export → onnx-tf → TFLite. Thorough testing is critical — double conversion can introduce artifacts. For PyTorch models where accuracy is paramount, we recommend ONNX Runtime for mobile platforms.

Delegate Comparison

Platform	Delegate	Speedup (vs CPU)	Hardware
Android GPU	GPU Delegate	3–10x	Adreno, Mali
Qualcomm	NNAPI / Hexagon	5–20x	Snapdragon (DSP)
iOS	Core ML Delegate	5–15x	Apple Neural Engine
Edge TPU	EdgeTPU Delegate	100x	Coral accelerator

Typical Conversion Pitfalls

Mismatched representative dataset — leads to accuracy drop. For instance, if the dataset uses high-quality JPEGs but production images are compressed, accuracy can drop by 3–5%.
Skipping on-device validation: an emulator doesn't reflect real performance. CPU on an emulator is 2–3x faster than a real device.
Ignoring power consumption when using GPU — on weak batteries, throttling reduces FPS.

Case Study: Porting YOLOv8 to Android

A client needed real-time detection on Snapdragon 8 Gen 2. We chose full integer quantization with a representative dataset of 500 images. After calibration, accuracy dropped by 1.2% — compensated with QAT. Final result: 30 FPS at 8 MB model size. The entire pipeline took 5 days.

What's Included

Model architecture analysis and profiling (size, latency, power).
Quantization strategy selection (post-training / QAT, INT8 / float16).
Conversion with calibration on a representative dataset.
Accuracy and performance testing on target devices.
Integration documentation and deployment support.

Our team: 5 years of experience, 70+ converted models — from YOLOv8 to transformers. We hold Google ML certifications and have hands-on experience with Qualcomm NPU.

Contact us for a project assessment — we'll estimate timelines (typically 1–2 weeks) and cost based on your requirements. Request a free consultation.

Edge AI and Optimization: How to Deploy Models Without Cloud?

Imagine: your face recognition model has 4 seconds latency on Jetson Orin, the battery runs out in an hour, and the model crashes with OOM. We are a team of Edge AI engineers with 5+ years in production — we have optimized over 150 models for edge devices. Without profiling and proper choice of quantization or distillation, the project is doomed. The gap between research code and edge deployment is a separate engineering discipline; we help you master it in 2–16 weeks turnkey. Edge AI and model optimization services are not just export, but systematic work with hardware.

Why Simply Exporting a Model Doesn't Work?

A PyTorch model with float32 and batch_size=32 is not ready for edge. Typical problems:

ResNet-50 in fp32 occupies 98 MB, inference on Cortex-A78 — 380 ms. After INT8 quantization via torch.ao.quantization — 24 MB, 95 ms. Export to ONNX + TensorRT on Jetson — 28 ms.
YOLOv8m on Raspberry Pi 5 in fp32 — 2.8 fps. TFLite INT8 — 9.4 fps. With XNNPACK delegate — 14 fps (1.5× faster than pure INT8).
Transformer encoder on mobile CPU: MobileBERT in fp16 via CoreML on iPhone 15 — 18 ms/inference. distilbert-base-uncased in ONNX — 42 ms.

The problem is not choosing "quantize or not" — the right path is determined by the device, task, and acceptable metric degradation. We offer an assessment of your project: within 24 hours we will tell you how feasible it is to speed up the model.

How to Choose Quantization Method for Your Task?

PTQ (Post-Training Quantization) — a quick path. Take a trained model, run a calibration dataset (200–1000 samples), get INT8 or INT4 weights. Tools: torch.ao.quantization, ONNX Runtime quantization tool, bitsandbytes. Accuracy degradation: 0.5–2% on classification. Red zone — small object detection and segmentation, where PTQ gives -4–8% mAP.

QAT (Quantization-Aware Training) — training with simulated quantization noise. More expensive (retraining), but degradation 0.1–0.5%. Justified when PTQ is unacceptable. In PyTorch — torch.ao.quantization.prepare_qat().

GPTQ / AWQ — for LLMs. AWQ better preserves quality at 4-bit quantization. llm-compressor from Neural Magic or autoawq are the main libraries.

Method	Implementation Time	Accuracy Degradation	Tools
PTQ	1–2 days	0.5–2% (up to 8% on detection)	torch.ao, ONNX RT, bitsandbytes
QAT	1–3 weeks	0.1–0.5%	torch.ao.prepare_qat, TF Quantization
GPTQ/AWQ	3–7 days	1–3% (LLM)	autoawq, llm-compressor

Potential savings from choosing the right method can be substantial — for example, reducing cloud inference costs by up to 70% when deploying to edge. Project cost is calculated individually based on model complexity and target platform.

When to Use Pruning vs Distillation?

Structural pruning removes channels or layers. torch.nn.utils.prune — basic tool. For transformers — attention head pruning (LTP, movement pruning). Result: ResNet-50 after removing 40% of channels with fine-tuning — -35% size, -28% latency, -1.2% top-1 accuracy.

Knowledge distillation — train a small student to mimic a large teacher. Classic via KLDivLoss on soft labels. Feature distillation on intermediate layers is more effective. Hugging Face DistilBERT: 66M vs 110M parameters, -40% latency, -3% on GLUE. This is a model compression technique.

Combined approach: distillation → pruning → QAT. Gives maximum effect on limited hardware. We recorded a case where a client achieved 70% reduction in cloud compute spend after moving to edge with this pipeline.

Target Platforms and Tools

Platform	Preferred Format	Tool	Specifics
NVIDIA Jetson	TensorRT engine	`trtexec`, `torch2trt`	INT8 calibration, DLA offload
Apple Silicon / iOS	CoreML (.mlmodel)	`coremltools`	ANE (Neural Engine) automatically
Android	TFLite (.tflite)	`tf.lite.TFLiteConverter`	GPU delegate, NNAPI
x86 CPU	ONNX + ORT	`onnxruntime`	AVX-512, VNNI
Arm Cortex	TFLite / ONNX	`ort-arm`, `tflite`	XNNPACK, NEON
Qualcomm NPU	QNN (.dlc)	Qualcomm AI Hub	Hexagon DSP

TensorRT — the main tool for NVIDIA edge. TRT builds a graph with operator fusion, selects optimal kernels. On Jetson AGX Orin YOLOv8m in TRT INT8 gives 78 fps vs 22 fps in fp16 PyTorch — 3.5× improvement.

Practical Case: How We Detected Defects on a Production Line (Our Client)

Task: real-time scratch detection on metal, 30 fps, camera to Jetson Xavier NX (16GB). Original model YOLOv8l mAP50 0.91, server inference 28 ms, on Jetson in fp16 — 110 ms (9 fps). Not suitable.

Optimization steps we performed for our client:

Switch to YOLOv8m — mAP50 0.887 (-2.3%), 68 ms
Export to TensorRT FP16 via yolo export format=engine half=True — 31 ms (32 fps)
INT8 calibration on 500 frames — 22 ms (45 fps), mAP50 0.879

Result: 3.5% degradation at 5× speedup. Client received engine and documentation. We guarantee metric will not drop below agreed threshold — specified in contract.

Example model profiling (layer latency)

Profile slice of YOLOv8m on Jetson Xavier NX (fp16):

Convolution (layer 1–5): 12 ms
Bottleneck (layer 6–10): 8 ms
Head (detection): 11 ms

Bottleneck is the last layers of the head. After quantizing the head separately, head latency dropped to 4 ms.

What is Included in the Work?

Report on model profiling on target device (layer latency, bottlenecks)
Selection and justification of optimization methods (quantization / pruning / distillation)
Optimized model (TensorRT engine / TFLite / CoreML / ONNX)
Configs for reproducibility (scripts, Docker image, instructions)
Testing on real device (at least 10,000 inferences)
Training of your team (2 hours online)
1 month support after delivery

How to Order Model Optimization

Submit a request on the website or contact us in any convenient way.
We perform free profiling of your model on the target device within 24 hours.
We prepare an optimization plan with trade-off estimates (speed vs quality).
You approve the plan — we start work.
After completion, we deliver the optimized model, configs, and documentation.
We train your team and provide monthly support.

Timeline: optimization of an existing model — 2–4 weeks. Development from scratch for edge — 6–16 weeks.

Get a consultation — we will evaluate your model for free and offer a plan within 24 hours. Order free profiling now. For complex projects, contact our engineering team to discuss custom optimisation strategies.