What formats does TFLite Micro support?

TFLite Micro works with .tflite models converted from TensorFlow SavedModel or Keras H5. It supports INT8 and dynamic range quantization, but the operation set is limited—compatibility must be checked in advance.

Can I convert a PyTorch model to TFLite?

Yes, by exporting to ONNX or TorchScript, then converting to TensorFlow and finally to TFLite. We use this pipeline for clients working with PyTorch. Additionally, we can apply quantization-aware training to preserve accuracy.

What if my model doesn't fit the Edge TPU?

The Edge TPU has an 8 MB limit for full hardware acceleration. If the model is larger, some operations run on the CPU—reducing performance. Solutions include pruning, distillation, or splitting the model into parts. We choose the optimal strategy.

Which tool should I use for conversion?

The primary tool is the TensorFlow Lite Converter (Python API or CLI). For Edge TPU, you additionally need the edgetpu_compiler from Google Coral. We also use custom scripts for batch conversion and validation. The toolset depends on the target platform.

How long does conversion take?

Timelines depend on model complexity and accuracy requirements. A typical project with quantization and validation takes 1 to 3 weeks. For standard models (MobileNet, YOLO), we can deliver in 5 business days.

What formats does TFLite Micro support?

TFLite Micro works with .tflite models converted from TensorFlow SavedModel or Keras H5. It supports INT8 and dynamic range quantization, but the operation set is limited—compatibility must be checked in advance.

Can I convert a PyTorch model to TFLite?

Yes, by exporting to ONNX or TorchScript, then converting to TensorFlow and finally to TFLite. We use this pipeline for clients working with PyTorch. Additionally, we can apply quantization-aware training to preserve accuracy.

What if my model doesn't fit the Edge TPU?

The Edge TPU has an 8 MB limit for full hardware acceleration. If the model is larger, some operations run on the CPU—reducing performance. Solutions include pruning, distillation, or splitting the model into parts. We choose the optimal strategy.

Which tool should I use for conversion?

The primary tool is the TensorFlow Lite Converter (Python API or CLI). For Edge TPU, you additionally need the edgetpu_compiler from Google Coral. We also use custom scripts for batch conversion and validation. The toolset depends on the target platform.

How long does conversion take?

Timelines depend on model complexity and accuracy requirements. A typical project with quantization and validation takes 1 to 3 weeks. For standard models (MobileNet, YOLO), we can deliver in 5 business days.

Neural Network Conversion for Edge: TFLite, Micro, Edge TPU

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Neural Network Conversion for Edge: TFLite, Micro, Edge TPU

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

A developer trained a segmentation model in Keras, but on hardware it froze. The model didn't fit into 2 MB of STM32 Flash, and FP32 weights dropped from 4 MB to 1.2 MB after quantization to TFLite Micro, but accuracy fell by 12%—and the client lost the order. We know how to avoid such scenarios: over six years we've done dozens of conversions for MCU, Arm Linux, and Google Coral. End-to-end: model analysis, format selection, quantization, accuracy validation, deployment. Our track record: 50+ edge projects, 30+ for Coral. Contact us to assess your project in one day. Conversion cost is calculated individually, but on average the project pays for itself through reduced compute costs and faster inference.

Problems We Solve

Operation Incompatibility

TFLite Micro supports only a subset of full TensorFlow: ~250 operations vs. ~2000. Common ops like tf.nn.depthwise_conv2d, tf.reshape are present, but tf.where or tf.sort are missing. We manually replace unsupported layers with equivalents—for example, replacing tf.where with tf.cast combined with tf.multiply. This issue is especially acute for edge ML, where every operation counts.

Model Size and Quantization

Edge TPU only accepts INT8 models, and they must be 8 MB or less. Our team has experience adapting YOLOv5 (14 MB float) to 4.2 MB INT8 with mAP drop of no more than 2%. We use quantization-aware training to preserve accuracy. Compared to Float16, INT8 quantization delivers 3–4x higher speed on Edge TPU at the same energy cost. TFLite Micro is 50% more compact than standard TFLite, which is critical for MCUs.

Performance Drop on MCU

Even after conversion to TFLite Micro, a model may be slow due to suboptimal operation ordering. We profile each operation and modify the graph to reduce DMA calls—gaining up to 40% on STM32H7. This is especially important for ML on STM32, where resources are tight.

How We Do It

Conversion pipeline for each platform.

TFLite (Mobile / Raspberry Pi / x86 Edge)

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

Supports: INT8, FP16, dynamic range quantization. GPU delegate, NNAPI, Hexagon DSP. Ideal for ML on Raspberry Pi.

TFLite Micro (MCU, <1 MB)

Subset of TFLite operations, portable C++:

xxd -i model.tflite > model_data.cc  # convert to C array

Supported on: STM32, Arduino, ESP32, nRF52840. Compatibility checker is mandatory—we run it before starting work.

Edge TPU (Google Coral)

Edge TPU requires INT8 quantization. Only operations from the whitelist execute on hardware (rest is CPU fallback):

edgetpu_compiler model_quant.tflite  # Google Coral compiler

Performance: 4 TOPS (Coral USB), 4 TOPS (Coral PCIe M.2). Great for image classification and object detection. As per Google Coral documentation, keep the model <8 MB for full acceleration.

Platform Comparison

Platform	Devices	Max Model Size	Quantization	Tools
TFLite	Android, iOS, RPi, x86	No limit	FP16, INT8, dynamic	TFLite Converter, GPU Delegate
TFLite Micro	STM32, ESP32, Arduino	<1 MB Flash	INT8 mandatory	XXD, compatibility checker
Edge TPU	Coral USB/PCIe/M.2	8 MB (full acceleration)	INT8 mandatory	edgetpu_compiler

Quantization Types and Parameters

Type	Weight Size	Accuracy Loss	Hardware Acceleration
FP32	4 bytes	Baseline	CPU/GPU
FP16	2 bytes	<1%	GPU, some TPUs
Dynamic range	2–4 bytes	1–3%	CPU (optimization)
INT8	1 byte	1–5%	Edge TPU, DSP, MCU

Why INT8 Quantization Is the Standard for Edge TPU?

Edge TPU hardware operates on integers—float operations are emulated on CPU with a 10–20x speed drop. We use calibration on a representative dataset to find scales and zero points. For image models, mAP loss is typically 1–3%.

How to Check Model Compatibility with TFLite Micro?

We run tflite_micro_compatibility_checker even before conversion. If an unsupported operation is found, we replace it with an equivalent. For example, tf.nn.max_pool can be replaced with tf.nn.avg_pool if the task allows. As a last resort, we use a custom operator, but that complicates deployment.

Detailed compatibility check workflow

Load model in .tflite format.
Run through checker: get list of unsupported operations.
For each operation, find a replacement from the available set.
Re-run compatibility check.
If replacement is impossible, consider custom operator or platform change.

Process of Work

Model analysis: load, profile operations, estimate size.
Platform selection: MCU, SBC, or Edge TPU—pick the optimal option.
Conversion and quantization: apply QAT or post-training quantization.
Accuracy validation: compare float and quantized model outputs on test set.
Deployment: prepare C-array, test on target device.

What's Included

Documentation: conversion report, deployment instructions.
Source code of conversion and validation scripts.
Training for the client's team (1–2 sessions).
Accuracy guarantee: deviation no more than 5% from baseline.
Post-deployment support for 1 month.

Timeline and Budget

Timelines: from 1 to 3 weeks depending on model complexity and requirements. Cost is calculated individually—contact us to assess your project within one business day. Get a consultation and a commercial proposal tailored to your needs. Our experience: over 6 years in edge ML, 50+ projects, 30+ for Coral. Savings at the deployment stage are one of the key results of our projects.

Edge AI and Optimization: How to Deploy Models Without Cloud?

Imagine: your face recognition model has 4 seconds latency on Jetson Orin, the battery runs out in an hour, and the model crashes with OOM. We are a team of Edge AI engineers with 5+ years in production — we have optimized over 150 models for edge devices. Without profiling and proper choice of quantization or distillation, the project is doomed. The gap between research code and edge deployment is a separate engineering discipline; we help you master it in 2–16 weeks turnkey. Edge AI and model optimization services are not just export, but systematic work with hardware.

Why Simply Exporting a Model Doesn't Work?

A PyTorch model with float32 and batch_size=32 is not ready for edge. Typical problems:

ResNet-50 in fp32 occupies 98 MB, inference on Cortex-A78 — 380 ms. After INT8 quantization via torch.ao.quantization — 24 MB, 95 ms. Export to ONNX + TensorRT on Jetson — 28 ms.
YOLOv8m on Raspberry Pi 5 in fp32 — 2.8 fps. TFLite INT8 — 9.4 fps. With XNNPACK delegate — 14 fps (1.5× faster than pure INT8).
Transformer encoder on mobile CPU: MobileBERT in fp16 via CoreML on iPhone 15 — 18 ms/inference. distilbert-base-uncased in ONNX — 42 ms.

The problem is not choosing "quantize or not" — the right path is determined by the device, task, and acceptable metric degradation. We offer an assessment of your project: within 24 hours we will tell you how feasible it is to speed up the model.

How to Choose Quantization Method for Your Task?

PTQ (Post-Training Quantization) — a quick path. Take a trained model, run a calibration dataset (200–1000 samples), get INT8 or INT4 weights. Tools: torch.ao.quantization, ONNX Runtime quantization tool, bitsandbytes. Accuracy degradation: 0.5–2% on classification. Red zone — small object detection and segmentation, where PTQ gives -4–8% mAP.

QAT (Quantization-Aware Training) — training with simulated quantization noise. More expensive (retraining), but degradation 0.1–0.5%. Justified when PTQ is unacceptable. In PyTorch — torch.ao.quantization.prepare_qat().

GPTQ / AWQ — for LLMs. AWQ better preserves quality at 4-bit quantization. llm-compressor from Neural Magic or autoawq are the main libraries.

Method	Implementation Time	Accuracy Degradation	Tools
PTQ	1–2 days	0.5–2% (up to 8% on detection)	torch.ao, ONNX RT, bitsandbytes
QAT	1–3 weeks	0.1–0.5%	torch.ao.prepare_qat, TF Quantization
GPTQ/AWQ	3–7 days	1–3% (LLM)	autoawq, llm-compressor

Potential savings from choosing the right method can be substantial — for example, reducing cloud inference costs by up to 70% when deploying to edge. Project cost is calculated individually based on model complexity and target platform.

When to Use Pruning vs Distillation?

Structural pruning removes channels or layers. torch.nn.utils.prune — basic tool. For transformers — attention head pruning (LTP, movement pruning). Result: ResNet-50 after removing 40% of channels with fine-tuning — -35% size, -28% latency, -1.2% top-1 accuracy.

Knowledge distillation — train a small student to mimic a large teacher. Classic via KLDivLoss on soft labels. Feature distillation on intermediate layers is more effective. Hugging Face DistilBERT: 66M vs 110M parameters, -40% latency, -3% on GLUE. This is a model compression technique.

Combined approach: distillation → pruning → QAT. Gives maximum effect on limited hardware. We recorded a case where a client achieved 70% reduction in cloud compute spend after moving to edge with this pipeline.

Target Platforms and Tools

Platform	Preferred Format	Tool	Specifics
NVIDIA Jetson	TensorRT engine	`trtexec`, `torch2trt`	INT8 calibration, DLA offload
Apple Silicon / iOS	CoreML (.mlmodel)	`coremltools`	ANE (Neural Engine) automatically
Android	TFLite (.tflite)	`tf.lite.TFLiteConverter`	GPU delegate, NNAPI
x86 CPU	ONNX + ORT	`onnxruntime`	AVX-512, VNNI
Arm Cortex	TFLite / ONNX	`ort-arm`, `tflite`	XNNPACK, NEON
Qualcomm NPU	QNN (.dlc)	Qualcomm AI Hub	Hexagon DSP

TensorRT — the main tool for NVIDIA edge. TRT builds a graph with operator fusion, selects optimal kernels. On Jetson AGX Orin YOLOv8m in TRT INT8 gives 78 fps vs 22 fps in fp16 PyTorch — 3.5× improvement.

Practical Case: How We Detected Defects on a Production Line (Our Client)

Task: real-time scratch detection on metal, 30 fps, camera to Jetson Xavier NX (16GB). Original model YOLOv8l mAP50 0.91, server inference 28 ms, on Jetson in fp16 — 110 ms (9 fps). Not suitable.

Optimization steps we performed for our client:

Switch to YOLOv8m — mAP50 0.887 (-2.3%), 68 ms
Export to TensorRT FP16 via yolo export format=engine half=True — 31 ms (32 fps)
INT8 calibration on 500 frames — 22 ms (45 fps), mAP50 0.879

Result: 3.5% degradation at 5× speedup. Client received engine and documentation. We guarantee metric will not drop below agreed threshold — specified in contract.

Example model profiling (layer latency)

Profile slice of YOLOv8m on Jetson Xavier NX (fp16):

Convolution (layer 1–5): 12 ms
Bottleneck (layer 6–10): 8 ms
Head (detection): 11 ms

Bottleneck is the last layers of the head. After quantizing the head separately, head latency dropped to 4 ms.

What is Included in the Work?

Report on model profiling on target device (layer latency, bottlenecks)
Selection and justification of optimization methods (quantization / pruning / distillation)
Optimized model (TensorRT engine / TFLite / CoreML / ONNX)
Configs for reproducibility (scripts, Docker image, instructions)
Testing on real device (at least 10,000 inferences)
Training of your team (2 hours online)
1 month support after delivery

How to Order Model Optimization

Submit a request on the website or contact us in any convenient way.
We perform free profiling of your model on the target device within 24 hours.
We prepare an optimization plan with trade-off estimates (speed vs quality).
You approve the plan — we start work.
After completion, we deliver the optimized model, configs, and documentation.
We train your team and provide monthly support.

Timeline: optimization of an existing model — 2–4 weeks. Development from scratch for edge — 6–16 weeks.

Get a consultation — we will evaluate your model for free and offer a plan within 24 hours. Order free profiling now. For complex projects, contact our engineering team to discuss custom optimisation strategies.