Which models can run on Coral Edge TPU?

Edge TPU supports convolutional neural networks such as MobileNet, EfficientDet-Lite, and other models from the Coral Model Zoo. The model must be quantized to INT8 and under 8 MB to run entirely on the TPU. Transformers and RNNs run slower due to partial fallback to CPU.

How do I convert a TensorFlow model for Coral Edge TPU?

First train your model in TensorFlow, then perform post-training INT8 quantization with a representative dataset to obtain a TFLite file. Next, use edgetpu_compiler to compile it into the Edge TPU format. The final .tflite file can be deployed using the PyCoral library.

What form factors does Coral come in?

The main form factors: USB Accelerator (plugs into Raspberry Pi or x86), PCIe M.2 Accelerator (for embedded systems), Dev Board (standalone computer with NXP i.MX 8M SoC), and Dev Board Mini (compact version). The choice depends on performance and interface requirements.

What is the advantage of Edge TPU over a regular CPU?

Edge TPU is a specialized ASIC for INT8 inference, delivering 4 TOPS at only 0.5–2W power consumption. This allows 400 FPS on MobileNet SSD at 28 mW, which is tens of times faster and more efficient than running on a CPU. Ideal for autonomous and battery-powered devices.

How long does deployment on Coral Edge TPU take?

Timelines depend on model complexity and data preparation. Typically we complete within 1–2 weeks: model analysis and quantization take 3–5 days, compilation and integration 2–3 days, testing on the target device 2–4 days. Faster turnaround is possible for common architectures from the Model Zoo.

Which models can run on Coral Edge TPU?

Edge TPU supports convolutional neural networks such as MobileNet, EfficientDet-Lite, and other models from the Coral Model Zoo. The model must be quantized to INT8 and under 8 MB to run entirely on the TPU. Transformers and RNNs run slower due to partial fallback to CPU.

How do I convert a TensorFlow model for Coral Edge TPU?

First train your model in TensorFlow, then perform post-training INT8 quantization with a representative dataset to obtain a TFLite file. Next, use edgetpu_compiler to compile it into the Edge TPU format. The final .tflite file can be deployed using the PyCoral library.

What form factors does Coral come in?

The main form factors: USB Accelerator (plugs into Raspberry Pi or x86), PCIe M.2 Accelerator (for embedded systems), Dev Board (standalone computer with NXP i.MX 8M SoC), and Dev Board Mini (compact version). The choice depends on performance and interface requirements.

What is the advantage of Edge TPU over a regular CPU?

Edge TPU is a specialized ASIC for INT8 inference, delivering 4 TOPS at only 0.5–2W power consumption. This allows 400 FPS on MobileNet SSD at 28 mW, which is tens of times faster and more efficient than running on a CPU. Ideal for autonomous and battery-powered devices.

How long does deployment on Coral Edge TPU take?

Timelines depend on model complexity and data preparation. Typically we complete within 1–2 weeks: model analysis and quantization take 3–5 days, compilation and integration 2–3 days, testing on the target device 2–4 days. Faster turnaround is possible for common architectures from the Model Zoo.

Optimizing AI Inference on Google Coral: 180 FPS with Edge TPU

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Optimizing AI Inference on Google Coral: 180 FPS with Edge TPU

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

High-Speed AI Inference on Google Coral: Achieving 180 FPS with Edge TPU

Our client from logistics needed to detect defects on a conveyor belt. They chose the Coral USB Accelerator with a Raspberry Pi 4. The model EfficientDet-Lite1 in FP32 achieved only 5 FPS, but they required 25. After INT8 quantization and compilation, we reached 180 FPS at 1.5W. The payback period was under 6 months due to reduced cloud computing costs of about $800 per month. The key was selecting the right representative dataset for quantization to maintain mAP at 0.87.

Google Coral is a platform for high-efficiency ML inference at the edge. Its core component, the Edge TPU accelerator, is a specialized ASIC for INT8 inference: 4 TOPS at 0.5–2W power consumption. According to official documentation, the ASIC provides maximum performance at minimum energy, ideal for battery-powered or low-power applications.

Coral Form Factors

Model	Interface	Target Use	Power Consumption
USB Accelerator	USB 3.0	Raspberry Pi / x86	1.5 W
PCIe M.2 Accelerator (A+E)	M.2	Embedded systems	2 W
Dev Board	SoC i.MX 8M	Standalone computer	5 W
Dev Board Mini	SoC i.MX 8M	Compact devices	3 W

Why Edge TPU Is Faster Than a CPU?

The Edge TPU is an ASIC optimized for INT8 matrix multiplications. Unlike a general-purpose CPU, it doesn't waste energy on branching and caching. The Edge TPU leverages dedicated matrix multiplication units and activation functions to accelerate inference, bypassing the need for general-purpose ALU cycles. Result: 400 FPS on MobileNet SSD at 28 mW. For comparison, the same code on a Raspberry Pi 4 CPU yields about 8 FPS at 3 W. Power savings reach 90%. This makes the Coral device ideal for edge inference.

How to Optimize a Model for the Coral Device?

The main workflow involves four steps: train model in TensorFlow → perform post-training quantization to INT8 using a representative calibration dataset to minimize quantization error → compile via edgetpu_compiler → deploy with PyCoral API. Critically, all operations must be supported by the TPU (Conv2D, DepthwiseConv, ReLU, etc.). Incompatible operations are automatically offloaded to the CPU, drastically reducing speed. We check compatibility during the audit phase and select a quantization dataset to avoid accuracy drops.

Example compilation command:

edgetpu_compiler model_quant.tflite

Popular Models and Their Performance on Coral USB Accelerator

Model	Size	FPS (INT8)	Latency (ms)
MobileNetV2 SSD	6.2 MB	230	4.3
EfficientDet-Lite1	7.8 MB	180	5.6
InceptionV3	7.5 MB	95	10.5
ResNet50	6.8 MB	110	9.1

Typical Problems During Deployment

Model size exceeds 8 MB — part of the computation runs on CPU, causing speed drops.
Use of operations not supported by TPU (e.g., Select, StridedSlice) — automatic fallback to CPU.
Lack of a representative dataset during quantization — large errors in metrics.

We solve these problems during the model audit: check compatibility, select a quantization dataset, apply pruning or knowledge distillation.

How We Do It: Deployment Case Study on Coral USB Accelerator

For a logistics client, we needed to detect box damage on a conveyor belt. We chose MobileNet SSD, converted it to TFLite with INT8 quantization on 500 representative frames. After compilation, we achieved 230 FPS on the USB Accelerator with latency under 10 ms. The total project cost was $4,500, and the client saved approximately $800 per month in cloud costs, achieving ROI within 6 months. The solution was deployed on 20 Raspberry Pi 4 units — FPS deviation was less than 5%. Savings on cloud computing were about 40%.

Our Work Process

Model and dataset analysis – check operation compatibility, evaluate accuracy after quantization.
Optimization – pruning, quantization-aware training (QAT), or representative dataset selection.
Compilation and integration – build for the target device, configure PyCoral or C++ API.
Testing – verify on real data, measure p99 latency, power consumption.
Deployment – roll out on device fleet, set up monitoring.

What's Included in the Work

Report on model compatibility with Edge TPU.
Conversion and compilation (TFLite → Edge TPU).
Integration with PyCoral or C++ on the target device.
Testing on your hardware (up to 3 devices).
Documentation for deployment and maintenance.

Timelines and Pricing

Timelines: from 1 to 3 weeks depending on model complexity. Pricing is calculated individually after the audit. Order a free audit of your model — we'll evaluate your project within 2 business days. Get a consultation to discuss details and timeline.

Why Choose Us

30+ successful projects on Coral and other platforms.
Guarantee: we'll bring your model to a working state on your device.
Individual approach: we select the stack for the task, including pruning or knowledge distillation.

Edge AI and Optimization: How to Deploy Models Without Cloud?

Imagine: your face recognition model has 4 seconds latency on Jetson Orin, the battery runs out in an hour, and the model crashes with OOM. We are a team of Edge AI engineers with 5+ years in production — we have optimized over 150 models for edge devices. Without profiling and proper choice of quantization or distillation, the project is doomed. The gap between research code and edge deployment is a separate engineering discipline; we help you master it in 2–16 weeks turnkey. Edge AI and model optimization services are not just export, but systematic work with hardware.

Why Simply Exporting a Model Doesn't Work?

A PyTorch model with float32 and batch_size=32 is not ready for edge. Typical problems:

ResNet-50 in fp32 occupies 98 MB, inference on Cortex-A78 — 380 ms. After INT8 quantization via torch.ao.quantization — 24 MB, 95 ms. Export to ONNX + TensorRT on Jetson — 28 ms.
YOLOv8m on Raspberry Pi 5 in fp32 — 2.8 fps. TFLite INT8 — 9.4 fps. With XNNPACK delegate — 14 fps (1.5× faster than pure INT8).
Transformer encoder on mobile CPU: MobileBERT in fp16 via CoreML on iPhone 15 — 18 ms/inference. distilbert-base-uncased in ONNX — 42 ms.

The problem is not choosing "quantize or not" — the right path is determined by the device, task, and acceptable metric degradation. We offer an assessment of your project: within 24 hours we will tell you how feasible it is to speed up the model.

How to Choose Quantization Method for Your Task?

PTQ (Post-Training Quantization) — a quick path. Take a trained model, run a calibration dataset (200–1000 samples), get INT8 or INT4 weights. Tools: torch.ao.quantization, ONNX Runtime quantization tool, bitsandbytes. Accuracy degradation: 0.5–2% on classification. Red zone — small object detection and segmentation, where PTQ gives -4–8% mAP.

QAT (Quantization-Aware Training) — training with simulated quantization noise. More expensive (retraining), but degradation 0.1–0.5%. Justified when PTQ is unacceptable. In PyTorch — torch.ao.quantization.prepare_qat().

GPTQ / AWQ — for LLMs. AWQ better preserves quality at 4-bit quantization. llm-compressor from Neural Magic or autoawq are the main libraries.

Method	Implementation Time	Accuracy Degradation	Tools
PTQ	1–2 days	0.5–2% (up to 8% on detection)	torch.ao, ONNX RT, bitsandbytes
QAT	1–3 weeks	0.1–0.5%	torch.ao.prepare_qat, TF Quantization
GPTQ/AWQ	3–7 days	1–3% (LLM)	autoawq, llm-compressor

Potential savings from choosing the right method can be substantial — for example, reducing cloud inference costs by up to 70% when deploying to edge. Project cost is calculated individually based on model complexity and target platform.

When to Use Pruning vs Distillation?

Structural pruning removes channels or layers. torch.nn.utils.prune — basic tool. For transformers — attention head pruning (LTP, movement pruning). Result: ResNet-50 after removing 40% of channels with fine-tuning — -35% size, -28% latency, -1.2% top-1 accuracy.

Knowledge distillation — train a small student to mimic a large teacher. Classic via KLDivLoss on soft labels. Feature distillation on intermediate layers is more effective. Hugging Face DistilBERT: 66M vs 110M parameters, -40% latency, -3% on GLUE. This is a model compression technique.

Combined approach: distillation → pruning → QAT. Gives maximum effect on limited hardware. We recorded a case where a client achieved 70% reduction in cloud compute spend after moving to edge with this pipeline.

Target Platforms and Tools

Platform	Preferred Format	Tool	Specifics
NVIDIA Jetson	TensorRT engine	`trtexec`, `torch2trt`	INT8 calibration, DLA offload
Apple Silicon / iOS	CoreML (.mlmodel)	`coremltools`	ANE (Neural Engine) automatically
Android	TFLite (.tflite)	`tf.lite.TFLiteConverter`	GPU delegate, NNAPI
x86 CPU	ONNX + ORT	`onnxruntime`	AVX-512, VNNI
Arm Cortex	TFLite / ONNX	`ort-arm`, `tflite`	XNNPACK, NEON
Qualcomm NPU	QNN (.dlc)	Qualcomm AI Hub	Hexagon DSP

TensorRT — the main tool for NVIDIA edge. TRT builds a graph with operator fusion, selects optimal kernels. On Jetson AGX Orin YOLOv8m in TRT INT8 gives 78 fps vs 22 fps in fp16 PyTorch — 3.5× improvement.

Practical Case: How We Detected Defects on a Production Line (Our Client)

Task: real-time scratch detection on metal, 30 fps, camera to Jetson Xavier NX (16GB). Original model YOLOv8l mAP50 0.91, server inference 28 ms, on Jetson in fp16 — 110 ms (9 fps). Not suitable.

Optimization steps we performed for our client:

Switch to YOLOv8m — mAP50 0.887 (-2.3%), 68 ms
Export to TensorRT FP16 via yolo export format=engine half=True — 31 ms (32 fps)
INT8 calibration on 500 frames — 22 ms (45 fps), mAP50 0.879

Result: 3.5% degradation at 5× speedup. Client received engine and documentation. We guarantee metric will not drop below agreed threshold — specified in contract.

Example model profiling (layer latency)

Profile slice of YOLOv8m on Jetson Xavier NX (fp16):

Convolution (layer 1–5): 12 ms
Bottleneck (layer 6–10): 8 ms
Head (detection): 11 ms

Bottleneck is the last layers of the head. After quantizing the head separately, head latency dropped to 4 ms.

What is Included in the Work?

Report on model profiling on target device (layer latency, bottlenecks)
Selection and justification of optimization methods (quantization / pruning / distillation)
Optimized model (TensorRT engine / TFLite / CoreML / ONNX)
Configs for reproducibility (scripts, Docker image, instructions)
Testing on real device (at least 10,000 inferences)
Training of your team (2 hours online)
1 month support after delivery

How to Order Model Optimization

Submit a request on the website or contact us in any convenient way.
We perform free profiling of your model on the target device within 24 hours.
We prepare an optimization plan with trade-off estimates (speed vs quality).
You approve the plan — we start work.
After completion, we deliver the optimized model, configs, and documentation.
We train your team and provide monthly support.

Timeline: optimization of an existing model — 2–4 weeks. Development from scratch for edge — 6–16 weeks.

Get a consultation — we will evaluate your model for free and offer a plan within 24 hours. Order free profiling now. For complex projects, contact our engineering team to discuss custom optimisation strategies.