Edge AI: Deploying Models on Devices Without Cloud
Model works perfectly on server with A100. On Jetson Orin or mobile — latency 4 seconds, battery dies in hour, model crashes OOM. Gap between research code and edge deployment — separate engineering discipline.
Why Simple "Export Model" Doesn't Work
PyTorch model trained with float32 weights and batch_size=32 not ready for edge. Common first-deployment problems:
- ResNet-50 fp32 weighs 98 MB, inference on Cortex-A78 — 380 ms. After INT8 quantization via
torch.ao.quantization— 24 MB, 95 ms. After ONNX export and TensorRT compilation on Jetson — 28 ms. - YOLOv8m on Raspberry Pi 5 fp32 — 2.8 fps. After TFLite INT8 — 9.4 fps. With
XNNPACKdelegate — 14 fps. - Transformer encoder for NLP on mobile CPU:
MobileBERTfp16 viaCoreMLon iPhone 15 — 18 ms/inference.distilbert-base-uncasedvia ONNX — 42 ms. Difference meaningful for real-time.
Problem not "quantize or not" but no single answer. Correct path determined by target device, task, and acceptable metric degradation.
Quantization: What Really Works
Three quantization types, very different in effort.
PTQ (Post-Training Quantization) — fastest. Take trained model, run through calibration dataset (200–1000 examples), get INT8 or INT4 weights. Tools: torch.ao.quantization, ONNX Runtime, bitsandbytes for LLM. Degradation: usually 0.5–2% on classification. Red zone — small object detection and segmentation, PTQ can degrade -4–8% mAP.
QAT (Quantization-Aware Training) — train with simulated quantization noise. Costlier but minimal degradation — 0.1–0.5%. Worthwhile when PTQ unacceptable. PyTorch — torch.ao.quantization.prepare_qat().
GPTQ / AWQ — specialized for LLM. AWQ (Activation-aware Weight Quantization) better preserves quality at 4-bit vs GPTQ. llm-compressor or autoawq — main libraries.
Pruning and Distillation
Structural pruning removes entire channels/layers. torch.nn.utils.prune baseline. For transformers — prune attention heads. Result: ResNet-50 after 40% channel removal with fine-tune — size -35%, latency -28%, top-1 accuracy -1.2%.
Knowledge distillation — train small student model to imitate large teacher. Classic via KLDivLoss on soft labels. More effective — feature distillation on intermediate layers. Hugging Face provides DistilBERT: 66M vs 110M parameters, -40% latency, -3% GLUE benchmark.
Combined approach works best: distillation → pruning → QAT. Increases development time but maximum effect on constrained hardware.
Target Platforms and Tools
| Platform | Preferred Format | Tool | Specifics |
|---|---|---|---|
| NVIDIA Jetson | TensorRT engine | trtexec, torch2trt |
INT8 with calibration, DLA offload |
| Apple Silicon / iOS | CoreML (.mlmodel) | coremltools |
ANE (Neural Engine) automatic |
| Android | TFLite (.tflite) | tf.lite.TFLiteConverter |
GPU delegate, NNAPI |
| x86 CPU | ONNX + ORT | onnxruntime |
AVX-512, VNNI instructions |
| Arm Cortex | TFLite / ONNX | ort-arm, tflite |
XNNPACK, NEON |
| Qualcomm NPU | QNN (.dlc) | Qualcomm AI Hub | Hexagon DSP |
TensorRT — main Jetson tool. Not just export: TRT builds graph with operator fusion, selects optimal kernels per architecture. YOLOv8m on Jetson AGX Orin in TRT INT8 gives 78 fps vs 22 fps fp16 PyTorch.
CoreML Tools auto-directs computation to ANE (Apple Neural Engine) with right structure. Not all ops supported — custom layers fall to CPU. Profile in Xcode Instruments shows bottlenecks.
Practical Case: On-Line Defect Detection
Task: scratch detection on metal parts in real-time, 30 fps, Jetson Xavier NX (16GB).
Original: YOLOv8l, 12000 annotated images, mAP50 0.91, server inference — 28 ms/frame. On Jetson Xavier NX fp16 — 110 ms (9 fps). Unsuitable.
Optimization steps:
- Switch YOLOv8m — mAP50 0.887 (-2.3%), 68 ms on Jetson
- TensorRT FP16 via
yolo export format=engine half=True— 31 ms (32 fps) - INT8 calibration on 500 production frames — 22 ms (45 fps), mAP50 0.879
Result: 3.5% metric degradation at 5× speedup. Acceptable — operator manually verifies.
Workflow and Timelines
Start with profiling: run model on target device, measure latency per layer, find bottlenecks. Without — optimization blind.
Then — optimization plan with tradeoff assessment: each method gives specific speed gain at specific quality cost. You decide what acceptable.
Timelines: optimize ready model for device — 2–4 weeks. Develop from scratch under edge requirements (architecture + training + optimization + hardware testing) — 6–16 weeks.







