ML Model Optimization for Edge Device Execution
Model trained on server with 80 GB GPU doesn't run on Raspberry Pi. Edge optimization — complex techniques reducing model size and latency while preserving acceptable quality.
Optimization Techniques
Quantization: Most impactful method. Float32 → INT8: 4× size reduction, 2–4× speedup (on supporting hardware). INT4: 8× reduction, quality loss depends on task.
Post-Training Quantization (PTQ): fast, needs calibration dataset (100–1000 samples). Quantization-Aware Training (QAT): train with quantization, 1–3% more accurate than PTQ.
Pruning: Remove insignificant weights. Unstructured pruning (80%+ sparsity) → hard to accelerate on standard hardware. Structured pruning (remove filters/heads) → direct acceleration on any hardware.
Knowledge Distillation: Small student model trained to reproduce large teacher outputs. BERT → TinyBERT: 7.5× faster, 96% GLUE score.
Neural Architecture Search: Find optimal architecture for target latency/memory. MobileNetV2 automatically found by NAS as optimal for mobile.
Operator Fusion: Merge operations: Conv+BN+ReLU executed as one operation. Implemented in TFLite converter, ONNX Runtime, TensorRT.
Benchmark Approach
Profiling on target device — only honest way. RTX 4090 latency ≠ Jetson Nano latency. Use layer-wise profiling for bottleneck identification.







