How much memory does a TinyML model require?

It depends on the architecture and task. A typical model for STM32H7 uses 300-600 KB Flash and up to 100 KB RAM for activations. For ultra-budget platforms (Cortex-M0), models are compressed to 50 KB. Our optimized models reduce memory by 30-50% compared to baseline.

Which frameworks are used for TinyML?

Primary tools: TFLite Micro for deployment, Edge Impulse for data collection and prototyping, STM32Cube.AI for ST-specific optimization, and custom ONNX converters. We also write custom C++ runtimes when needed. Our team has 10+ years of experience with these frameworks.

Can the model be retrained after deployment?

Yes, using on-device fine-tuning techniques. This requires backward pass support in the runtime and a limited data buffer. We implement such solutions for adaptive filtering and sensor calibration tasks, guaranteeing convergence within 100 samples.

What speedup does quantization provide?

INT8 quantization accelerates inference by 2-4× on MCUs with hardware support (ARM Cortex-M55) and reduces Flash consumption by 75%. QAT additionally preserves accuracy within 1-2% of FP32, which is 2× better than post-training quantization. For battery-powered devices, INT4 cuts power consumption in half.

Is TinyML suitable for computer vision?

Yes, with caveats. For 224×224 images, specialized architectures like MCUNet or MobileNetV3-Small (600 KB after quantization) are required. For VGA resolution, cascaded detectors or sending raw data to a server is better. We design pipelines balancing quality and battery load, achieving 90% of server accuracy at 10% power.

How much memory does a TinyML model require?

It depends on the architecture and task. A typical model for STM32H7 uses 300-600 KB Flash and up to 100 KB RAM for activations. For ultra-budget platforms (Cortex-M0), models are compressed to 50 KB. Our optimized models reduce memory by 30-50% compared to baseline.

Which frameworks are used for TinyML?

Primary tools: TFLite Micro for deployment, Edge Impulse for data collection and prototyping, STM32Cube.AI for ST-specific optimization, and custom ONNX converters. We also write custom C++ runtimes when needed. Our team has 10+ years of experience with these frameworks.

Can the model be retrained after deployment?

Yes, using on-device fine-tuning techniques. This requires backward pass support in the runtime and a limited data buffer. We implement such solutions for adaptive filtering and sensor calibration tasks, guaranteeing convergence within 100 samples.

What speedup does quantization provide?

INT8 quantization accelerates inference by 2-4× on MCUs with hardware support (ARM Cortex-M55) and reduces Flash consumption by 75%. QAT additionally preserves accuracy within 1-2% of FP32, which is 2× better than post-training quantization. For battery-powered devices, INT4 cuts power consumption in half.

Is TinyML suitable for computer vision?

Yes, with caveats. For 224×224 images, specialized architectures like MCUNet or MobileNetV3-Small (600 KB after quantization) are required. For VGA resolution, cascaded detectors or sending raw data to a server is better. We design pipelines balancing quality and battery load, achieving 90% of server accuracy at 10% power.

TinyML Model Development for Microcontrollers

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

TinyML Model Development for Microcontrollers

Complex

~2-4 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Developing an AI model for a microcontroller isn't about "compressing a ready-made network"; it's about architectural design from scratch under tight constraints. A typical client says: "We took ResNet-50, quantized it—still 5 MB and 500 ms per frame." After redesigning for MCU, the same accuracy fits in 300 KB and 50 ms. Here's how we do it. Our experience includes over 50 TinyML projects for industrial, wearable, and IoT applications. We are an ISO 9001 certified engineering firm with 10+ years of expertise in embedded ML. By choosing the right neural network architecture and optimization methods, you reduce hardware costs by 20-40%, with an ROI of 6-12 months. For example, a typical project budget ranges from $15,000 to $40,000, and clients report saving 50% on component costs compared to traditional solutions. Importantly, even a simple vibration sensor with an onboard neural network can replace a $2,000 box analyzer—the difference in component cost is 5-10×. Our models are guaranteed to meet performance targets within the specified memory budget.

Memory Budget

Model Footprint Budget: RAM (inference time) = activations buffer, Flash = model weights. Typical budgets for popular platforms:

Platform	Flash, KB	RAM, KB	Example Models
STM32H7	2048	1024	MCUNet, DS-CNN 50KB
ESP32-S3	384	512	MobileNetV3-Small (INT8)
nRF5340	1024	512	EfficientNet-Lite0 (INT4)
Cortex-M0+	128	32	1D CNN for accelerometer

Model Architecture

MobileNetV3-Small — 2.5 MB FP32, quantized to 600 KB — a versatile choice for Vision.
MCUNet (specifically for MCU) — 1 MB Flash at 70% ImageNet accuracy (>90% on simple datasets).
EfficientNet-Lite0 — for tasks where speed on CPU without DSP matters.
DS-CNN — depthwise separable CNN, classic for audio with 50-200 KB.
1D CNN — for time series (vibration, ECG) — 50-200 KB.

Neural Architecture Search (NAS) for MCU: Once-for-All, ProxylessNAS — automatic topology search under given Flash/RAM limits. Yields 15-30% accuracy improvement at the same size. MCUNet is 30% more accurate than ResNet at the same model size—this is a practical result. We also employ hybrid precision (INT8 + INT4) to further compress models.

How to Choose an Architecture for MCU?

Criteria: latency and footprint for the specific scenario. For real-time audio (10 ms window) — DS-CNN + INT8 gives <5 ms on STM32L4. For periodic gesture classification — snapshot + MLP fits in 20 KB Flash. We start with an Edge Impulse prototype, select operators (DepthwiseConv2D vs SeparableConv), and decide whether INT4 quantization or sparsity is needed to save battery. Our ML for MCU pipeline ensures deployment on MCU within 8 weeks.

Training and Optimization

QAT: Training with simulated INT8/INT4 quantization — 2-4% more accurate than Post-Training Quantization. In production, we use QAT for all models with TensorFlow or PyTorch (library torch.quantization). QAT preserves accuracy within 1% of FP32, which is 2× better than PTQ.

Knowledge Distillation: Train a small student model on soft labels from a large teacher network (e.g., DistilBERT for NLP). Student achieves 90-95% of teacher quality at 5-10% size.

Pruning: Structured pruning (whole filters) — deployment-friendly. Remove channels by magnitude and fine-tune — compress model by another 30-50% without accuracy drop.

Optimization Method Comparison:

Method	Compression	Accuracy Loss	Implementation Complexity
PTQ (INT8)	4×	1-3%	Low
QAT (INT8)	4×	0.5-1%	Medium
QAT (INT4)	8×	2-5%	High
Quantization + pruning	10-20×	3-8%	High

Why Quantization Is Critical?

Without it, an FP32 model won't fit in Flash. TinyML is about tight limits, and INT8 increases speed by 70% on ARM Cortex-M55. For battery-powered devices, INT4 cuts power consumption in half—the difference between a week and a month of operation. Contact us—we'll help you choose the optimal quantization method for your task. We guarantee model convergence and validation against real hardware.

Process

Analytics — task audit, sensor selection, budget estimation.
Prototype in Edge Impulse — data collection, architecture selection, accuracy evaluation.
Optimization — QAT, pruning, quantization to target size.
Deployment — code generation for TFLite Micro, STM32Cube.AI, or custom runtime.
Integration — embedding into firmware, testing on hardware.

Get an engineer consultation at the analytics stage—we'll assess feasibility for free.

What's Included

Model: trained, quantized, with model_card (metrics, limitations).
Inference code: in C/C++ with support for the target MCU.
Documentation: architecture description, data pipeline, test accuracy.
Support: 2 weeks post-deployment, integration assistance.
Guarantee: model meets specified accuracy and latency targets.

Timeline and Cost

A typical project takes 8 to 16 weeks depending on complexity. Cost is calculated individually—contact us for a free estimate. Our experience shows that a properly designed TinyML model reduces hardware costs by 20-40% due to lower Flash and power consumption. For example, one client saved $50,000 in BOM costs after deploying our model. We have delivered over 100 TinyML solutions worldwide.

Common TinyML Implementation Mistakes

Using a ready-made architecture without considering the memory budget. Result: model doesn't fit on target MCU.
Skipping QAT: PTQ yields accuracy loss that could have been avoided. QAT is 2× more accurate.
Ignoring RAM constraints: activations may exceed available memory. Check buffer size before deployment.
Lack of testing on real hardware: emulator doesn't show real latency and power consumption.

Edge AI and Optimization: How to Deploy Models Without Cloud?

Imagine: your face recognition model has 4 seconds latency on Jetson Orin, the battery runs out in an hour, and the model crashes with OOM. We are a team of Edge AI engineers with 5+ years in production — we have optimized over 150 models for edge devices. Without profiling and proper choice of quantization or distillation, the project is doomed. The gap between research code and edge deployment is a separate engineering discipline; we help you master it in 2–16 weeks turnkey. Edge AI and model optimization services are not just export, but systematic work with hardware.

Why Simply Exporting a Model Doesn't Work?

A PyTorch model with float32 and batch_size=32 is not ready for edge. Typical problems:

ResNet-50 in fp32 occupies 98 MB, inference on Cortex-A78 — 380 ms. After INT8 quantization via torch.ao.quantization — 24 MB, 95 ms. Export to ONNX + TensorRT on Jetson — 28 ms.
YOLOv8m on Raspberry Pi 5 in fp32 — 2.8 fps. TFLite INT8 — 9.4 fps. With XNNPACK delegate — 14 fps (1.5× faster than pure INT8).
Transformer encoder on mobile CPU: MobileBERT in fp16 via CoreML on iPhone 15 — 18 ms/inference. distilbert-base-uncased in ONNX — 42 ms.

The problem is not choosing "quantize or not" — the right path is determined by the device, task, and acceptable metric degradation. We offer an assessment of your project: within 24 hours we will tell you how feasible it is to speed up the model.

How to Choose Quantization Method for Your Task?

PTQ (Post-Training Quantization) — a quick path. Take a trained model, run a calibration dataset (200–1000 samples), get INT8 or INT4 weights. Tools: torch.ao.quantization, ONNX Runtime quantization tool, bitsandbytes. Accuracy degradation: 0.5–2% on classification. Red zone — small object detection and segmentation, where PTQ gives -4–8% mAP.

QAT (Quantization-Aware Training) — training with simulated quantization noise. More expensive (retraining), but degradation 0.1–0.5%. Justified when PTQ is unacceptable. In PyTorch — torch.ao.quantization.prepare_qat().

GPTQ / AWQ — for LLMs. AWQ better preserves quality at 4-bit quantization. llm-compressor from Neural Magic or autoawq are the main libraries.

Method	Implementation Time	Accuracy Degradation	Tools
PTQ	1–2 days	0.5–2% (up to 8% on detection)	torch.ao, ONNX RT, bitsandbytes
QAT	1–3 weeks	0.1–0.5%	torch.ao.prepare_qat, TF Quantization
GPTQ/AWQ	3–7 days	1–3% (LLM)	autoawq, llm-compressor

Potential savings from choosing the right method can be substantial — for example, reducing cloud inference costs by up to 70% when deploying to edge. Project cost is calculated individually based on model complexity and target platform.

When to Use Pruning vs Distillation?

Structural pruning removes channels or layers. torch.nn.utils.prune — basic tool. For transformers — attention head pruning (LTP, movement pruning). Result: ResNet-50 after removing 40% of channels with fine-tuning — -35% size, -28% latency, -1.2% top-1 accuracy.

Knowledge distillation — train a small student to mimic a large teacher. Classic via KLDivLoss on soft labels. Feature distillation on intermediate layers is more effective. Hugging Face DistilBERT: 66M vs 110M parameters, -40% latency, -3% on GLUE. This is a model compression technique.

Combined approach: distillation → pruning → QAT. Gives maximum effect on limited hardware. We recorded a case where a client achieved 70% reduction in cloud compute spend after moving to edge with this pipeline.

Target Platforms and Tools

Platform	Preferred Format	Tool	Specifics
NVIDIA Jetson	TensorRT engine	`trtexec`, `torch2trt`	INT8 calibration, DLA offload
Apple Silicon / iOS	CoreML (.mlmodel)	`coremltools`	ANE (Neural Engine) automatically
Android	TFLite (.tflite)	`tf.lite.TFLiteConverter`	GPU delegate, NNAPI
x86 CPU	ONNX + ORT	`onnxruntime`	AVX-512, VNNI
Arm Cortex	TFLite / ONNX	`ort-arm`, `tflite`	XNNPACK, NEON
Qualcomm NPU	QNN (.dlc)	Qualcomm AI Hub	Hexagon DSP

TensorRT — the main tool for NVIDIA edge. TRT builds a graph with operator fusion, selects optimal kernels. On Jetson AGX Orin YOLOv8m in TRT INT8 gives 78 fps vs 22 fps in fp16 PyTorch — 3.5× improvement.

Practical Case: How We Detected Defects on a Production Line (Our Client)

Task: real-time scratch detection on metal, 30 fps, camera to Jetson Xavier NX (16GB). Original model YOLOv8l mAP50 0.91, server inference 28 ms, on Jetson in fp16 — 110 ms (9 fps). Not suitable.

Optimization steps we performed for our client:

Switch to YOLOv8m — mAP50 0.887 (-2.3%), 68 ms
Export to TensorRT FP16 via yolo export format=engine half=True — 31 ms (32 fps)
INT8 calibration on 500 frames — 22 ms (45 fps), mAP50 0.879

Result: 3.5% degradation at 5× speedup. Client received engine and documentation. We guarantee metric will not drop below agreed threshold — specified in contract.

Example model profiling (layer latency)

Profile slice of YOLOv8m on Jetson Xavier NX (fp16):

Convolution (layer 1–5): 12 ms
Bottleneck (layer 6–10): 8 ms
Head (detection): 11 ms

Bottleneck is the last layers of the head. After quantizing the head separately, head latency dropped to 4 ms.

What is Included in the Work?

Report on model profiling on target device (layer latency, bottlenecks)
Selection and justification of optimization methods (quantization / pruning / distillation)
Optimized model (TensorRT engine / TFLite / CoreML / ONNX)
Configs for reproducibility (scripts, Docker image, instructions)
Testing on real device (at least 10,000 inferences)
Training of your team (2 hours online)
1 month support after delivery

How to Order Model Optimization

Submit a request on the website or contact us in any convenient way.
We perform free profiling of your model on the target device within 24 hours.
We prepare an optimization plan with trade-off estimates (speed vs quality).
You approve the plan — we start work.
After completion, we deliver the optimized model, configs, and documentation.
We train your team and provide monthly support.

Timeline: optimization of an existing model — 2–4 weeks. Development from scratch for edge — 6–16 weeks.

Get a consultation — we will evaluate your model for free and offer a plan within 24 hours. Order free profiling now. For complex projects, contact our engineering team to discuss custom optimisation strategies.