On-Device ML Model Integration (ONNX Runtime) for Mobile App
ONNX Runtime Mobile offers a compelling advantage: one model for both platforms. After converting PyTorch to ONNX, connect onnxruntime-android and onnxruntime-objc, and run the same .onnx file on both. In practice, differences in execution providers between iOS and Android still require platform-specific code, but the model itself remains unified.
Preparing Models for Mobile
Standard ONNX export from PyTorch:
import torch
import onnx
from onnxsim import simplify # onnx-simplifier for graph optimization
model = MyModel(); model.eval()
dummy = torch.zeros(1, 3, 224, 224)
torch.onnx.export(
model, dummy, "model.onnx",
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)
# Graph simplification — removes unnecessary reshape, transpose operations, cleanups the graph
model_onnx = onnx.load("model.onnx")
model_simplified, check = simplify(model_onnx)
onnx.save(model_simplified, "model_simplified.onnx")
For mobile deployment, add quantization via onnxruntime.quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
"model_simplified.onnx",
"model_int8.onnx",
weight_type=QuantType.QInt8
)
# Model size reduces by ~4x compared to FP32
Android: Integration and Execution
// build.gradle
implementation("com.microsoft.onnxruntime:onnxruntime-android:1.18.0")
// Creating session
val sessionOptions = OrtSession.SessionOptions().apply {
// NNAPI Execution Provider for Android NPU/DSP
addNnapi(NNAPIFlags.USE_FP16) // FP16 mode in NNAPI
// Or: addXnnpack(mapOf()) for XNNPACK (CPU SIMD)
setOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
setIntraOpNumThreads(4)
}
val env = OrtEnvironment.getEnvironment()
val session = env.createSession(
context.assets.open("model_simplified.onnx").readBytes(),
sessionOptions
)
// Inference
val inputTensor = OnnxTensor.createTensor(
env,
FloatBuffer.wrap(preprocessedArray),
longArrayOf(1, 3, 224, 224)
)
val results = session.run(mapOf("input" to inputTensor))
val outputArray = (results["output"]?.value as Array<FloatArray>)[0]
// Release resources — mandatory
inputTensor.close()
results.close()
Leaks from unclosed OnnxTensor and OrtSession.Result are common issues. In Kotlin, use use {} blocks: results.use { ... }.
iOS: Objective-C/Swift Integration
// Package.swift or Podfile: pod 'onnxruntime-objc'
import onnxruntime_objc
// Configuration
let env = try ORTEnv(loggingLevel: ORTLoggingLevel.warning)
let options = try ORTSessionOptions()
try options.setIntraOpNumThreads(4)
// On iOS — CoreML Execution Provider
try options.appendCoreMLExecutionProvider(withFlags: [.enableOnSubgraphs])
let session = try ORTSession(
env: env,
modelPath: Bundle.main.path(forResource: "model_simplified", ofType: "onnx")!,
sessionOptions: options
)
// Input preparation
let inputShape: [NSNumber] = [1, 3, 224, 224]
let inputData = Data(bytes: preprocessedFloats, count: preprocessedFloats.count * MemoryLayout<Float>.size)
let inputTensor = try ORTValue(
tensorData: NSMutableData(data: inputData),
elementType: .float,
shape: inputShape
)
let outputs = try session.run(
withInputs: ["input": inputTensor],
outputNames: ["output"],
runOptions: nil
)
let outputTensor = outputs["output"]!
let outputData = try outputTensor.tensorData() as Data
let floats = outputData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) }
appendCoreMLExecutionProvider on iOS 13+ delegates supported operations to Core ML, enabling access to ANE. Operations not supported by Core ML automatically execute on CPU. While not as performant as native Core ML with full ANE acceleration, it offers convenience for rapid cross-platform deployment.
When ONNX Runtime Outperforms Native Formats
Yes: prototyping, cross-platform models, models with non-standard operations that coremltools cannot convert, frequent model updates without pipeline rebuilds.
No: maximum single-platform performance. Native Core ML on iOS with full ANE acceleration is typically 20–40% faster than ORT+CoreML EP. TFLite with GPU Delegate on Android sometimes outperforms ORT+NNAPI. If the model deploys on only one platform and performance is critical, use native formats.
Debugging Incompatible Operations
# Check which operations NNAPI Execution Provider supports
python -m onnxruntime.tools.check_nnapi_supported_ops --model model.onnx
# Unsupported operations fall back to CPU execution
# This is not a crash, but may eliminate NNAPI acceleration gains
For bottleneck identification, ORT Profiling API logs execution time per operator. Enable via options.enableProfiling("ort_profile") — generates JSON viewable in Chrome chrome://tracing.
Process
ONNX graph export and simplification → quantization → iOS and Android integration with appropriate execution providers → profiling and native format comparison → production runtime selection.
Timeline Estimates
Basic cross-platform ONNX Runtime integration — 2–3 weeks. With execution provider optimization, profiling, and multi-device testing — 4–6 weeks.







