On-device ML model ONNX Runtime integration for mobile app

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
On-device ML model ONNX Runtime integration for mobile app
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

On-Device ML Model Integration (ONNX Runtime) for Mobile App

ONNX Runtime Mobile offers a compelling advantage: one model for both platforms. After converting PyTorch to ONNX, connect onnxruntime-android and onnxruntime-objc, and run the same .onnx file on both. In practice, differences in execution providers between iOS and Android still require platform-specific code, but the model itself remains unified.

Preparing Models for Mobile

Standard ONNX export from PyTorch:

import torch
import onnx
from onnxsim import simplify  # onnx-simplifier for graph optimization

model = MyModel(); model.eval()
dummy = torch.zeros(1, 3, 224, 224)

torch.onnx.export(
    model, dummy, "model.onnx",
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

# Graph simplification — removes unnecessary reshape, transpose operations, cleanups the graph
model_onnx = onnx.load("model.onnx")
model_simplified, check = simplify(model_onnx)
onnx.save(model_simplified, "model_simplified.onnx")

For mobile deployment, add quantization via onnxruntime.quantization:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    "model_simplified.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)
# Model size reduces by ~4x compared to FP32

Android: Integration and Execution

// build.gradle
implementation("com.microsoft.onnxruntime:onnxruntime-android:1.18.0")

// Creating session
val sessionOptions = OrtSession.SessionOptions().apply {
    // NNAPI Execution Provider for Android NPU/DSP
    addNnapi(NNAPIFlags.USE_FP16)  // FP16 mode in NNAPI
    // Or: addXnnpack(mapOf()) for XNNPACK (CPU SIMD)
    setOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
    setIntraOpNumThreads(4)
}

val env = OrtEnvironment.getEnvironment()
val session = env.createSession(
    context.assets.open("model_simplified.onnx").readBytes(),
    sessionOptions
)

// Inference
val inputTensor = OnnxTensor.createTensor(
    env,
    FloatBuffer.wrap(preprocessedArray),
    longArrayOf(1, 3, 224, 224)
)

val results = session.run(mapOf("input" to inputTensor))
val outputArray = (results["output"]?.value as Array<FloatArray>)[0]

// Release resources — mandatory
inputTensor.close()
results.close()

Leaks from unclosed OnnxTensor and OrtSession.Result are common issues. In Kotlin, use use {} blocks: results.use { ... }.

iOS: Objective-C/Swift Integration

// Package.swift or Podfile: pod 'onnxruntime-objc'
import onnxruntime_objc

// Configuration
let env = try ORTEnv(loggingLevel: ORTLoggingLevel.warning)
let options = try ORTSessionOptions()
try options.setIntraOpNumThreads(4)
// On iOS — CoreML Execution Provider
try options.appendCoreMLExecutionProvider(withFlags: [.enableOnSubgraphs])

let session = try ORTSession(
    env: env,
    modelPath: Bundle.main.path(forResource: "model_simplified", ofType: "onnx")!,
    sessionOptions: options
)

// Input preparation
let inputShape: [NSNumber] = [1, 3, 224, 224]
let inputData = Data(bytes: preprocessedFloats, count: preprocessedFloats.count * MemoryLayout<Float>.size)
let inputTensor = try ORTValue(
    tensorData: NSMutableData(data: inputData),
    elementType: .float,
    shape: inputShape
)

let outputs = try session.run(
    withInputs: ["input": inputTensor],
    outputNames: ["output"],
    runOptions: nil
)

let outputTensor = outputs["output"]!
let outputData = try outputTensor.tensorData() as Data
let floats = outputData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) }

appendCoreMLExecutionProvider on iOS 13+ delegates supported operations to Core ML, enabling access to ANE. Operations not supported by Core ML automatically execute on CPU. While not as performant as native Core ML with full ANE acceleration, it offers convenience for rapid cross-platform deployment.

When ONNX Runtime Outperforms Native Formats

Yes: prototyping, cross-platform models, models with non-standard operations that coremltools cannot convert, frequent model updates without pipeline rebuilds.

No: maximum single-platform performance. Native Core ML on iOS with full ANE acceleration is typically 20–40% faster than ORT+CoreML EP. TFLite with GPU Delegate on Android sometimes outperforms ORT+NNAPI. If the model deploys on only one platform and performance is critical, use native formats.

Debugging Incompatible Operations

# Check which operations NNAPI Execution Provider supports
python -m onnxruntime.tools.check_nnapi_supported_ops --model model.onnx

# Unsupported operations fall back to CPU execution
# This is not a crash, but may eliminate NNAPI acceleration gains

For bottleneck identification, ORT Profiling API logs execution time per operator. Enable via options.enableProfiling("ort_profile") — generates JSON viewable in Chrome chrome://tracing.

Process

ONNX graph export and simplification → quantization → iOS and Android integration with appropriate execution providers → profiling and native format comparison → production runtime selection.

Timeline Estimates

Basic cross-platform ONNX Runtime integration — 2–3 weeks. With execution provider optimization, profiling, and multi-device testing — 4–6 weeks.