What is fallback logic in a mobile app?

It's a mechanism that automatically redirects requests to a backup provider, a local model, or static responses when the primary AI service is unavailable. This ensures users don't see errors and receive useful results.

What degradation levels should be implemented?

A recommended cascade has 4 levels: 1) retry with exponential backoff; 2) switch provider (e.g., OpenAI → Anthropic); 3) local model (e.g., Phi-3.5-mini); 4) static responses from cache. Each subsequent level activates if the previous fails.

How to implement a circuit breaker for an AI service on Android?

Use a failure counter, a reset timeout (e.g., 60 s), and states CLOSED/OPEN/HALF_OPEN. After 5 consecutive failures, transition to OPEN. In OPEN state, requests are blocked until the timer expires. A success in HALF_OPEN resets to CLOSED.

How long does it take to implement a full fallback cascade?

Basic retry with backoff — 1 day. Full cascade with circuit breaker, two providers, and a local model — 2 to 3 days. Timelines are refined after analyzing your stack.

What is fallback logic in a mobile app?

It's a mechanism that automatically redirects requests to a backup provider, a local model, or static responses when the primary AI service is unavailable. This ensures users don't see errors and receive useful results.

What degradation levels should be implemented?

A recommended cascade has 4 levels: 1) retry with exponential backoff; 2) switch provider (e.g., OpenAI → Anthropic); 3) local model (e.g., Phi-3.5-mini); 4) static responses from cache. Each subsequent level activates if the previous fails.

How to implement a circuit breaker for an AI service on Android?

Use a failure counter, a reset timeout (e.g., 60 s), and states CLOSED/OPEN/HALF_OPEN. After 5 consecutive failures, transition to OPEN. In OPEN state, requests are blocked until the timer expires. A success in HALF_OPEN resets to CLOSED.

How long does it take to implement a full fallback cascade?

Basic retry with backoff — 1 day. Full cascade with circuit breaker, two providers, and a local model — 2 to 3 days. Timelines are refined after analyzing your stack.

AI Service Fallback in Mobile Apps: Fault Tolerance

Q: How to test fallback logic?

Use integration tests that simulate 503 errors from the primary provider. Verify switching to backup, correct logging of degradation levels, and a user experience without technical error messages.

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

AI Service Fallback in Mobile Apps: Fault Tolerance

Medium

from 1 day to 3 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
745
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
968
Development of a mobile application for the FLAVORS company
563

Show more works

OpenAI returns 503 approximately once every few weeks—during peak load or incidents. For a mobile app where an AI assistant is part of the core user flow, this means a white screen or crash if fallback isn't prepared in advance. In such scenarios, a request queue builds up, latency increases, and users churn. To avoid this, we design a degradation cascade: retry with exponential backoff, then switch to a backup AI service like Anthropic Claude or Google Gemini. Our solutions have been proven on projects with a million-user audience. If your AI assistant is key functionality, fallback logic is essential. Average downtime drops from 3 hours to 5 minutes, saving the business up to $5,000 monthly. Support costs are reduced by 30% thanks to automatic recovery.

The Problem: AI Service Unavailable

Failures vary: provider overload, network issues, rate limits. Without fallback, the user faces an error or infinite loading. The result is a drop in retention and negative reviews. Retry and circuit breaker patterns are the baseline, but for critical scenarios a cascade is needed.

How the Degradation Cascade Works?

Proper fallback is not a single stub but multiple levels, each activating when the previous fails.

Level 1: Retry with backoff. Transient errors (429 Rate Limit, 503, timeout) are retried with exponential backoff. Three attempts: after 1s, 3s, 9s. If all three fail, proceed to level 2.

Level 2: Provider switch. If the primary provider is OpenAI, fallback to Anthropic Claude API or Google Gemini. Responses differ in style, but quality is comparable for most tasks. Keys to backup providers are stored in server configuration.

Level 3: Local model. For critical flows, a small local model (Phi-3.5-mini via llama.cpp, ~2.2 GB). Quality is lower than GPT-4o but works offline. On iOS, run via MLModel or llama.swift.

Level 4: Static responses. FAQ and common questions from cache or database. The user receives a useful answer without knowing the AI is unavailable.

Degradation Level Comparison Table

Level	Latency	Quality	Cost per execution
Retry	~13 s	Full	Free
Provider switch	~1 s	90-95%	API calls
Local model	~2 s	70-80%	Device energy
Static responses	<100 ms	Exact only for FAQ	Zero

Why Circuit Breaker is a Must-Have Pattern?

The circuit breaker pattern prevents cascading load on a degrading service. It's faster than simple retry and saves client and server resources.

// Android — Kotlin
class AIServiceCircuitBreaker {
    private var failureCount = 0
    private var lastFailureTime = 0L
    private val failureThreshold = 5
    private val resetTimeout = 60_000L // 1 minute

    enum class State { CLOSED, OPEN, HALF_OPEN }
    var state = State.CLOSED

    fun canCall(): Boolean = when (state) {
        State.CLOSED -> true
        State.OPEN -> {
            if (System.currentTimeMillis() - lastFailureTime > resetTimeout) {
                state = State.HALF_OPEN
                true
            } else false
        }
        State.HALF_OPEN -> true
    }

    fun recordSuccess() {
        failureCount = 0
        state = State.CLOSED
    }

    fun recordFailure() {
        failureCount++
        lastFailureTime = System.currentTimeMillis()
        if (failureCount >= failureThreshold) state = State.OPEN
    }
}

Circuit breaker implementation example on iOS (Swift)

enum CircuitBreakerState {
    case closed, open, halfOpen
}

class CircuitBreaker {
    private var state: CircuitBreakerState = .closed
    private var failureCount = 0
    private let threshold = 5
    private let timeout: TimeInterval = 60
    private var lastFailure: Date?

    func canCall() -> Bool {
        switch state {
        case .closed: return true
        case .open:
            if let lastFailure = lastFailure, Date().timeIntervalSince(lastFailure) > timeout {
                state = .halfOpen
                return true
            }
            return false
        case .halfOpen: return true
        }
    }

    func recordSuccess() {
        failureCount = 0
        state = .closed
    }

    func recordFailure() {
        failureCount += 1
        lastFailure = Date()
        if failureCount >= threshold { state = .open }
    }
}

Retry Strategy Comparison

Strategy	Wait between attempts	Number of attempts	Use case
Fixed	5 s	3	Low load
Exponential	1 s, 3 s, 9 s	3-5	Transient errors
Jitter	1-5 s (random)	3-5	Rate Limit

How Static Responses Work?

Static responses are pre‑prepared messages for frequently asked questions. They are stored as key‑value pairs in a local database or cache. When all higher levels are unavailable, the system returns the most relevant response based on query analysis (e.g., TF‑IDF). This guarantees an instant answer without network.

How to Test Fallback Logic?

We use integration tests that simulate a 503 error from the primary provider. We verify the switch to backup, correct logging of degradation levels, and a UX without technical error messages. These tests run on CI with every commit.

Step‑by‑Step Fallback Implementation Plan

Analyze current stack and integration with AI providers—identify points of failure.
Design degradation cascade (retry, circuit breaker, provider switch, local model).
Implement retry with exponential backoff and jitter on the client.
Implement circuit breaker with failure thresholds.
Connect backup provider and local model.
Develop static responses for frequent queries.
Write unit and integration tests for each level.
Document fallback logic and monitoring.
Provide instructions for adding new providers.
Offer post‑deployment support (1 month).

UX During Degradation

The user should never see technical errors. When falling back to static responses, show normal UI without labeling. When fully unavailable, display "Assistant temporarily unavailable, please try again in a few minutes" instead of a raw Error 503.

A degradation indicator is useful for internal analytics: log each fallback with level and reason. This helps identify problematic providers and improve stability.

What’s Included in the Work

Analysis of current stack and integration with AI providers
Design and implementation of degradation cascade (retry, circuit breaker, provider switch, local model)
Setup of static responses and cache
Writing unit and integration tests
Documentation of fallback logic and monitoring
Instructions for adding new providers
Post‑deployment support (1 month)

Timeline Estimates

Basic retry with backoff — 1 day. Full cascade with circuit breaker and two providers — 2–3 days. We’ll provide a precise estimate after a free consultation.

Our Metrics

5+ years in mobile development, 50+ projects with AI integration, certified engineers (iOS, Android, Flutter). Warranty on implemented functionality. Order an audit of your current AI stack—we’ll find weak spots. Get a free consultation on fault‑tolerance design. Contact us for a project assessment.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.