How does background replacement work on a mobile device?

On each frame of the video stream (30 fps), a neural network runs on the device (Neural Engine, GPU) to segment the human silhouette. The mask is then applied to the original frame, replacing the background with an image, video, or blur. The entire process takes 8–28 ms on modern smartphones.

What technologies are used for human segmentation?

On iOS we use Vision (VNGeneratePersonSegmentationRequest) or Core ML with DeepLab v3 / MediaPipe SelfieSegmentation models. On Android we use MLKit Selfie Segmentation in STREAM_MODE. For cross-platform solutions, TFLite and MediaPipe are suitable.

How to integrate background replacement into a WebRTC pipeline?

Most WebRTC SDKs (LiveKit, Daily, Agora) provide VideoProcessor or VideoSource protocols. You plug in your handler, which receives the raw frame (CVPixelBuffer on iOS, ImageProxy on Android), applies segmentation and background, and returns the modified frame for encoding.

How long does it take to develop such a feature?

A basic version with background blur on one platform takes 2–3 weeks. A full implementation with image and video backgrounds, on both platforms with integration into an existing WebRTC stack takes 5–8 weeks. Timelines depend on the chosen segmentation model and quality requirements.

Which devices are supported?

We test on devices from iPhone X (A11) and Android with Snapdragon 695 / Tensor G2 and above. Minimum requirements: 3 GB RAM, Android 10 / iOS 14. On weak devices we use lightweight models (MediaPipe) and lower the frame rate to 24 fps.

How does background replacement work on a mobile device?

On each frame of the video stream (30 fps), a neural network runs on the device (Neural Engine, GPU) to segment the human silhouette. The mask is then applied to the original frame, replacing the background with an image, video, or blur. The entire process takes 8–28 ms on modern smartphones.

What technologies are used for human segmentation?

On iOS we use Vision (VNGeneratePersonSegmentationRequest) or Core ML with DeepLab v3 / MediaPipe SelfieSegmentation models. On Android we use MLKit Selfie Segmentation in STREAM_MODE. For cross-platform solutions, TFLite and MediaPipe are suitable.

How to integrate background replacement into a WebRTC pipeline?

Most WebRTC SDKs (LiveKit, Daily, Agora) provide VideoProcessor or VideoSource protocols. You plug in your handler, which receives the raw frame (CVPixelBuffer on iOS, ImageProxy on Android), applies segmentation and background, and returns the modified frame for encoding.

How long does it take to develop such a feature?

A basic version with background blur on one platform takes 2–3 weeks. A full implementation with image and video backgrounds, on both platforms with integration into an existing WebRTC stack takes 5–8 weeks. Timelines depend on the chosen segmentation model and quality requirements.

Which devices are supported?

We test on devices from iPhone X (A11) and Android with Snapdragon 695 / Tensor G2 and above. Minimum requirements: 3 GB RAM, Android 10 / iOS 14. On weak devices we use lightweight models (MediaPipe) and lower the frame rate to 24 fps.

AI Virtual Background for Mobile Video Calls

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

AI Virtual Background for Mobile Video Calls

Complex

~1-2 weeks

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
745
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
968
Development of a mobile application for the FLAVORS company
563

Show more works

When developing a mobile app for video calls, clients often face the problem: the standard implementation of a virtual background via the server introduces delays and artifacts. In one project, the client used cloud AI from AWS Rekognition — each frame was sent to the cloud and returned after 60–80 ms. As a result, the contour fluctuated, and users complained about quality.

We solved it by moving segmentation to the device. Server-side processing adds 40–80 ms per frame, which at 30 fps causes noticeable contour breakup and a 'ghost' effect during fast movements. On-device segmentation completes in 8–28 ms, saving up to 80% of time compared to cloud inference. The key advantage is segmentation on the device, not in the cloud. This not only reduces infrastructure costs but also ensures user privacy. On each frame of the video stream, the neural network extracts the human silhouette, applies the background (image, video, or blur), and returns the result to the encoder pipeline — all without transmitting data to the server. The typical time budget is 33 ms per frame, and the on-device solution easily fits. For budget devices, we use lightweight models and reduce the frame rate to 24 fps, which ensures stable operation without overheating.

What are the advantages of on-device AI virtual background?

The task is to extract the human silhouette on each frame of a video stream (30 fps), apply the background, and return the result to the pipeline before encoding. This means a budget of ~33 ms per frame including capture, model inference, post-processing, and rendering.

Server-side: capture → send → inference → response → rendering. Even with an ideal network, roundtrip adds 40–80 ms. In practice, this means contour jitter and 'ghost' during movement.

On device: capture → inference → rendering. Everything in one pipeline. Infrastructure costs are high with the server approach — GPU servers are needed. On-device approach completely eliminates these costs. Savings on server GPU computing can reach $2,000 per month for an app with 10,000 active users. In another project, savings were $3,000 per month by abandoning expensive GPU instances. For a typical project, the investment is between $5,000 and $25,000 depending on complexity.

On-device segmentation process

We use neural networks optimized for mobile chips (Neural Engine, GPU, DSP). Inference runs locally; no data leaves the device — this simultaneously solves privacy and latency issues.

iOS: MLKit + CoreImage or Vision

On iOS we use the Vision framework with the VNGeneratePersonSegmentationRequest model. Apple added it in iOS 15 and later — it runs on the Neural Engine without explicit model loading. Accuracy is good for the front camera, but it can produce jagged contours with complex hairstyles and transparent clothing elements.

// Configure segmentation
let request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .balanced   // .accurate gives better contour but is heavier
request.outputPixelFormat = kCVPixelFormatType_OneComponent8

// In AVFoundation frame handler
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:])
try handler.perform([request])

guard let mask = request.results?.first?.pixelBuffer else { return }
// mask — CVPixelBuffer 8-bit, apply via CIBlendWithMask

CIBlendWithMask with CIContext(options: [.workingColorSpace: NSNull()]) — render in Metal, avoiding color space conversion. Without this, each frame adds ~5 ms just for conversion.

For higher quality segmentation, we convert a TFLite model like DeepLab v3 or MediaPipe SelfieSegmentation to Core ML via coremltools and load it through MLModel. MediaPipe gives a stable contour even at blurry edges. Apple Vision VNGeneratePersonSegmentationRequest

Android: MLKit Selfie Segmentation

val segmenter = Segmentation.getClient(
    SelfieSegmenterOptions.Builder()
        .setDetectorMode(SelfieSegmenterOptions.STREAM_MODE)  // optimized for video
        .enableRawSizeMask()
        .build()
)

// In CameraX ImageAnalysis handler
override fun analyze(imageProxy: ImageProxy) {
    val inputImage = InputImage.fromMediaImage(imageProxy.image!!, imageProxy.imageInfo.rotationDegrees)
    segmenter.process(inputImage)
        .addOnSuccessListener { segmentationMask ->
            val mask = segmentationMask.buffer
            // Apply background via RenderScript or Vulkan compute shader
            applyBackground(mask, imageProxy)
        }
        .addOnCompleteListener { imageProxy.close() }
}

STREAM_MODE is critical — it keeps internal state between frames and runs faster than SINGLE_IMAGE_MODE. On Pixel 6 with Tensor G2, inference takes 8–12 ms. On budget devices (Snapdragon 695) — 20–28 ms. For mask post-processing, we use RenderScript (deprecated in API 31+) or Vulkan compute shader via RenderEffect on Android 12+. MLKit Selfie Segmentation

Comparison of segmentation models

Model	Platform	Latency (ms)	Contour quality
Vision (Apple)	iOS	12–20	Good
MLKit Selfie Segmentation	Android	8–12	Excellent
MediaPipe	Cross-platform	15–25	Average
Core ML (DeepLab)	iOS	20–30	High

Comparison of approaches: server vs on-device

Parameter	Server segmentation	On-device segmentation
Latency	150–300 ms (roundtrip)	8–28 ms (inference)
Network dependency	Critical	None
Privacy	Data goes to server	Data stays on device
Accuracy	High (large model)	Good (optimized models)
Infrastructure cost	High (GPU servers)	Zero (only software)

Background application: three options

Static image — simplest case. CIBlendWithMask on iOS, PorterDuff compositing on Android.

Blur — CIGaussianBlur filter with radius 12–20 applied to the original frame, then mask selects between original and blurred. On Android — RenderEffect.createBlurEffect (API 31+) or custom blur via Vulkan.

Video background — needs a decoder synchronized with the video call timing. On iOS — AVPlayerItemVideoOutput + Metal texture. Memory heavy: video background buffer + camera buffer + mask buffer + result. On iPhone 12 with 4 GB it's fine, on iPhone SE 2nd gen (3 GB) we need aggressive buffer reuse.

How to integrate background replacement into WebRTC pipeline?

Most mobile calling solutions are built on WebRTC — via LiveKit, Daily.co, Agora, or native WebRTC. All provide a custom VideoSource/VideoProcessor mechanism for frame manipulation before encoding.

In LiveKit SDK for iOS it's the VideoProcessor protocol:

class BackgroundReplacementProcessor: VideoProcessor {
    func process(frame: RTCVideoFrame) -> RTCVideoFrame? {
        // Segmentation + background application
        // Return new RTCVideoFrame with processed buffer
    }
}
room.localParticipant?.videoTracks.first?.processor = BackgroundReplacementProcessor()

Important: RTCVideoFrame works with CVPixelBuffer in format kCVPixelFormatType_420YpCbCr8BiPlanarFullRange. Converting to RGB for ML inference and back incurs losses. If the model accepts YUV, we keep the format untouched.

Criteria for choosing a segmentation model

Model choice depends on target devices and quality requirements. For iOS with A12+, Vision is suitable — built-in model, no extra resources needed. For Android with Tensor G2 or Snapdragon 8 Gen 1, MLKit gives the best quality. On weak devices (Snapdragon 695, A11) we use MediaPipe or lower fps.

Optimization for low-end devices

On resource-constrained devices we reduce frame rate to 24 fps and use lightweight models (MediaPipe). We also apply dynamic scaling of the input frame: reduce resolution to 480p before inference, then upscale the mask. This cuts processing time by 30-40% without noticeable quality loss.

Deliverables

After project completion, you get a fully integrated background replacement feature tested on 20+ real devices. Deliverables include:

Source code of the segmentation and background processing module.
Detailed technical documentation for integration and customization.
Training of your team on the code and configurations.
One month of support after launch to fix potential bugs.
Recommendations for optimizing for new OS versions.

Integration timeline

Audit of the current WebRTC stack and frame pipeline.
Selection of segmentation model based on quality/speed (Vision, MLKit, Core ML, MediaPipe).
Prototype development with performance measurement on 10+ devices.
Integration into the existing WebRTC pipeline via VideoProcessor.
Optimization of mask post-processing (antialiasing, feathering).
Testing on edge cases (complex background, fast movements).
Documentation and training of the client's team.

Timeline estimates

Basic implementation with background blur (one platform) — 2–3 weeks, typically costing $5,000-$10,000. Full implementation with support for static images and video backgrounds, both platforms, integration into existing WebRTC stack — 5–8 weeks, costing $15,000-$25,000.

We have 8+ years of experience in mobile development and have completed 50+ projects with video and AI features. We guarantee stable operation on devices older than five years.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.