AI Virtual Background (Background Replacement) for Video Calls
Standard virtual background implementation via WebRTC and a third-party service works while the network is stable. On mobile during 4G degradation, the roundtrip for sending a frame to the server and receiving the masked result grows to 150–300 ms, causing real-time artifacts. The correct approach is on-device segmentation.
Why Server-Side Segmentation Doesn't Work on Mobile
The task is to isolate the human silhouette on each video frame (30 fps), apply a background, and return it to the pipeline before encoding. This means ~33 ms budget per frame including capture, model inference, post-processing, and rendering.
Server approach: capture → send → infer → response → render. Even with perfect network, roundtrip adds 40–80 ms. In practice—contour jitter, motion "ghosting".
On-device: capture → infer → render. Everything in one pipeline.
iOS: MLKit + CoreImage or Vision
On iOS, use the Vision framework with VNGeneratePersonSegmentationRequest. Apple added it in iOS 15—runs on Neural Engine without explicit model loading. Accuracy is good for front camera but produces ragged contours with complex hairstyles and transparent clothing.
// Segmentation setup
let request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .balanced // .accurate better contour, heavier
request.outputPixelFormat = kCVPixelFormatType_OneComponent8
// In AVFoundation frame handler
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:])
try handler.perform([request])
guard let mask = request.results?.first?.pixelBuffer else { return }
// mask—8-bit CVPixelBuffer, apply via CIBlendWithMask
CIBlendWithMask with CIContext(options: [.workingColorSpace: NSNull()]) renders on Metal, avoiding color space conversion. Without this, each frame adds ~5 ms just for conversion.
For better segmentation, convert TFLite DeepLab v3 or MediaPipe SelfieSegmentation to Core ML via coremltools and load via MLModel. MediaPipe gives stable contours even with soft edges.
Android: MLKit Selfie Segmentation
val segmenter = Segmentation.getClient(
SelfieSegmenterOptions.Builder()
.setDetectorMode(SelfieSegmenterOptions.STREAM_MODE) // optimized for video
.enableRawSizeMask()
.build()
)
// In CameraX ImageAnalysis handler
override fun analyze(imageProxy: ImageProxy) {
val inputImage = InputImage.fromMediaImage(imageProxy.image!!, imageProxy.imageInfo.rotationDegrees)
segmenter.process(inputImage)
.addOnSuccessListener { segmentationMask ->
val mask = segmentationMask.buffer
// Apply background via RenderScript or Vulkan compute shader
applyBackground(mask, imageProxy)
}
.addOnCompleteListener { imageProxy.close() }
}
STREAM_MODE is critical—maintains internal state between frames and runs faster than SINGLE_IMAGE_MODE. On Pixel 6 with Tensor G2, inference takes 8–12 ms. On budget phones (Snapdragon 695)—20–28 ms. For mask post-processing—RenderScript (deprecated API 31+) or Vulkan compute shader via RenderEffect on Android 12+.
Applying Backgrounds: Three Variants
Static image — simplest case. CIBlendWithMask on iOS, PorterDuff compositing on Android.
Blur — filter CIGaussianBlur with radius 12–20 applied to original frame, then mask selects between original and blurred. On Android—RenderEffect.createBlurEffect (API 31+) or custom blur via Vulkan.
Video background — needs a decoder synchronized with call timing. On iOS—AVPlayerItemVideoOutput + Metal texture. Heavy on memory: video buffer + camera buffer + mask buffer + result. On iPhone 12 with 4 GB OK, iPhone SE 2nd gen (3 GB) needs aggressive buffer reuse.
Integration in WebRTC Pipeline
Most mobile calling solutions use WebRTC—via LiveKit, Daily.co, Agora, or native WebRTC. All provide custom VideoSource/VideoProcessor mechanism to replace frames before encoding.
In LiveKit SDK for iOS, this is the VideoProcessor protocol:
class BackgroundReplacementProcessor: VideoProcessor {
func process(frame: RTCVideoFrame) -> RTCVideoFrame? {
// Segmentation + background application
// Return new RTCVideoFrame with processed buffer
}
}
room.localParticipant?.videoTracks.first?.processor = BackgroundReplacementProcessor()
Important: RTCVideoFrame works in CVPixelBuffer with format kCVPixelFormatType_420YpCbCr8BiPlanarFullRange. RGB conversion for ML inference and back—lossy. If model accepts YUV—keep format untouched.
Assessment and Process
Start with audit of existing WebRTC stack: which SDK, how frame pipeline organized, target device list. Then prototype with Vision/MLKit, measure on real devices from minimum requirements list.
Critical steps: tune segmentation model for quality/speed, optimize mask post-processing (contour antialiasing, edge feathering), test edge cases—uneven lighting, complex background, fast motion.
Timeline Estimates
Basic implementation with blurred background (one platform) takes 2–3 weeks. Full implementation supporting static images and video backgrounds, both platforms, integration into existing WebRTC stack requires 5–8 weeks.







