AI Video Quality Enhancement in Mobile Apps
Video quality enhancement is fundamentally different from photo enhancement. Photos can be processed over seconds: user will wait. Video requires either real-time (record/stream) or acceptable post-processing speed (roughly: 1-minute clip in 2–3 minutes). This dictates model choice and solution architecture.
Two Modes—Different Architectures
Post-processing recorded video—consume video file frame-by-frame via decoder, process with ML model, encode back. Speed matters, real-time not required.
Real-time enhancement during recording or playback—strict 33 ms per frame at 30 fps. Complex models won't fit. Need compromise quality.
Most client tasks are the first: user recorded old video or uploaded file, clicks "enhance".
Post-Processing on iOS: AVAssetReader + Metal + Core ML
// Read video as sequence of CVPixelBuffer
let asset = AVAsset(url: videoURL)
let reader = try AVAssetReader(asset: asset)
let output = AVAssetReaderTrackOutput(
track: asset.tracks(withMediaType: .video).first!,
outputSettings: [kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange]
)
reader.add(output)
reader.startReading()
// Parallel writer for result
let writer = try AVAssetWriter(outputURL: resultURL, fileType: .mp4)
let writerInput = AVAssetWriterInput(mediaType: .video, outputSettings: videoSettings)
let adaptor = AVAssetWriterInputPixelBufferAdaptor(assetWriterInput: writerInput, ...)
Per frame: convert YUV → RGB (Metal shader), pass through Core ML model, convert RGB → YUV, write to AVAssetWriter. Conversion via Metal critical—CPU-based is 15–20 ms per frame just for color space.
Which model? For video denoising—RVRT or BasicVSR++ (temporal super resolution, accounting for neighboring frames). But they require frame batches and don't come for mobile originally. Compromise—per-frame models like Real-ESRGAN (ignore temporal consistency, simpler to deploy) or lightweight temporal models with 3–5 frame window.
Temporal flickering—main problem of per-frame approach: neighboring frames processed independently, details at boundaries "flicker". Solution—temporal consistency loss during fine-tuning or post-processing: averaging activations between neighboring frames with weight 0.1–0.2.
Android: MediaCodec + TFLite
// Use MediaCodec for decoding
val decoder = MediaCodec.createDecoderByType(format.getString(MediaFormat.KEY_MIME)!!)
decoder.configure(format, null, null, 0)
// Surface for decoding directly into OpenGL texture
val surface = Surface(surfaceTexture)
decoder.configure(format, surface, null, 0)
decoder.start()
On Android, decode to Surface → process via OpenGL/Vulkan compute → back via MediaCodec encoder. TFLite GPU Delegate works with OpenGL textures directly via setExternalContext(), eliminating one memory copy per frame.
For Full HD (1920×1080), tiled slicing 256×256 gives ~48 tiles. At 15 ms/tile inference—720 ms per frame. One-minute video (1800 frames) takes ~22 minutes. Acceptable for editor, not for quick-enhance. For quick-enhance—light model (2–4 MB TFLite), no tiling, downscale to 540p for inference.
Real-Time Denoising During Recording
If real-time needed—work with reduced resolution. Models trained on 360p/480p input deliver fast. On iPhone 14 Pro—ESRGAN x2 on 480p input ~28 ms via ANE. Android Snapdragon 8 Gen 2—comparable via GPU Delegate.
For CameraX with custom processing—ImageAnalysis use case with setBackpressureStrategy(STRATEGY_KEEP_ONLY_LATEST): don't queue if processing lags.
Don't Break Audio
Video pipeline processes video track only. Audio copied via AVAssetReaderTrackOutput / MediaExtractor unmodified. Typical mistake—forget to sync PTS (presentation timestamps) between audio and processed video track. AI processing doesn't change frame duration; copy PTS from original one-to-one.
Process
Analyze target scenarios (post-process vs real-time), pick model with measurements on target devices, implement decoding/encoding via AVFoundation/MediaCodec, ML pipeline with/without tiling, test edge cases—HDR video, non-standard resolutions, rotation.
Timeline Estimates
One-platform post-processing with per-frame model takes 3–5 weeks. Both platforms with temporal consistency and real-time mode requires 8–14 weeks.







