Implementing AI Short Clip Generation in a Mobile App
Short clips — TikTok/Reels format — differ from regular video generation by aspect ratio requirements (9:16), duration (5–15 seconds), and speed. Users expect quick results ready to publish without manual crop.
Break task into components
Generating short clip from text or photo — not one operation, but pipeline:
- Text/Image → Video — actual generation (Kling, Hailuo, Runway in 9:16 mode)
- Add music — AI selection or generation (Suno API, ElevenLabs Sound Effects)
- Subtitles/text — overlay with custom font
- Final compression — H.264/H.265 for optimal size
Each step can run on backend or client. Video generation — always server. Editing — can be on-device via FFmpeg.
Video generation in 9:16
Most APIs support aspect ratio selection:
// Kling API: create task
{
"prompt": "A cinematic shot of...",
"aspect_ratio": "9:16",
"duration": "5",
"mode": "std"
}
Hailuo (MiniMax) and Luma Dream Machine similarly. Runway Gen-3 supports 768:1280 — that's 9:16. For Image-to-Video: crop source photo to 9:16 on-client before send.
FFmpeg on mobile: editing without server
After getting generated clip, add music, subtitles, transitions right on device. ffmpeg-kit-react-native or native ffmpeg-kit for iOS/Android — statically linked FFmpeg without GPL deps (LGPLv3 build).
// Android: overlay audio on video via FFmpegKit
FFmpegKit.executeAsync(
"-i ${videoPath} -i ${audioPath} " +
"-filter_complex \"[1:a]afade=t=out:st=4:d=1[a]\" " +
"-map 0:v -map \"[a]\" " +
"-c:v copy -c:a aac -shortest " +
outputPath
) { session ->
if (ReturnCode.isSuccess(session.returnCode)) {
// Done
}
}
Compression for Stories/Reels: -c:v libx264 -crf 23 -preset fast -vf scale=1080:1920. Typical 10-second clip size — 5–8 MB in H.264 at 1080p.
Clip templates
Real apps (CapCut-like) work via templates: fixed structure — 1 sec intro, 8 sec main content, 1 sec outro. User inputs only text/photo, template dictates timing and transitions.
Template stored as JSON:
{
"duration": 10,
"segments": [
{"type": "title_card", "duration": 1.5, "text_position": "center"},
{"type": "ai_video", "duration": 7.0, "transition_in": "fade"},
{"type": "outro", "duration": 1.5, "logo": true}
],
"aspect_ratio": "9:16",
"music": {"genre": "upbeat", "volume": 0.4}
}
Client renders intro card (custom text layer), replaces ai_video segment with generated clip, composites outro. All via AVFoundation (iOS) or MediaCodec (Android).
Music sync
AI-generated music rarely syncs perfectly with video rhythm. Better: library of royalty-free short tracks pre-segmented by beat. Kling/Hailuo API let specify audio, so can generate video to-beat instead of reverse.
Or: send generated video to backend, process with Suno/ElevenLabs music gen (text prompt → audio), then composite on-client.
Export and publishing
Final clip compression matters. Target sizes:
- TikTok: max 287.6 MB (not practical, aim 50–100 MB)
- Instagram Reels: max 4 GB (safe at 100–200 MB)
- YouTube Shorts: max 100 MB
On-client compression via FFmpeg: -crf 25 (good quality/size balance), -preset fast (5–10 sec for 10 sec clip on mid-range device).
Timeline
Template-based clip generation with API integration, music, on-device editing — 2–3 weeks. Full editor with custom layouts, transitions, effects — 4–6 weeks. Cost includes FFmpeg integration complexity and provider API calls.







