Building an AI Assistant with Llama (Meta) in a Mobile Application
Llama is Meta's family of open-weight models. This is the only choice when you need an assistant that works completely on-device (no server, no cloud, no data leakage), or when you need fine-tuning the model for a domain through fine-tuning on your own data. But "open weights" doesn't mean "works on your phone out of the box"—significant technical work is required for quantization and runtime selection.
Architecture: On-Device vs Server Llama
Two completely different scenarios:
On-device (Llama on device) — model is loaded into phone memory, inference without internet. Realistic for Llama 3.2 1B and 3B in INT4 quantization. Llama 3.2 3B INT4 takes ~2 GB RAM and runs on iPhone 15 Pro at 15–25 tokens/sec.
Server Llama — model on your own GPU server (or rented), mobile client communicates via API. Allows using Llama 3.3 70B or Llama 3.1 405B—full models, indistinguishable in quality from GPT-4.
For most commercial applications—server option. On-device is justified by strict privacy requirements (medical data never leaves the device) or offline operation.
On-Device: llama.cpp, Core ML, ExecuTorch
llama.cpp — most mature runtime for running GGUF models. On iOS: compiles as C++ library, called via Objective-C++ bridging header. On Android: via JNI. Complexity—building for different architectures (arm64-v8a for modern devices, armeabi-v7a for older).
// iOS — minimal wrapper over llama.cpp
class LlamaContext {
private var context: OpaquePointer?
init(modelPath: String) {
var params = llama_context_default_params()
params.n_ctx = 4096
params.n_threads = 4 // fewer threads—less heat
let model = llama_load_model_from_file(modelPath, llama_model_default_params())
context = llama_new_context_with_model(model, params)
}
func generate(prompt: String, maxTokens: Int = 256) -> AsyncStream<String> {
// tokenize → sample loop → detokenize
}
}
Apple MLX / Core ML — Apple provides official converter for Llama models to Core ML format. Advantage: Neural Engine is engaged automatically, inference is faster and cooler than via CPU. Limitation: iOS 17+ only.
ExecuTorch — Meta's runtime for mobile, officially supports Llama 3. More complex build, but better Android Neural Networks API integration.
Quantization: Choosing Precision
| Type | Size (3B) | Quality | Speed |
|---|---|---|---|
| FP16 | ~6 GB | Reference | Slow |
| Q8_0 | ~3.3 GB | ≈FP16 | Moderate |
| Q4_K_M | ~2.0 GB | Good | Fast |
| Q2_K | ~1.3 GB | Noticeably worse | Very fast |
For most mobile tasks—Q4_K_M is optimal balance. Q2_K can be considered for devices with 4 GB RAM.
Server Llama: Ollama and vLLM
For server deployment—Ollama (simplicity) or vLLM (performance). Ollama exposes an OpenAI-compatible API: POST /api/chat, request format identical to OpenAI Chat Completions. Mobile client written for OpenAI works with Ollama without changes—just change base URL.
vLLM is preferable for production with load: continuous batching, tensor parallelism across GPUs, throughput is orders of magnitude higher than Ollama.
Fine-tuning: When and How
Base Llama suffices for general assistant. Fine-tuning is justified when specialization is needed: medical terms, legal style, industry specifics. LoRA/QLoRA is standard for fine-tuning on a single GPU. Trained adapters (~50–100 MB) are loaded on top of base model.
Timeline Estimates
Server Llama with Ollama and mobile client—1–2 weeks. On-device via llama.cpp with iOS/Android builds and model loading management—3–5 weeks. Fine-tuning plus deployment—separate estimate.







