LLM Deployment on Edge Devices
LLM on edge — not marketing buzz. Real problems being solved: data privacy, offline operation, latency-critical applications. But hardware requirements and model selection are key engineering decisions.
Spectrum of Edge Devices for LLM
Apple Silicon (M-series): best edge-LLM hardware today. Unified memory allows GPU bandwidth usage for LLM without PCIe limitations. M2 Ultra: 192 GB unified memory — Llama 3 70B in float16. Stack: MLX framework or llama.cpp with Metal.
NVIDIA Jetson Orin: 64 GB for Orin AGX. CUDA-native, DeepSpeed/TensorRT-LLM. Production edge AI server.
x86 Server (no GPU): llama.cpp with AVX-512. Llama 3 8B Q4: 10–20 token/sec. For low-throughput corporate tasks.
ARM Server (Ampere, AWS Graviton): good price/performance for batch inference.
Model Selection for Edge
| Parameter Count | RAM Required (Q4) | Use Case |
|---|---|---|
| 1–3B | 1.5–2.5 GB | Mobile devices, MCU (TinyML) |
| 7–8B | 5–6 GB | Raspberry Pi 5, low-end desktop |
| 13B | 9 GB | Mid-range edge server |
| 70B | 40 GB | Jetson Orin AGX, M2 Ultra |
Serving Stack
Ollama: simplest deploy, OpenAI-compatible API, auto-management of models. Production-ready for single instance.
vLLM (if CUDA available): best throughput via PagedAttention. For concurrent requests.
llama-server: part of llama.cpp, OpenAI-compatible, lightweight.
Edge Optimizations
Speculative decoding (draft model + target model) — 2–3× speedup with minimal resources. KV-cache quantization. Context window limitation (smaller context = less memory).
Pipeline: 2–4 weeks
Hardware evaluation, model and quantization selection, serving setup, application integration, load testing.







