LLM Inference Optimization via llama.cpp
llama.cpp — C++ implementation of LLM inference with aggressive optimizations for CPU and mixed CPU+GPU output. Enables running models of 7–70B parameters on regular servers and even laptops without expensive GPU infrastructure.
Key llama.cpp Optimizations
Quantization: the primary reason for choosing llama.cpp. GGUF format supports: Q4_0, Q4_K_M, Q5_K_M, Q8_0, F16. Q4_K_M — best balance of quality/size for most cases: 4-bit quantization preserving 95–98% quality from F16.
Metal (Apple Silicon): on M-series chips — automatic Neural Engine + GPU usage via Metal API. Llama 3 8B: 30–50 token/sec on M2 Pro.
CUDA acceleration: partial layer offload to GPU (n_gpu_layers). If GPU memory doesn't fit the entire model — hybrid CPU+GPU.
AVX/AVX2/AVX-512: CPU optimizations for Intel/AMD servers. Compilation targeting specific CPU.
Use Cases
Privacy-first deployment: corporate chatbot with LLM entirely on-premise without GPU. Llama 3 70B Q4_K_M on dual-socket Xeon server: 5–12 token/sec, acceptable for most corporate tasks.
Edge servers: Raspberry Pi 5, Orange Pi, Intel NUC. Llama 3 8B Q4 on Pi 5: 3–5 token/sec — sufficient for simple tasks.
Performance
| Model | Q4_K_M Size | Hardware | Speed |
|---|---|---|---|
| Llama 3.2 3B | 2 GB | M2 Pro | 60–80 t/s |
| Llama 3 8B | 5 GB | M2 Max | 40–60 t/s |
| Llama 3 70B | 40 GB | 2×RTX 4090 | 20–30 t/s |
| Llama 3 8B | 5 GB | RTX 4090 | 100–120 t/s |
Setup: 1–2 weeks
Compilation for target hardware, quantization selection, llama-server configuration (OpenAI-compatible API), monitoring.







