Model Conversion to GGUF Format for llama.cpp
GGUF (GPT-Generated Unified Format) — binary format for storing LLM weights and metadata, used by llama.cpp, Ollama, LM Studio, GPT4All. Replaced the deprecated GGML format. Convert any HuggingFace LLM model to GGUF in several commands.
Conversion Process
Step 1: Download convert_hf_to_gguf.py from llama.cpp repository
Step 2: Convert to F16 GGUF:
python convert_hf_to_gguf.py /path/to/model --outtype f16 --outfile model-f16.gguf
Step 3: Quantize via llama-quantize:
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
Quantization Selection
| Type | Size (7B model) | Quality | Use Case |
|---|---|---|---|
| Q4_K_M | ~4.1 GB | Good | Optimal balance |
| Q5_K_M | ~5.0 GB | Very good | When RAM allows |
| Q8_0 | ~7.7 GB | Excellent | Maximum quality |
| Q3_K_M | ~3.3 GB | Acceptable | Minimum size |
Supported Architectures
LLaMA, Mistral, Qwen, Phi, Gemma, DeepSeek, Falcon, MPT, GPT-J/NeoX. Full list in llama.cpp documentation.
Timeframe: 1–3 days
Conversion is a technical procedure. Main time — testing output quality after quantization and selecting optimal quantization type.







