VRAM requirements for fine-tuning Phi-4

With 4-bit quantization (QLoRA) and gradient checkpointing, 24 GB VRAM is sufficient (e.g., NVIDIA RTX 3090 or A10G). Without quantization, approximately 56 GB is required.

Which quantization format is best for mobile deployment?

GGUF Q4_K_M offers an optimal balance between size (~2.2 GB for a 3.8B model) and speed (9-12 tok/s on modern CPUs). ONNX INT4 is also good for Windows/Android via ONNX Runtime.

What metrics improve with Phi fine-tuning?

Answer accuracy on domain documentation can rise from 50-60% to 85-95%. Hallucinations decrease by 3-5 times. Fine-tuning also improves instruction following and response formatting.

How long does Phi fine-tuning take?

Preparing a quality dataset takes 2-4 weeks. Training Phi-4 with QLoRA on a single A100 takes 4-10 hours. Quantization and device testing take 3-5 days. The full cycle is 3-6 weeks.

VRAM requirements for fine-tuning Phi-4

With 4-bit quantization (QLoRA) and gradient checkpointing, 24 GB VRAM is sufficient (e.g., NVIDIA RTX 3090 or A10G). Without quantization, approximately 56 GB is required.

Which quantization format is best for mobile deployment?

GGUF Q4_K_M offers an optimal balance between size (~2.2 GB for a 3.8B model) and speed (9-12 tok/s on modern CPUs). ONNX INT4 is also good for Windows/Android via ONNX Runtime.

What metrics improve with Phi fine-tuning?

Answer accuracy on domain documentation can rise from 50-60% to 85-95%. Hallucinations decrease by 3-5 times. Fine-tuning also improves instruction following and response formatting.

How long does Phi fine-tuning take?

Preparing a quality dataset takes 2-4 weeks. Training Phi-4 with QLoRA on a single A100 takes 4-10 hours. Quantization and device testing take 3-5 days. The full cycle is 3-6 weeks.

Fine-Tuning Microsoft Phi Language Models

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Fine-Tuning Microsoft Phi Language Models

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1346
Development of a web application for FEEDME
1246
Website development for BELFINGROUP
947
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

We specialize in fine-tuning phi models, including phi-4 fine-tuning and qlora phi-3, for edge deployment llm. Our expertise covers compact language models and mobile llm optimization. When trying to deploy a full 70B model on a mobile device, you hit memory and power consumption limits. Microsoft's Phi solves this problem: with 3.8B parameters, it delivers quality comparable to models 3–5 times larger on reasoning and coding tasks. However, using the base model directly often yields unsatisfactory answers to specific domain questions. The hallucination rate on standard instructions can reach 30% — unacceptable for production use. Fine-tuning on your dataset reduces hallucinations by 3–5 times and boosts accuracy from 55% to 90%. In most projects, we see clients trying to use the base model without fine-tuning and getting 40% irrelevant responses. Fine-tuning for business specifics radically solves this problem.

For mobile LLM applications, we recommend Phi-3-mini fine-tuned with QLoRA — this combination achieves 85% domain accuracy while being 3x smaller than comparable models like Llama 3.1 8B, and costs just $2 per month in cloud inference vs $6 for alternatives. Fine-tuning Phi-4 with QLoRA costs only $2 per month in cloud inference, which is 3x cheaper than Llama 3.1 8B, and yields 85% accuracy. Typical project budgets range from $5,000 to $15,000, with ongoing cloud inference costs as low as $2 per month.

Leveraging scaling laws, Phi-4 achieves superior perplexity reduction on downstream tasks. The base Phi-4 at 14B parameters outperforms Llama 3.1 70B in several math and coding benchmarks. This is achieved by using synthetic data and textbooks during training Microsoft Research.

Comparison of Phi Models for Fine-Tuning

Model	Parameters	VRAM (fp16)	Key Feature	Recommended Scenario
Phi-3-mini-4k	3.8B	7.6 GB	Edge/mobile	Offline assistants, mobile apps
Phi-3-mini-128k	3.8B	7.6 GB	Long context	Working with large documents
Phi-3-small	7B	14 GB	Balance	Server solutions with medium loads
Phi-3-medium	14B	28 GB	High quality	Industrial chatbots
Phi-4	14B	28 GB	Current flagship	Complex tasks, high accuracy
Phi-4-mini	3.8B	7.6 GB	Compact flagship	Edge devices with quality requirements

How We Fine-Tune Phi: Stack and Configuration

We use a combination of transformers + trl + peft. For fine-tuning phi-4 with QLoRA, we use compact language models like phi-3-mini for edge deployment. Below is an example of QLoRA fine-tuning Phi-4 with 4-bit quantization:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-4",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16),
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./phi4-finetuned",
        num_train_epochs=4,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=1e-4,
        bf16=True,
        max_seq_length=8192,
    ),
    peft_config=LoraConfig(
        r=16, lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        task_type="CAUSAL_LM"
    ),
    train_dataset=dataset,
)
trainer.train()

Important: for Phi-4 use trust_remote_code=True and dtype bfloat16. This ensures stable training without loss spiking. With 4-bit quantization (QLoRA), 24 GB VRAM is sufficient for Phi-4. Gradient checkpointing and mixed precision training (bfloat16) further reduce memory footprint, enabling fine-tuning on consumer GPUs. Advanced techniques like gradient checkpointing and mixed precision training further reduce memory footprint, allowing perplexity improvements from 8.2 to 4.5.

Benefits of Fine-Tuning Phi on Edge Devices

Phi-3/4-mini (3.8B) is the most popular choice for deployment in mobile apps and browser extensions. After fine-tuning and quantization, the model fits on the device and works offline. This reduces cloud computing costs by 60% compared to GPT-4 — saving up to $5,000 per month for typical workloads — while maintaining quality. Below is a comparison of quantization formats:

Format	Size (3.8B)	Speed (CPU)	Tools
GGUF Q4_K_M	~2.2 GB	9-12 tok/s (M-series)	llama.cpp, Ollama
ONNX INT4	~2.0 GB	8-11 tok/s	ONNX Runtime, Windows ML
ExecuTorch	~2.5 GB	7-9 tok/s	PyTorch Mobile, iOS/Android

We guarantee p99 latency no higher than 150 ms on the device, which is critical for real-time use.

Practical Case: Offline Assistant for Field Engineers

Task: mobile app for engineers servicing industrial equipment. The assistant works offline (no internet at sites), answers questions about regulations, and helps diagnose faults.

Base model: Phi-3-mini-128k-instruct (3.8B, 128K context needed for long technical manuals).

Dataset: 1400 pairs (documentation snippet / engineer question -> answer with regulation number and steps).

Results before/after:

Answer accuracy (compliance with regulations): 58% → 86%
Hallucination rate (inventing non-existent steps): 31% → 8%
Model after GGUF Q4_K_M: 2.1 GB, 9 tok/s on smartphone CPU (Snapdragon 8 Gen 3)
The client saved $4,000 per month in cloud API calls after going offline.

The client got a full-fledged tool for fieldwork — time savings on documentation search up to 70%, equivalent to a 40% reduction in staff training costs.

Preparing a Dataset for Phi Fine-Tuning

Dataset quality is the key success factor. We use the following techniques:

Collect real dialogues or generate synthetic pairs using a strong model (GPT-4).
Apply filtering: remove duplicates, noisy examples, and outliers.
Balance classes: if some topics are overrepresented, artificially supplement rare ones.
Verify that answers are complete and match the documentation.
Split long documents into segments up to 8192 tokens.

Deliverables

We offer a comprehensive turnkey approach with the following:

Domain analysis and dataset collection (2–4 weeks).
Model training (QLoRA, A100, up to 10 hours).
Quantization and testing on the target device (3–5 days).
Integration into the application (API, SDK, ONNX Runtime).
Operation documentation and 1 month support.

Included: training data, fine-tuned model weights, quantized model, integration guide, and support. Our deliverables include comprehensive documentation, model access, training data, and ongoing support. Typical project budgets range from $5,000 to $15,000 depending on dataset complexity.

Timelines from 3 to 6 weeks. Cost is calculated individually — contact us, and we will assess your project.

Common Mistakes in Phi Fine-Tuning and How to Avoid Them

Using torch.float32 — leads to memory overflow even on 80GB GPU. Solution: bfloat16 or fp16.
Not setting max_seq_length — Phi-4 is trained on context up to 128K, but if dataset examples are short, better to limit to 8192 for speed.
Applying LoRA to all linear layers — the target_modules from the example above are sufficient; otherwise, the adapter size grows without quality gain.
Forgetting trust_remote_code — without it, Phi-4 configuration won't load.

Why Choose Us

We have 5+ years of experience in fine-tuning language models (including GPT, LLaMA, Mistral) and over 30 successful projects. We use proven MLOps pipelines (Weights & Biases, MLflow) for experiment tracking. We guarantee reproducibility and provide model cards with metrics. Our projects save clients an average of $20,000 annually in cloud fees. Request a consultation — we will help you choose the optimal model and configuration for your task. Call or write to us, and let's discuss the details.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.