How many examples are needed for Command R fine-tuning?

For stable results, 1000–3000 dialogues are sufficient. Even 500 carefully annotated examples yield noticeable improvements in faithfulness and citation accuracy.

What is the difference between managed and self-hosted fine-tuning for Command R?

Managed fine-tuning via the Cohere API requires no GPU infrastructure—training happens on Cohere's side. Self-hosted gives full control over weights, including QLoRA fine-tuning on your own cluster. For high request volumes, self-hosted is often cheaper—savings can reach $5,000 per month.

Which quality metrics are important for RAG scenarios?

Key metrics: faithfulness (how much the answer relies only on provided documents), answer relevancy, citation accuracy (references to correct sources), and hallucination rate. After fine-tuning on our dataset, Command R+ achieved faithfulness of 0.93 and a hallucination rate of 4%.

Is Command R suitable for Russian-language RAG projects?

Yes, Command R supports Russian out of the box. We fine-tuned the model on Russian legal documents—faithfulness and citation accuracy were on par with English projects. It's important to include a Russian preamble and annotate documents in the dataset.

What is the minimum context needed for RAG with Command R?

Command R has a 128K token context window, enough for most enterprise scenarios. When fine-tuning with documents in context, the model learns to select relevant fragments—we recommend passing no more than 10–15 documents per request.

How many examples are needed for Command R fine-tuning?

For stable results, 1000–3000 dialogues are sufficient. Even 500 carefully annotated examples yield noticeable improvements in faithfulness and citation accuracy.

What is the difference between managed and self-hosted fine-tuning for Command R?

Managed fine-tuning via the Cohere API requires no GPU infrastructure—training happens on Cohere's side. Self-hosted gives full control over weights, including QLoRA fine-tuning on your own cluster. For high request volumes, self-hosted is often cheaper—savings can reach $5,000 per month.

Which quality metrics are important for RAG scenarios?

Key metrics: faithfulness (how much the answer relies only on provided documents), answer relevancy, citation accuracy (references to correct sources), and hallucination rate. After fine-tuning on our dataset, Command R+ achieved faithfulness of 0.93 and a hallucination rate of 4%.

Is Command R suitable for Russian-language RAG projects?

Yes, Command R supports Russian out of the box. We fine-tuned the model on Russian legal documents—faithfulness and citation accuracy were on par with English projects. It's important to include a Russian preamble and annotate documents in the dataset.

What is the minimum context needed for RAG with Command R?

Command R has a 128K token context window, enough for most enterprise scenarios. When fine-tuning with documents in context, the model learns to select relevant fragments—we recommend passing no more than 10–15 documents per request.

Customizing Command R for Enterprise RAG Tasks

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Customizing Command R for Enterprise RAG Tasks

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1346
Development of a web application for FEEDME
1246
Website development for BELFINGROUP
947
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Fine-Tuning the Command R Language Model (Cohere)

Command R and Command R+ are Cohere's model family designed for RAG tasks and tool use. Out of the box, they don't always deliver the required citation accuracy and low hallucination rate in a specific domain. A typical problem: the model cites non-existent articles of law or incorrectly selects a relevant fragment from a 50-page document. Model adaptation solves this.

Our engineers fine-tune LLMs for enterprise clients: we adapt Command R to your stack, dataset, and business logic. We have 5+ years of NLP experience and over 30 deployed RAG systems. Since 2018, we've helped companies reduce hallucinations by up to 5x and cut inference costs by 40% compared to base models. For example, an enterprise with 10M monthly requests saves approximately $5,000 per month. Contact us for a consultation—we'll evaluate your project in a couple of days.

The Command R Family

Model	Parameters	Context	Key Feature
Command R	35B	128K	RAG, citation
Command R+	104B	128K	Complex tasks, reasoning
Command R7B	7B	128K	Fast, cheap
Command A	—	256K	Latest generation

Cohere provides open weights for Command R via Hugging Face, enabling self-hosted fine-tuning. The open version matches the RAG quality of the closed—the difference is only in infrastructure and control. As noted in the Cohere documentation, the model is specifically optimized for RAG scenarios.

How to Choose Between Managed and Self-Hosted Fine-Tuning?

The choice depends on data confidentiality requirements and request volume. Managed fine-tuning through the Cohere API works if data does not need to be stored on-premise. Self-hosted with QLoRA is for strict security policies and high loads.

Managed (via Cohere API)

import cohere

co = cohere.Client(api_key="...")

dataset = co.datasets.create(
    name="legal-analysis-dataset",
    type="chat-finetune-input",
    data=open("train.jsonl", "rb"),
    eval_data=open("val.jsonl", "rb"),
)

ft = co.finetuning.create_finetune(
    request=cohere.finetuning.CreateFinetune(
        name="command-r-legal",
        model="command-r-plus",
        settings=cohere.finetuning.Settings(
            base_model=cohere.finetuning.BaseModel(
                base_type=cohere.finetuning.BaseType.BASE_TYPE_CHAT,
                name="command-r-plus",
            ),
            dataset_id=dataset.dataset.id,
            train_epochs=5,
            learning_rate=0.001,
        ),
    )
)

Self-Hosted via PEFT/LoRA

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(
    "CohereForAI/c4ai-command-r-v01",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")

lora_config = LoraConfig(
    r=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

Technical Details of QLoRA

QLoRA uses 4-bit quantization and LoRA adapters, making it possible to train a 35B model on a single A100 GPU in 12–36 hours. Memory consumption is about 24 GB.

Self-hosted with QLoRA reduces cost by 2–3x compared to the API at high volumes.

Data Format: Chat with Preamble

Command R uses a special chat format supporting a system prompt (preamble), RAG documents, and conversation history:

{
  "messages": [
    {
      "role": "System",
      "message": "You are a legal assistant. Always cite specific articles of law."
    },
    {
      "role": "User",
      "message": "What is the statute of limitations for a real estate sale contract?"
    },
    {
      "role": "Chatbot",
      "message": "The statute of limitations for a real estate sale contract is three years (Article 196 of the Civil Code). For void transactions—also three years from the date the person knew or should have known about the violation (Article 181 of the Civil Code)..."
    }
  ]
}

RAG-Specific: Fine-Tuning with Documents

A unique capability of Command R is training with documents in context. This allows you to adapt the model to a specific citation style and level of detail when working with corporate documents:

{
  "messages": [...],
  "documents": [
    {
      "title": "Claims Handling Regulations",
      "snippet": "3.4. The claim review period shall not exceed 30 calendar days..."
    }
  ]
}

With this approach, the model learns not only to generate answers but also to correctly extract relevant fragments from the provided documents.

Practical Case: Legal Assistant for Corporate Law

Task

A legal department assistant for a large company—contract analysis, internal regulation inquiries, legislative base work. Our client was a Russian law firm with 2000+ employees, requiring guaranteed hallucination reduction.

Dataset

2800 examples (question + relevant document fragment → answer with source citation). Data from real lawyer queries to the knowledge base.

Results

Faithfulness (RAGAS): from 0.71 to 0.93
Answer relevancy: from 0.78 to 0.91
Citation accuracy: from 64% to 89%
Hallucination rate: from 18% to 4%

This fine-tuning pays off through reduced hallucinations and fewer tokens per response. Inference cost savings reach 40%, and with the increased accuracy, the total cost of ownership decreases by 25%.

Why Our Approach Is More Effective Than Off-the-Shelf Solutions?

Ready-made assistant models do not account for the specifics of your data. Fine-tuning delivers accuracy unattainable through prompt engineering: up to 40% inference savings due to fewer tokens per response. We don't just run fine-tuning—we design the dataset for your use case. Our engineers have 10+ years of NLP experience and certifications from leading vendors (Cohere, Hugging Face). Our language model fine-tuning for enterprise RAG ensures your model fits your domain perfectly.

Estimated Implementation Timelines

Timelines depend on dataset size and chosen approach:

Stage	Duration
Dataset preparation with documents	3–6 weeks
Training (Cohere API)	2–5 days
Training (self-hosted, 35B, QLoRA)	12–36 hours
RAG quality testing	1–2 weeks
Total	6–10 weeks

Deliverables (What's Included in the Work)

Turnkey fine-tuning includes:

Audit of current data and dialogue collection process
Data annotation with a domain expert
Training and evaluation cycle (faithfulness, relevancy, hallucination rate)
Deployment (on-premise or cloud) with documentation
Access to the fine-tuned model via API or local endpoint
Team training session (up to 4 hours)
1 month of production model support with monthly quality reports

Dataset Preparation: Key Steps

Collect 1000–3000 dialogues with real queries and expert responses.
Each example must include preamble, documents (if RAG), and expected answer with citations.
Annotate faithfulness: the answer must rely only on provided documents.
Ensure diversity: the dataset should cover all typical scenarios.

Comparison: self-hosted fine-tuning with QLoRA gives quality comparable to full fine-tuning but at 2–3x lower cost and faster. It's ideal for pilot projects.

Contact us to get a consultation on your project. We'll evaluate your data, choose the optimal method (managed or self-hosted), and provide precise timelines.

Cohere fine-tuning API Command R on Hugging Face

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.