How does Graph RAG differ from standard RAG?

Standard RAG retrieves semantically similar chunks via vector embeddings. Graph RAG additionally traverses a knowledge graph: from a found entity, it follows relationships to related concepts that may not have similar embeddings but provide context for complex multi-hop questions.

When should we use Graph RAG?

Use Graph RAG when you need answers about relationships between entities, global summarization of a large corpus, or multi-hop reasoning. For simple fact queries, standard vector RAG is sufficient.

Which LLM is used for entity extraction?

Typically GPT-4o or similar LLMs with JSON response format. The prompt extracts entities (types: PERSON, ORG, CONTRACT) and relationships (e.g., SIGNED, MANAGES). Extraction is performed once during indexing per document.

Should we store the knowledge graph in Neo4j or NetworkX?

For prototypes, in-memory NetworkX is simple and fast. For production with tens of thousands of entities, Neo4j supports Cypher queries, sharding, and ACID transactions.

How long does Graph RAG implementation take?

Extraction pipeline development takes 2–3 weeks. Graph construction from documents takes 1–4 weeks. Local/global search implementation takes 2 weeks. Total is 6–11 weeks for a full production solution.

How does Graph RAG differ from standard RAG?

Standard RAG retrieves semantically similar chunks via vector embeddings. Graph RAG additionally traverses a knowledge graph: from a found entity, it follows relationships to related concepts that may not have similar embeddings but provide context for complex multi-hop questions.

When should we use Graph RAG?

Use Graph RAG when you need answers about relationships between entities, global summarization of a large corpus, or multi-hop reasoning. For simple fact queries, standard vector RAG is sufficient.

Which LLM is used for entity extraction?

Typically GPT-4o or similar LLMs with JSON response format. The prompt extracts entities (types: PERSON, ORG, CONTRACT) and relationships (e.g., SIGNED, MANAGES). Extraction is performed once during indexing per document.

Should we store the knowledge graph in Neo4j or NetworkX?

For prototypes, in-memory NetworkX is simple and fast. For production with tens of thousands of entities, Neo4j supports Cypher queries, sharding, and ACID transactions.

How long does Graph RAG implementation take?

Extraction pipeline development takes 2–3 weeks. Graph construction from documents takes 1–4 weeks. Local/global search implementation takes 2 weeks. Total is 6–11 weeks for a full production solution.

Enhancing Multi-Hop Search with Graph RAG: A Knowledge Graph Approach

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Enhancing Multi-Hop Search with Graph RAG: A Knowledge Graph Approach

Complex

from 2 weeks to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1351
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
950
Development of an online store for the company FURNORO
1186
B2B Advance company logo design
642
Development of a web application for Enviok
922

Show more works

Graph RAG: Knowledge Graph Extraction for Multi-Hop Search

We often encounter this scenario: a standard vector RAG retrieves relevant chunks perfectly but cannot answer "How are company X and contract Y related?" It lacks the ability to understand relationships between entities and traverse a graph of connections. Graph RAG solves this by adding knowledge graph structure to embeddings. The system traverses the graph from a found entity through relationships to related concepts. This delivers better answers for complex multi-hop questions. Time savings on search reach up to 70%. Implementation pays off within 3–6 months. Manual analysis costs drop by up to 80%. For example, a legal department with 10,000 contracts might spend $50,000 annually on manual analysis; Graph RAG can reduce that to $10,000.

Why Standard RAG Falls Short on Multi-Hop Queries

Query Type	Standard RAG	Graph RAG
Entity lookup ("Who signed contract #123?")	92%	89%
Multi-hop (2+ hops)	12%	71%
Relationship query ("Are X and Y connected?")	34%	82%
Global summarization ("What are the main topics?")	34%	82%

Our case: a legal department with thousands of contracts over a long period. Standard RAG could not answer "Which suppliers participated in tenders where the winner was later declared bankrupt?" This required traversing the chain "tender → winner → bankruptcy." Graph RAG raised accuracy on such questions from 12% to 71%. The graph contained 45,000 entities and 180,000 relationships, built on Neo4j.

How Graph RAG Achieves Higher Accuracy: Graph Traversal

The key difference is graph traversal. When a user asks "Which contracts will be affected by the change of CEO in company X?", standard RAG finds chunks mentioning "CEO change X" but cannot infer that the CEO manages certain contracts through departments. Graph RAG follows the relationships: CEO → department → contract, obtaining full context. Our measurements on corporate documentation showed multi-hop accuracy rising from 12% to 71% and global summarization from 34% to 82%. For simple facts, Graph RAG matches standard RAG within 3%. Graph RAG outperforms standard RAG by 6x on multi-hop queries.

Architecture of Microsoft GraphRAG

The Microsoft GraphRAG architecture (Microsoft GraphRAG](https://microsoft.github.io/graphrag/) is the most influential implementation. The process includes:

LLM (GPT-4o) extracts entities and relationships from documents.
The built knowledge graph is stored in NetworkX or Neo4j.
The Leiden algorithm discovers hierarchical communities. A community report is generated for each.
Two search modes: Local — combines vector search with graph traversal from found entities; Global — summarizes community reports for global questions.

Example of entity extraction via GPT-4o

from openai import OpenAI
import json

client = OpenAI()

ENTITY_EXTRACTION_PROMPT = """Extract entities and relationships from the following text.
Return JSON:
{
  "entities": [
    {"id": "1", "name": "...", "type": "PERSON|ORG|CONTRACT|REGULATION|CONCEPT", "description": "..."}
  ],
  "relationships": [
    {"source": "id1", "target": "id2", "relation": "SIGNED|MANAGES|REFERS_TO|PART_OF", "description": "..."}
  ]
}

Text:
{text}"""

def extract_graph_elements(text: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": ENTITY_EXTRACTION_PROMPT.format(text=text)}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

Building a Knowledge Graph with NetworkX

import networkx as nx
from typing import List

class KnowledgeGraph:
    def __init__(self):
        self.graph = nx.DiGraph()
        self.entity_embeddings = {}

    def add_elements(self, elements: dict, source_doc: str):
        for entity in elements["entities"]:
            self.graph.add_node(
                entity["id"],
                name=entity["name"],
                type=entity["type"],
                description=entity["description"],
                source=source_doc,
            )
        for rel in elements["relationships"]:
            self.graph.add_edge(
                rel["source"],
                rel["target"],
                relation=rel["relation"],
                description=rel["description"],
            )

    def get_subgraph(self, entity_id: str, depth: int = 2) -> nx.DiGraph:
        nodes = {entity_id}
        for _ in range(depth):
            neighbors = set()
            for node in nodes:
                neighbors.update(self.graph.predecessors(node))
                neighbors.update(self.graph.successors(node))
            nodes.update(neighbors)
        return self.graph.subgraph(nodes)

    def serialize_subgraph(self, subgraph: nx.DiGraph) -> str:
        lines = []
        for node in subgraph.nodes(data=True):
            lines.append(f"Entity: {node[1].get('name')} ({node[1].get('type')})")
            lines.append(f"  Description: {node[1].get('description', '')}")
        for edge in subgraph.edges(data=True):
            source_name = subgraph.nodes[edge[0]].get("name", edge[0])
            target_name = subgraph.nodes[edge[1]].get("name", edge[1])
            lines.append(f"Relation: {source_name} -> {target_name} ({edge[2].get('relation')})")
            lines.append(f"  {edge[2].get('description', '')}")
        return "\n".join(lines)

Local Search: Combining Graph and Vector Context

from langchain_openai import OpenAIEmbeddings
import numpy as np

class GraphRAGRetriever:
    def __init__(self, knowledge_graph: KnowledgeGraph, vectorstore, embeddings):
        self.kg = knowledge_graph
        self.vectorstore = vectorstore
        self.embeddings = embeddings

    def local_search(self, query: str, top_k: int = 5) -> str:
        vector_docs = self.vectorstore.similarity_search(query, k=top_k)
        mentioned_entities = self._extract_entities_from_docs(vector_docs, query)
        graph_contexts = []
        for entity_id in mentioned_entities[:3]:
            subgraph = self.kg.get_subgraph(entity_id, depth=2)
            graph_context = self.kg.serialize_subgraph(subgraph)
            graph_contexts.append(graph_context)
        vector_context = "\n\n".join([d.page_content for d in vector_docs])
        graph_context = "\n\n".join(graph_contexts)
        return f"## Text Context\n{vector_context}\n\n## Knowledge Graph Context\n{graph_context}"

Tools for Graph RAG

Microsoft GraphRAG library: pip install graphrag — full implementation from Microsoft
Neo4j + LangChain: Neo4jGraph + GraphCypherQAChain for Cypher queries
LlamaIndex + Knowledge Graph: KnowledgeGraphIndex
NetworkX: lightweight in-memory graph in Python

What’s Included in Our Work

Design of the knowledge graph schema (entities, relationships, types)
Implementation of the extraction pipeline using GPT-4o or Claude 3.5
Graph construction with Neo4j or NetworkX
Configuration of both Local and Global search modes
Integration with your existing RAG system (LangChain, LlamaIndex)
Testing on your data: accuracy measurements (precision/recall) and p99 latency
Documentation and team training
Dedicated support and maintenance

Our team has over 5 years of experience in NLP and has delivered 20+ production RAG systems. We guarantee high accuracy and reliable performance. Our certified engineers will ensure a smooth deployment. Contact us for a free consultation to assess your project.

Project Timeline

Phase	Duration
Extraction pipeline development	2–3 weeks
Graph construction from existing documents	1–4 weeks
Local/Global search implementation	2 weeks
Testing and evaluation	1–2 weeks
Total	6–11 weeks

Pricing is individual — depends on document volume, required accuracy, and graph schema complexity. Request a detailed estimate.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.