How is a knowledge graph different from a relational database?

A knowledge graph stores data as nodes and relationships, enabling multi-level queries like 'Find all colleagues of Ivan working in Gazprom's subsidiaries.' In SQL, such queries require complex JOINs and are inefficient for deep relations.

What tech stack do you use for construction?

Core components: REBEL (end-to-end triplet extraction), NeuralCoref (coreference resolution), Wikidata/DBpedia for entity linking, Neo4j graph database, and LangChain GraphRAG for LLM integration.

How long does graph construction take?

Typical project: 1 month for ontology design and pipeline setup, 2–3 months for building the graph on historical corpus, 4 months for Neo4j and GraphRAG interface, 5–6 months for incremental updates and integration.

How to integrate the knowledge graph with an LLM?

Via GraphRAG: the user question is converted into a graph traversal, the subgraph is serialized into text and fed into the LLM context. This yields more accurate and explainable answers, especially for entity relations.

How do you keep the graph up‑to‑date?

We use incremental updates with contradiction detection: if a new triplet conflicts with an existing one, the edge is versioned with an effective date. Outdated relationships are automatically deactivated.

How is a knowledge graph different from a relational database?

A knowledge graph stores data as nodes and relationships, enabling multi-level queries like 'Find all colleagues of Ivan working in Gazprom's subsidiaries.' In SQL, such queries require complex JOINs and are inefficient for deep relations.

What tech stack do you use for construction?

Core components: REBEL (end-to-end triplet extraction), NeuralCoref (coreference resolution), Wikidata/DBpedia for entity linking, Neo4j graph database, and LangChain GraphRAG for LLM integration.

How long does graph construction take?

Typical project: 1 month for ontology design and pipeline setup, 2–3 months for building the graph on historical corpus, 4 months for Neo4j and GraphRAG interface, 5–6 months for incremental updates and integration.

How to integrate the knowledge graph with an LLM?

Via GraphRAG: the user question is converted into a graph traversal, the subgraph is serialized into text and fed into the LLM context. This yields more accurate and explainable answers, especially for entity relations.

How do you keep the graph up‑to‑date?

We use incremental updates with contradiction detection: if a new triplet conflicts with an existing one, the edge is versioned with an effective date. Outdated relationships are automatically deactivated.

Automatic Knowledge Graph Construction from Texts

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Automatic Knowledge Graph Construction from Texts

Complex

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Automatic Knowledge Graph Construction from Texts

You have thousands of documents — reports, news, internal regulations. Manually extracting all mentions of companies, people, products, and their relationships takes weeks of man‑hours. Even an NER model only gives a flat list of entities, but won't answer "Which executives of a competing company previously worked in our holding?" For that you need a semantic graph. Contact us — we will analyze your corpus and propose the optimal solution.

We automatically build a knowledge graph from your text corpus: extract entities, relationships, and construct an ontology. This turns unstructured data into a navigable knowledge base for GraphRAG, semantic search, and analytics. Our pipeline includes NER, relation extraction, coreference resolution, and entity linking — all in one process. Get a consultation and find out how a knowledge graph can improve your analytics. Time savings for analysis — up to 80%, manual processing cost reduction — 3–5 times. Our clients typically save $30,000–$80,000 per year on manual data analysis. Typical project investment ranges from $20,000 to $60,000 depending on corpus size and complexity.

Automatic Knowledge Graph Construction from Texts: Step by Step

The process consists of four sequential stages, each solving its own task.

Entity Extraction

Named Entity Recognition with an extended set of types: PERSON, ORGANIZATION, LOCATION, PRODUCT, EVENT, DATE, MONEY, ROLE. We use a fine‑tuned BERT‑large achieving F1=0.93 on CoNLL‑2003. For domain adaptation we apply fine‑tuning with LoRA, which reduces computational requirements and speeds up training.

Relation Extraction

Determining the type of relationship between entity pairs in a sentence or paragraph. REBEL (Babelscape/rebel-large) is the best open‑source tool for end‑to‑end triplet extraction. It is 15% more accurate than the pipeline approach (NER + relation classifier) and faster: p99 latency ~200 ms on a single GPU.

Coreference Resolution

Resolving anaphora: „Gazprom… The company… It…” — all refer to the same entity. We use NeuralCoref with a dictionary of industrial names, which reduces duplicates by 40%.

Entity Linking

Mapping mentioned entities to canonical records in a database (Wikidata, DBpedia). For example, „VTB“, „VTB Bank“ and „VTB Bank (Russia)“ become one graph node. Linking accuracy — 88% on TAC KBP (TAC KBP).

Project Deliverables

Ontology design (entity types, relationship hierarchy)
Pipeline development for your corpus (Python, Transformers, spaCy)
Neo4j deployment and Cypher query writing for typical analytics
GraphRAG integration with your LLM (OpenAI GPT‑4, Claude, LLaMA)
Detailed documentation with access credentials
Two hands‑on training workshops for your team
3 months of post‑launch support

Why GraphRAG Is Better Than Classic RAG?

Classic RAG searches by vector chunks and loses relationship context. GraphRAG uses the semantic graph: traverses nodes along edges, collects a subgraph, and feeds it to the LLM. On the QALD‑9 test set, answer accuracy increased by 30%, and hallucinations halved. Additionally, by using embeddings with a context window of up to 8192 tokens, GraphRAG handles longer reasoning chains.

Parameter	Classic RAG	GraphRAG
Search by context	Vector chunks	Graph traversal
Accuracy	65% (baseline)	85% (with graph)
Hallucinations	High	Low
Explainability	Low	High (relationship chains)

How Do We Keep the Graph Current?

Knowledge graphs become outdated: companies change owners, people change positions. We use contradiction detection with edge versioning. If a new triplet 'A works_in B' contradicts an existing 'A works_in C', the edge gets an effective date. Relationships older than a threshold are automatically deactivated. This guarantees data freshness without manual moderation.

Work Process for Knowledge Graph from Texts

Corpus analysis and ontology (1 month): Define entity and relationship types relevant to your domain.
Extraction pipeline (2–3 months): Fine‑tune NER, relation extraction, coreference resolution, entity linking on your corpus. To reduce latency we use ONNX Runtime and FP16 inference.
Neo4j loading and GraphRAG (1 month): Deploy the graph database, write Cypher queries, integrate with the LLM.
Incremental updates (1–2 months): Set up automatic processing of new documents and deactivation of outdated relationships.

Click to see an example of a graph traversal query

// All colleagues of Ivan Petrov
MATCH (p:Entity {name: "Ivan Petrov"})-[:RELATION {type: "works_in"}]->
      (org:Entity)<-[:RELATION {type: "works_in"}]-(colleague:Entity)
WHERE colleague <> p
RETURN colleague.name

Technical Stack

# REBEL for triplet extraction
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

def extract_triplets(text: str) -> list[tuple]:
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=256)
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
    return parse_rebel_output(decoded)

# Loading triplets into Neo4j
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def add_triplet(tx, subject, predicate, obj, source_doc):
    tx.run("""
        MERGE (s:Entity {name: $subject})
        MERGE (o:Entity {name: $obj})
        MERGE (s)-[r:RELATION {type: $predicate, source: $source_doc}]->(o)
    """, subject=subject, predicate=predicate, obj=obj, source_doc=source_doc)

Example Cypher query:

// All colleagues of Ivan Petrov
MATCH (p:Entity {name: "Ivan Petrov"})-[:RELATION {type: "works_in"}]->
      (org:Entity)<-[:RELATION {type: "works_in"}]-(colleague:Entity)
WHERE colleague <> p
RETURN colleague.name

Implementation Timeline

Stage	Duration	Result
Ontology	1 month	Dictionary of entity types and relationships
Extraction pipeline	2–3 months	Graph on historical corpus
Neo4j + GraphRAG	1 month	API for queries
Incremental updates	1–2 months	Automatic deactivation of old relationships

Example case: building a semantic network for a news aggregator. The client — a media platform with 1 million news articles. After the graph was built, the analytics team could answer complex queries like 'Which companies are mentioned together with AI technology in the context of investments?' in seconds instead of hours of Elasticsearch search.

With 10+ years of experience in NLP and over 50 completed knowledge graph projects, we deliver reliable solutions. Contact us for a preliminary analysis of your project — we'll estimate the scope and complexity free of charge. Get a consultation on automatic knowledge graph construction for your data.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.