Implementation of Knowledge Graph Construction from Text
A knowledge graph is a structured representation of knowledge as triples (subject, predicate, object) stored as a graph. Automatic construction from text allows transforming unstructured corpora into a navigable, queryable knowledge base.
What is a Knowledge Graph and When Is It Needed
Unlike a relational database, a knowledge graph flexibly represents highly interconnected relations: "Ivan Petrov → works_in → Gazprom → located_in → Moscow → is_capital_of → Russia". Graph queries ("Find all employees of companies located in Moscow") are impossible or inconvenient in SQL.
A knowledge graph is needed when:
- Data is highly interconnected with many relation types
- Multi-level queries are needed (graph traversal)
- Integration of data from different sources is planned
- Explainability is required: "why did the system decide this" — because A is connected to B through C
Architecture of Automatic Construction
Three key components work sequentially:
Entity Extraction — NER with an expanded set of types. For corporate graphs: PERSON, ORGANIZATION, LOCATION, PRODUCT, EVENT, DATE, MONEY, ROLE.
Relation Extraction — determining the relationship type between entity pairs in a sentence or paragraph. REBEL (Babelscape) — the best open-source tool for end-to-end triple extraction.
Coreference Resolution — resolving coreferences: "Gazprom... Company... It..." — all refer to one entity. Uses NeuralCoref or spaCy-experimental.
Entity Linking — linking mentioned entities to canonical records in the base (Wikidata, DBpedia): "VTB", "Bank VTB", "VTB Bank" → one graph node.
Technical Stack
# REBEL for triple extraction
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
def extract_triplets(text: str) -> list[tuple]:
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=256)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
# Parsing REBEL special format: <triplet> <subj> <rel> <obj>
return parse_rebel_output(decoded)
Graph Storage
Neo4j — de facto standard for graph databases:
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
def add_triplet(tx, subject, predicate, obj, source_doc):
tx.run("""
MERGE (s:Entity {name: $subject})
MERGE (o:Entity {name: $obj})
MERGE (s)-[r:RELATION {type: $predicate, source: $source_doc}]->(o)
""", subject=subject, predicate=predicate, obj=obj, source_doc=source_doc)
Queries in Cypher:
// Find all colleagues of a person (working in the same company)
MATCH (p:Entity {name: "Ivan Petrov"})-[:RELATION {type: "works_in"}]->
(org:Entity)<-[:RELATION {type: "works_in"}]-(colleague:Entity)
WHERE colleague <> p
RETURN colleague.name
Integration with LLM (GraphRAG)
Knowledge graph + LLM = GraphRAG: instead of semantic search over chunks, LLM receives context from a connected subgraph. Microsoft GraphRAG (LangChain implementation) shows significantly better results for questions about relations between entities compared to classic RAG.
Workflow:
- User question → entity extraction
- Graph traversal from these entities (2–3 levels)
- Subgraph → text representation → LLM context
- LLM generates answer







