Implementation of Fact Extraction from Text (Information Extraction)
Information Extraction (IE) is the automatic extraction of structured information from unstructured text. The goal is to transform free text into filled database fields.
Components of an IE System
A complete IE system includes several interconnected tasks:
Named Entity Recognition → identifies entities (persons, organizations, dates, amounts)
Relation Extraction → determines relationships between entities ("Ivan works at Gazprom")
Event Extraction → extracts events with participants, time, location
Attribute Extraction → fills entity attributes ("Gazprom, revenue 10 trillion rubles, 2024")
LLM-based Extraction (Modern Approach)
For most IE tasks today, LLM with structured output is the optimal choice:
from pydantic import BaseModel
from openai import OpenAI
class CompanyInfo(BaseModel):
name: str
revenue: float | None
revenue_year: int | None
ceo: str | None
headquarters: str | None
employees_count: int | None
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Extract company information from text:\n{text}"
}],
response_format=CompanyInfo,
)
result = response.choices[0].message.parsed
Classical Pipeline (for High Load)
For systems with > 1000 documents/hour and latency requirement < 100ms:
- spaCy / natasha for basic NER (persons, orgs, locations, dates)
- Dependency parsing for extracting simple relations (subject-verb-object)
- Pattern matching (spaCy Matcher) for structured patterns ("price X rubles", "rate X%")
- Normalization — conversion to canonical form (dates → ISO, amounts → float + currency)
Working with Tabular Data in Text
Texts often contain tables in PDF/HTML. Strategy:
- PDF: Camelot or pdfplumber for table extraction
-
HTML: BeautifulSoup + pandas
read_html - Table images: Azure Document Intelligence or Table Transformer (Microsoft)
Quality Evaluation
Metrics for IE:
- Precision/Recall/F1 by entity types
- Relation-level F1 (correct entity pair + correct relation type)
- Slot-filling accuracy (percentage of correctly filled fields)
Typical results: 90–95% F1 for well-structured texts (financial reports, contracts), 75–85% for news and free text.







