Information Extraction from Text Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Information Extraction from Text Implementation
Medium
~5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Implementation of Fact Extraction from Text (Information Extraction)

Information Extraction (IE) is the automatic extraction of structured information from unstructured text. The goal is to transform free text into filled database fields.

Components of an IE System

A complete IE system includes several interconnected tasks:

Named Entity Recognition → identifies entities (persons, organizations, dates, amounts)

Relation Extraction → determines relationships between entities ("Ivan works at Gazprom")

Event Extraction → extracts events with participants, time, location

Attribute Extraction → fills entity attributes ("Gazprom, revenue 10 trillion rubles, 2024")

LLM-based Extraction (Modern Approach)

For most IE tasks today, LLM with structured output is the optimal choice:

from pydantic import BaseModel
from openai import OpenAI

class CompanyInfo(BaseModel):
    name: str
    revenue: float | None
    revenue_year: int | None
    ceo: str | None
    headquarters: str | None
    employees_count: int | None

client = OpenAI()
response = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": f"Extract company information from text:\n{text}"
    }],
    response_format=CompanyInfo,
)
result = response.choices[0].message.parsed

Classical Pipeline (for High Load)

For systems with > 1000 documents/hour and latency requirement < 100ms:

  1. spaCy / natasha for basic NER (persons, orgs, locations, dates)
  2. Dependency parsing for extracting simple relations (subject-verb-object)
  3. Pattern matching (spaCy Matcher) for structured patterns ("price X rubles", "rate X%")
  4. Normalization — conversion to canonical form (dates → ISO, amounts → float + currency)

Working with Tabular Data in Text

Texts often contain tables in PDF/HTML. Strategy:

  • PDF: Camelot or pdfplumber for table extraction
  • HTML: BeautifulSoup + pandas read_html
  • Table images: Azure Document Intelligence or Table Transformer (Microsoft)

Quality Evaluation

Metrics for IE:

  • Precision/Recall/F1 by entity types
  • Relation-level F1 (correct entity pair + correct relation type)
  • Slot-filling accuracy (percentage of correctly filled fields)

Typical results: 90–95% F1 for well-structured texts (financial reports, contracts), 75–85% for news and free text.