Parsing articles and content for auto-population in 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.
Our competencies:
Development stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1175
  • image_bitrix-bitrix-24-1c_fixper_448_0.png
    Website development for FIXPER company
    811
  • image_bitrix-bitrix-24-1c_development_of_an_online_appointment_booking_widget_for_a_medical_center_594_0.webp
    Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
    564
  • image_bitrix-bitrix-24-1c_mirsanbel_458_0.webp
    Development based on 1C Enterprise for MIRSANBEL
    747
  • image_crm_dolbimby_434_0.webp
    Website development on CRM Bitrix24 for DOLBIMBY
    655
  • image_crm_technotorgcomplex_453_0.webp
    Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
    976

Parsing articles and content for auto-populating 1С-Bitrix

Content sections of a website — blog, knowledge base, article catalog — require regular updates. Manual content creation is expensive and doesn't scale well. Parsing articles from external sources allows maintaining publication frequency, but differs from product parsing: here the emphasis is on text quality, structure and formatting preservation, rather than processing speed.

Difference from news parsing

A news parser works with RSS feeds — structured, predictable data. Article parsing involves working with arbitrary HTML pages, where each source site has its own markup, navigation structure, and content presentation method.

Key differences:

  • No single format — each source requires an individual parser or universal extractor.
  • Complex content structure — an article contains headings, lists, tables, embedded media, code blocks. All of this must be preserved.
  • Text volume — an article of 5,000–10,000 characters versus a 500-character news item. More data means more failure points.
  • Update frequency — articles are published less frequently than news, but each piece of content is more valuable.

Extracting content from HTML

The main task is to separate article text from navigation, sidebars, ads, comments, and footer. Three approaches:

1. CSS selectors for a specific site. For each source, a main content selector is determined: article.post-content, div#main-text, .entry-body. Works reliably for a limited set of sources, breaks when redesigned.

2. Content extraction algorithms. Libraries like Readability (Mozilla Readability port to PHP — andreskrey/readability.php) analyze the DOM and identify main content using heuristics: text density, link-to-text ratio, semantic tags <article>, <main>.

3. Hybrid approach. Readability for initial extraction + custom rules for specific sources where automation fails.

In practice, the hybrid approach is the only one that works for 10+ sources. Pure automation loses important blocks (tables, lists), pure selectors don't scale.

Preserving structure and formatting

After HTML extraction, it must be converted to a format suitable for storage in the DETAIL_TEXT field of a Bitrix info block:

  • Cleanup — removal of <script>, <style>, <iframe>, inline styles, data attributes. Use HTMLPurifier with custom configuration allowing <h2>–<h4>, <p>, <ul>, <ol>, <li>, <table>, <img>, <a>, <strong>, <em>, <blockquote>, <pre>, <code>.
  • Heading normalization — the original article <h1> becomes <h2> in the Bitrix page context (where <h1> is the element heading).
  • Image localization — downloading external pictures to /upload/, replacing URLs in HTML. Without this, images disappear when the source blocks them or changes URLs.
  • Lazy loading — many sites use data-src instead of src for images. The parser must account for this.

Info block mapping

Extracted data Info block field Processing
Title <h1> / <title> NAME Truncate to 255 characters, clean HTML
First 300 characters of text PREVIEW_TEXT strip_tags() + truncate at sentence boundary
Full article HTML DETAIL_TEXT Cleanup via HTMLPurifier
First image PREVIEW_PICTURE Download + resize
Source URL PROPERTY_SOURCE_URL No changes
Publication date ACTIVE_FROM Parse via strtotime()
md5(url) XML_ID For deduplication
Author PROPERTY_AUTHOR Extract from meta or byline
Tags / keywords PROPERTY_TAGS Multiple property of type "string"

Step-by-step parsing process

Step 1. URL collection. The parser traverses list pages (pagination, categories, sitemap.xml) and collects article URLs. Saves to queue — parser_queue table with fields url, status, created_at.

Step 2. Loading and extraction. For each URL from the queue: load HTML, extract content, parse metadata. Result — a structured array saved to intermediate table parser_articles.

Step 3. Moderation (optional). The administrator reviews parsed articles in the interface, approves or rejects. For full automation, this step is replaced by rule-based filtering.

Step 4. Import. Approved articles are loaded into the info block via CIBlockElement::Add(). Images are saved via CFile::MakeFileArray().

Dealing with anti-parsing protection

Content sites are protected less robustly than marketplaces, but basic measures exist:

  • robots.txt — check Disallow for parsed sections. Ignoring robots.txt is an additional legal risk.
  • Rate limiting — 1–2 requests per second are safe for most sites. Aggressive parsing (10+ rps) will result in blocking.
  • JavaScript rendering — SPA sites require a headless browser. For static sites, cURL is sufficient.
  • Cloudflare / WAF — identify bots by fingerprint. Solved with a headless browser using realistic headers.

Automation with cron

Recommended cron task structure:

# Collect new URLs from sources — once daily
0 2 * * * php /home/bitrix/parsers/collect_urls.php

# Parse articles from queue — every 2 hours
0 */2 * * * php /home/bitrix/parsers/parse_articles.php --limit=50

# Import to info block — every hour
0 * * * * php /home/bitrix/parsers/import_articles.php

Splitting into three tasks allows controlling each stage independently and quickly localizing problems when failures occur.