Parsing articles and content for auto-populating 1С-Bitrix
Content sections of a website — blog, knowledge base, article catalog — require regular updates. Manual content creation is expensive and doesn't scale well. Parsing articles from external sources allows maintaining publication frequency, but differs from product parsing: here the emphasis is on text quality, structure and formatting preservation, rather than processing speed.
Difference from news parsing
A news parser works with RSS feeds — structured, predictable data. Article parsing involves working with arbitrary HTML pages, where each source site has its own markup, navigation structure, and content presentation method.
Key differences:
- No single format — each source requires an individual parser or universal extractor.
- Complex content structure — an article contains headings, lists, tables, embedded media, code blocks. All of this must be preserved.
- Text volume — an article of 5,000–10,000 characters versus a 500-character news item. More data means more failure points.
- Update frequency — articles are published less frequently than news, but each piece of content is more valuable.
Extracting content from HTML
The main task is to separate article text from navigation, sidebars, ads, comments, and footer. Three approaches:
1. CSS selectors for a specific site. For each source, a main content selector is determined: article.post-content, div#main-text, .entry-body. Works reliably for a limited set of sources, breaks when redesigned.
2. Content extraction algorithms. Libraries like Readability (Mozilla Readability port to PHP — andreskrey/readability.php) analyze the DOM and identify main content using heuristics: text density, link-to-text ratio, semantic tags <article>, <main>.
3. Hybrid approach. Readability for initial extraction + custom rules for specific sources where automation fails.
In practice, the hybrid approach is the only one that works for 10+ sources. Pure automation loses important blocks (tables, lists), pure selectors don't scale.
Preserving structure and formatting
After HTML extraction, it must be converted to a format suitable for storage in the DETAIL_TEXT field of a Bitrix info block:
-
Cleanup — removal of
<script>,<style>,<iframe>, inline styles, data attributes. UseHTMLPurifierwith custom configuration allowing<h2>–<h4>,<p>,<ul>,<ol>,<li>,<table>,<img>,<a>,<strong>,<em>,<blockquote>,<pre>,<code>. -
Heading normalization — the original article
<h1>becomes<h2>in the Bitrix page context (where<h1>is the element heading). -
Image localization — downloading external pictures to
/upload/, replacing URLs in HTML. Without this, images disappear when the source blocks them or changes URLs. -
Lazy loading — many sites use
data-srcinstead ofsrcfor images. The parser must account for this.
Info block mapping
| Extracted data | Info block field | Processing |
|---|---|---|
Title <h1> / <title> |
NAME |
Truncate to 255 characters, clean HTML |
| First 300 characters of text | PREVIEW_TEXT |
strip_tags() + truncate at sentence boundary |
| Full article HTML | DETAIL_TEXT |
Cleanup via HTMLPurifier |
| First image | PREVIEW_PICTURE |
Download + resize |
| Source URL | PROPERTY_SOURCE_URL |
No changes |
| Publication date | ACTIVE_FROM |
Parse via strtotime() |
| md5(url) | XML_ID |
For deduplication |
| Author | PROPERTY_AUTHOR |
Extract from meta or byline |
| Tags / keywords | PROPERTY_TAGS |
Multiple property of type "string" |
Step-by-step parsing process
Step 1. URL collection. The parser traverses list pages (pagination, categories, sitemap.xml) and collects article URLs. Saves to queue — parser_queue table with fields url, status, created_at.
Step 2. Loading and extraction. For each URL from the queue: load HTML, extract content, parse metadata. Result — a structured array saved to intermediate table parser_articles.
Step 3. Moderation (optional). The administrator reviews parsed articles in the interface, approves or rejects. For full automation, this step is replaced by rule-based filtering.
Step 4. Import. Approved articles are loaded into the info block via CIBlockElement::Add(). Images are saved via CFile::MakeFileArray().
Dealing with anti-parsing protection
Content sites are protected less robustly than marketplaces, but basic measures exist:
-
robots.txt — check
Disallowfor parsed sections. Ignoring robots.txt is an additional legal risk. - Rate limiting — 1–2 requests per second are safe for most sites. Aggressive parsing (10+ rps) will result in blocking.
- JavaScript rendering — SPA sites require a headless browser. For static sites, cURL is sufficient.
- Cloudflare / WAF — identify bots by fingerprint. Solved with a headless browser using realistic headers.
Automation with cron
Recommended cron task structure:
# Collect new URLs from sources — once daily
0 2 * * * php /home/bitrix/parsers/collect_urls.php
# Parse articles from queue — every 2 hours
0 */2 * * * php /home/bitrix/parsers/parse_articles.php --limit=50
# Import to info block — every hour
0 * * * * php /home/bitrix/parsers/import_articles.php
Splitting into three tasks allows controlling each stage independently and quickly localizing problems when failures occur.







