Parsing articles and content for auto-population in 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1626 services

Parsing articles and content for auto-population in 1C-Bitrix

Medium

~1-2 weeks

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1298
Website development for FIXPER company
889
Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
638
Development based on 1C Enterprise for MIRSANBEL
788
Website development on CRM Bitrix24 for DOLBIMBY
689
Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
1021

Show more works

Parsing articles and content for auto-populating 1С-Bitrix

Content sections of a website — blog, knowledge base, article catalog — require regular updates. Manual content creation is expensive and doesn't scale well. Parsing articles from external sources allows maintaining publication frequency, but differs from product parsing: here the emphasis is on text quality, structure and formatting preservation, rather than processing speed.

Difference from news parsing

A news parser works with RSS feeds — structured, predictable data. Article parsing involves working with arbitrary HTML pages, where each source site has its own markup, navigation structure, and content presentation method.

Key differences:

No single format — each source requires an individual parser or universal extractor.
Complex content structure — an article contains headings, lists, tables, embedded media, code blocks. All of this must be preserved.
Text volume — an article of 5,000–10,000 characters versus a 500-character news item. More data means more failure points.
Update frequency — articles are published less frequently than news, but each piece of content is more valuable.

Extracting content from HTML

The main task is to separate article text from navigation, sidebars, ads, comments, and footer. Three approaches:

1. CSS selectors for a specific site. For each source, a main content selector is determined: article.post-content, div#main-text, .entry-body. Works reliably for a limited set of sources, breaks when redesigned.

2. Content extraction algorithms. Libraries like Readability (Mozilla Readability port to PHP — andreskrey/readability.php) analyze the DOM and identify main content using heuristics: text density, link-to-text ratio, semantic tags <article>, <main>.

3. Hybrid approach. Readability for initial extraction + custom rules for specific sources where automation fails.

In practice, the hybrid approach is the only one that works for 10+ sources. Pure automation loses important blocks (tables, lists), pure selectors don't scale.

Preserving structure and formatting

After HTML extraction, it must be converted to a format suitable for storage in the DETAIL_TEXT field of a Bitrix info block:

Cleanup — removal of <script>, <style>, <iframe title="Embedded content">, inline styles, data attributes. Use HTMLPurifier with custom configuration allowing <h2>–<h4>, <p>, <ul>, <ol>, <li>, <table>, <img>, <a>, <strong>, <em>, <blockquote>, <pre>, <code>.
Heading normalization — the original article <h1> becomes <h2> in the Bitrix page context (where <h1> is the element heading).
Image localization — downloading external pictures to /upload/, replacing URLs in HTML. Without this, images disappear when the source blocks them or changes URLs.
Lazy loading — many sites use data-src instead of src for images. The parser must account for this.

Info block mapping

Extracted data	Info block field	Processing
Title `<h1>` / `<title>`	`NAME`	Truncate to 255 characters, clean HTML
First 300 characters of text	`PREVIEW_TEXT`	`strip_tags()` + truncate at sentence boundary
Full article HTML	`DETAIL_TEXT`	Cleanup via HTMLPurifier
First image	`PREVIEW_PICTURE`	Download + resize
Source URL	`PROPERTY_SOURCE_URL`	No changes
Publication date	`ACTIVE_FROM`	Parse via `strtotime()`
md5(url)	`XML_ID`	For deduplication
Author	`PROPERTY_AUTHOR`	Extract from meta or byline
Tags / keywords	`PROPERTY_TAGS`	Multiple property of type "string"

Step-by-step parsing process

Step 1. URL collection. The parser traverses list pages (pagination, categories, sitemap.xml) and collects article URLs. Saves to queue — parser_queue table with fields url, status, created_at.

Step 2. Loading and extraction. For each URL from the queue: load HTML, extract content, parse metadata. Result — a structured array saved to intermediate table parser_articles.

Step 3. Moderation (optional). The administrator reviews parsed articles in the interface, approves or rejects. For full automation, this step is replaced by rule-based filtering.

Step 4. Import. Approved articles are loaded into the info block via CIBlockElement::Add(). Images are saved via CFile::MakeFileArray().

Dealing with anti-parsing protection

Content sites are protected less robustly than marketplaces, but basic measures exist:

robots.txt — check Disallow for parsed sections. Ignoring robots.txt is an additional legal risk.
Rate limiting — 1–2 requests per second are safe for most sites. Aggressive parsing (10+ rps) will result in blocking.
JavaScript rendering — SPA sites require a headless browser. For static sites, cURL is sufficient.
Cloudflare / WAF — identify bots by fingerprint. Solved with a headless browser using realistic headers.

Automation with cron

Recommended cron task structure:

# Collect new URLs from sources — once daily
0 2 * * * php /home/bitrix/parsers/collect_urls.php

# Parse articles from queue — every 2 hours
0 */2 * * * php /home/bitrix/parsers/parse_articles.php --limit=50

# Import to info block — every hour
0 * * * * php /home/bitrix/parsers/import_articles.php

Splitting into three tasks allows controlling each stage independently and quickly localizing problems when failures occur.

1C Bitrix presentation 1C Bitrix24 presentation 1C Enterprise presentation