Parsing news feeds for auto-filling 1C-Bitrix
A news section on a 1C-Bitrix site that updates once a month is worse than its absence. Search engines see an abandoned resource, users lose trust. Auto-filling via news feed parsing solves the task of regular content updates, but requires proper implementation — otherwise you get duplicates, broken layout, and uniqueness issues.
Data sources
News feeds are available in several formats:
- RSS/Atom feeds — standardized XML with title, description, link, date. Supported by most news outlets and blogs. Most reliable source.
- News aggregator APIs — NewsAPI, Mediastack, Currents API. Structured JSON, paid rates for commercial use.
- HTML pages — parsing source sites directly. Unreliable: layout changes, bot protection, legal risks.
For auto-filling Bitrix sites, RSS feeds are the optimal balance of reliability and simplicity. Start with them.
Architecture of RSS parser
A news parser for Bitrix consists of three layers:
1. Fetcher. Retrieves RSS feeds from a list of URLs. Uses file_get_contents with context or cURL with timeouts. Each feed is parsed via SimpleXMLElement or the SimplePie library.
$xml = simplexml_load_string($rssContent);
foreach ($xml->channel->item as $item) {
$title = (string)$item->title;
$link = (string)$item->link;
$date = strtotime((string)$item->pubDate);
$desc = (string)$item->description;
}
2. Processor. Cleans HTML tags from descriptions, downloads and saves images, normalizes dates, determines category by keywords or source.
3. Importer. Creates elements in the Bitrix infoblock via CIBlockElement::Add(). Checks for duplicates by XML_ID (usually article URL or GUID from feed).
Storage in infoblock
News in Bitrix is stored in a standard structure infoblock. Recommended mapping:
| RSS field | Infoblock field | Type |
|---|---|---|
title |
NAME |
String |
link |
PROPERTY_SOURCE_URL |
Link |
description |
PREVIEW_TEXT |
HTML/text |
content:encoded |
DETAIL_TEXT |
HTML |
pubDate |
ACTIVE_FROM |
Date |
guid / link |
XML_ID |
String (for deduplication) |
category |
IBLOCK_SECTION_ID |
Section link |
enclosure / media:content |
PREVIEW_PICTURE |
File |
XML_ID — mandatory field. Without it, the parser creates duplicates on each run. Use md5 hash of article URL as XML_ID — this guarantees uniqueness even if GUID changes in the feed.
Content processing
Raw HTML from RSS is unsuitable for publication. Typical problems:
-
External images — image links point to source site. If it becomes unavailable, images disappear. Solution: download images to
/upload/during import. -
Third-party scripts and iframes — feeds may contain widgets, counters, embedded videos. Use
strip_tags()with whitelist of allowed tags or theHTMLPurifierlibrary. -
Relative links — links like
/article/123without domain. Convert to absolute by adding source domain. -
Encoding — feeds may arrive in UTF-8, Windows-1251, ISO-8859-1. Detect encoding via
mb_detect_encoding()and convert to UTF-8.
Scheduling and cron
Parser runs via cron. Frequency depends on news type:
- Breaking news (news agencies) — every 15–30 minutes.
- Industry news — every 1–2 hours.
- Blogs and analytics — 1–2 times daily.
Cron task calls PHP script that includes Bitrix core:
$_SERVER['DOCUMENT_ROOT'] = '/home/bitrix/www';
require $_SERVER['DOCUMENT_ROOT'] . '/bitrix/modules/main/include/prolog_before.php';
CModule::IncludeModule('iblock');
Alternative — Bitrix agent (b_agent), but for long operations cron is more reliable: agents have execution time limits and block each other.
Deduplication and quality control
In addition to checking XML_ID, recommend:
- Date filter — don't import news older than N days. Otherwise, when connecting a new feed, catalog fills with outdated content.
- Minimum length — discard entries with description shorter than 100 characters. This removes technical entries and announcements without content.
- Stop-words — filter news by keywords irrelevant to site topic.
- Source limit — no more than N news per day from one feed, so one active source doesn't push out the rest.
Automatic categorization
Simplest approach — "source → infoblock section" mapping. All TechCrunch news goes to "Technology" section, RBK — to "Economics".
More flexible approach — classification by keywords in title and text. Rules array like:
$rules = [
'Technology' => ['AI', 'blockchain', 'startup', 'app'],
'Finance' => ['stocks', 'rate', 'investment', 'IPO'],
];
For 10+ categories and serious accuracy requirements — connect external classifier (OpenAI API, Yandex GPT) or trained model.
Legal aspects
Publishing others' news "as is" violates copyright. Acceptable options:
- Publish title + first 2–3 sentences with source link (fair dealing citation).
- Automatic rewrite via LLM (GPT, YandexGPT) — legally questionable, but used in practice.
- Use feeds with open license (Creative Commons, government sources).







