Parsing news feeds for automatic filling of 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.
Our competencies:
Development stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1177
  • image_bitrix-bitrix-24-1c_fixper_448_0.png
    Website development for FIXPER company
    811
  • image_bitrix-bitrix-24-1c_development_of_an_online_appointment_booking_widget_for_a_medical_center_594_0.webp
    Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
    564
  • image_bitrix-bitrix-24-1c_mirsanbel_458_0.webp
    Development based on 1C Enterprise for MIRSANBEL
    747
  • image_crm_dolbimby_434_0.webp
    Website development on CRM Bitrix24 for DOLBIMBY
    655
  • image_crm_technotorgcomplex_453_0.webp
    Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
    976

Parsing news feeds for auto-filling 1C-Bitrix

A news section on a 1C-Bitrix site that updates once a month is worse than its absence. Search engines see an abandoned resource, users lose trust. Auto-filling via news feed parsing solves the task of regular content updates, but requires proper implementation — otherwise you get duplicates, broken layout, and uniqueness issues.

Data sources

News feeds are available in several formats:

  • RSS/Atom feeds — standardized XML with title, description, link, date. Supported by most news outlets and blogs. Most reliable source.
  • News aggregator APIs — NewsAPI, Mediastack, Currents API. Structured JSON, paid rates for commercial use.
  • HTML pages — parsing source sites directly. Unreliable: layout changes, bot protection, legal risks.

For auto-filling Bitrix sites, RSS feeds are the optimal balance of reliability and simplicity. Start with them.

Architecture of RSS parser

A news parser for Bitrix consists of three layers:

1. Fetcher. Retrieves RSS feeds from a list of URLs. Uses file_get_contents with context or cURL with timeouts. Each feed is parsed via SimpleXMLElement or the SimplePie library.

$xml = simplexml_load_string($rssContent);
foreach ($xml->channel->item as $item) {
    $title = (string)$item->title;
    $link  = (string)$item->link;
    $date  = strtotime((string)$item->pubDate);
    $desc  = (string)$item->description;
}

2. Processor. Cleans HTML tags from descriptions, downloads and saves images, normalizes dates, determines category by keywords or source.

3. Importer. Creates elements in the Bitrix infoblock via CIBlockElement::Add(). Checks for duplicates by XML_ID (usually article URL or GUID from feed).

Storage in infoblock

News in Bitrix is stored in a standard structure infoblock. Recommended mapping:

RSS field Infoblock field Type
title NAME String
link PROPERTY_SOURCE_URL Link
description PREVIEW_TEXT HTML/text
content:encoded DETAIL_TEXT HTML
pubDate ACTIVE_FROM Date
guid / link XML_ID String (for deduplication)
category IBLOCK_SECTION_ID Section link
enclosure / media:content PREVIEW_PICTURE File

XML_ID — mandatory field. Without it, the parser creates duplicates on each run. Use md5 hash of article URL as XML_ID — this guarantees uniqueness even if GUID changes in the feed.

Content processing

Raw HTML from RSS is unsuitable for publication. Typical problems:

  • External images — image links point to source site. If it becomes unavailable, images disappear. Solution: download images to /upload/ during import.
  • Third-party scripts and iframes — feeds may contain widgets, counters, embedded videos. Use strip_tags() with whitelist of allowed tags or the HTMLPurifier library.
  • Relative links — links like /article/123 without domain. Convert to absolute by adding source domain.
  • Encoding — feeds may arrive in UTF-8, Windows-1251, ISO-8859-1. Detect encoding via mb_detect_encoding() and convert to UTF-8.

Scheduling and cron

Parser runs via cron. Frequency depends on news type:

  • Breaking news (news agencies) — every 15–30 minutes.
  • Industry news — every 1–2 hours.
  • Blogs and analytics — 1–2 times daily.

Cron task calls PHP script that includes Bitrix core:

$_SERVER['DOCUMENT_ROOT'] = '/home/bitrix/www';
require $_SERVER['DOCUMENT_ROOT'] . '/bitrix/modules/main/include/prolog_before.php';
CModule::IncludeModule('iblock');

Alternative — Bitrix agent (b_agent), but for long operations cron is more reliable: agents have execution time limits and block each other.

Deduplication and quality control

In addition to checking XML_ID, recommend:

  • Date filter — don't import news older than N days. Otherwise, when connecting a new feed, catalog fills with outdated content.
  • Minimum length — discard entries with description shorter than 100 characters. This removes technical entries and announcements without content.
  • Stop-words — filter news by keywords irrelevant to site topic.
  • Source limit — no more than N news per day from one feed, so one active source doesn't push out the rest.

Automatic categorization

Simplest approach — "source → infoblock section" mapping. All TechCrunch news goes to "Technology" section, RBK — to "Economics".

More flexible approach — classification by keywords in title and text. Rules array like:

$rules = [
    'Technology' => ['AI', 'blockchain', 'startup', 'app'],
    'Finance'    => ['stocks', 'rate', 'investment', 'IPO'],
];

For 10+ categories and serious accuracy requirements — connect external classifier (OpenAI API, Yandex GPT) or trained model.

Legal aspects

Publishing others' news "as is" violates copyright. Acceptable options:

  • Publish title + first 2–3 sentences with source link (fair dealing citation).
  • Automatic rewrite via LLM (GPT, YandexGPT) — legally questionable, but used in practice.
  • Use feeds with open license (Creative Commons, government sources).