Developing a PHP parser for 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.
Our competencies:
Development stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1177
  • image_bitrix-bitrix-24-1c_fixper_448_0.png
    Website development for FIXPER company
    811
  • image_bitrix-bitrix-24-1c_development_of_an_online_appointment_booking_widget_for_a_medical_center_594_0.webp
    Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
    564
  • image_bitrix-bitrix-24-1c_mirsanbel_458_0.webp
    Development based on 1C Enterprise for MIRSANBEL
    747
  • image_crm_dolbimby_434_0.webp
    Website development on CRM Bitrix24 for DOLBIMBY
    655
  • image_crm_technotorgcomplex_453_0.webp
    Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
    976

Development of a PHP parser for 1С-Bitrix

PHP is a natural choice for a parser working together with 1С-Bitrix. One language, one runtime, direct access to info block APIs without intermediate layers. But a PHP parser has limitations that must be considered during design: single-threading, memory limits, lack of a built-in event loop.

PHP parser architecture

A parser consists of four components:

1. Source configuration. An array or database table with parameters for each source: URL, type (RSS, HTML, API), CSS selectors for data extraction, field mapping, update frequency.

2. HTTP client. For simple tasks — cURL via CHttpClient from the Bitrix kernel or native curl_multi for parallel requests. For complex ones — Guzzle with middleware for retry, logging, proxy rotation.

3. HTML/XML parser. DOMDocument + DOMXPath for precise DOM navigation. For CSS selectors — Symfony\Component\DomCrawler library. For RSS — SimpleXMLElement.

4. Importer. A layer for writing data to Bitrix info blocks via D7 API or old API (CIBlockElement).

Basic implementation

A minimal PHP HTML page parser:

use Bitrix\Main\Loader;
use Bitrix\Iblock\ElementTable;

Loader::includeModule('iblock');

function parseSource(string $url, array $selectors): array
{
    $html = file_get_contents($url, false, stream_context_create([
        'http' => [
            'timeout' => 30,
            'user_agent' => 'Mozilla/5.0 (compatible; SiteBot/1.0)',
        ],
    ]));

    $dom = new DOMDocument();
    @$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    $xpath = new DOMXPath($dom);

    $items = [];
    foreach ($xpath->query($selectors['list']) as $node) {
        $title = $xpath->evaluate('string(' . $selectors['title'] . ')', $node);
        $link  = $xpath->evaluate('string(' . $selectors['link'] . ')', $node);
        $items[] = ['title' => trim($title), 'link' => $link];
    }
    return $items;
}

This skeleton works, but is not production-ready. Error handling, timeouts, logging, and deduplication are needed.

Parallel requests via curl_multi

The main bottleneck of a PHP parser is sequential requests. Loading 1,000 pages at 2 seconds each = 33 minutes. With curl_multi_exec, you can process 10–20 requests in parallel:

$multiHandle = curl_multi_init();
$handles = [];

foreach ($urls as $i => $url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
    curl_multi_add_handle($multiHandle, $ch);
    $handles[$i] = $ch;
}

do {
    $status = curl_multi_exec($multiHandle, $active);
    curl_multi_select($multiHandle);
} while ($active > 0);

Limitation: more than 50 parallel connections — and PHP starts consuming too much memory. For large-scale parsing (10,000+ URLs), split into batches of 20–30 connections.

Integration with Bitrix kernel

The advantage of a PHP parser is direct API access. No REST, no intermediate database needed. Import to info block:

$element = new CIBlockElement();
$elementId = $element->Add([
    'IBLOCK_ID'    => IBLOCK_CATALOG,
    'NAME'         => $parsedData['title'],
    'XML_ID'       => $parsedData['external_id'],
    'ACTIVE'       => 'Y',
    'PREVIEW_TEXT'  => $parsedData['description'],
    'DETAIL_TEXT'   => $parsedData['content'],
    'DETAIL_TEXT_TYPE' => 'html',
    'PREVIEW_PICTURE' => CFile::MakeFileArray($parsedData['image_path']),
]);

if ($elementId) {
    CIBlockElement::SetPropertyValuesEx($elementId, IBLOCK_CATALOG, [
        'SOURCE_URL' => $parsedData['url'],
        'ARTICLE'    => $parsedData['sku'],
    ]);
}

Important: when importing large amounts, disable search and URL updates:

CIBlockElement::DisableEvents(); // Disables event handlers

Without this, each Add() call triggers search reindexing, faceted index update, and other handlers — importing 10,000 products will take hours.

Error handling and resilience

A production PHP parser must handle:

  • Timeouts — server doesn't respond, connection hangs. Set CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT.
  • HTTP errors — 403, 429, 503. For 429 (rate limit) — increase delay. For 403 — change proxy. For 503 — retry later.
  • Malformed HTMLDOMDocument::loadHTML generates warnings. Suppress via @ or libxml_use_internal_errors(true), but log problematic URLs.
  • Memory exhaustion — large HTML pages (5+ MB) consume memory. Set memory_limit appropriately and free DOM after processing: unset($dom).

Retry pattern with exponential backoff:

function fetchWithRetry(string $url, int $maxRetries = 3): ?string
{
    for ($i = 0; $i < $maxRetries; $i++) {
        $response = @file_get_contents($url);
        if ($response !== false) {
            return $response;
        }
        sleep(pow(2, $i)); // 1, 2, 4 seconds
    }
    return null;
}

Logging

Without logs, debugging a parser is impossible. Minimal set of events to log:

  • Start and completion of parsing session (time, number of processed URLs).
  • Each HTTP request: URL, response status, load time.
  • Parsing errors: URL, error type, context.
  • Import result: created, updated, skipped (duplicates), errors.

Use \Bitrix\Main\Diag\Logger from D7 or write to separate parser_log table.

When PHP is insufficient

A PHP parser is not suitable if:

  • JavaScript rendering is needed — SPA sites, dynamic content loading. You need a headless browser (Puppeteer/Playwright), which requires Node.js or Python.
  • Parsing volume exceeds 50,000 pages per session — PHP hits single-threading and memory consumption limits.
  • Complex text processing is required (NLP, classification, entity extraction) — the Python ecosystem is significantly richer.

In these cases, consider a hybrid approach: Python/Node.js for data collection, PHP for import into Bitrix.