Developing a PHP parser for 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1626 services

Developing a PHP parser for 1C-Bitrix

Medium

~1-2 weeks

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1298
Website development for FIXPER company
889
Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
638
Development based on 1C Enterprise for MIRSANBEL
788
Website development on CRM Bitrix24 for DOLBIMBY
689
Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
1021

Show more works

Development of a PHP parser for 1С-Bitrix

PHP is a natural choice for a parser working together with 1С-Bitrix. One language, one runtime, direct access to info block APIs without intermediate layers. But a PHP parser has limitations that must be considered during design: single-threading, memory limits, lack of a built-in event loop.

PHP parser architecture

A parser consists of four components:

1. Source configuration. An array or database table with parameters for each source: URL, type (RSS, HTML, API), CSS selectors for data extraction, field mapping, update frequency.

2. HTTP client. For simple tasks — cURL via CHttpClient from the Bitrix kernel or native curl_multi for parallel requests. For complex ones — Guzzle with middleware for retry, logging, proxy rotation.

3. HTML/XML parser. DOMDocument + DOMXPath for precise DOM navigation. For CSS selectors — Symfony\Component\DomCrawler library. For RSS — SimpleXMLElement.

4. Importer. A layer for writing data to Bitrix info blocks via D7 API or old API (CIBlockElement).

Basic implementation

A minimal PHP HTML page parser:

use Bitrix\Main\Loader;
use Bitrix\Iblock\ElementTable;

Loader::includeModule('iblock');

function parseSource(string $url, array $selectors): array
{
    $html = file_get_contents($url, false, stream_context_create([
        'http' => [
            'timeout' => 30,
            'user_agent' => 'Mozilla/5.0 (compatible; SiteBot/1.0)',
        ],
    ]));

    $dom = new DOMDocument();
    @$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    $xpath = new DOMXPath($dom);

    $items = [];
    foreach ($xpath->query($selectors['list']) as $node) {
        $title = $xpath->evaluate('string(' . $selectors['title'] . ')', $node);
        $link  = $xpath->evaluate('string(' . $selectors['link'] . ')', $node);
        $items[] = ['title' => trim($title), 'link' => $link];
    }
    return $items;
}

This skeleton works, but is not production-ready. Error handling, timeouts, logging, and deduplication are needed.

Parallel requests via curl_multi

The main bottleneck of a PHP parser is sequential requests. Loading 1,000 pages at 2 seconds each = 33 minutes. With curl_multi_exec, you can process 10–20 requests in parallel:

$multiHandle = curl_multi_init();
$handles = [];

foreach ($urls as $i => $url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
    curl_multi_add_handle($multiHandle, $ch);
    $handles[$i] = $ch;
}

do {
    $status = curl_multi_exec($multiHandle, $active);
    curl_multi_select($multiHandle);
} while ($active > 0);

Limitation: more than 50 parallel connections — and PHP starts consuming too much memory. For large-scale parsing (10,000+ URLs), split into batches of 20–30 connections.

Integration with Bitrix kernel

The advantage of a PHP parser is direct API access. No REST, no intermediate database needed. Import to info block:

$element = new CIBlockElement();
$elementId = $element->Add([
    'IBLOCK_ID'    => IBLOCK_CATALOG,
    'NAME'         => $parsedData['title'],
    'XML_ID'       => $parsedData['external_id'],
    'ACTIVE'       => 'Y',
    'PREVIEW_TEXT'  => $parsedData['description'],
    'DETAIL_TEXT'   => $parsedData['content'],
    'DETAIL_TEXT_TYPE' => 'html',
    'PREVIEW_PICTURE' => CFile::MakeFileArray($parsedData['image_path']),
]);

if ($elementId) {
    CIBlockElement::SetPropertyValuesEx($elementId, IBLOCK_CATALOG, [
        'SOURCE_URL' => $parsedData['url'],
        'ARTICLE'    => $parsedData['sku'],
    ]);
}

Important: when importing large amounts, disable search and URL updates:

CIBlockElement::DisableEvents(); // Disables event handlers

Without this, each Add() call triggers search reindexing, faceted index update, and other handlers — importing 10,000 products will take hours.

Error handling and resilience

A production PHP parser must handle:

Timeouts — server doesn't respond, connection hangs. Set CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT.
HTTP errors — 403, 429, 503. For 429 (rate limit) — increase delay. For 403 — change proxy. For 503 — retry later.
Malformed HTML — DOMDocument::loadHTML generates warnings. Suppress via @ or libxml_use_internal_errors(true), but log problematic URLs.
Memory exhaustion — large HTML pages (5+ MB) consume memory. Set memory_limit appropriately and free DOM after processing: unset($dom).

Retry pattern with exponential backoff:

function fetchWithRetry(string $url, int $maxRetries = 3): ?string
{
    for ($i = 0; $i < $maxRetries; $i++) {
        $response = @file_get_contents($url);
        if ($response !== false) {
            return $response;
        }
        sleep(pow(2, $i)); // 1, 2, 4 seconds
    }
    return null;
}

Logging

Without logs, debugging a parser is impossible. Minimal set of events to log:

Start and completion of parsing session (time, number of processed URLs).
Each HTTP request: URL, response status, load time.
Parsing errors: URL, error type, context.
Import result: created, updated, skipped (duplicates), errors.

Use \Bitrix\Main\Diag\Logger from D7 or write to separate parser_log table.

When PHP is insufficient

A PHP parser is not suitable if:

JavaScript rendering is needed — SPA sites, dynamic content loading. You need a headless browser (Puppeteer/Playwright), which requires Node.js or Python.
Parsing volume exceeds 50,000 pages per session — PHP hits single-threading and memory consumption limits.
Complex text processing is required (NLP, classification, entity extraction) — the Python ecosystem is significantly richer.

In these cases, consider a hybrid approach: Python/Node.js for data collection, PHP for import into Bitrix.

1C Bitrix presentation 1C Bitrix24 presentation 1C Enterprise presentation