Development of a PHP parser for 1С-Bitrix
PHP is a natural choice for a parser working together with 1С-Bitrix. One language, one runtime, direct access to info block APIs without intermediate layers. But a PHP parser has limitations that must be considered during design: single-threading, memory limits, lack of a built-in event loop.
PHP parser architecture
A parser consists of four components:
1. Source configuration. An array or database table with parameters for each source: URL, type (RSS, HTML, API), CSS selectors for data extraction, field mapping, update frequency.
2. HTTP client. For simple tasks — cURL via CHttpClient from the Bitrix kernel or native curl_multi for parallel requests. For complex ones — Guzzle with middleware for retry, logging, proxy rotation.
3. HTML/XML parser. DOMDocument + DOMXPath for precise DOM navigation. For CSS selectors — Symfony\Component\DomCrawler library. For RSS — SimpleXMLElement.
4. Importer. A layer for writing data to Bitrix info blocks via D7 API or old API (CIBlockElement).
Basic implementation
A minimal PHP HTML page parser:
use Bitrix\Main\Loader;
use Bitrix\Iblock\ElementTable;
Loader::includeModule('iblock');
function parseSource(string $url, array $selectors): array
{
$html = file_get_contents($url, false, stream_context_create([
'http' => [
'timeout' => 30,
'user_agent' => 'Mozilla/5.0 (compatible; SiteBot/1.0)',
],
]));
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$xpath = new DOMXPath($dom);
$items = [];
foreach ($xpath->query($selectors['list']) as $node) {
$title = $xpath->evaluate('string(' . $selectors['title'] . ')', $node);
$link = $xpath->evaluate('string(' . $selectors['link'] . ')', $node);
$items[] = ['title' => trim($title), 'link' => $link];
}
return $items;
}
This skeleton works, but is not production-ready. Error handling, timeouts, logging, and deduplication are needed.
Parallel requests via curl_multi
The main bottleneck of a PHP parser is sequential requests. Loading 1,000 pages at 2 seconds each = 33 minutes. With curl_multi_exec, you can process 10–20 requests in parallel:
$multiHandle = curl_multi_init();
$handles = [];
foreach ($urls as $i => $url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_multi_add_handle($multiHandle, $ch);
$handles[$i] = $ch;
}
do {
$status = curl_multi_exec($multiHandle, $active);
curl_multi_select($multiHandle);
} while ($active > 0);
Limitation: more than 50 parallel connections — and PHP starts consuming too much memory. For large-scale parsing (10,000+ URLs), split into batches of 20–30 connections.
Integration with Bitrix kernel
The advantage of a PHP parser is direct API access. No REST, no intermediate database needed. Import to info block:
$element = new CIBlockElement();
$elementId = $element->Add([
'IBLOCK_ID' => IBLOCK_CATALOG,
'NAME' => $parsedData['title'],
'XML_ID' => $parsedData['external_id'],
'ACTIVE' => 'Y',
'PREVIEW_TEXT' => $parsedData['description'],
'DETAIL_TEXT' => $parsedData['content'],
'DETAIL_TEXT_TYPE' => 'html',
'PREVIEW_PICTURE' => CFile::MakeFileArray($parsedData['image_path']),
]);
if ($elementId) {
CIBlockElement::SetPropertyValuesEx($elementId, IBLOCK_CATALOG, [
'SOURCE_URL' => $parsedData['url'],
'ARTICLE' => $parsedData['sku'],
]);
}
Important: when importing large amounts, disable search and URL updates:
CIBlockElement::DisableEvents(); // Disables event handlers
Without this, each Add() call triggers search reindexing, faceted index update, and other handlers — importing 10,000 products will take hours.
Error handling and resilience
A production PHP parser must handle:
-
Timeouts — server doesn't respond, connection hangs. Set
CURLOPT_TIMEOUTandCURLOPT_CONNECTTIMEOUT. - HTTP errors — 403, 429, 503. For 429 (rate limit) — increase delay. For 403 — change proxy. For 503 — retry later.
-
Malformed HTML —
DOMDocument::loadHTMLgenerates warnings. Suppress via@orlibxml_use_internal_errors(true), but log problematic URLs. -
Memory exhaustion — large HTML pages (5+ MB) consume memory. Set
memory_limitappropriately and free DOM after processing:unset($dom).
Retry pattern with exponential backoff:
function fetchWithRetry(string $url, int $maxRetries = 3): ?string
{
for ($i = 0; $i < $maxRetries; $i++) {
$response = @file_get_contents($url);
if ($response !== false) {
return $response;
}
sleep(pow(2, $i)); // 1, 2, 4 seconds
}
return null;
}
Logging
Without logs, debugging a parser is impossible. Minimal set of events to log:
- Start and completion of parsing session (time, number of processed URLs).
- Each HTTP request: URL, response status, load time.
- Parsing errors: URL, error type, context.
- Import result: created, updated, skipped (duplicates), errors.
Use \Bitrix\Main\Diag\Logger from D7 or write to separate parser_log table.
When PHP is insufficient
A PHP parser is not suitable if:
- JavaScript rendering is needed — SPA sites, dynamic content loading. You need a headless browser (Puppeteer/Playwright), which requires Node.js or Python.
- Parsing volume exceeds 50,000 pages per session — PHP hits single-threading and memory consumption limits.
- Complex text processing is required (NLP, classification, entity extraction) — the Python ecosystem is significantly richer.
In these cases, consider a hybrid approach: Python/Node.js for data collection, PHP for import into Bitrix.







