Parsers and Auto-Population for 1C-Bitrix
XMLReader, not SimpleXML — that's where working with an 800 MB supplier catalog begins. SimpleXML pulls the entire file into memory, and PHP crashes with a fatal error at the 512 MB limit. XMLReader reads as a stream, node by node, consuming 20-30 MB regardless of file size. Every auto-population project we do on Bitrix starts with this detail.
What Parsing Actually Does
- Initial catalog population — 15,000 product cards with descriptions, specifications, photos. Manually, that's three months of a content manager's work; a parser — one week including debugging.
- Competitor price monitoring — data collection from Ozon, Wildberries, competitor websites. A competitor dropped the price on a popular item — you find out in two hours, not two weeks.
- Supplier aggregation — five price lists in different formats (CSV in CP1251, XML in CommerceML, Excel with merged cells) turn into a unified catalog with a common infoblock property system.
- Product card enrichment — pulling specifications, manuals, 3D models from manufacturer websites. Without this, a product card is an SEO dead end.
-
Assortment updates — products missing from the supplier's feed are deactivated via
CIBlockElement::Update($ID, ['ACTIVE' => 'N']). New ones are created. The catalog stays synchronized.
Sources and Tools
Static websites — PHP (Goutte, Symfony DomCrawler) or Python (Scrapy, lxml). Speed: 50-100 pages/sec. Sufficient for catalogs without JS rendering.
SPAs and dynamic websites — Puppeteer or Playwright. Infinite scroll, AJAX filters, lazy-loaded images — a headless browser handles all of it. Speed drops to 1-10 pages/sec, but there's no alternative: the data only exists after JavaScript execution.
Supplier files:
- Excel (XLS, XLSX) — PhpSpreadsheet. Be careful with merged cells and formulas — they break automatic mapping.
- CSV —
fgetcsv()with proper encoding. Suppliers love CP1251, BOM in UTF-8, and semicolons instead of commas. All of this needs to be detected and handled. - XML/YML — XMLReader for large files, SimpleXML for feeds up to 50 MB.
- CommerceML — the standard exchange format with 1C. We parse
import.xmlandoffers.xml, mapping to the infoblock structure.
API — supplier REST endpoints, marketplace APIs (Ozon Seller API, Wildberries API). We work within rate limits and handle pagination.
Auto-Population Pipeline
Four stages. Each can fail in its own way.
1. Collection. The parser crawls sources on a cron schedule. Raw data goes into an intermediate table — not directly into b_iblock_element. We log everything: how many pages were crawled, how many items were parsed, where we got a 403 or timeout. Without logs, debugging a parser is guesswork.
2. Normalization. This is where the main work happens:
- Cleaning HTML tags, extra whitespace, Unicode garbage
- Units of measurement: "mm" → "mm", "millimeters" → "mm", "millimeter" → "mm"
- Mapping supplier categories → Bitrix infoblock sections. One supplier has "Laptops," another has "Laptops and Tablets," a third has "Notebooks" — all go into a single section
- Deduplication by SKU, EAN/GTIN. One product from three suppliers shouldn't appear three times
3. Loading into Bitrix. Via CIBlockElement::Add() for new items, CIBlockElement::Update() for existing ones. Images: download, resize via CFile::ResizeImageGet(), convert to WebP. Properties — via CIBlockElement::SetPropertyValuesEx(). SEO meta via \Bitrix\Iblock\InheritedProperty\ElementValues. SEF URLs generated from transliterated names.
4. Updates. The key point — don't overwrite manual edits by content managers. We only update price, stock, and active status. Descriptions and photos that were manually refined are flagged with UF_MANUAL_EDIT in element properties and skipped during import. Products missing from the feed are deactivated but not deleted.
Competitor Price Monitoring
A separate subsystem with its own specifics:
| Parameter | How It Works |
|---|---|
| Frequency | From once daily to every 2 hours — depends on market volatility |
| Matching | By SKU, EAN, fuzzy name comparison via Levenshtein distance |
| Storage | Custom vendor_price_monitor table with history, not infoblock |
| Alerts | Telegram/email when a competitor's price deviates by more than X% |
| Auto-rules | "Keep price 3% below the lowest competitor, but not below cost + 15%" |
The result — a dashboard: your product vs competitors, price history, trends. The manager sees where they can raise the price without losing position and where they need to react.
CSV/XML Import Module
For supplier files — a custom module with an admin panel:
- Configurable mapping: "column B in the file → BRAND property of the infoblock"
- Auto-detect encoding (CP1251, UTF-8, UTF-16) via
mb_detect_encoding()with verification - Image download by URL with a queue — to avoid saturating the connection
- Incremental updates by row hash: row changed — update; unchanged — skip
- Cron schedule, report: created 145, updated 892, errors 3 (with details)
Large files: CSV processed in batches of 1000 rows via fgetcsv(), XML streamed via XMLReader, background execution via the Bitrix agent queue — no PHP timeouts.
Legal Considerations
-
robots.txt— we respect it. Crawl-delay — we comply - Request frequency — 1-2 per second, no more. No need to DDoS someone else's site
- Manufacturer content — we use it. Unique authored texts — we don't copy
- Personal data — we don't collect
Our Process
- Prototype — a parser for 1-2 sources in 2-3 days. We assess data quality and potential pitfalls (Cloudflare protection, CAPTCHA, dynamic loading).
- Development — the full pipeline: parser → normalization → import into Bitrix → admin panel for management.
- Testing — we run it on the full catalog volume, checking edge cases (empty fields, broken HTML, corrupted images).
- Launch — set up cron, error monitoring via Telegram bot.
- Support — a competitor redesigned their layout? We update the CSS selectors in the parser.
Timelines
| Task | Timeline |
|---|---|
| Parser for a single site (static HTML) | 3-5 days |
| Parser for an SPA site (Puppeteer/Playwright, bypass protection) | 1-2 weeks |
| CSV/XML import module for Bitrix | 1-2 weeks |
| Price monitoring system (5-10 competitors) | 2-4 weeks |
| Comprehensive auto-population system | 4-8 weeks |
| Parser support and adaptation | subscription-based |







