Developing a Python parser for 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.
Our competencies:
Development stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1177
  • image_bitrix-bitrix-24-1c_fixper_448_0.png
    Website development for FIXPER company
    811
  • image_bitrix-bitrix-24-1c_development_of_an_online_appointment_booking_widget_for_a_medical_center_594_0.webp
    Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
    564
  • image_bitrix-bitrix-24-1c_mirsanbel_458_0.webp
    Development based on 1C Enterprise for MIRSANBEL
    747
  • image_crm_dolbimby_434_0.webp
    Website development on CRM Bitrix24 for DOLBIMBY
    655
  • image_crm_technotorgcomplex_453_0.webp
    Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
    976

Development of a Python parser for 1С-Bitrix

Python is chosen for parsing when PHP hits its limits: you need a headless browser, asynchronous processing of thousands of URLs, machine learning for content classification, or dealing with anti-bot protection. The downside — Python has no direct access to the Bitrix API, so there's always an intermediate layer between the parser and the CMS.

Why Python, not PHP

Concrete reasons, not abstract advantages:

  • Asynchronicity. asyncio + aiohttp allow processing 100+ parallel requests in a single thread. PHP's curl_multi is limited to 20–50 parallel connections in practice.
  • Headless browser. Playwright for Python is a mature tool with full support for Chromium, Firefox, WebKit. PHP wrappers for Puppeteer exist but are significantly less stable.
  • NLP and ML. Text classification, entity extraction, language detection — spaCy, transformers, langdetect libraries have no PHP analogs.
  • Parsing libraries. BeautifulSoup, lxml, Scrapy — battle-tested tools with enormous community support.

Architecture

A Python parser for Bitrix works as a separate service, interacting with the CMS through one of the data transmission channels:

[Python Parser] → [Intermediate storage] → [PHP importer → Bitrix]

Intermediate storage options:

Method When to use Pros Cons
JSON files Up to 1,000 elements Simple, no dependencies Doesn't scale
PostgreSQL/MySQL 1,000–100,000 Transactions, indexes Need shared DB
REST API Bitrix Any volume Direct write to CMS Slow (HTTP overhead)
Redis/RabbitMQ Stream processing Asynchronicity, queues Infrastructure complexity

For most projects, a shared database is optimal: Python writes to intermediate tables, PHP script reads and imports into info blocks.

Implementation with Scrapy

Scrapy is a parsing framework that handles URL queues, retry, throttling, middleware. Project structure:

bitrix_parser/
├── scrapy.cfg
├── bitrix_parser/
│   ├── spiders/
│   │   ├── catalog_spider.py
│   │   └── news_spider.py
│   ├── items.py
│   ├── pipelines.py
│   ├── middlewares.py
│   └── settings.py

Spider for catalog parsing:

import scrapy

class CatalogSpider(scrapy.Spider):
    name = 'catalog'
    start_urls = ['https://example.com/catalog/']

    def parse(self, response):
        for product in response.css('.product-card'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'description': product.css('.desc::text').get(),
                'image': product.css('img::attr(src)').get(),
                'url': product.css('a::attr(href)').get(),
            }
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Pipeline for writing to Bitrix database:

import psycopg2

class BitrixPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(
            host='localhost', port=5433,
            dbname='bitrix_db', user='bitrix'
        )

    def process_item(self, item, spider):
        cursor = self.conn.cursor()
        cursor.execute("""
            INSERT INTO parser_staging (name, price, description, image_url, source_url, status)
            VALUES (%s, %s, %s, %s, %s, 'new')
            ON CONFLICT (source_url) DO UPDATE SET
                price = EXCLUDED.price,
                updated_at = NOW()
        """, (item['name'], item['price'], item['description'],
              item['image'], item['url']))
        self.conn.commit()
        return item

Asynchronous parsing with aiohttp

For tasks where Scrapy is overkill (simple APIs, RSS feeds), an asynchronous approach with aiohttp:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
        return await resp.text()

async def parse_all(urls):
    connector = aiohttp.TCPConnector(limit=50)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

50 parallel connections process 10,000 URLs in 5–10 minutes — vs 5–6 hours with sequential loading.

Headless browser for SPA

Sites built with React, Vue, Angular return empty HTML. Playwright solves this:

from playwright.async_api import async_playwright

async def parse_spa(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until='networkidle')
        content = await page.content()
        await browser.close()
        return content

Resource intensity: each Chromium instance consumes 100–300 MB RAM. For mass parsing, use a pool of 3–5 instances with a task queue.

Passing data to Bitrix

A PHP script on the Bitrix side retrieves data from the intermediate table:

$rows = $DB->Query("SELECT * FROM parser_staging WHERE status = 'new' LIMIT 100");
while ($row = $rows->Fetch()) {
    $elementId = (new CIBlockElement())->Add([
        'IBLOCK_ID' => CATALOG_IBLOCK_ID,
        'NAME'      => $row['name'],
        'XML_ID'    => md5($row['source_url']),
        // ...
    ]);
    if ($elementId) {
        $DB->Query("UPDATE parser_staging SET status='imported', bx_id={$elementId} WHERE id={$row['id']}");
    }
}

This script runs via cron every 5–15 minutes and processes new records in batches.

Deploy and monitoring

A Python parser is deployed separately from Bitrix:

  • Systemd service or cron for scheduled runs.
  • Virtual environment (venv) — dependency isolation from system Python.
  • Logginglogging module with file rotation or syslog export.
  • Monitoring — script verifies the parser completed within N hours and sends alert on hang.

Typical crontab:

0 1 * * * cd /opt/parsers && /opt/parsers/venv/bin/scrapy crawl catalog 2>> /var/log/parser.log
0 */4 * * * cd /opt/parsers && /opt/parsers/venv/bin/python news_parser.py 2>> /var/log/parser.log

When to choose Python

Criterion PHP Python
Simple RSS/XML parsing Best choice Overkill
JavaScript rendering Limited Playwright/Selenium
10,000+ URLs per session Hard asyncio/Scrapy
Content classification No tools spaCy, transformers
Direct Bitrix import Native Via intermediate layer

A Python parser is justified with complex sources, large volume, or text processing needs. For simple tasks, a PHP parser is easier to maintain — one codebase, one stack.