Developing a Python parser for 1C-Bitrix

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1626 services

Developing a Python parser for 1C-Bitrix

Medium

~1-2 weeks

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1298
Website development for FIXPER company
889
Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
638
Development based on 1C Enterprise for MIRSANBEL
788
Website development on CRM Bitrix24 for DOLBIMBY
689
Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
1021

Show more works

Development of a Python parser for 1С-Bitrix

Python is chosen for parsing when PHP hits its limits: you need a headless browser, asynchronous processing of thousands of URLs, machine learning for content classification, or dealing with anti-bot protection. The downside — Python has no direct access to the Bitrix API, so there's always an intermediate layer between the parser and the CMS.

Why Python, not PHP

Concrete reasons, not abstract advantages:

Asynchronicity. asyncio + aiohttp allow processing 100+ parallel requests in a single thread. PHP's curl_multi is limited to 20–50 parallel connections in practice.
Headless browser. Playwright for Python is a mature tool with full support for Chromium, Firefox, WebKit. PHP wrappers for Puppeteer exist but are significantly less stable.
NLP and ML. Text classification, entity extraction, language detection — spaCy, transformers, langdetect libraries have no PHP analogs.
Parsing libraries. BeautifulSoup, lxml, Scrapy — battle-tested tools with enormous community support.

Architecture

A Python parser for Bitrix works as a separate service, interacting with the CMS through one of the data transmission channels:

[Python Parser] → [Intermediate storage] → [PHP importer → Bitrix]

Intermediate storage options:

Method	When to use	Pros	Cons
JSON files	Up to 1,000 elements	Simple, no dependencies	Doesn't scale
PostgreSQL/MySQL	1,000–100,000	Transactions, indexes	Need shared DB
REST API Bitrix	Any volume	Direct write to CMS	Slow (HTTP overhead)
Redis/RabbitMQ	Stream processing	Asynchronicity, queues	Infrastructure complexity

For most projects, a shared database is optimal: Python writes to intermediate tables, PHP script reads and imports into info blocks.

Implementation with Scrapy

Scrapy is a parsing framework that handles URL queues, retry, throttling, middleware. Project structure:

bitrix_parser/
├── scrapy.cfg
├── bitrix_parser/
│   ├── spiders/
│   │   ├── catalog_spider.py
│   │   └── news_spider.py
│   ├── items.py
│   ├── pipelines.py
│   ├── middlewares.py
│   └── settings.py

Spider for catalog parsing:

import scrapy

class CatalogSpider(scrapy.Spider):
    name = 'catalog'
    start_urls = ['https://example.com/catalog/']

    def parse(self, response):
        for product in response.css('.product-card'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'description': product.css('.desc::text').get(),
                'image': product.css('img::attr(src)').get(),
                'url': product.css('a::attr(href)').get(),
            }
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Pipeline for writing to Bitrix database:

import psycopg2

class BitrixPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(
            host='localhost', port=5433,
            dbname='bitrix_db', user='bitrix'
        )

    def process_item(self, item, spider):
        cursor = self.conn.cursor()
        cursor.execute("""
            INSERT INTO parser_staging (name, price, description, image_url, source_url, status)
            VALUES (%s, %s, %s, %s, %s, 'new')
            ON CONFLICT (source_url) DO UPDATE SET
                price = EXCLUDED.price,
                updated_at = NOW()
        """, (item['name'], item['price'], item['description'],
              item['image'], item['url']))
        self.conn.commit()
        return item

Asynchronous parsing with aiohttp

For tasks where Scrapy is overkill (simple APIs, RSS feeds), an asynchronous approach with aiohttp:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
        return await resp.text()

async def parse_all(urls):
    connector = aiohttp.TCPConnector(limit=50)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

50 parallel connections process 10,000 URLs in 5–10 minutes — vs 5–6 hours with sequential loading.

Headless browser for SPA

Sites built with React, Vue, Angular return empty HTML. Playwright solves this:

from playwright.async_api import async_playwright

async def parse_spa(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until='networkidle')
        content = await page.content()
        await browser.close()
        return content

Resource intensity: each Chromium instance consumes 100–300 MB RAM. For mass parsing, use a pool of 3–5 instances with a task queue.

Passing data to Bitrix

A PHP script on the Bitrix side retrieves data from the intermediate table:

$rows = $DB->Query("SELECT * FROM parser_staging WHERE status = 'new' LIMIT 100");
while ($row = $rows->Fetch()) {
    $elementId = (new CIBlockElement())->Add([
        'IBLOCK_ID' => CATALOG_IBLOCK_ID,
        'NAME'      => $row['name'],
        'XML_ID'    => md5($row['source_url']),
        // ...
    ]);
    if ($elementId) {
        $DB->Query("UPDATE parser_staging SET status='imported', bx_id={$elementId} WHERE id={$row['id']}");
    }
}

This script runs via cron every 5–15 minutes and processes new records in batches.

Deploy and monitoring

A Python parser is deployed separately from Bitrix:

Systemd service or cron for scheduled runs.
Virtual environment (venv) — dependency isolation from system Python.
Logging — logging module with file rotation or syslog export.
Monitoring — script verifies the parser completed within N hours and sends alert on hang.

Typical crontab:

0 1 * * * cd /opt/parsers && /opt/parsers/venv/bin/scrapy crawl catalog 2>> /var/log/parser.log
0 */4 * * * cd /opt/parsers && /opt/parsers/venv/bin/python news_parser.py 2>> /var/log/parser.log

When to choose Python

Criterion	PHP	Python
Simple RSS/XML parsing	Best choice	Overkill
JavaScript rendering	Limited	Playwright/Selenium
10,000+ URLs per session	Hard	asyncio/Scrapy
Content classification	No tools	spaCy, transformers
Direct Bitrix import	Native	Via intermediate layer

A Python parser is justified with complex sources, large volume, or text processing needs. For simple tasks, a PHP parser is easier to maintain — one codebase, one stack.

1C Bitrix presentation 1C Bitrix24 presentation 1C Enterprise presentation