Development of a Python parser for 1С-Bitrix
Python is chosen for parsing when PHP hits its limits: you need a headless browser, asynchronous processing of thousands of URLs, machine learning for content classification, or dealing with anti-bot protection. The downside — Python has no direct access to the Bitrix API, so there's always an intermediate layer between the parser and the CMS.
Why Python, not PHP
Concrete reasons, not abstract advantages:
-
Asynchronicity.
asyncio+aiohttpallow processing 100+ parallel requests in a single thread. PHP'scurl_multiis limited to 20–50 parallel connections in practice. - Headless browser. Playwright for Python is a mature tool with full support for Chromium, Firefox, WebKit. PHP wrappers for Puppeteer exist but are significantly less stable.
-
NLP and ML. Text classification, entity extraction, language detection —
spaCy,transformers,langdetectlibraries have no PHP analogs. -
Parsing libraries.
BeautifulSoup,lxml,Scrapy— battle-tested tools with enormous community support.
Architecture
A Python parser for Bitrix works as a separate service, interacting with the CMS through one of the data transmission channels:
[Python Parser] → [Intermediate storage] → [PHP importer → Bitrix]
Intermediate storage options:
| Method | When to use | Pros | Cons |
|---|---|---|---|
| JSON files | Up to 1,000 elements | Simple, no dependencies | Doesn't scale |
| PostgreSQL/MySQL | 1,000–100,000 | Transactions, indexes | Need shared DB |
| REST API Bitrix | Any volume | Direct write to CMS | Slow (HTTP overhead) |
| Redis/RabbitMQ | Stream processing | Asynchronicity, queues | Infrastructure complexity |
For most projects, a shared database is optimal: Python writes to intermediate tables, PHP script reads and imports into info blocks.
Implementation with Scrapy
Scrapy is a parsing framework that handles URL queues, retry, throttling, middleware. Project structure:
bitrix_parser/
├── scrapy.cfg
├── bitrix_parser/
│ ├── spiders/
│ │ ├── catalog_spider.py
│ │ └── news_spider.py
│ ├── items.py
│ ├── pipelines.py
│ ├── middlewares.py
│ └── settings.py
Spider for catalog parsing:
import scrapy
class CatalogSpider(scrapy.Spider):
name = 'catalog'
start_urls = ['https://example.com/catalog/']
def parse(self, response):
for product in response.css('.product-card'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'description': product.css('.desc::text').get(),
'image': product.css('img::attr(src)').get(),
'url': product.css('a::attr(href)').get(),
}
next_page = response.css('.pagination .next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Pipeline for writing to Bitrix database:
import psycopg2
class BitrixPipeline:
def open_spider(self, spider):
self.conn = psycopg2.connect(
host='localhost', port=5433,
dbname='bitrix_db', user='bitrix'
)
def process_item(self, item, spider):
cursor = self.conn.cursor()
cursor.execute("""
INSERT INTO parser_staging (name, price, description, image_url, source_url, status)
VALUES (%s, %s, %s, %s, %s, 'new')
ON CONFLICT (source_url) DO UPDATE SET
price = EXCLUDED.price,
updated_at = NOW()
""", (item['name'], item['price'], item['description'],
item['image'], item['url']))
self.conn.commit()
return item
Asynchronous parsing with aiohttp
For tasks where Scrapy is overkill (simple APIs, RSS feeds), an asynchronous approach with aiohttp:
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
return await resp.text()
async def parse_all(urls):
connector = aiohttp.TCPConnector(limit=50)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch(session, url) for url in urls]
return await asyncio.gather(*tasks, return_exceptions=True)
50 parallel connections process 10,000 URLs in 5–10 minutes — vs 5–6 hours with sequential loading.
Headless browser for SPA
Sites built with React, Vue, Angular return empty HTML. Playwright solves this:
from playwright.async_api import async_playwright
async def parse_spa(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until='networkidle')
content = await page.content()
await browser.close()
return content
Resource intensity: each Chromium instance consumes 100–300 MB RAM. For mass parsing, use a pool of 3–5 instances with a task queue.
Passing data to Bitrix
A PHP script on the Bitrix side retrieves data from the intermediate table:
$rows = $DB->Query("SELECT * FROM parser_staging WHERE status = 'new' LIMIT 100");
while ($row = $rows->Fetch()) {
$elementId = (new CIBlockElement())->Add([
'IBLOCK_ID' => CATALOG_IBLOCK_ID,
'NAME' => $row['name'],
'XML_ID' => md5($row['source_url']),
// ...
]);
if ($elementId) {
$DB->Query("UPDATE parser_staging SET status='imported', bx_id={$elementId} WHERE id={$row['id']}");
}
}
This script runs via cron every 5–15 minutes and processes new records in batches.
Deploy and monitoring
A Python parser is deployed separately from Bitrix:
- Systemd service or cron for scheduled runs.
-
Virtual environment (
venv) — dependency isolation from system Python. -
Logging —
loggingmodule with file rotation or syslog export. - Monitoring — script verifies the parser completed within N hours and sends alert on hang.
Typical crontab:
0 1 * * * cd /opt/parsers && /opt/parsers/venv/bin/scrapy crawl catalog 2>> /var/log/parser.log
0 */4 * * * cd /opt/parsers && /opt/parsers/venv/bin/python news_parser.py 2>> /var/log/parser.log
When to choose Python
| Criterion | PHP | Python |
|---|---|---|
| Simple RSS/XML parsing | Best choice | Overkill |
| JavaScript rendering | Limited | Playwright/Selenium |
| 10,000+ URLs per session | Hard | asyncio/Scrapy |
| Content classification | No tools | spaCy, transformers |
| Direct Bitrix import | Native | Via intermediate layer |
A Python parser is justified with complex sources, large volume, or text processing needs. For simple tasks, a PHP parser is easier to maintain — one codebase, one stack.







