Configuring Bypass of Anti-Parsing Protection for 1C-Bitrix
A data source updated its protection — and the parser that had been working for months stopped retrieving content. Instead of HTML with prices, the page returns a CAPTCHA, JavaScript challenge, or empty body. This is the reality of industrial parsing: defense systems evolve, and parsers must adapt. Let's review the main types of protections and technical approaches to handling them.
Types of Protections and Their Indicators
JavaScript Challenge (Cloudflare, DataDome). The server returns HTTP 503 with JS code that must execute in the browser and set a cookie cf_clearance or datadome. Indicator: body contains <noscript> and window._cf_chl_opt or similar obfuscated script.
Rate Limiting. HTTP 429 or 403 after N requests per period. Can be by IP, by cookie-session, or by fingerprint. Indicator: requests work the first few minutes, then get blocked.
Browser Fingerprinting. The server checks TLS fingerprint (JA3), HTTP header order, presence of JavaScript API (navigator, canvas). Regular cURL with default settings has a characteristic JA3 that differs from browser fingerprints.
Honeypot Links. Links hidden via CSS (display:none, visibility:hidden) that only bots click. Navigating to such a link results in instant IP ban.
Headless Browser for JavaScript Challenge
When the source requires JS execution, Bitrix HttpClient is powerless — it doesn't execute JavaScript. Solution — headless browser.
Puppeteer / Playwright run as a separate service (Node.js), with the Bitrix parser calling it via HTTP API. Scheme:
- PHP-parser sends URL to internal service:
http://localhost:3000/render?url=... - Node.js-service opens the page in Chromium, waits for JS execution, retrieves cookies and rendered HTML.
- Returns HTML and cookies to PHP.
- PHP-parser uses the obtained cookies for subsequent requests via regular
HttpClient— JS Challenge provides a cookie for 15-30 minutes.
This avoids running each request through a browser (slow and resource-intensive) and instead gets a "pass" once and uses it for a series of regular HTTP requests.
Important: headless browser must be masked. Standard Puppeteer is detected by navigator.webdriver = true, absence of plugins, characteristic window sizes. Use puppeteer-extra with stealth plugin or equivalent for Playwright.
TLS Fingerprint Rotation
To bypass fingerprinting, it's not enough to rotate IP. You must rotate TLS fingerprint. In PHP/cURL this is done via options:
-
CURLOPT_SSLVERSION— sets TLS version. -
CURLOPT_SSL_CIPHER_LIST— sets cipher order, forming JA3.
The curl-impersonate library (cURL fork) allows emulating TLS fingerprints of specific browsers (Chrome, Firefox, Safari). Installed on the server as a replacement for standard cURL.
CAPTCHA Handling
If the source shows CAPTCHA, options are:
- Recognition Service (2Captcha, Anti-Captcha) — parser sends image, receives answer via API, submits in form. Cost: $2-3 per 1000 solutions. Delay: 10-30 seconds.
- Reduce Frequency — often CAPTCHA appears as a reaction to rate limiting. Reducing request frequency and rotating proxies may eliminate CAPTCHA entirely.
Integration with 2Captcha from PHP-parser:
$taskId = file_get_contents("http://2captcha.com/in.php?key={$apiKey}&method=base64&body=" . base64_encode($captchaImage));
// Waiting for solution (polling)
$result = file_get_contents("http://2captcha.com/res.php?key={$apiKey}&action=get&id={$taskId}");
Honeypot Protection
Before following a link, check computed styles of the element: display, visibility, opacity, position (outside viewport). If parser works via DOM (DOMDocument in PHP), check inline-styles and classes. If via headless-browser — use getComputedStyle() to verify visibility.
What We Configure in One Day
- Diagnosis of protection type on specific source.
- Setup of headless-renderer (if JS Challenge) or header rotation (if fingerprinting).
- Integration with Bitrix parser — retrieval of cookies/HTML.
- Testing on real source, fine-tuning delays.
- Documentation of protection behavior for further support.







