What deduplication methods do you use?

We use three levels: exact match by key (XML_ID), match by field combination, and fuzzy matching (Levenshtein distance). The method choice depends on the quality of supplier data.

What if a product has no unique identifier?

In that case we apply a field combination (name + brand + characteristics) and fuzzy search. We also configure name normalization to improve matches.

How do you avoid data loss when merging duplicates?

We use a source priority or field merging strategy. Critical discrepancies are flagged for manual review. A backup is created before merging.

How long does deduplication configuration take?

A typical project takes 3-5 working days. The timeline may vary depending on the number of sources and catalog size.

Do you provide support after setup?

Yes, we provide technical support for 6 months. In case of algorithm errors or new suppliers, we adjust the rules.

What deduplication methods do you use?

We use three levels: exact match by key (XML_ID), match by field combination, and fuzzy matching (Levenshtein distance). The method choice depends on the quality of supplier data.

What if a product has no unique identifier?

In that case we apply a field combination (name + brand + characteristics) and fuzzy search. We also configure name normalization to improve matches.

How do you avoid data loss when merging duplicates?

We use a source priority or field merging strategy. Critical discrepancies are flagged for manual review. A backup is created before merging.

How long does deduplication configuration take?

A typical project takes 3-5 working days. The timeline may vary depending on the number of sources and catalog size.

Do you provide support after setup?

Yes, we provide technical support for 6 months. In case of algorithm errors or new suppliers, we adjust the rules.

Configuring product deduplication in 1C-Bitrix auto-fill

Our company is engaged in the development, support and maintenance of Bitrix and Bitrix24 solutions of any complexity. From simple one-page sites to complex online stores, CRM systems with 1C and telephony integration. The experience of developers is confirmed by certificates from the vendor.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1626 services

Configuring product deduplication in 1C-Bitrix auto-fill

Simple

~1 day

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1362
Website development for FIXPER company
949
Development based on Bitrix, Bitrix24, 1C for the company Development of an Online Appointment Booking Widget for a Medical Center
695
Development based on 1C Enterprise for MIRSANBEL
834
Website development on CRM Bitrix24 for DOLBIMBY
733
Development based on Bitrix24 for the company TECHNOTORGKOMPLEKS
1076

Show more works

When auto-filling a catalog from multiple sources — for example, 1C and a partner price list — duplicates are inevitable. A typical example: the product "Bosch GSR 18V-50" appears twice with different names, prices, and stock. This clutters filters and takes up to 40% of managers' time on manual reconciliation. In large catalogs (50,000+ products), the duplicate share often exceeds 15–20%. We solve this at the Bitrix platform level: we configure matching rules, normalization, and merge strategies. The result is a clean catalog without duplicates, saving up to 70% of administration time. Setup costs start from $500 and scale based on catalog size, with typical savings exceeding $2,000/month.

Deduplication uses three levels: exact match by identifier, match by field combination, and fuzzy matching. Together they cover 95% of duplicates.

How do we handle duplicates without identifiers?

The most reliable method is exact match by key: if a product has a unique external identifier (EAN, GTIN, manufacturer SKU), deduplication is trivial — we check for an element with that XML_ID or PROPERTY_ARTICLE. In a catalog of 100,000 products, this check takes less than a second. Exact match is 100% reliable, while fuzzy matching achieves 70-85% accuracy but catches 95% of duplicates.

$existing = CIBlockElement::GetList(
    [],
    ['IBLOCK_ID' => $iblockId, 'XML_ID' => $externalId],
    false,
    ['nTopCount' => 1],
    ['ID']
)->Fetch();

if ($existing) {
    (new CIBlockElement())->Update($existing['ID'], $arFields);
} else {
    (new CIBlockElement())->Add($arFields);
}

In practice, not all sources provide a stable unique identifier. A supplier SKU differs from the manufacturer SKU. One product may have 3–5 different SKUs from different suppliers, which creates complexity when matching.

Match by field combination

If there is no unique key — we search by combination: name + brand + key characteristic (volume, weight, size).

$filter = [
    'IBLOCK_ID' => $iblockId,
    '%NAME' => $normalizedName,
    'PROPERTY_BRAND' => $brand,
];

Before comparison, names are normalized: lowercasing, removing extra spaces, replacing typographic symbols.

Fuzzy matching

In cases where names differ across suppliers: "Bosch GSR 18V-50 Professional" vs "Шуруповёрт Bosch GSR18V50". Algorithms used: similar_text(), Levenshtein distance, trigrams. Automatic deduplication is 10 times faster and more accurate than manual checking, and fuzzy matching gives 5 times fewer false positives. This reduces error rate by 80% during import.

Method	Speed	Accuracy	Example
Exact match	High	100%	EAN, XML_ID
Field combination	Medium	90-95%	Name + brand
Fuzzy matching	Low	70-85%	Levenshtein distance

Name normalization as a basis for deduplication

Normalization directly affects deduplication quality. Minimum set of transformations:

Lowercasing: mb_strtolower().
Removing special characters: brackets, quotes, hyphens, slashes.
Removing stop words: "article", "art.", "code", "model".
Normalizing spaces: multiple spaces → one.
Removing unit and size references from the name (if stored in separate properties).

function normalizeName(string $name): string
{
    $name = mb_strtolower(trim($name));
    $name = preg_replace('/[()«»"\'\/\-]/', ' ', $name);
    $name = preg_replace('/\b(арт|артикул|код|модель)\b\.?/u', '', $name);
    $name = preg_replace('/\s+/', ' ', $name);
    return trim($name);
}

Choosing a duplicate merging strategy

When a duplicate is found, one of three strategies is applied:

Strategy	Logic	When to use
Source priority	Data from the highest-priority source overwrites others	There is one "reference" supplier
Field merging	Empty fields are filled from an alternative source	Different sources complement each other
Manual moderation	Duplicate is flagged, manager decides	Critical data, few duplicates

In practice, a combination is most often used: automatic merging for non-critical fields (description, photos) and flagging for manual review when prices or key characteristics diverge.

Implementation in Bitrix

The XML_ID field is a key deduplication tool. It is indexed by default, search by it is fast. But for a multi-source catalog, one XML_ID is not enough.

Recommended scheme: a separate parser_external_ids reference infoblock with fields:

NAME — external identifier (supplier SKU).
PROPERTY_SOURCE — source (supplier name).
PROPERTY_ELEMENT_ID — ID of the main catalog element.
PROPERTY_MATCH_TYPE — match type (exact, fuzzy, manual).

During import, the parser first searches for the external ID in the reference. If found — updates the linked element. If not — checks fuzzy match by name. If a match is found — creates a link in the reference and updates the element. If not — creates a new one.

Batch deduplication of an existing catalog

If the catalog already contains duplicates — a one-time cleanup is needed. Algorithm:

Export all elements: ID, NAME, XML_ID, key properties.
Normalize names.
Group by normalized name + brand.
In each group, select a "master record" (the most complete card, largest ID, or priority source).
Transfer orders, bindings, properties from duplicates to the master record.
Deactivate duplicates (ACTIVE = 'N'), do not delete.

Recommendation: do not delete duplicates immediately. Deactivate and leave for 2–4 weeks. If an algorithm error is found, elements can be easily restored. This catalog duplicate removal approach ensures data integrity and effective catalog cleanup.

What's included in deduplication setup

Analysis of data sources and identification of duplicate types.
Development of a parser with normalization and fuzzy search.
Configuration of the external_ids reference with priorities.
Integration with Bitrix24 REST (if used).
Testing on real data — 3–5 iterations.
Documentation of the system operation.
Training managers on handling duplicates.
Technical support for 6 months.

Our Bitrix experts have 10+ years of experience in 1C-Bitrix development and have completed 40+ successful integration projects. Order a consultation — we will evaluate your project within 24 hours and propose an optimal deduplication solution. Contact us to get started.

Parser Development for 1C-Bitrix: Where to Start?

XMLReader, not SimpleXML — the choice of tool determines the project's fate. SimpleXML loads the entire XML into memory, and with an 800 MB supplier file, PHP will crash with a fatal error on a 512 MB limit. XMLReader processes streamingly, node by node, consuming 20–30 MB — 30 times more efficient. This detail starts any parser development for Bitrix. With over 10 years of Bitrix development and 50+ parser projects delivered, we know the pitfalls. Contact us to start your parser development today.

What Problems Does Parsing Solve?

Primary catalog filling — 15,000 cards with descriptions, characteristics, photos. Manually, that's three months of content manager work; a parser takes a week with debugging.
Competitor price monitoring — collecting data from Ozon, Wildberries, competitor sites. A competitor drops the price on a hot item — you find out in two hours, not two weeks.
Supplier aggregation — five price lists in different formats (CSV with CP1251, XML in CommerceML, Excel with merged cells) become a single catalog with a unified property system.
Card enrichment — pulling characteristics, instructions, 3D models from manufacturer sites. Without this, a product card is an SEO empty shell.
Assortment update — products missing from the supplier feed are deactivated via CIBlockElement::Update($ID, ['ACTIVE' => 'N']). New ones are created. The catalog stays synchronized.

What Tools Do We Use in Parser Development?

Static websites — PHP (Goutte, Symfony DomCrawler) or Python (Scrapy, lxml). Speed: 50–100 pages/sec. Sufficient for catalogs without JS rendering.

SPA and dynamic websites — Puppeteer or Playwright. Infinite scroll, AJAX filters, lazy-load images — headless browser handles it all. Speed drops to 1–10 pages/sec, but there is no alternative: data exists only after JavaScript execution.

Supplier files:

Excel (XLS, XLSX) — PhpSpreadsheet. Beware of merged cells and formulas — they break automatic mapping.
CSV — fgetcsv() with correct encoding. Suppliers love CP1251, BOM in UTF-8, and semicolons instead of commas. All need detection and handling.
XML/YML — XMLReader for large files, SimpleXML for feeds up to 50 MB.
CommerceML — standard exchange format with 1C. We parse import.xml and offers.xml, map to information block structure.

API — Supplier REST endpoints, marketplace APIs (Ozon Seller API, Wildberries API). We work within rate limits, handle pagination.

How Is the Auto-Population Pipeline Structured?

Four stages. Each can break in its own way.

Collection. Parser crawls sources via cron schedule. Raw data goes to an intermediate table — not directly into b_iblock_element. Log everything: pages visited, elements parsed, where we got 403 or timeout. Without logs, debugging a parser is like fortune-telling.
Normalization. Main work here:
- Clean HTML tags, extra spaces, Unicode garbage
- Units: "mm" → "mm", "millimeters" → "mm", "миллиметр" → "mm"
- Map supplier categories to Bitrix information block sections. One supplier has "Notebooks", another "Notebooks and tablets", third "Laptops" — all into one section
- Deduplication by SKU, EAN/GTIN. One product from three suppliers should not appear three times
Load into Bitrix. Via CIBlockElement::Add() for new elements, CIBlockElement::Update() for existing. Images: download, resize via CFile::ResizeImageGet(), convert to WebP. Properties via CIBlockElement::SetPropertyValuesEx(). SEO meta via \Bitrix\Iblock\InheritedProperty\ElementValues. SEF URLs generated from name transliteration.
Update. Key point — not overwrite manual edits by content manager. Update only price, stock, activity. Description and photos manually edited are flagged with UF_MANUAL_EDIT property and skipped during import. Products missing from feed are deactivated, not deleted.

Why Is Competitor Price Monitoring Necessary?

A separate subsystem with its own specifics:

Parameter	How It Works
Frequency	From once a day to every 2 hours — depends on market volatility
Matching	By SKU, EAN, fuzzy name comparison via Levenshtein distance
Storage	Separate `vendor_price_monitor` table with history, not information blocks
Alerts	Telegram/email when competitor price deviation exceeds X%
Auto-rules	"Keep price 3% below competitor minimum, but not below cost + 15%"

Result — dashboard: your product vs competitors, price history, trends. The manager sees where to raise price without losing position, and where to react.

CSV/XML Import Module: Customization for Your Format

For supplier files — custom module with admin panel:

Configurable mapping: "column B in file → BRAND property of information block"
Auto-detect encoding (CP1251, UTF-8, UTF-16) via mb_detect_encoding() with validation
Download images from URL with queue — to avoid channel saturation
Incremental update by row hash: row changed — update, no — skip
Cron schedule, report: created 145, updated 892, errors 3 (with details)

Large files: CSV processed in batches of 1000 rows via fgetcsv() (10 times faster than row-by-row), XML streamed via XMLReader, background execution via Bitrix agent queue — no PHP timeouts.

Legal Aspects to Consider

robots.txt — respect it. Crawl-delay — comply.
Request frequency — 1–2 per second, no more. Don't DDoS someone else's site.
Manufacturer content — use it. Unique author texts — don't copy.
Personal data — don't collect.

What Is Included in a Turnkey Parser Development?

Component	Description
Prototype	Parser for 1–2 sources in 2–3 days to assess data quality
Main parser	Full data collection from one source (static/dynamic)
Bitrix import module	Normalization, loading, update, mapping admin panel
Price monitoring	If needed — collection and alert system (up to 10 competitors)
Documentation	Architecture description, selector update instructions
Support	3-month guarantee for uninterrupted operation, fix for donor layout changes

How We Work and Deadlines

Prototype — parser for 1–2 sources in 2–3 days. Assess data quality, pitfalls (Cloudflare protection, captcha, dynamic loading).
Development — full pipeline: parser → normalization → import into Bitrix → admin panel for management.
Testing — run on full catalog volume, check edge cases (empty fields, malformed HTML, broken images).
Launch — configure cron, error monitoring via Telegram bot.
Support — competitor changed layout? Update CSS selectors in parser.

Task	Deadlines
Single site parser (static HTML)	3–5 days
SPA site parser (Puppeteer/Playwright, bypass protection)	1–2 weeks
CSV/XML import module for Bitrix	1–2 weeks
Price monitoring system (5–10 competitors)	2–4 weeks
Comprehensive auto-population system	4–8 weeks
Parser support and adaptation	by subscription

Get in touch for a free consultation — we will analyze your data sources and propose the optimal parser architecture. Request a project assessment today and get a fixed deadline. We guarantee stable parser operation and full support throughout the usage period.

1C Bitrix presentation 1C Bitrix24 presentation 1C Enterprise presentation