Setting up product deduplication during auto-populating 1С-Bitrix
Filling a catalog from multiple sources inevitably creates duplicates. The same product — "Bosch GSR 18V-50" — comes from three suppliers with different names, SKUs, and descriptions. Without deduplication, the catalog grows, filters show duplicates, and managers spend hours on manual cleanup. Let's look at deduplication mechanisms at the Bitrix level.
Deduplication levels
1. Exact match by key. The most reliable method. If a product has a unique external identifier — EAN, GTIN, manufacturer SKU — deduplication is trivial: check for an element with such XML_ID or property value PROPERTY_ARTICLE.
$existing = CIBlockElement::GetList(
[],
['IBLOCK_ID' => $iblockId, 'XML_ID' => $externalId],
false,
['nTopCount' => 1],
['ID']
)->Fetch();
if ($existing) {
// Update existing
(new CIBlockElement())->Update($existing['ID'], $arFields);
} else {
// Create new
(new CIBlockElement())->Add($arFields);
}
Problem: not all sources provide stable unique identifiers. Supplier SKU ≠ manufacturer SKU. One product may have 3–5 different SKUs from different suppliers.
2. Match by field combination. If there's no unique key — search by combination: name + brand + key characteristic (volume, weight, size).
$filter = [
'IBLOCK_ID' => $iblockId,
'%NAME' => $normalizedName,
'PROPERTY_BRAND' => $brand,
];
Before comparing names, normalize them: convert to lower case, remove extra spaces, replace typographic characters.
3. Fuzzy matching. When names differ between suppliers: "Bosch GSR 18V-50 Professional" vs "Шуруповёрт Bosch GSR18V50". Use fuzzy comparison algorithms: similar_text(), Levenshtein distance, trigrams.
Normalization before comparison
Deduplication quality directly depends on normalization. Minimal transformation set:
- Convert to lower case:
mb_strtolower(). - Remove special characters: parentheses, quotes, hyphens, slashes.
- Remove stop words: "article", "art.", "code", "model".
- Normalize spaces: multiple spaces → one.
- Remove unit and size indicators from name (if stored in separate properties).
function normalizeName(string $name): string
{
$name = mb_strtolower(trim($name));
$name = preg_replace('/[()«»"\'\/\-]/', ' ', $name);
$name = preg_replace('/\b(арт|артикул|код|модель)\b\.?/u', '', $name);
$name = preg_replace('/\s+/', ' ', $name);
return trim($name);
}
Merge strategy
When a duplicate is found — what to do with the data? Three strategies:
| Strategy | Logic | When to use |
|---|---|---|
| Source priority | Data from highest-priority source overwrites others | Have one "reference" supplier |
| Field merging | Empty fields filled from alternative source | Different sources complement each other |
| Manual moderation | Duplicate flagged, manager decides | Critical data, few duplicates |
In practice, a combination is most common: automatic merging for non-critical fields (description, photos) and flagging for manual review when prices or key characteristics differ.
Implementation in Bitrix
The info block element's XML_ID field is the primary deduplication tool. Indexed by default, search is fast. But for multi-source catalogs, one XML_ID isn't enough.
Recommended scheme: separate reference info block parser_external_ids with fields:
-
NAME— external identifier (supplier SKU). -
PROPERTY_SOURCE— source (supplier name). -
PROPERTY_ELEMENT_ID— ID of main catalog element. -
PROPERTY_MATCH_TYPE— match type (exact, fuzzy, manual).
On import, the parser first searches for the external ID in the reference. If found — update the linked element. If not — check fuzzy match by name. If match found — create link in reference and update element. If not — create new.
Batch deduplication of existing catalog
If the catalog already contains duplicates — one-time cleanup is needed. Algorithm:
- Export all elements: ID, NAME, XML_ID, key properties.
- Normalize names.
- Group by normalized name + brand.
- In each group, select "master record" (most complete card, highest ID, or priority source).
- Transfer orders, links, properties from duplicates to master record.
- Deactivate duplicates (
ACTIVE = 'N'), don't delete.
Don't delete duplicates immediately. Deactivate and leave for 2–4 weeks. If algorithm error is found — elements can be restored.







