Implementing Auto-Fill Website from Parser
Auto-fill is the bridge between parser and CMS: data collected from external sources automatically appears on the site as products, articles, announcements, or profiles. The task involves not just import, but normalization, conflict handling, and content management.
System Architecture
Parser → Raw data → Processor → Normalized data → Import Queue
↓
CMS / Site DB
↓
Status: draft / published
Direct import from parser to site database is bad practice. You need an intermediate queue with data validation before publication.
Field Mapping
Each source has its own structure. Mapping configuration:
{
"source": "supplier_catalog",
"mappings": {
"title": "$.name",
"description": "$.full_description",
"price": "$.price_rub",
"category": { "field": "$.category_id", "transform": "category_map" },
"images": "$.photos[*].url",
"sku": "$.article"
},
"category_map": {
"1": "electronics",
"2": "clothing",
"15": "home-garden"
}
}
JSONPath mapping allows you to adapt a new source without code changes—only configuration.
Image Processing
Images from the source are downloaded, optimized, and uploaded to own storage:
async def process_image(url: str, product_id: int) -> str:
async with httpx.AsyncClient() as client:
resp = await client.get(url, timeout=30)
img = Image.open(BytesIO(resp.content))
img = img.convert('RGB')
# resize preserving aspect ratio
img.thumbnail((1200, 1200), Image.LANCZOS)
# save as WebP
output = BytesIO()
img.save(output, 'WEBP', quality=85)
# upload to S3/MinIO
s3_key = f'products/{product_id}/{uuid4()}.webp'
s3.put_object(Bucket=BUCKET, Key=s3_key, Body=output.getvalue())
return f'https://cdn.example.com/{s3_key}'
Quality Control
Before publication, data passes validation:
- Required fields are filled (title, price, at least one photo)
- Price is within acceptable range (protection from source errors: 0 or 999,999,999)
- Description is not shorter than N characters
- Images are accessible and have adequate size
Records that fail validation are marked as review_required and need manual review.
Publication Strategies
- Auto-publish — content from trusted sources publishes immediately
- Draft — content is created, editor publishes
- Diff-update — when source data changes, only changed fields update, not entire material
Timeline
Auto-fill system for CMS with one source and basic validation: 5–8 days. With multiple sources, UI for managing mappings, and moderation system: 15–20 days.







