Implementing Distributed Scraping (Multiple Workers) for Scaling
One worker is limited by network speed, target site restrictions, and proxy capacity. Distributed system allows horizontal scaling: 5 workers give not just ×5 speed — they work with different proxies, different IPs, parallel scrape different sections, and don't interfere with each other when writing results.
Overall Architecture
Coordinator (Scheduler)
│
▼
Task Queue (Redis + BullMQ / Celery / Sidekiq)
┌────┴────┐
│ │
Worker-1 Worker-2 ... Worker-N
(proxy-1) (proxy-2) (proxy-N)
│ │
└────┬────┘
▼
Shared Storage (PostgreSQL + S3)
│
▼
Deduplicator / Merger
Coordinator doesn't scrape itself — only generates tasks and monitors progress. Workers stateless: any can take any task.
Task Structure
{
"task_id": "uuid-v4",
"job_type": "scrape_product_list",
"url": "https://target.com/catalog?page=42",
"site_id": 7,
"depth": 2,
"retry_count": 0,
"max_retries": 3,
"proxy_pool": "residential",
"created_at": "2025-03-01T12:00:00Z"
}
Coordinator: Task Generation
For paginated site coordinator first determines page count, then issues task per page:
class CatalogScheduler:
def __init__(self, queue: Queue, site_config: dict):
self.queue = queue
self.config = site_config
async def schedule_full_crawl(self):
total_pages = await self.detect_total_pages(self.config['catalog_url'])
tasks = [
ScrapeTask(
url=self.config['catalog_url'] + f"?page={i}",
job_type="scrape_listing",
site_id=self.config['id'],
)
for i in range(1, total_pages + 1)
]
await self.queue.bulk_add(tasks, priority=5)
logger.info(f"Scheduled {len(tasks)} tasks for site {self.config['id']}")
Listing page tasks spawn secondary tasks — for individual product cards. This is two-level queue with different priorities.
Proxy Management
Each worker tied to proxy pool. Rotation — round-robin with "quarantine" for banned IPs:
class ProxyRotator:
def __init__(self, proxies: list[str]):
self.proxies = proxies
self.banned: dict[str, datetime] = {}
self.idx = 0
def get_proxy(self) -> str:
for _ in range(len(self.proxies)):
proxy = self.proxies[self.idx % len(self.proxies)]
self.idx += 1
ban_until = self.banned.get(proxy)
if ban_until and ban_until > datetime.utcnow():
continue
return proxy
raise NoProxyAvailable("All proxies are in cooldown")
def report_banned(self, proxy: str, cooldown_minutes: int = 30):
self.banned[proxy] = datetime.utcnow() + timedelta(minutes=cooldown_minutes)
Residential proxies rotated less often, datacenter — more often (detected faster).
Preventing Duplicate Work
Before adding URL to queue coordinator checks Bloom filter or Redis Set:
async def should_schedule(self, url: str) -> bool:
normalized = normalize_url(url) # remove UTM, trailing slash, etc.
key = f"visited:{self.site_id}"
return not await self.redis.sismember(key, normalized)
async def mark_visited(self, url: str):
normalized = normalize_url(url)
await self.redis.sadd(f"visited:{self.site_id}", normalized)
await self.redis.expire(f"visited:{self.site_id}", 86400 * 7)
Bloom filter preferable for scale > 10M URLs: needs 50–100× less memory than Set, with acceptable error < 0.1%.
Consistent Result Writing
Multiple workers write to same DB simultaneously. Conflicts handled via INSERT ... ON CONFLICT DO UPDATE:
INSERT INTO scraped_products (site_id, external_id, url, data, scraped_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (site_id, external_id)
DO UPDATE SET
data = EXCLUDED.data,
scraped_at = EXCLUDED.scraped_at,
updated_at = NOW()
WHERE scraped_products.scraped_at < EXCLUDED.scraped_at - INTERVAL '1 hour';
Condition WHERE scraped_at < EXCLUDED.scraped_at - '1 hour' prevents two workers hitting same product from endlessly overwriting each other.
Scaling Workers
Workers run in Docker containers. Horizontal scaling via docker-compose scale or Kubernetes HPA:
# docker-compose.yml
services:
scraper-worker:
image: scraper:latest
environment:
- REDIS_URL=redis://redis:6379
- DB_URL=postgresql://...
- PROXY_LIST=/run/secrets/proxies
deploy:
replicas: 5
restart: unless-stopped
When new workers added queue automatically distributes load — no manual config.
Progress Monitoring
Coordinator maintains counters in Redis:
scrape_job:{job_id}:total → 2400 (total tasks)
scrape_job:{job_id}:done → 1837 (completed)
scrape_job:{job_id}:failed → 12 (errors)
scrape_job:{job_id}:started_at → timestamp
Estimated completion time = (total - done) / (done / elapsed_seconds).
Dashboard (BullMQ Board or custom UI) shows active tasks, processing speed, distribution across workers, error stats.
Typical Configurations and Performance
| Workers | Proxies | Speed | Suitable for |
|---|---|---|---|
| 3 | 10 datacenter | ~500 pages/min | Catalogs up to 100k products |
| 10 | 50 residential | ~1500 pages/min | Large marketplaces |
| 20+ | 100+ residential | ~5000 pages/min | Daily full crawls |
Implementation Timeline
Basic distributed system with 2–3 workers, Redis queue, PostgreSQL storage — 8–10 business days. Dynamic proxy rotation, Bloom filter deduplication, Docker/K8s autoscaling, monitoring dashboard — additional 5–7 days.







