What architecture is suitable for distributed scraping?

The optimal architecture includes a coordinator (scheduler), a task queue (Redis + BullMQ), stateless workers, and shared storage (PostgreSQL + S3). The coordinator generates tasks, workers execute them, and results are deduplicated.

How do you avoid duplicate tasks with multiple workers?

We use a Bloom filter or Redis Set to check URL uniqueness before adding to the queue. Bloom filter consumes less memory and works for scales exceeding 10 million URLs.

How many workers are needed for scraping a large marketplace?

For a catalog of up to 100k products, 3 workers with 10 datacenter proxies are enough. For large marketplaces (millions of products), 10–20 workers with residential proxies are required.

How do you manage proxies in a distributed system?

Each worker is tied to a proxy pool with round-robin rotation and automatic quarantine of banned IPs. Residential proxies rotate less frequently, datacenter proxies more often.

What technology stack is used?

The core stack: Python, Redis (queue), PostgreSQL (storage), Docker (containerization). For scaling, Kubernetes HPA. Monitoring via BullMQ Board or a custom UI.

What architecture is suitable for distributed scraping?

The optimal architecture includes a coordinator (scheduler), a task queue (Redis + BullMQ), stateless workers, and shared storage (PostgreSQL + S3). The coordinator generates tasks, workers execute them, and results are deduplicated.

How do you avoid duplicate tasks with multiple workers?

We use a Bloom filter or Redis Set to check URL uniqueness before adding to the queue. Bloom filter consumes less memory and works for scales exceeding 10 million URLs.

How many workers are needed for scraping a large marketplace?

For a catalog of up to 100k products, 3 workers with 10 datacenter proxies are enough. For large marketplaces (millions of products), 10–20 workers with residential proxies are required.

How do you manage proxies in a distributed system?

Each worker is tied to a proxy pool with round-robin rotation and automatic quarantine of banned IPs. Residential proxies rotate less frequently, datacenter proxies more often.

What technology stack is used?

The core stack: Python, Redis (queue), PostgreSQL (storage), Docker (containerization). For scaling, Kubernetes HPA. Monitoring via BullMQ Board or a custom UI.

Distributed Scraping: Scaling with Multiple Workers

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and maintenance of all types of websites:

Informational websites or web applications

Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators

E-commerce websites or web applications

Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers

Business process management web applications

CRM systems, ERP systems, corporate portals, production management systems, information parsers

Electronic service websites or web applications

Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 2062 services

Distributed Scraping: Scaling with Multiple Workers

Complex

~5 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1252
Website development for BELFINGROUP
958
Development of an online store for the company FURNORO
1190
Development of a web application for Enviok
931
Website development for FIXPER company
949

Show more works

Distributed Scraping: Scaling with Multiple Workers

When a single parser hits the limits of the target site and its own network speed, we offer a distributed architecture. Our experience shows that 5 workers with different proxies not only provide 5x speed — they crawl different sections in parallel and do not conflict when writing to the common database. You get a turnkey solution: from coordinator to deduplicator. Distributed crawling allows you to bypass request frequency limits and IP blocks. Each worker uses its own pool of residential proxies, reducing the likelihood of being blocked. A priority Redis queue manages tasks, ensuring even load distribution. The system is designed with fault tolerance: if a worker fails, its task is reassigned to another.

Unlike simply running multiple copies, our architecture guarantees no duplicates and data consistency. Thanks to a Bloom filter and a two-level queue, the system scales without performance loss. We guarantee stability under loads up to 5000 pages per minute. Infrastructure savings compared to sequential crawling can reach 40%, translating to monthly cost reductions of $500-$1000 for typical deployments with 5 workers. This performance is achieved through parallel workers and efficient queue management.

We have 5 years of experience in scraping and have delivered over 15 projects. Our team will select the optimal configuration: number of workers, proxy type, queue capacity. We will evaluate your project for free within 1–2 days. Get a consultation from an engineer.

Solving the Blocking Problem with Distributed Scraping

Workers operate with different IPs, each with its own request limit. We use proxy rotation with automatic quarantine of banned addresses. This allows collecting data from aggressively protected sites while maintaining stability. Additionally, we apply random delays (jitter) between requests to avoid creating uniform patterns. This approach is 3x more effective than using a single proxy pool.

The Critical Role of Deduplication

Without deduplication, the same URL can be processed by multiple workers, leading to redundant requests and inconsistent data. We use a Bloom filter to check URL uniqueness before adding to the queue. Bloom filter takes 50–100 times less memory than a Set, with a tolerance of less than 0.1%. It is efficient for scales exceeding 10 million URLs. Bloom filter is the optimal choice — it is 5x more memory-efficient than hash sets.

Architecture of Distributed Scraping

General Scheme

Coordinator (Scheduler) → Task Queue (Redis + BullMQ) → Workers (stateless) → Shared Storage (PostgreSQL + S3) → Deduplicator (Bloom filter). The coordinator does not scrape; it generates tasks and monitors progress. Each worker picks a task from the queue, executes it, and returns the result. For parallelizing listings, we use a two-level queue: first catalog pages, then product cards with different priorities.

Proxy Management and Deduplication

Each worker is bound to a proxy pool. Rotation is round-robin with quarantine. Example implementation in Python:

class ProxyRotator:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.banned: dict[str, datetime] = {}
        self.idx = 0

    def get_proxy(self) -> str:
        for _ in range(len(self.proxies)):
            proxy = self.proxies[self.idx % len(self.proxies)]
            self.idx += 1
            ban_until = self.banned.get(proxy)
            if ban_until and ban_until > datetime.utcnow():
                continue
            return proxy
        raise NoProxyAvailable("All proxies are in cooldown")

    def report_banned(self, proxy: str, cooldown_minutes: int = 30):
        self.banned[proxy] = datetime.utcnow() + timedelta(minutes=cooldown_minutes)

Consistent Result Writing

Multiple workers write simultaneously. We use INSERT ... ON CONFLICT DO UPDATE with a time condition to avoid endless overwrites.

INSERT INTO scraped_products (site_id, external_id, url, data, scraped_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (site_id, external_id)
DO UPDATE SET
    data       = EXCLUDED.data,
    scraped_at = EXCLUDED.scraped_at,
    updated_at = NOW()
WHERE scraped_products.scraped_at < EXCLUDED.scraped_at - INTERVAL '1 hour';

Concrete Case: Large Marketplace

On a project for a major e-commerce aggregator with 3 million products, we deployed 10 workers with residential proxies and a Bloom filter. The system crawled 1500 pages per minute, completing the full catalog in 6 hours. The deduplication rate was 0.08%, and proxy bans were reduced by 90% compared to sequential crawling. This project achieved a 4x improvement in speed over a single-worker setup.

Scaling and Monitoring

Horizontal Scaling

Workers run in Docker. Horizontal scaling via docker-compose or Kubernetes HPA. When new workers are added, load is distributed automatically.

services:
  scraper-worker:
    image: scraper:latest
    environment:
      - REDIS_URL=redis://redis:6379
      - DB_URL=postgresql://...
      - PROXY_LIST=/run/secrets/proxies
    deploy:
      replicas: 5
    restart: unless-stopped

Progress Monitoring

The coordinator maintains counters in Redis: total, done, failed. Estimated completion time is straightforward. A dashboard (BullMQ Board) shows active tasks and processing speed.

Typical Configurations and Performance

Workers	Proxies	Speed	Suitable for
3	10 datacenter	~500 pages/min	Catalogs up to 100k products
10	50 residential	~1500 pages/min	Large marketplaces
20+	100+ residential	~5000 pages/min	Daily full crawl

Typical Issues and Their Solutions

IP blocking: Rotate residential proxies with quarantine. This reduces bans by 3x compared to no rotation.
Task duplication: Use Bloom filter for deduplication; 50x memory savings over sets.
Write conflicts: Employ UPSERT with time protection to avoid data inconsistency.

What's Included and Implementation Stages

We provide: architectural scheme, queue setup (Redis), proxy selection and integration, Python worker code, Bloom filter deduplication, monitoring dashboard, documentation, and team training. The system is tested under your load.

Stages:

Analysis of the target site and data requirements.
Architecture design: stack selection (Redis, PostgreSQL, Python).
Queue and worker setup.
Load testing.
Deployment and team training.

Implementation Timeline

Basic system with 2–3 workers, Redis, and PostgreSQL: 8–10 business days. Adding dynamic proxy rotation, Bloom filter, autoscaling, and dashboard: another 5–7 days. Full turnkey solution: up to 3 weeks. Monthly costs for a 5-worker system with proxies start around $800, which is 40% cheaper than comparable single-worker solutions.

Why Choose Our Implementation?

We have designed and deployed such systems for 15+ clients, been on the market for over 5 years. Guarantee stability under peak loads. Contact us to discuss your project. Get a consultation from an engineer. Order the implementation of distributed scraping.

Backend Development Services: Laravel, Node.js, Go, Django, PostgreSQL

On a production server at 3:14 AM, the Laravel Jobs queue stopped processing. 40,000 unprocessed jobs in Redis. Cause: worker crashed due to a memory leak in one of the Jobs (leak via a static variable in an Eloquent observer), supervisor didn't restart it because of misconfigured stopwaitsecs. This is not a hypothetical scenario — it's Tuesday. We analyzed such an incident on a project with 500 RPS load: diagnosis took 4 hours, fix — 20 minutes. So you don't lose money on downtime, we offer backend development services with a focus on production-grade reliability. We'll assess your project in 2 days.

Backend is what works when no one is watching. Or doesn't work. We guarantee you'll have the first option.

How do we ensure production-grade reliability from day one?

What we do correctly from day one

Service Layer over Fat Controllers. Controller receives HTTP request, validates it via Form Request, passes data to Service, returns response. Business logic in Service, not Controller. This sounds trivial, but most legacy projects have controllers with 500 lines and SQL queries inside.

Repository Pattern we use cautiously. If you just wrap Model::where(...) in a repository method — that's boilerplate without benefit. Repository is justified when: you need to abstract from the data source (DB + cache + external API) or when query logic is complex enough to isolate.

Jobs, Events, Listeners. Everything that can be async — make async. Sending email, PDF generation, external API sync, aggregate recalculation — into Queue. Laravel Horizon for queue monitoring in Redis: see throughput, failed jobs, processing time per queue.

How Octane handles high load

Laravel Octane with RoadRunner or Swoole keeps the app in memory between requests — removes bootstrap overhead (config loading, class autoloading) on each HTTP request. Gain: 3–8x on synthetic benchmarks, 2–4x on real applications. Important: no state between requests in static variables — that leads to exactly the incidents from the beginning. We use this in projects with >1000 RPS.

What to do about N+1 queries

N+1 is the most common cause of slow pages in Laravel apps. Standard story: page worked fine on dev with 10 records, on production with 10,000 — 8-second load.

Laravel Debugbar in dev environment shows the number of queries per page. More than 20 queries per page — signal for audit.

Model::preventLazyLoading(! app()->isProduction());

Telescope for profiling in staging: logs all queries, jobs, mail, notifications with time detail. Numbers: after implementing eager loading, page load time drops from 8s to 0.3s — 27 times faster.

PostgreSQL: indexes that are actually needed

PostgreSQL 14+ is the primary DB on all projects. We use PgBouncer + PostgreSQL combination. 10+ years experience, more than 50 backend projects, 5 years on the market.

How PostgreSQL helps avoid slow queries

Composite indexes for frequent WHERE + ORDER BY. If you have WHERE user_id = ? AND status = ? ORDER BY created_at DESC — you need (user_id, status, created_at DESC). A separate index on (user_id) doesn't help much with sorting.

Partial indexes. If 95% of queries go with WHERE status = 'active':

CREATE INDEX idx_orders_active ON orders (created_at DESC)
WHERE status = 'active';

The index is small, fast, covers the main load.

GIN indexes for JSONB and arrays. @> operator without GIN index — seq scan. With index — fast even on millions of rows.

GIN for full-text search. to_tsvector + GIN instead of LIKE '%query%'. LIKE without index is always seq scan. With pg_trgm extension and gin_trgm_ops — supports LIKE with index, useful for CRM search by partial match.

Connection pooling: why it's more important than it seems

Rails, Laravel, Django open a new connection to PostgreSQL for each PHP/Python process. With 100 workers — 100 connections. PostgreSQL starts degrading from 200–300 active connections — overhead on connection management becomes significant.

PgBouncer — connection pooler in front of PostgreSQL. Transaction pooling mode: connection to PostgreSQL is occupied only during a transaction, returned to pool between requests. 1000 application workers → 20–50 actual connections to PostgreSQL. This reduces latency by 40% and hosting costs by 30%.

Node.js with Fastify: when it's better than Laravel

Node.js is justified for:

Realtime: WebSocket servers, Server-Sent Events, chat, live updates
Streaming: large files, video, streaming data
High I/O concurrency: many parallel requests to external APIs without heavy business logic
Serverless: Lambda/Cloud Functions — Node.js starts faster than PHP

Fastify over Express: 2–3 times faster on benchmarks, built-in JSON Schema validation, better TypeScript support, plugin architecture.

Typical realtime architecture: Laravel — core business logic and REST API. Node.js + Socket.io or ws — WebSocket server. Laravel publishes events to Redis Pub/Sub, Node.js subscribes and broadcasts to clients. This separation allows scaling the WebSocket server independently of the main app.

Go: microservices and high load

Go we use for:

High-load microservices (>10,000 RPS)
Background workers with strict latency requirements
DevOps tools and CLI
gRPC services in microservice architecture

Goroutines — thousands of times cheaper than OS threads. 10,000 concurrent connections on Go is normal on one server.

But Go is not a silver bullet. Development is slower than Laravel: more boilerplate, no ORM at Eloquent level, error handling with if err != nil everywhere. Justified only when performance is a real requirement, not an assumption.

Django and Python backend

Django with DRF (Django REST Framework) — for tasks where Python is needed: ML pipelines, data processing, integrations with AI tools.

Celery for background tasks — similar to Laravel Queue but more complex to configure. Celery Beat for cron tasks.

Django ORM vs raw SQL: ORM is convenient for CRUD. For analytical queries with multiple JOINs, window functions, and CTEs — connection.execute() with raw SQL is more readable and predictable.

Redis: not just cache

Redis in our projects plays multiple roles:

Role	Details
Cache	Caching results of heavy queries, HTML fragments
Queues	Backend for Laravel Queue / Celery
Session store	Distributed sessions in multi-instance environment
Pub/Sub	Realtime events between services
Rate limiting	Sliding window counters for API throttling
Leaderboards	Sorted Sets for rankings

Redis Cluster for horizontal scaling. Sentinel for automatic failover on standalone setups.

Deployment and infrastructure

Docker + docker-compose — standard for local development and production. Each service in a container: PHP-FPM/Octane, Nginx, PostgreSQL, Redis, Queue Worker, Scheduler.

CI/CD via GitHub Actions:

Run tests (PHPUnit / Pest, Vitest, Playwright)
Build Docker image
Push to Container Registry
Deploy: docker pull → docker-compose up -d on server, or Kubernetes rolling update

Zero-downtime deploy for Laravel: php artisan down --secret=TOKEN is not needed with proper configuration. Strategy: new container starts next to the old one, Nginx switches traffic after health check, old container stops.

Monitoring: Sentry for exception tracking with alerting in Slack/Telegram. Grafana + Prometheus (or Grafana Cloud) for metrics: CPU, memory, request rate, queue depth, database connection count. Alerts on: error rate > 1%, p99 latency > 2s, queue depth > 1000 jobs.

What's included in turnkey work

Architecture design (API documentation, DB schema, service diagram)
Implementation according to agreed specification with code review
CI/CD, monitoring, alerting setup
Load testing (k6, wrk) with report
Handover of source code, access, deployment instructions
Training of customer's team (2-3 sessions)
Warranty support for 1 month after delivery

Timeline benchmarks

Task	Timeline
REST API for mobile/SPA (medium complexity)	6–12 weeks
Backend with complex business logic + integrations	12–20 weeks
High-load service on Go	8–16 weeks
Migration from legacy PHP to Laravel	16–32 weeks

Pricing is calculated individually after analyzing load, integrations, and business logic. Contact us for a free audit of your current backend — get an optimization plan in 2 days. Request a consultation.