Чем кастомный краулер лучше готовых сервисов?

Готовые сервисы (Screaming Frog, Sitebulb) ограничены объёмом, частотой обхода и гибкостью. Кастомный краулер на Python работает без лимитов, поддерживает любые источники данных (API, динамические страницы) и легко интегрируется с вашей инфраструктурой.

Какие данные собирает краулер?

Краулер собирает HTTP-статусы, метаданные (title, description, canonical, hreflang), граф внутренних ссылок, текстовое содержимое страниц. Также фиксирует дубли контента, медленные страницы и ошибки в структуре.

Сколько страниц может обработать краулер?

Архитектура асинхронного краулера позволяет обходить до 100 000 страниц за один запуск. При необходимости можно распределить нагрузку на несколько воркеров или запускать инкрементальные обходы.

Как краулер учитывает robots.txt и лимиты?

Краулер по умолчанию соблюдает правила robots.txt: анализирует директивы Disallow и Crawl-Delay. Также можно настроить политику вежливости (задержки между запросами) и исключить страницы с параметрами.

В каком виде предоставляются результаты?

Результаты индексируются в выбранную базу данных (PostgreSQL, Elasticsearch, Meilisearch) и доступны через API. Также предоставляется CSV-отчёт со всеми собранными данными и визуализация графа ссылок.

Чем кастомный краулер лучше готовых сервисов?

Готовые сервисы (Screaming Frog, Sitebulb) ограничены объёмом, частотой обхода и гибкостью. Кастомный краулер на Python работает без лимитов, поддерживает любые источники данных (API, динамические страницы) и легко интегрируется с вашей инфраструктурой.

Какие данные собирает краулер?

Краулер собирает HTTP-статусы, метаданные (title, description, canonical, hreflang), граф внутренних ссылок, текстовое содержимое страниц. Также фиксирует дубли контента, медленные страницы и ошибки в структуре.

Сколько страниц может обработать краулер?

Архитектура асинхронного краулера позволяет обходить до 100 000 страниц за один запуск. При необходимости можно распределить нагрузку на несколько воркеров или запускать инкрементальные обходы.

Как краулер учитывает robots.txt и лимиты?

Краулер по умолчанию соблюдает правила robots.txt: анализирует директивы Disallow и Crawl-Delay. Также можно настроить политику вежливости (задержки между запросами) и исключить страницы с параметрами.

В каком виде предоставляются результаты?

Результаты индексируются в выбранную базу данных (PostgreSQL, Elasticsearch, Meilisearch) и доступны через API. Также предоставляется CSV-отчёт со всеми собранными данными и визуализация графа ссылок.

Website Crawler for Internal Content Indexing

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and maintenance of all types of websites:

Informational websites or web applications

Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators

E-commerce websites or web applications

Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers

Business process management web applications

CRM systems, ERP systems, corporate portals, production management systems, information parsers

Electronic service websites or web applications

Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 2062 services

Website Crawler for Internal Content Indexing

Medium

~3-5 days

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1252
Website development for BELFINGROUP
958
Development of an online store for the company FURNORO
1190
Development of a web application for Enviok
931
Website development for FIXPER company
949

Show more works

Website Crawler for Internal Content Indexing

We build custom crawlers that index your website content automatically. Our crawlers traverse all pages, extract metadata, and store results in PostgreSQL, Elasticsearch, or Meilisearch to power site search, content audits, and SEO analysis. Delivery takes three to five working days. We have built crawlers for content portals, e-commerce sites, and knowledge bases. Our clients use the indexed data for real-time search, automated sitemap generation, and regular content quality reports.

Every content-heavy website needs a way to know what it contains. A custom crawler gives you a complete, structured map of your site: all URLs, their titles, descriptions, headings, and text content. This powers faster site search, automated duplicate detection, broken link reports, and hreflang validation.

What's Included in Our Crawler Development Service

We deliver the crawler as a turnkey system. The scope covers:

Async crawler using Python with asyncio and httpx for high-speed traversal
HTML parsing with BeautifulSoup to extract title, description, canonical, H1, and body text
Link graph construction for internal link analysis
Configurable depth limit, domain restriction, and URL exclusion patterns
Storage integration: PostgreSQL with tsvector, Elasticsearch, or Meilisearch
Incremental re-crawl mode that processes only changed pages
CLI interface for manual runs and scheduling via cron
Detailed run report: pages found, errors, redirect chains, and indexing statistics

Why Build a Custom Crawler Instead of Using Off-the-Shelf Tools?

General-purpose crawlers like Screaming Frog export CSV files. They do not integrate with your application database. They cannot push new content into your search index in real time. They do not know which content types matter for your specific use case.

A custom crawler reads your site and writes structured data exactly where your application needs it. It runs on your infrastructure, respects your authentication and rate limits, and produces the exact data schema your team designed. Build once, run on schedule, no seat licenses.

Technical Architecture

The crawler is built as an async Python application. It manages a frontier queue of pending URLs, tracks visited URLs to avoid duplicates, and processes pages concurrently with a configurable concurrency limit.

import asyncio
import httpx
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

class SiteCrawler:
    def __init__(self, base_url: str, concurrency: int = 10):
        self.base_url = base_url
        self.domain = urlparse(base_url).netloc
        self.semaphore = asyncio.Semaphore(concurrency)
        self.visited: set[str] = set()
        self.queue: asyncio.Queue = asyncio.Queue()
        self.results: list[dict] = []

    async def fetch_page(self, client: httpx.AsyncClient, url: str) -> dict | None:
        async with self.semaphore:
            try:
                response = await client.get(url, timeout=15, follow_redirects=True)
                if response.status_code != 200:
                    return {'url': url, 'status': response.status_code, 'error': 'non-200'}
                return self.parse_html(url, response.text, response.status_code)
            except Exception as e:
                return {'url': url, 'status': None, 'error': str(e)}

    def parse_html(self, url: str, html: str, status: int) -> dict:
        soup = BeautifulSoup(html, 'html.parser')
        links = [
            urljoin(url, a['href'])
            for a in soup.find_all('a', href=True)
            if urlparse(urljoin(url, a['href'])).netloc == self.domain
        ]
        return {
            'url': url,
            'status': status,
            'title': (soup.find('title') or {}).get_text(strip=True),
            'description': (soup.find('meta', {'name': 'description'}) or {}).get('content', ''),
            'h1': (soup.find('h1') or {}).get_text(strip=True),
            'canonical': (soup.find('link', {'rel': 'canonical'}) or {}).get('href', ''),
            'body_text': soup.get_text(separator=' ', strip=True)[:5000],
            'links': links,
        }

Saving to Search Index

Results go into the search backend your team prefers.

For PostgreSQL full-text search:

CREATE TABLE page_index (
    url         TEXT PRIMARY KEY,
    title       TEXT,
    description TEXT,
    h1          TEXT,
    body_text   TEXT,
    tsv         tsvector GENERATED ALWAYS AS (
        to_tsvector('english',
            coalesce(title, '') || ' ' ||
            coalesce(description, '') || ' ' ||
            coalesce(body_text, ''))
    ) STORED,
    crawled_at  TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON page_index USING gin(tsv);

For Meilisearch, results are pushed via the Python client after each crawl run. Meilisearch handles typo tolerance and faceted filtering without additional configuration.

How Does an Incremental Crawl Work?

Full site crawls can take minutes on large sites. Incremental mode fetches only pages where the content has changed since the last crawl. We detect changes by comparing the ETag or Last-Modified response header, or by checksumming the extracted text. Changed pages are re-indexed; unchanged pages are skipped.

This makes the crawler suitable for running on a schedule every hour without excessive server load.

What Reports Does the Crawler Generate?

After each run the crawler produces a structured report:

Total pages crawled and crawl duration
HTTP error breakdown: 404 pages, 500 errors, redirect chains longer than two hops
Pages missing title, description, or H1
Duplicate title and description groups
Internal link graph statistics: pages with no inbound links, pages with too many outbound links

Scope	Timeline
Crawler with PostgreSQL index	3–5 working days
Crawler with Meilisearch and admin dashboard	5–7 working days
Incremental crawler with change detection	4–6 working days

Contact us to discuss your indexing requirements. We will review your site structure, choose the right stack, and deliver a working crawler with documentation and deployment instructions.

Why are Core Web Vitals critical for technical SEO?

PageSpeed 34/100 on mobile. Search Console shows red on all category pages. A competitor with an older site outranks you despite weaker content. Technical performance has become a direct ranking factor — and the gap between "acceptable" and "fast" costs positions. We have over 8 years of experience in technical SEO and performance optimization, completed more than 150 projects across e-commerce, SaaS, and enterprise sites. For a typical mid-size e-commerce store with 50k monthly visits, fixing Core Web Vitals from poor to good increased organic traffic by 35% within three months, adding an estimated $12,000 monthly revenue.

Core Web Vitals: what really affects rankings

Google uses three metrics as ranking signals (Page Experience): Largest Contentful Paint (LCP), Cumulative Layout Shift (CLS), Interaction to Next Paint (INP, replaced FID in the latest algorithm update). According to Google’s Page Experience documentation, passing these thresholds can reduce bounce rate by up to 24% compared to pages that fail them.

LCP: why 8 seconds is not an image problem

LCP measures rendering time of the largest visible element. Good <2.5s, poor >4s.

Real case: online clothing store, LCP 7.8s on mobile. Hero image 4.2MB JPEG without srcset, loaded via CSS background-image (not <img>). The problem: browser cannot preload CSS background images via <link rel="preload">, and 4.2MB on mobile connection is slow.

Solution:

Move to <img> with fetchpriority="high" and loading="eager"
Convert to WebP, add srcset: 800w for mobile, 1400w for desktop
<link rel="preload" as="image" href="hero-800.webp" media="(max-width: 768px)"> in <head>
Remove render-blocking scripts above hero with defer

Result: LCP 7.8s → 1.9s without changing hosting or CDN. That's 4x faster — a competitive advantage in search ranking.

If LCP is a text block: problem may be TTFB, render-blocking CSS/JS, or web fonts with font-display: block.

CLS: what causes layout shifts and how to stop them

CLS measures cumulative layout shift. Good <0.1, poor >0.25. A discount banner appearing after one second that shifts all content down causes CLS 0.35.

Sources:

Images without dimensions. <img src="photo.jpg"> without width/height — browser doesn't reserve space. Fix: explicit width/height or aspect-ratio in CSS.
Ad blocks and widgets — Google Ads, chat, cookie consent. Reserve space via min-height or load before main content.
Web fonts. font-display: swap with size-adjust minimizes CLS.
Dynamic content — add skeleton placeholder with dimensions.

Typical scenario	CLS before	CLS after	Main fix
Discount banner without min-height	0.42	0.02	min-height: 300px
Article images without attributes	0.18	0.01	width/height + aspect-ratio
Chat widget loaded after 3s	0.35	0.05	position: fixed with reserved margin

INP: why interface freezes for 500ms

INP measures response delay to any user interaction. Good <200ms, poor >500ms. INP 680ms means user presses filter button and waits half a second.

Main cause: blocked main thread. A 2.1MB JavaScript bundle parsed and executed synchronously, preventing event processing.

Diagnosis: Chrome DevTools → Performance → interact → find Long Tasks (>50ms). Typical culprits:

Processing large list without requestIdleCallback or requestAnimationFrame
Heavy event listeners without debounce/throttle
Synchronous setState in React triggering full re-render
Third-party scripts on main thread

Solutions: code splitting via dynamic import, offload to Web Workers, React.memo + useMemo, Scheduler API.

How do structured data and Schema.org improve search visibility?

Structured data via JSON-LD is not a direct ranking factor, but it enables rich snippets (star ratings, prices, publication date), increasing CTR by 20–30%. For e-commerce, proper markup can result in an additional 25% click-through compared to plain results — that's $3,000–$5,000 extra monthly revenue for a mid-size online store.

Markup types by scenario:

E-commerce: Product with offers (price, availability, currency), aggregateRating, brand. BreadcrumbList, ItemList.
Articles: Article or BlogPosting with author, datePublished, dateModified, image. Organization and WebSite.
Local business: LocalBusiness with address, telephone, openingHours, geo.
FAQ: FAQPage with mainEntity — questions appear as expandable block.

Validation: Google Rich Results Test, Schema Markup Validator. Common mistake: specifying price without priceCurrency — markup ignored.

How to conduct a technical SEO audit

Crawlability. robots.txt blocks necessary pages or doesn't block service pages. Canonical URLs incorrectly set — duplicates with UTM parameters. Sitemap contains noindex pages. Tools like Screaming Frog or Sitebulb show this in an hour.

Core Web Vitals at scale. Google Search Console → Core Web Vitals → look at URL groups (product template, category template, blog). Problem is usually systemic.

JavaScript SEO. Google renders JS with delay. For critical content, SSR or SSG are mandatory. Check via Search Console → Inspect URL → View Crawled Page.

Internal linking. Orphan pages lose PageRank. Broken links (404) are a quality signal.

Common mistakes when implementing Schema.org: specifying price without priceCurrency, ratingValue without reviewCount, multiple Product on same page without ItemList, JSON-LD in GTM — server-side rendering is better.

What does the optimization process look like?

Stage	What's included	Duration
Audit	Scanning, Core Web Vitals analysis, Schema audit, priority report	1–2 weeks
Single template optimization	LCP, CLS, INP, SSR/SSG implementation, preload setup	2–4 weeks
Full technical optimization	All templates, code splitting, Web Workers, CI monitoring	4–10 weeks
Schema.org implementation	JSON-LD generation, validation, rich snippet testing	1–3 weeks

What deliverables do you receive?

Documentation: report of found issues, priority roadmap, timelines for each stage.
Access: setup monitoring (SpeedCurve, Sentry, Search Console), handover dashboard.
Training: one or two calls reviewing typical mistakes for your team.
Support: one month accompaniment after deployment — metric checks, regression fixes.

How many positions can you regain through technical SEO?

We have 5+ years on the market and 150+ projects completed. For a case study: a SaaS platform with 200k monthly visits had LCP 6.2s, CLS 0.45, INP 600ms. After optimization, LCP dropped to 1.8s, CLS to 0.02, INP to 180ms. Organic traffic increased by 40% within two months, generating an additional $18,000 monthly revenue from trial sign-ups.

Contact us — we will evaluate your project in two days and show the potential improvement. Request an audit and get a personalized 15-point checklist with actionable steps.