Why is streaming needed for blockchain data?

Blockchain data is generated continuously and requires minimal latency for tasks like arbitrage, liquidations, and analytics. Streaming architecture processes events as they arrive, not via periodic polling.

What advantage does Kafka have over queues like RabbitMQ?

Kafka persists all messages in a log with configurable retention, allowing new consumers to re-read history without re-querying the node. This is critical for analytics and debugging.

How does Flink achieve exactly-once delivery?

Flink uses checkpoints that atomically save operator state and the Kafka consumer offset. On failure, recovery from the last checkpoint ensures every event is processed exactly once.

What types of aggregations can be performed in real time?

With Flink you can compute rolling averages, VWAP, anomaly detection via CEP, stream joins, etc. All with event-time semantics and handling of late events.

How long does it take to build such a system?

A basic pipeline with event decoding and aggregations takes 4–6 weeks. Adding CEP and multi-chain support adds 3–4 weeks.

Why is streaming needed for blockchain data?

Blockchain data is generated continuously and requires minimal latency for tasks like arbitrage, liquidations, and analytics. Streaming architecture processes events as they arrive, not via periodic polling.

What advantage does Kafka have over queues like RabbitMQ?

Kafka persists all messages in a log with configurable retention, allowing new consumers to re-read history without re-querying the node. This is critical for analytics and debugging.

How does Flink achieve exactly-once delivery?

Flink uses checkpoints that atomically save operator state and the Kafka consumer offset. On failure, recovery from the last checkpoint ensures every event is processed exactly once.

What types of aggregations can be performed in real time?

With Flink you can compute rolling averages, VWAP, anomaly detection via CEP, stream joins, etc. All with event-time semantics and handling of late events.

How long does it take to build such a system?

A basic pipeline with event decoding and aggregations takes 4–6 weeks. Adding CEP and multi-chain support adds 3–4 weeks.

Building Blockchain Data Stream Processing with Kafka/Flink

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1305 services

Building Blockchain Data Stream Processing with Kafka/Flink

Complex

from 2 weeks to 3 months

Frequently Asked Questions

Blockchain Development Services

Discuss your blockchain project

Free consultation — we will show how blockchain can solve your challenge

Get a quote

We will estimate the budget and timeline for your blockchain project

Blockchain Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Building Blockchain Data Stream Processing with Kafka/Flink

We often encounter a situation where an Ethereum node in real-time generates about 2–5 MB of data per second during high network activity. That includes Transfer events, contract calls, and state changes. If your analytics system or trading engine fetches this data via periodic polling of an RPC node, you are working with stale data and missing events. For tasks where a 1–2 block delay is critical (arbitrage, liquidation monitoring, fraud detection), a streaming architecture with delivery guarantees is necessary. Our engineers build such systems turnkey — from topics to dashboards.

Why Streaming Blockchain Data Is Critical for DeFi

Arbitrage bots, liquidation monitoring, and MEV detection require latency under 500 ms from block arrival to decision-making. Polling an RPC node via JSON-RPC gives delays of seconds and no guarantee of event delivery. A streaming architecture on Kafka ensures data persistence with replay capability, and Flink enables sliding aggregations and complex pattern detection in real time.

Data Sources: From Node to Kafka

WebSocket Subscriptions vs Polling

The standard eth_subscribe("newHeads") via WebSocket notifies about new blocks without polling delay. However, WebSocket connections are unstable over long periods — reconnect with catch-up logic is needed:

func (s *NodeSubscriber) subscribeWithRecovery(ctx context.Context) error {
    for {
        lastBlock, _ := s.db.GetLastProcessedBlock()
        
        // Catch up missed blocks on reconnect
        if err := s.catchUpFromBlock(ctx, lastBlock+1); err != nil {
            return err
        }
        
        // Subscribe to new blocks
        sub, err := s.client.SubscribeNewHead(ctx, s.headers)
        if err != nil {
            time.Sleep(backoffDuration)
            continue
        }
        
        select {
        case err := <-sub.Err():
            log.Warnf("subscription error: %v, reconnecting", err)
        case <-ctx.Done():
            return nil
        }
    }
}

Firehose Protocol (StreamingFast/Pinax)

For Ethereum and other EVM networks, the most efficient way to get raw data is Firehose (StreamingFast), which instruments the node at the binary level and exports blocks in protobuf with minimal latency. Throughput is an order of magnitude higher than JSON-RPC. For projects requiring full historical replay, Firehose plus flat files in S3/GCS allows reproducing any block range without re-syncing the node.

Kafka as Transport Layer

Kafka is a log-based queue. Unlike RabbitMQ/Redis Streams, Kafka persists all messages for a configured retention (days, weeks), allowing consumers to re-read data. This is critical for blockchain analytics: a new consumer group can read the entire event history without touching the node.

Topic topology for a blockchain pipeline:

raw.blocks          → raw blocks (partitioned by block_number % N)
raw.transactions    → all transactions 
raw.logs            → all event logs
decoded.transfers   → decoded ERC-20 Transfer events
decoded.swaps       → decoded Swap events (Uniswap, Curve, etc.)
alerts.large-txns   → transactions > threshold
analytics.prices    → aggregated price data

Partitioning strategy matters: for specific contract events, partition by contractAddress (guarantees ordering). For transactions, partition by from address or blockNumber.

Apache Flink: Stateful Stream Processing

Flink is the right tool for tasks requiring state: sliding aggregations, stream joins, temporal pattern detection. Spark Streaming is batching disguised as streaming (micro-batches). Flink is true event-time processing.

On-the-Fly ABI Decoding

Incoming logs are raw hex data. A Flink job must decode them into typed events:

public class LogDecoderFunction extends RichFlatMapFunction<RawLog, DecodedEvent> {
    private Map<String, ContractABI> abiRegistry;
    
    @Override
    public void flatMap(RawLog log, Collector<DecodedEvent> out) {
        String contractAddress = log.getAddress().toLowerCase();
        ContractABI abi = abiRegistry.get(contractAddress);
        
        if (abi == null) return; // unknown contract
        
        String topic0 = log.getTopics().get(0);
        EventDefinition eventDef = abi.findEventBySignatureHash(topic0);
        
        if (eventDef != null) {
            DecodedEvent decoded = AbiDecoder.decode(eventDef, log);
            out.collect(decoded);
        }
    }
}

The ABI registry is loaded from PostgreSQL/Redis at job start and updated via Broadcast State pattern — no job restart when new contracts are added.

Temporal Windows and Aggregations

Task: compute 5-minute VWAP (Volume Weighted Average Price) from Uniswap V3 swaps in real time.

DataStream<SwapEvent> swaps = source
    .filter(e -> e.getType().equals("Swap"))
    .map(e -> (SwapEvent) e);

DataStream<VWAPResult> vwap = swaps
    .keyBy(SwapEvent::getPoolAddress)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new VWAPAggregator(), new VWAPWindowFunction());

Event time vs processing time — a fundamental choice. Event time (block time) gives deterministic results when replaying history. Processing time is faster but yields different results on replay.

Watermarks for handling late events — blockchain transactions may arrive in Kafka with slight delay:

WatermarkStrategy.<RawLog>forBoundedOutOfOrderness(Duration.ofSeconds(10))
    .withTimestampAssigner((log, ts) -> log.getBlockTimestamp() * 1000L)

Complex Patterns: CEP for Anomaly Detection

Flink CEP (Complex Event Processing) allows describing event sequences. Task: detect a sandwich attack — front-run transaction, victim, back-run transaction within one block.

Pattern<DecodedEvent, ?> sandwichPattern = Pattern
    .<DecodedEvent>begin("frontrun")
        .where(e -> e.isSwap() && e.getGasPrice() > threshold)
    .next("victim")
        .where(e -> e.isSwap() && samePool(e, "frontrun"))
    .next("backrun")
        .where(e -> e.isSwap() && samePool(e, "frontrun") 
               && e.getSender().equals(frontrunSender(e)))
    .within(Time.seconds(12)); // within one block

State Backend and Fault Tolerance

How We Ensure Exactly-Once Delivery?

Flink checkpoint — snapshot of all operator state to S3/HDFS. On failure, recovery from the last checkpoint, Kafka consumer offset saved atomically with state. This guarantees exactly-once semantics for most operators.

RocksDB state backend — mandatory for production with large state (millions of keys). In-memory backend doesn’t scale.

Details on checkpointing

Checkpointing interval of 60 seconds balances performance and recovery. On failure, recovery takes no more than 2 minutes.

Monitoring and Dead Letter Queues

Unprocessed events (unknown ABI, parsing error, unexpected format) cannot simply be dropped. Dead letter queue (DLQ) into a separate Kafka topic preserving the original message and stack trace — standard pattern.

Metrics: Flink + Prometheus + Grafana: lag per topic, operator throughput, backpressure in job graph. Backpressure is the first indicator that downstream can’t keep up.

Typical Use Cases and Latency

Use Case	Acceptable Latency	Tool
MEV bot / arbitrage	< 100 ms	WebSocket → in-process
Liquidation monitoring	< 1 sec	Kafka + Flink CEP
Real-time DeFi analytics	1–5 sec	Kafka + Flink aggregations
On-chain analytics / BI	< 1 min	Kafka + Flink → ClickHouse
Historical analysis	no limit	Firehose → S3 → Spark/dbt

Comparison of Stream Processing Tools

Tool	Approach	Delivery Guarantee	Latency
Apache Flink	True streaming, event-time	Exactly-once	< 100 ms
Kafka Streams	Stream-table duality	At-least-once	< 100 ms
Spark Streaming	Micro-batches	Exactly-once (via checkpoint)	~ 1 sec
Akka Streams	Reactive streams	At-most-once	< 50 ms

Infrastructure and Stack

Minimum production cluster: 3 Kafka brokers (3 replicas for durability), Flink cluster with 1 JobManager + 3–5 TaskManager pods in Kubernetes. Result storage: ClickHouse for analytical queries (columnar, fast aggregations on large volumes) or PostgreSQL + TimescaleDB for time-series metrics.

Managed services reduce operational load: Confluent Cloud (Kafka), Amazon Kinesis (alternative for AWS-native stack). For on-premise or compliance requirements — own cluster.

What’s Included in System Development

Architecture of streaming pipeline from sources to storage
Kafka setup: topics, partitioning, retention policies
Flink job development: ABI decoding, aggregations, CEP patterns
Monitoring and alerting: Prometheus + Grafana dashboards
Documentation and team training
Post-launch support (per SLA)

Our team has 7+ years of experience building high-load systems for Crypto and DeFi, having delivered 30+ projects. We’re ready to assess your project — get in touch. Evaluation takes 2 business days.

Blockchain Infrastructure Deployment: Nodes, RPC, Indexing

Subgraph fell at 3:47 AM. By morning users saw outdated balances, transactions "hung" in the UI, support received 47 tickets in an hour. Cause: the handler in the subgraph failed on a transaction with a non-standard event log — and the entire index stopped. We have encountered such situations dozens of times. Our experience shows: blockchain infrastructure does not forgive gaps in observability. Guaranteeing uptime without multi-layered monitoring and fault-tolerant architecture is impossible. Over 8 years working with Ethereum, Polygon, and Solana, we have developed an approach that allows predictable deployment of infrastructure of any scale — from a single node to a multichain grid with dozens of subgraphs.

RPC Layer Architecture

Every dApp interaction with the blockchain goes through RPC — the JSON-RPC API provided by a node. Three options:

Managed providers — Alchemy, QuickNode, Infura, Ankr. Minimal operational costs, SLA, built-in monitoring. Limits: rate limits (Alchemy Free: 300 RU/sec), vendor lock, potential downtime during provider incidents. For most projects — the right choice at the start.

Self-owned nodes — full control, no rate limits, no third-party dependence. Cost: archive Ethereum node requires 2.5–3TB SSD, a strong server, and DevOps support. Sync from scratch on Ethereum via Geth/Nethermind — 3–7 days. Justified under high load or latency requirements.

Hybrid — self-owned node as primary, managed provider as fallback. Standard for protocols with high TVL. Proper load balancing can reduce costs by 20–30% compared to pure managed setup. Under high monthly request volume, hybrid saves significantly.

Provider	Strength	Limitation
Alchemy	Supernode, Enhanced APIs, webhooks	Expensive on high-volume
QuickNode	Low latency, multi-chain	More expensive than Alchemy on basic plan
Infura	Historical reliability	Rate limits on free, one major incident halted half of DeFi
Ankr	Cheap, 40+ chains	Less stable

How to Set Up an RPC Layer Without a Single Point of Failure?

At least two providers, DNS round-robin with health check every 5 seconds, automatic fallback when latency >500 ms. In practice, this gives 99.99% availability during any provider failure. For protocols with high TVL, we recommend a custom HA-proxy (nginx or Envoy) in front of two managed providers.

Why Is a Hybrid RPC Scheme More Cost-Effective Than Pure Managed?

At high request volumes, managed providers can be very expensive; a hybrid using a self-owned node as primary and a managed fallback cuts costs significantly without losing SLA.

Ethereum Node Clients

Execution clients: Geth (most used), Nethermind (C#, fast sync), Besu (Java, enterprise), Erigon (fastest sync, efficient archive mode ~2TB instead of 3TB).

Consensus clients (post-Merge): Lighthouse (Rust), Prysm (Go), Teku (Java), Nimbus (Nim). Each node after The Merge requires a pair of execution + consensus clients.

For DevOps: eth-docker — Docker Compose configurations for all client combinations. Setting up monitoring via Grafana + Prometheus is mandatory; a standard dashboard is available in each client's repository.

The Graph: Event Indexing

The Graph Protocol — decentralized indexing. A subgraph describes which events from which contracts to index and how to transform them into a GraphQL schema.

Subgraph structure:

subgraph.yaml — manifest: contract addresses, startBlock, events to handle
schema.graphql — GraphQL schema of entities
src/mapping.ts — AssemblyScript event handlers

dataSources:
  - kind: ethereum
    name: UniswapV3Pool
    network: mainnet
    source:
      address: "0x88e6A0c2dDD26FEEb64F039a2c41296FcB3f5640"
      abi: UniswapV3Pool
      startBlock: 12370624
    mapping:
      eventHandlers:
        - event: Swap(indexed address,indexed address,int256,int256,uint160,uint128,int24)
          handler: handleSwap

AssemblyScript handlers — not TypeScript. No nullable types, no closures, no many standard APIs. An error in the handler stops the subgraph indexing on that transaction. Important: add try-catch for operations that can fail (e.g., store.get() for an entity that may not exist).

How to Avoid Subgraph Indexing Stops?

Graph Node logs are monitored in real-time; on hasIndexingErrors = true an alert fires and an automatic node restart (via systemd or Kubernetes). Typical downtime on error — 150–300 seconds to recover. Additionally, for production we set up a watchdog that restarts Graph Node if subgraph lag exceeds 50 blocks.

Choosing Between Hosted Service and Decentralized Network

Graph Hosted Service (free, centralized) is deprecated in favor of Subgraph Studio + Graph Network. For production: deploy on Graph Network with GRT curation signal — the subgraph gets indexers proportional to curation.

Alternatives to The Graph: Ponder (TypeScript, self-hosted, easier to debug), Envio (ultra-fast indexer, supports EVM + non-EVM), Subsquid (TypeScript, own network), Moralis Streams (managed, webhook-based). Our experience shows: for high-load projects with unique logic, Ponder or Envio are more effective — they give full control over the process and do not require GRT tokenomics.

Webhooks and Real-Time Notifications

Alchemy Webhooks and QuickNode Streams allow receiving events in real-time via HTTP webhook or WebSocket. For monitoring addresses, new transactions, mints — this is faster than polling RPC.

Tenderly — platform for monitoring and alerts. You can set up an alert for a specific contract event, balance change, function call with certain parameters. Transaction simulation via Tenderly API is invaluable for debugging.

Monitoring and Observability

Minimum monitoring stack for a protocol:

On-chain: OpenZeppelin Defender Sentinel — watches contract events, triggers webhook or Autotask when conditions are met. Forta Network — community-maintained bots detect anomalies (large withdrawals, flash loans, governance attacks).

Infrastructure: Grafana + Prometheus for nodes, Datadog or Grafana Cloud for managed metrics. Alerts on: node is 10+ blocks behind, RPC latency >500ms, subgraph lag >100 blocks.

Uptime: Better Uptime or PagerDuty on RPC endpoint and subgraph health endpoint (The Graph provides _meta { hasIndexingErrors, block { number } }).

Why Is Monitoring Without Tenderly Insufficient?

Tenderly provides transaction simulation and detailed traces — critical for debugging subgraph and smart contract errors. Forta focuses on network anomalies, not your infrastructure. The combination of Tenderly plus a custom Grafana dashboard covers 90% of incident scenarios.

Multichain Infrastructure

A protocol on 5 chains = 5 separate RPC endpoints, 5 subgraphs, 5 monitoring configs. Manageable but requires deployment automation.

For subgraph multi-network deployment: graph deploy --network mainnet, graph deploy --network arbitrum-one etc. with a unified codebase and network-specific addresses in separate config files.

Chainlink CCIP and LayerZero for cross-chain messaging require monitoring of both chains and transactions on intermediate relayers. A reorg on the source chain after a confirmed mint on the target chain is a classic bridge problem. Solution: wait for finality (on Ethereum ~15 minutes after Merge for economic finality) before confirming on the target chain.

Infrastructure Setup Process

Audit current stack — determine chains, request volume, latency and availability requirements.
Architecture design — select providers, load balancing, redundancy.
Subgraph development — manifest → schema → handlers → testing on local Graph Node → deploy to testnet → mainnet.
Monitoring configuration — Tenderly alerts, Grafana dashboard, PagerDuty integration.
Documentation and runbook — what to do when: subgraph falls behind, RPC downtime, node desync.
Handover to operations — team training, access transfer, first month support.

What's Included

Deployment of managed or self-hosted Ethereum, Polygon, BNB Chain nodes
RPC layer setup with primary/fallback and load balancing
Subgraph development and deployment for your protocol
Monitoring connection (Tenderly, Grafana, alerts)
Runbook and operations documentation
Team training (up to 4 hours online)
30-day support after delivery

Timeline

Task	Duration
RPC and basic monitoring setup	1–2 weeks
Subgraph for one protocol	2–4 weeks
Self-hosted node with monitoring	2–3 weeks
Full infrastructure (multi-chain, monitoring, runbooks)	6–10 weeks

All projects are managed in a GitHub/GitLab repository with CI/CD; configuration code stays with you. Order infrastructure deployment — we'll show how to cut costs by 20–30% without losing reliability. Get a consultation — we'll demonstrate how we deployed infrastructure for a protocol with large TVL on Ethereum and Arbitrum. Contact us.