Development of Node-as-a-Service Platform
Running a blockchain node manually is straightforward. Running them by the hundreds with uptime guarantees, versioning, client isolation, billing, and API proxies is a full-scale infrastructure product. This is the platform built by those who want to compete with Infura, Alchemy, QuickNode or offer managed infrastructure to enterprise clients in a specific region or ecosystem.
Before diving into development, honestly answer the question: are you building NaaS for public blockchains (Ethereum, Solana, BNB) or for private/permissioned networks (Hyperledger Besu, Quorum)? These are architecturally different systems with different challenges.
NaaS Platform Architecture Layers
Node Orchestration Layer
Kubernetes is the standard for managing node lifecycle. However, standard K8s deployment doesn't work directly for blockchain nodes: nodes have massive stateful data, require specific network policies, and pod restart means resyncing from scratch (which takes days).
StatefulSet with PVC:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ethereum-geth
spec:
serviceName: "geth"
replicas: 1
selector:
matchLabels:
app: ethereum-geth
template:
spec:
containers:
- name: geth
image: ethereum/client-go:v1.13.14
args:
- "--datadir=/data"
- "--http"
- "--http.addr=0.0.0.0"
- "--http.vhosts=*"
- "--http.api=eth,net,web3,txpool"
- "--ws"
- "--ws.addr=0.0.0.0"
- "--maxpeers=50"
- "--cache=4096"
ports:
- containerPort: 8545 # HTTP RPC
- containerPort: 8546 # WebSocket
- containerPort: 30303 # P2P
protocol: TCP
- containerPort: 30303
protocol: UDP
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
memory: "16Gi"
cpu: "4"
limits:
memory: "32Gi"
cpu: "8"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "fast-nvme"
resources:
requests:
storage: 3Ti # Ethereum archive node
P2P port issue in K8s: blockchain nodes require fixed ports for peer discovery (30303/TCP+UDP for Ethereum). NodePort or LoadBalancer with fixed port per node is the only workable approach. HostNetwork is an alternative but loses isolation.
Bootstrapping: "Day One" Problem
Ethereum mainnet sync from zero (snap sync): 12–24 hours. Archive node: 2–5 weeks. Unacceptable for a platform where the client starts paying from minute one.
Solutions:
Snapshot distribution — initialize node from current database snapshot. Must maintain current snapshots (~500GB–3TB depending on type) and provide HTTP/S3 access for bootstrap. Strategy: snapshot every 7 days, incremental diffs daily.
#!/bin/bash
# Bootstrap node from snapshot
NODE_ID=$1
CHAIN=$2
SNAPSHOT_BASE="s3://your-snapshots/${CHAIN}/latest"
# Download snapshot
aws s3 sync ${SNAPSHOT_BASE} /data/${NODE_ID}/chaindata \
--no-sign-request \
--region eu-west-1
# Verify integrity
sha256sum -c /data/${NODE_ID}/chaindata/CHECKSUM
# Start node
kubectl rollout restart statefulset/${CHAIN}-node-${NODE_ID}
Firehose / Erigon snapshots — for some chains, the community maintains public snapshots (Erigon uploads snapshots to BitTorrent/IPFS).
API Gateway and Routing
Client gets one endpoint. Behind it — load balancer, health checks, rate limiting, API key management.
Client → API Gateway (Kong/custom) → Node Pool → Blockchain Node
↓
[Auth, RateLimit, Billing, Logging]
Custom RPC proxy is necessary because:
- Must filter dangerous methods (
debug_traceTransaction— expensive, premium only) - Need request routing by type (archive requests → archive node, latest block → sync node)
- Need response caching for common requests (
eth_chainId,eth_blockNumber)
// Example RPC proxy with routing logic
package proxy
type RPCRouter struct {
archivePool NodePool
fullNodePool NodePool
cacheClient *redis.Client
}
var archiveMethods = map[string]bool{
"eth_getBalance": true, // with block parameter != "latest"
"eth_call": true,
"eth_getStorageAt": true,
"trace_call": true,
"trace_replayTransaction": true,
}
func (r *RPCRouter) Route(req *RPCRequest) NodePool {
if archiveMethods[req.Method] {
if req.RequiresHistoricalBlock() {
return r.archivePool
}
}
return r.fullNodePool
}
func (r *RPCRouter) Handle(w http.ResponseWriter, req *RPCRequest, apiKey string) {
// Check cache
cacheKey := req.CacheKey()
if cached, err := r.cacheClient.Get(ctx, cacheKey).Bytes(); err == nil {
w.Write(cached)
return
}
pool := r.Route(req)
node := pool.GetHealthyNode()
resp := node.Forward(req)
if req.IsCacheable() {
r.cacheClient.Set(ctx, cacheKey, resp, req.CacheTTL())
}
// Billing: record usage
r.billing.RecordRequest(apiKey, req.Method, resp.ComputeUnits())
w.Write(resp)
}
Health Checking and Failover
A blockchain node can be technically "alive" (responds to ping) but practically useless (lagged 100 blocks behind the network, or sync mode = syncing).
type NodeHealthChecker struct {
client *ethclient.Client
}
func (h *NodeHealthChecker) IsHealthy(ctx context.Context) (bool, error) {
// Check if not in sync mode
syncing, err := h.client.SyncProgress(ctx)
if err != nil {
return false, err
}
if syncing != nil {
return false, fmt.Errorf("node is syncing: %d/%d",
syncing.CurrentBlock, syncing.HighestBlock)
}
// Check block freshness
header, err := h.client.HeaderByNumber(ctx, nil)
if err != nil {
return false, err
}
blockAge := time.Since(time.Unix(int64(header.Time), 0))
if blockAge > 2*time.Minute {
return false, fmt.Errorf("block too old: %v", blockAge)
}
return true, nil
}
Health checks should run every 10–30 seconds. Node is excluded from pool on two consecutive failures, returned after three successful checks.
Multi-tenancy and Isolation
Resource Sharing
Three models:
Shared nodes — multiple clients use one node. Cheap, but no performance guarantees. Suitable for free tier and small projects.
Dedicated nodes — one node per client. Guaranteed resources, isolation. For enterprise.
Node clusters — multiple replicas behind load balancer for one client. For high-availability requirements.
-- Billing schema
CREATE TABLE api_keys (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
key_hash BYTEA NOT NULL, -- never store key in plaintext
tier VARCHAR(20) NOT NULL, -- free, starter, pro, enterprise
rate_limit_rps INTEGER NOT NULL,
monthly_cu_limit BIGINT, -- compute units
node_type VARCHAR(20) NOT NULL, -- shared, dedicated
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE usage_records (
id BIGSERIAL PRIMARY KEY,
api_key_id UUID NOT NULL REFERENCES api_keys(id),
method VARCHAR(100) NOT NULL,
chain_id INTEGER NOT NULL,
compute_units INTEGER NOT NULL,
response_time_ms INTEGER,
recorded_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for billing by period
CREATE INDEX idx_usage_billing ON usage_records (api_key_id, recorded_at);
Compute Units (CU) — standard NaaS billing unit. Each RPC method has a weight:
-
eth_blockNumber→ 10 CU -
eth_getTransactionReceipt→ 15 CU -
eth_call→ 26 CU -
trace_replayTransaction→ 75 CU -
eth_getLogs→ 75 CU + 1 CU per log returned
Rate Limiting
Redis-based sliding window is better than token bucket for RPC loads:
func (rl *RateLimiter) Allow(ctx context.Context, apiKey string, rps int) (bool, error) {
now := time.Now().UnixMilli()
window := int64(1000) // 1 second in milliseconds
pipe := rl.redis.Pipeline()
pipe.ZRemRangeByScore(ctx, apiKey, "0",
strconv.FormatInt(now-window, 10))
pipe.ZCard(ctx, apiKey)
pipe.ZAdd(ctx, apiKey, redis.Z{Score: float64(now), Member: now})
pipe.Expire(ctx, apiKey, 2*time.Second)
results, err := pipe.Exec(ctx)
count := results[1].(*redis.IntCmd).Val()
return count < int64(rps), nil
}
Monitoring and Alerting
Critical Metrics
# Prometheus metrics for NaaS
- node_sync_lag_blocks{chain, node_id} # lag from head
- node_peer_count{chain, node_id} # peer count
- rpc_request_duration_seconds{method, status} # p50, p95, p99
- rpc_requests_total{method, chain, tier} # for billing
- node_restart_total{chain, node_id, reason} # restart frequency
- compute_units_consumed{api_key, chain} # billing data
Alert rules for on-call:
| Metric | Threshold | Severity |
|---|---|---|
| sync_lag_blocks | > 10 blocks | Warning |
| sync_lag_blocks | > 50 blocks | Critical |
| peer_count | < 5 | Warning |
| rpc_error_rate | > 5% in 5 min | Warning |
| node_restart_total | > 3 per hour | Critical |
Supported Clients and Specifics
| Chain | Client | Data Size | Features |
|---|---|---|---|
| Ethereum (full) | Geth / Reth | ~1.2 TB | snap sync available |
| Ethereum (archive) | Erigon | ~2.5 TB | trace API differs from Geth |
| Solana | Agave (Solana Labs) | ~50 TB (full ledger) | geyser plugin for streaming |
| BNB Chain | BSC Geth fork | ~800 GB | faster block time (3s) |
| Polygon | Bor + Heimdall | ~600 GB | two processes per node |
| Arbitrum | Nitro | ~1 TB | sequencer feed, not P2P |
| Base | op-geth | ~800 GB | OP Stack, op-node nearby |
Reth — new Ethereum client in Rust by Paradigm. Significantly faster at sync (~2× less time), better resource utilization. For new deployments — first choice for Ethereum full nodes.
Development Stages
Phase 1 — Core infrastructure (4–6 weeks): K8s setup, StatefulSet templates for 2–3 chains, snapshot bootstrap pipeline, basic health checker.
Phase 2 — API Gateway (3–4 weeks): RPC proxy, API key management, rate limiting, compute units counting.
Phase 3 — Multi-tenancy & billing (3–4 weeks): tenant isolation, usage tracking, billing integration (Stripe), usage dashboard.
Phase 4 — Observability (2–3 weeks): Prometheus + Grafana, alerting, log aggregation (Loki), on-call runbooks.
Phase 5 — Self-service portal (4–6 weeks): web interface for creating nodes, viewing metrics, managing API keys.
Total: 16–23 weeks to production-ready platform. Adding each new chain post-launch — 1–2 weeks (template + snapshot pipeline + testing).







