Integration with io.net (GPU Network)
Standard ML infrastructure problem on blockchain: centralized GPU providers (AWS, GCP) offer predictable latency and SLA, but completely violate permissionless compute access principle. io.net solves this via DePIN model — decentralized network of ~200 thousand GPUs, aggregated from data centers, mining farms, and gaming computers. Integration task is not just calling REST API, but building reliable pipeline accounting for decentralized computing specifics: variable latency, worker failures, stochastic task distribution.
Architecture of io.net Integration
io.net provides two main interaction methods: IO Cloud API for managed clusters and IOG (IO Compute) for direct single GPU worker access. For production systems, use the first option with clusters.
Cluster Lifecycle
Typical flow looks like:
POST /clusters → create cluster with GPU requirements
GET /clusters/{id} → poll status (PROVISIONING → READY)
POST /clusters/{id}/jobs → run tasks
GET /jobs/{job_id} → monitor execution
DELETE /clusters/{id} → release resources
Resource provisioning strategy is critical: io.net doesn't guarantee resource allocation time — depending on network load and GPU requirements this can take 2 minutes to 30+. Any integration should be built on async model with webhook notifications or polling with exponential backoff, not sync calls with timeout.
Cluster Specification
When creating cluster, specify requirements:
{
"cluster_name": "inference-cluster-prod",
"num_gpus": 8,
"gpu_model": "NVIDIA_3090",
"min_vcpus": 16,
"min_ram": 64,
"locations": ["US", "EU"],
"compliance": ["GDPR"],
"duration_hours": 4
}
The gpu_model field is one of the most important. For LLM inference (LLaMA 3, Mistral) RTX 3090/4090 with 24GB VRAM is sufficient. For training or fine-tuning — need A100/H100 with NVLink. GPU model mismatch with task — main source of inefficient spending on io.net.
Managing Failures and Reliability
Decentralized network is by definition less predictable than managed cloud. In practice this means:
- Worker can go offline mid-task (node lost connection, operator removed machine)
- GPUs can have different state — one slot faster than another
- Network delay between workers in cluster not guaranteed — for allreduce tasks (distributed training) this is critical
Retry and Checkpointing Pattern
For long tasks, checkpoint mechanism is mandatory. If 6-hour training task fails on hour 5 — without checkpoints everything starts over:
class IONetJobManager:
def __init__(self, api_key: str, checkpoint_storage: str):
self.client = IONetClient(api_key)
self.storage = CheckpointStorage(checkpoint_storage) # S3/IPFS
def submit_with_retry(self, job_config: dict, max_retries: int = 3):
last_checkpoint = self.storage.get_latest_checkpoint(job_config["job_id"])
if last_checkpoint:
job_config["resume_from"] = last_checkpoint
for attempt in range(max_retries):
try:
job = self.client.submit_job(job_config)
return self._monitor_with_checkpointing(job)
except WorkerFailureError as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt * 30 # 30s, 60s, 120s
time.sleep(wait_time)
Monitoring via On-Chain Events
io.net uses Solana for settlements and verification — this enables building monitoring on top of on-chain events, not just REST API. Worker accounts update on status changes, and WebSocket subscription via @solana/web3.js (connection.onAccountChange) gives lower latency notifications than API polling.
Payment via $IO Token
Settlements in io.net are in $IO token (SPL-token on Solana). For automated systems this means managing on-chain balance:
| Aspect | Solution |
|---|---|
| Balance replenishment | Programmatic swap via Jupiter Aggregator or direct purchase |
| Cost control | Set max_spend limit on cluster creation |
| Refunds | Automatic on DELETE /clusters/{id} |
| Currency risk | Hedging via perpetual on Drift Protocol |
For enterprise clients io.net offers stablecoin settlements via separate enterprise plan — eliminates $IO volatility concern.
Typical Use Cases
Inference-as-a-service: deploy model on io.net cluster, expose own API on top. Savings vs AWS SageMaker — 60–80% at comparable throughput.
Federated learning: io.net supports isolated clusters with compliance restrictions by geography — enables federated learning pipelines where data doesn't leave jurisdiction.
Burst computing for Web3 projects: on-chain games, AI content generation for NFTs, ZK-proof generation verification — tasks requiring GPU only periodically. io.net lets pay only for used time without capacity reservation.







