Replicate integration for running open AI models
Replicate is a cloud platform for running open-source AI models via an API without the need to manage GPU infrastructure. It contains thousands of models, including Stable Diffusion, LLaMA, Whisper, CodeLlama, and others. Payment is based on GPU time.
Installation and basic use
import replicate
# Генерация изображения через Stable Diffusion XL
output = replicate.run(
"stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
input={
"prompt": "A photorealistic cat wearing a space suit",
"width": 1024,
"height": 1024,
"num_outputs": 1,
}
)
print(output[0]) # URL изображения
Running LLM via Replicate
# LLaMA 2 70B через Replicate
for event in replicate.stream(
"meta/llama-2-70b-chat",
input={
"prompt": "Explain transformer architecture",
"max_new_tokens": 512,
"temperature": 0.7,
"system_prompt": "You are a helpful ML engineer."
}
):
print(str(event), end="")
Async and batch requests
import asyncio
import replicate
async def run_batch_inference(prompts: list[str]) -> list:
tasks = [
replicate.async_run(
"meta/llama-2-70b-chat",
input={"prompt": p, "max_new_tokens": 256}
)
for p in prompts
]
results = await asyncio.gather(*tasks)
return results
When to use Replicate
Replicate is optimal for: prototyping (no need for a dedicated GPU), irregular workloads (no point in keeping a GPU running 24/7), and access to models that are difficult to deploy manually (large diffusion models). For constant workloads, deploying manually via HuggingFace or vLLM is 5-10 times cheaper.







