Fireworks AI Integration for LLM Inference
Fireworks AI specializes in optimized inference of open-source models with LoRA adapter support. Distinctive feature: serverless deployment with support for hundreds of concurrent LoRA adapters on top of a single base model — efficient for SaaS with per-customer customization.
Basic Integration
from openai import OpenAI
client = OpenAI(
api_key="FIREWORKS_API_KEY",
base_url="https://api.fireworks.ai/inference/v1",
)
# Text requests
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{"role": "user", "content": "Explain transformers"}],
temperature=0.1,
max_tokens=2048,
)
print(response.choices[0].message.content)
# Functions
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
response = client.chat.completions.create(
model="accounts/fireworks/models/firefunction-v2", # Special model for function calling
messages=[{"role": "user", "content": "Weather in Moscow?"}],
tools=tools,
tool_choice="auto",
)
Serverless LoRA
# Unique Fireworks feature: deploy LoRA adapter without dedicated GPU
# Perfect for multi-tenant applications
# Upload LoRA adapter
import fireworks.client as fw
fw.api_key = "FIREWORKS_API_KEY"
# After fine-tuning, adapter available through regular API
response = client.chat.completions.create(
model="accounts/your-account/models/your-lora-adapter", # your LoRA
messages=[{"role": "user", "content": "Request"}],
)
Streaming and JSON Mode
# JSON mode
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{"role": "user", "content": "Return user data in JSON"}],
response_format={"type": "json_object"},
)
# Streaming
with client.chat.completions.stream(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{"role": "user", "content": "Long answer"}],
) as stream:
for chunk in stream.text_stream:
print(chunk, end="")
Popular Fireworks AI Models
| Model | Specialization |
|---|---|
| llama-v3p1-405b-instruct | Maximum quality |
| llama-v3p1-70b-instruct | Balance |
| llama-v3p1-8b-instruct | Fast |
| firefunction-v2 | Function calling |
| mixtral-8x22b-instruct | Long context |
Timeline
- Basic integration: 0.5 day
- LoRA fine-tuning + deployment: 3–5 days
- Multi-tenant architecture with LoRA: 2 weeks







