Together AI Integration for Running Open LLMs
Together AI provides cloud inference for 200+ open models: Llama 3.1, Mistral, Qwen, DeepSeek, Yi and others. OpenAI-compatible API allows migrating existing code without rewriting. Key advantages: ability to run any open-source model without own GPU infrastructure, fine-tuning your own models.
Basic Integration
from openai import OpenAI, AsyncOpenAI
# Together uses OpenAI SDK
client = OpenAI(
api_key="TOGETHER_API_KEY",
base_url="https://api.together.xyz/v1",
)
# Model selection
MODELS = {
"quality": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
"balanced": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
"fast": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
"code": "Qwen/Qwen2.5-Coder-32B-Instruct",
"reasoning": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
}
response = client.chat.completions.create(
model=MODELS["balanced"],
messages=[{"role": "user", "content": "Task"}],
temperature=0.1,
max_tokens=2048,
)
print(response.choices[0].message.content)
Fine-tuning Your Own Models
# Together allows fine-tuning open models on your own data
import together
together.api_key = "TOGETHER_API_KEY"
# Upload dataset (JSONL format: {"prompt": "...", "completion": "..."})
file_response = together.Files.upload(file="training_data.jsonl")
file_id = file_response["id"]
# Start fine-tuning
ft_response = together.Finetune.create(
training_file=file_id,
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
n_epochs=3,
batch_size=16,
learning_rate=1e-5,
suffix="my-custom-model",
)
ft_job_id = ft_response["id"]
# Check status
status = together.Finetune.retrieve(ft_job_id)
print(status["status"]) # "running" | "completed" | "failed"
Embeddings
response = client.embeddings.create(
model="BAAI/bge-large-en-v1.5", # One of the best for search
input=["First text", "Second text"],
)
embeddings = [item.embedding for item in response.data]
Model Comparison on Together AI
| Model | Quality | Speed (tokens/s) | Cost (1M) |
|---|---|---|---|
| Llama 3.1 405B | Excellent | ~50 | $3.50 |
| Llama 3.1 70B | Very Good | ~150 | $0.88 |
| Llama 3.1 8B | Good | ~400 | $0.18 |
| Qwen2.5-Coder 32B | Code-specific | ~120 | $0.80 |
Timeline
- Basic integration: 0.5 day
- Fine-tuning pipeline: 3–5 days (+ training time)
- A/B testing models: 1–2 days







