Together AI is a cloud platform for inference and fine-tuning of open LLMs. It provides access to 200+ models via an OpenAI-compatible API, allowing quick switching without code changes.

Which models are available on Together AI?

All popular open models are available: Llama 3.1, Mistral, Qwen, DeepSeek, Yi, as well as specialized models for code and reasoning. You can select a model for your specific task.

How fast can Together AI be integrated into an existing project?

Thanks to full compatibility with the OpenAI SDK, integration takes 0.5 days. Simply replace the base URL and API key — the code remains unchanged.

How does model fine-tuning work on Together AI?

You upload a dataset in JSONL format and launch a job via the Python SDK. The platform automatically selects hyperparameters. We assist with data preparation and quality evaluation.

What are the advantages over other cloud services?

Together AI offers lower inference costs due to GPU optimization and flexible routing between models. Additionally, fine-tuning and A/B testing are available.

Together AI is a cloud platform for inference and fine-tuning of open LLMs. It provides access to 200+ models via an OpenAI-compatible API, allowing quick switching without code changes.

Which models are available on Together AI?

All popular open models are available: Llama 3.1, Mistral, Qwen, DeepSeek, Yi, as well as specialized models for code and reasoning. You can select a model for your specific task.

How fast can Together AI be integrated into an existing project?

Thanks to full compatibility with the OpenAI SDK, integration takes 0.5 days. Simply replace the base URL and API key — the code remains unchanged.

How does model fine-tuning work on Together AI?

You upload a dataset in JSONL format and launch a job via the Python SDK. The platform automatically selects hyperparameters. We assist with data preparation and quality evaluation.

What are the advantages over other cloud services?

Together AI offers lower inference costs due to GPU optimization and flexible routing between models. Additionally, fine-tuning and A/B testing are available.

Together AI Integration: Turnkey Open LLM Deployment

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Together AI Integration: Turnkey Open LLM Deployment

Simple

~1 day

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Deploying your own GPU infrastructure for inference of modern LLMs is an expensive endeavor. A100/H100 clusters require budgets starting from millions of rubles and a team of MLOps engineers. Together AI removes this barrier: we take over the integration of your services with the platform. We connect an OpenAI-compatible API, configure routing between models, optimize p99 latency — and you get a ready-made endpoint for your tasks. With over 5 years of experience, we have completed more than 30 LLM integration projects in production — we will evaluate your project within a day.

Case study: for a fintech startup, we deployed a RAG system on Together AI in 3 days. Previously, they paid for dedicated GPU instances on AWS — Together reduced costs by 5x, p99 latency did not exceed 200 ms, and answer accuracy reached 94%. This is not an isolated example: according to the platform documentation, its use can reduce inference costs by up to 80%. Compared to renting GPU instances on AWS, Together AI provides up to 80% savings with comparable performance.

What problems does Together AI solve?

High GPU cost. Renting dedicated A100/H100 instances in the cloud results in tens of thousands of dollars monthly. Together AI uses a shared GPU pool, reducing inference cost by 5–10x compared to cloud instances. Deployment complexity: selecting CUDA version, optimizing via vLLM or TensorRT-LLM — engineering hours. The platform handles all low-level optimizations. Lack of flexibility: balancing quality and speed — we implement a router that automatically switches the model depending on the context.

Basic integration

from openai import OpenAI, AsyncOpenAI

# Together uses OpenAI SDK
client = OpenAI(
    api_key="TOGETHER_API_KEY",
    base_url="https://api.together.xyz/v1",
)

# Model selection
MODELS = {
    "quality": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    "balanced": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    "fast": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    "code": "Qwen/Qwen2.5-Coder-32B-Instruct",
    "reasoning": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
}

response = client.chat.completions.create(
    model=MODELS["balanced"],
    messages=[{"role": "user", "content": "Task"}],
    temperature=0.1,
    max_tokens=2048,
)
print(response.choices[0].message.content)

How does model routing work?

The router analyzes the request: if the task requires deep reasoning — it routes to DeepSeek-R1, if speed is needed — to Llama 3.1 8B. This reduces average token cost by 30–40% without quality loss. Under the hood, we use dynamic thresholds based on latency and confidence.

More about routing configuration

We set threshold values through A/B testing on a representative sample of queries. The router then automatically adapts to load, switching models based on current latency and required quality.

How does fine-tuning work?

Fine-tuning open models on your own data improves accuracy and reduces hallucinations. Together AI provides infrastructure for LoRA and full fine-tuning. We help prepare the dataset, select hyperparameters, and conduct A/B testing. Example launch:

# Together allows fine-tuning open models on your own data
import together

together.api_key = "TOGETHER_API_KEY"

# Upload dataset (JSONL format: {"prompt": "...", "completion": "..."})
file_response = together.Files.upload(file="training_data.jsonl")
file_id = file_response["id"]

# Start fine-tuning
ft_response = together.Finetune.create(
    training_file=file_id,
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
    n_epochs=3,
    batch_size=16,
    learning_rate=1e-5,
    suffix="my-custom-model",
)
ft_job_id = ft_response["id"]

# Check status
status = together.Finetune.retrieve(ft_job_id)
print(status["status"])  # "running" | "completed" | "failed"

Advantages of fine-tuning on Together AI

The platform manages all compute resources — no need to think about GPUs. Pre-trained LoRA weights are available, which can be customized to your data. We help with hyperparameter selection and metric evaluation on a validation dataset. Fine-tuning pays off by increasing accuracy and reducing the number of expensive tokens.

Embeddings and RAG

For Retrieval-Augmented Generation, we use embedding models such as BAAI/bge-large-en-v1.5. Together AI supports asynchronous embedding generation, which is important under high load:

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",  # One of the best for search
    input=["First text", "Second text"],
)
embeddings = [item.embedding for item in response.data]

Model comparison on Together AI

Model	Quality	Speed (tokens/s)	Resource intensity
Llama 3.1 405B	Excellent	~50	Requires H100
Llama 3.1 70B	Very Good	~150	Medium
Llama 3.1 8B	Good	~400	Low
Qwen2.5-Coder 32B	Code-specific	~120	Medium
DeepSeek-R1 70B	Reasoning	~100	Medium

How to integrate Together AI in 3 steps?

Get an API key from the Together AI console.
Replace the base URL in your OpenAI client with https://api.together.xyz/v1.
Configure routing — we select models for your scenarios. Get your endpoint in half a day.

How long does integration take?

Basic API integration takes half a day: replacing the base URL and key. A full project with fine-tuning and RAG takes from 5 working days. The cost is calculated individually based on traffic volume and model complexity. Request integration — get a ready endpoint in 1 day.

Process overview

Stage	Duration
Scenario analysis	1–2 days
Model selection and testing	2–3 days
API integration and routing	1–2 days
Fine-tuning and A/B test	3–5 days
Deployment and optimization	2–3 days
Team training	1 day

What's included

Connect Together AI to your project (0.5 days)
Fine-tuning pipeline with A/B testing (3–5 days)
Embedding integration for RAG (1–2 days)
Cost optimization (routing, caching, batch processing)
Documentation and team training
Post-launch support (monitoring, tuning)

Contact us — we will evaluate your project and propose the optimal solution. We guarantee a 2–5x reduction in inference costs compared to renting GPUs. Additionally, read about LLMs — the foundation for all modern AI solutions.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.