What is a multi-agent AI system?

A multi-agent system (MAS) is an architecture where multiple specialized AI agents interact to solve complex tasks. Each agent focuses on a narrow domain, reducing hallucinations and improving result accuracy.

What architectures exist for multi-agent systems?

Key patterns: Supervisor (central orchestrator), Peer-to-peer (direct agent communication), Hierarchical (multi-level structure), and Pipeline (linear chain). Choice depends on task complexity and fault tolerance requirements.

How long does it take to develop an MAS?

Typically 7 to 12 weeks. This includes architecture design, agent implementation, integration, testing, and production deployment. For simple prototypes with CrewAI, 2–3 weeks suffice.

Which tools do you use for implementation?

We use LangGraph for complex loops and production systems, CrewAI for rapid prototyping, and custom solutions for unique requirements. LLMs: GPT-4o, Claude 3.5, LLaMA 3. Vector stores: ChromaDB or pgvector.

What tasks benefit most from MAS?

MAS excel at tasks requiring multi-step processing of heterogeneous data: due diligence, legal analysis, financial reporting, complex report generation, and customer support automation with escalations.

What is a multi-agent AI system?

A multi-agent system (MAS) is an architecture where multiple specialized AI agents interact to solve complex tasks. Each agent focuses on a narrow domain, reducing hallucinations and improving result accuracy.

What architectures exist for multi-agent systems?

Key patterns: Supervisor (central orchestrator), Peer-to-peer (direct agent communication), Hierarchical (multi-level structure), and Pipeline (linear chain). Choice depends on task complexity and fault tolerance requirements.

How long does it take to develop an MAS?

Typically 7 to 12 weeks. This includes architecture design, agent implementation, integration, testing, and production deployment. For simple prototypes with CrewAI, 2–3 weeks suffice.

Which tools do you use for implementation?

We use LangGraph for complex loops and production systems, CrewAI for rapid prototyping, and custom solutions for unique requirements. LLMs: GPT-4o, Claude 3.5, LLaMA 3. Vector stores: ChromaDB or pgvector.

What tasks benefit most from MAS?

MAS excel at tasks requiring multi-step processing of heterogeneous data: due diligence, legal analysis, financial reporting, complex report generation, and customer support automation with escalations.

Custom Multi-Agent AI Systems Development with Guarantee

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Custom Multi-Agent AI Systems Development with Guarantee

Complex

from 2 weeks to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1351
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
950
Development of an online store for the company FURNORO
1186
B2B Advance company logo design
642
Development of a web application for Enviok
922

Show more works

Recent project: a client from the fintech sector wanted to automate due diligence within two weeks. A single LLM agent couldn't handle it — it hallucinated in legal sections and lost context when analyzing 200+ contracts. We split the task among specialized agents: financial, legal, HR, and risk. Result — due diligence in 3 days instead of 4 weeks, with quality confirmed by an independent audit. This is an example of how a multi-agent system (MAS) solves tasks unattainable by a single agent.

Multi-agent systems are distributed AI systems where agents coordinate to perform complex tasks. Each agent is narrowly specialized, reducing hallucination rates and improving accuracy. Agentic AI allows delegating tasks to specialized agents. We use MAS in projects with heterogeneous data, multi-step reasoning, and expert reviews. Our track record: 10+ MAS implementations in finance and retail. We guarantee each section's quality through a human-in-the-loop process. Our team is certified in OpenAI and LangChain.

Architectures of Multi-Agent Systems

The Supervisor pattern uses a central orchestrator that distributes tasks among agents. It's simple to manage, but the orchestrator becomes a bottleneck. Peer-to-peer allows agents to communicate directly — the system is fault-tolerant, but debugging is more complex. Hierarchical organizes multi-level governance, which scales well but is overkill for small tasks. Pipeline represents a linear chain — predictable, but no feedback.

How the Supervisor Pattern Works in Practice

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict, Literal
import operator
from langchain_core.messages import HumanMessage

class MultiAgentState(TypedDict):
    messages: list
    current_task: str
    task_result: str
    next_agent: str

# Specialized agents
def researcher_agent(state: MultiAgentState) -> MultiAgentState:
    """Agent for information retrieval"""
    llm = ChatOpenAI(model="gpt-4o")
    task = state["current_task"]

    # Retrieval + analysis
    docs = retriever.invoke(task)
    context = "\n".join([d.page_content for d in docs])

    result = llm.invoke([
        HumanMessage(content=f"Research task: {task}\n\nContext:\n{context}\n\nExtract key facts:")
    ]).content

    return {**state, "task_result": result, "next_agent": "writer"}

def writer_agent(state: MultiAgentState) -> MultiAgentState:
    """Agent for text generation"""
    llm = ChatOpenAI(model="gpt-4o")
    research = state["task_result"]
    original_task = state["current_task"]

    result = llm.invoke([
        HumanMessage(content=f"Write an answer to the task: {original_task}\n\nMaterials: {research}")
    ]).content

    return {**state, "task_result": result, "next_agent": "reviewer"}

def reviewer_agent(state: MultiAgentState) -> MultiAgentState:
    """Agent for quality check"""
    llm = ChatOpenAI(model="gpt-4o")
    draft = state["task_result"]

    review = llm.invoke([
        HumanMessage(content=f"""Check the following text for:
1. Factual errors
2. Completeness of answer
3. Structure and clarity

Text: {draft}

If everything is fine, answer "APPROVED". Otherwise, specify concrete corrections.""")
    ]).content

    if "APPROVED" in review:
        return {**state, "next_agent": "complete"}
    else:
        return {**state, "task_result": review, "next_agent": "writer"}

def supervisor_agent(state: MultiAgentState) -> MultiAgentState:
    """Orchestrator: determines the first agent for the task"""
    return {**state, "next_agent": "researcher"}

def route_agent(state: MultiAgentState) -> str:
    return state["next_agent"]

# Build the graph
graph = StateGraph(MultiAgentState)
graph.add_node("supervisor", supervisor_agent)
graph.add_node("researcher", researcher_agent)
graph.add_node("writer", writer_agent)
graph.add_node("reviewer", reviewer_agent)

graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route_agent)
graph.add_conditional_edges("researcher", route_agent)
graph.add_conditional_edges("writer", route_agent)
graph.add_conditional_edges("reviewer", lambda s: END if s["next_agent"] == "complete" else s["next_agent"])

mas = graph.compile()

In this example, the supervisor decides where to route the task, the researcher finds information, the writer produces an answer, and the reviewer checks quality. If the reviewer rejects it, the task goes back to the writer for revision. According to LangGraph documentation, StateGraph allows modeling complex feedback loops.

Why CrewAI Is Convenient for Rapid Prototyping

from crewai import Agent, Task, Crew, Process

# Define agents with roles
analyst = Agent(
    role="Financial Analyst",
    goal="Analyze financial data and identify trends",
    backstory="Experienced financial analyst with 10 years in investment banking",
    tools=[search_tool, calculator_tool, db_query_tool],
    llm=ChatOpenAI(model="gpt-4o"),
    verbose=True,
)

report_writer = Agent(
    role="Report Writer",
    goal="Create professional financial reports",
    backstory="Business communication specialist with finance background",
    tools=[document_writer_tool],
    llm=ChatOpenAI(model="gpt-4o"),
)

fact_checker = Agent(
    role="Fact Checker",
    goal="Verify all figures and statements in the report",
    tools=[search_tool, calculator_tool],
    llm=ChatOpenAI(model="gpt-4o"),
)

# Tasks
analysis_task = Task(
    description="Analyze financial indicators of Company X for the last reporting quarter",
    expected_output="JSON with KPIs: revenue, EBITDA, net_profit, growth_rates",
    agent=analyst,
)

report_task = Task(
    description="Create an investment memorandum based on the analysis",
    expected_output="PDF-ready investment memorandum text",
    agent=report_writer,
    context=[analysis_task],
)

# Crew
crew = Crew(
    agents=[analyst, report_writer, fact_checker],
    tasks=[analysis_task, report_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"company": "LLC Example", "period": "last quarter"})

CrewAI allows declarative role and task definition without graphs. It suits prototypes and simple MAS. The Supervisor with LangGraph reduces latency by 40% compared to a P2P network, while CrewAI speeds up prototyping by 3 times.

MAS Tool Comparison

Tool	Level	Use Case
LangGraph	Low (graphs)	Complex loops, production MAS
CrewAI	High (roles)	Rapid prototyping, simple workflows
Custom	Any	Unique requirements, legacy

Practical Case: Due Diligence in 3 Days

We built a system for a fintech company handling M&A deals. Agent composition: Financial Analyst (IFRS, RAS), Legal (contracts, courts), HR (turnover, key employees), Risk (consolidated risk), Report (final document). Infrastructure: LangGraph with RAG indices for each agent. Results: time reduced from 4 weeks to 3 days, aspect coverage increased from 78% to 94%, cost per DD reduced by 67% — savings of 2 to 5 million rubles. Human-in-the-loop: final validation of each section.

What's Included in Multi-Agent System Development

Architectural diagram with agent roles, exchange protocols, failure points.
Baseline agents (3–5 pieces) with LLM settings, tools, RAG indices.
Orchestration system (LangGraph or custom Supervisor) with error handling and retries.
Monitoring dashboard (latency p99, token usage, rerun rate) — part of agent MLOps.
Test polygon with 10+ scenarios covering normal and edge cases.
Documentation in README format plus a December presentation for stakeholders.
Team training (2–3 webinars on agent customization).

Development Stages

Stage	Duration
Analysis and design	1–2 weeks
Agent development	3–5 weeks
Integration and communication	2–3 weeks
Testing and validation	1–2 weeks
Production and monitoring	1–2 weeks
Handover and training	1–2 weeks

Timelines and Cost

Estimated timelines: 7 to 12 weeks depending on complexity. The cost of turnkey MAS development is calculated individually after an audit of your task. Request a consultation — we will evaluate your project and propose an architecture.

Common Mistakes in MAS Development

Too many agents: each agent adds overhead. Optimal is 3–5.
No human control: agentic loops lead to hallucinations. Always include checkpoints.
Poorly configured RAG: if an agent's knowledge base is low quality, results will be noisy.
Ignoring semantic caching: frequent requests get duplicated. Use embedding cache.

Communication Between Agents

# Pattern: agents pass structured messages via shared state
class AgentMessage:
    source_agent: str
    target_agent: str
    message_type: str  # "request", "result", "error"
    content: dict
    priority: int

# Message queue for asynchronous communication
import asyncio
from asyncio import Queue

class AgentCommunicationBus:
    def __init__(self):
        self.queues: dict[str, Queue] = {}

    def register_agent(self, agent_id: str):
        self.queues[agent_id] = Queue()

    async def send(self, msg: AgentMessage):
        await self.queues[msg.target_agent].put(msg)

    async def receive(self, agent_id: str) -> AgentMessage:
        return await self.queues[agent_id].get()

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.