How is an autonomous AI system different from a regular chatbot?

A chatbot answers questions but does not perform actions in business systems. An autonomous system is an AI orchestrator: it classifies the request, extracts data, makes decisions, creates tasks in CRM/ERP, sends notifications – all without a human. It's a full processing cycle, not a dialogue.

What incoming request channels are supported?

Email (via webhook), REST API, Telegram/WhatsApp bots, web forms. The system normalizes requests into a unified format and routes them to the processing core. Any channel can be added through an adapter.

What technology stack is used?

Core: LangGraph + StateGraph, GPT-4o classifier, enrichment via async calls to CRM/Order Service, action planner, vector memory (ChromaDB), Prometheus monitoring. Deployed on Kubernetes with Triton Inference Server.

How long does onboarding take?

From 8 to 13 weeks depending on the number of request types and integrations. Includes: graph architecture (1-2 weeks), classifier (2-3 weeks), executors (2-4 weeks), channel integration (1-2 weeks), calibration on real data (2 weeks).

How is an autonomous AI system different from a regular chatbot?

A chatbot answers questions but does not perform actions in business systems. An autonomous system is an AI orchestrator: it classifies the request, extracts data, makes decisions, creates tasks in CRM/ERP, sends notifications – all without a human. It's a full processing cycle, not a dialogue.

What incoming request channels are supported?

Email (via webhook), REST API, Telegram/WhatsApp bots, web forms. The system normalizes requests into a unified format and routes them to the processing core. Any channel can be added through an adapter.

What technology stack is used?

Core: LangGraph + StateGraph, GPT-4o classifier, enrichment via async calls to CRM/Order Service, action planner, vector memory (ChromaDB), Prometheus monitoring. Deployed on Kubernetes with Triton Inference Server.

How long does onboarding take?

From 8 to 13 weeks depending on the number of request types and integrations. Includes: graph architecture (1-2 weeks), classifier (2-3 weeks), executors (2-4 weeks), channel integration (1-2 weeks), calibration on real data (2 weeks).

Turnkey Autonomous AI Request Processing System

Q: How does the system decide to escalate a request to a human?

We use a GPT-4o classifier with a confidence threshold of 0.6. If confidence is below that or the request contains legal threats, refunds above a threshold, mention of damage – the request is escalated to an operator. The agent also transfers the task when a critical action fails.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Turnkey Autonomous AI Request Processing System

Complex

from 2 weeks to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1357
Development of a web application for FEEDME
1248
Website development for BELFINGROUP
954
Development of an online store for the company FURNORO
1187
B2B Advance company logo design
644
Development of a web application for Enviok
925

Show more works

We develop autonomous AI request processing systems. This is an AI orchestrator that accepts incoming requests from various channels: email, forms, API, messengers. The system classifies them, extracts data, executes processing logic, and returns a response. Or creates tasks in business systems — all without operator involvement for typical cases.

Unlike a simple chatbot or an agent with a single tool, our system includes a full cycle: intake → understanding → data enrichment → execution → notification → monitoring. We have implemented dozens of such projects, and in this article we will break down the architecture using a real example.

How does request classification work?

The incoming request passes through a LangGraph state graph. The first node is a classifier based on GPT-4o with a Pydantic model. It determines the request type (technical support, billing, new order, status, complaint, refund), urgency, confidence, and the need for human escalation. If confidence is below 0.6 or the request contains triggers (legal threats, refunds above a threshold, mention of damages) — the request is handed over to an operator. This is our experience, confirmed by hundreds of projects.

System architecture

Input channels

webhook (email parser)
REST API
Telegram/WhatsApp Bot
web form

Processing core

LangGraph state graph with classification, executors, aggregator.

Output channels

REST API of external systems (CRM, ERP, Service Desk)
email/push notifications
task queue (Celery/Redis)

More on graph structure and nodes

Main nodes: classify, enrich, plan, execute, generate_response, escalate_to_human, send_response. Conditional edges route_after_classification and route_after_enrichment determine the next step based on state.

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
from typing import TypedDict, Annotated, Optional
from datetime import datetime
import operator

class RequestState(TypedDict):
    # Incoming request
    raw_content: str
    channel: str               # "email", "api", "telegram", "form"
    sender_id: str
    received_at: datetime

    # Classification
    request_type: Optional[str]        # "support", "order", "complaint", "inquiry", "refund"
    urgency: Optional[str]             # "critical", "high", "normal", "low"
    confidence: Optional[float]

    # Enrichment
    user_profile: Optional[dict]
    related_entities: Optional[list]   # Related orders, contracts, tickets

    # Processing
    action_plan: Optional[list[dict]]
    executed_actions: Annotated[list, operator.add]
    requires_human: bool
    human_reason: Optional[str]

    # Result
    response_draft: Optional[str]
    outcome: Optional[str]
    processing_time_ms: Optional[int]

from langchain_openai import ChatOpenAI
from pydantic import BaseModel
from typing import Literal

class RequestClassification(BaseModel):
    request_type: Literal["support_technical", "support_billing", "order_new",
                           "order_status", "complaint", "refund_request", "general_inquiry"]
    urgency: Literal["critical", "high", "normal", "low"]
    confidence: float
    extracted_entities: dict
    requires_human: bool
    human_reason: Optional[str] = None
    summary: str

llm = ChatOpenAI(model="gpt-4o", temperature=0)

def classify_request(state: RequestState) -> RequestState:
    result = llm.with_structured_output(RequestClassification).invoke(
        f"""Classify the incoming request.

Channel: {state['channel']}
Request: {state['raw_content']}

Escalate to human if:
- Legal threats or mention of litigation
- Refund request above threshold
- Mention of physical damage
- Emotionally charged review with public threats"""
    )

    return {
        **state,
        "request_type": result.request_type,
        "urgency": result.urgency,
        "confidence": result.confidence,
        "requires_human": result.requires_human,
        "human_reason": result.human_reason,
    }

def plan_actions(state: RequestState) -> RequestState:
    """Agent composes an action plan based on request type"""

    action_templates = {
        "order_status": [
            {"action": "query_order_db", "params": {"order_id": "{extracted_order_id}"}},
            {"action": "generate_status_response", "params": {}},
            {"action": "send_response", "params": {}},
        ],
        "refund_request": [
            {"action": "verify_refund_eligibility", "params": {}},
            {"action": "create_refund_ticket", "params": {}},
            {"action": "notify_finance_team", "params": {}},
            {"action": "send_confirmation", "params": {}},
        ],
        "support_technical": [
            {"action": "search_knowledge_base", "params": {}},
            {"action": "generate_solution", "params": {}},
            {"action": "create_ticket_if_unsolved", "params": {}},
            {"action": "send_response", "params": {}},
        ],
    }

    base_plan = action_templates.get(state["request_type"], [
        {"action": "generate_generic_response", "params": {}},
        {"action": "create_manual_review_task", "params": {}},
    ])

    return {**state, "action_plan": base_plan}

def route_after_classification(state: RequestState) -> str:
    if state["requires_human"]:
        return "escalate_to_human"
    if state["confidence"] < 0.6:
        return "escalate_to_human"
    return "enrich"


def route_after_enrichment(state: RequestState) -> str:
    if state.get("user_profile", {}).get("tier") == "vip" and state["urgency"] in ("high", "critical"):
        return "plan_premium"
    return "plan"


graph = StateGraph(RequestState)
graph.add_node("classify", classify_request)
graph.add_node("enrich", enrich_request)
graph.add_node("plan", plan_actions)
graph.add_node("plan_premium", plan_premium_actions)
graph.add_node("execute", execute_actions)
graph.add_node("generate_response", generate_final_response)
graph.add_node("escalate_to_human", create_human_task)
graph.add_node("send_response", send_response_to_channel)

graph.set_entry_point("classify")
graph.add_conditional_edges("classify", route_after_classification)
graph.add_conditional_edges("enrich", route_after_enrichment)
graph.add_edge("plan", "execute")
graph.add_edge("plan_premium", "execute")
graph.add_edge("execute", "generate_response")
graph.add_edge("generate_response", "send_response")
graph.add_edge("send_response", END)
graph.add_edge("escalate_to_human", END)

processor = graph.compile(checkpointer=PostgresSaver(conn))

Why does the system work without an operator?

The key difference is the ability to perform actions in business systems: create orders, check statuses, process refunds, send notifications. Each action is a ready-made module that the system invokes according to the plan. The action planner forms a sequence of steps based on the request type and user context. If the plan is successfully executed, the response is sent automatically. Otherwise, the system escalates the task to an operator with a detailed error log.

Practical case: online retailer, 2500 requests/day

Before implementation: average first response time 4.2 hours, 12 operators working three shifts, 60% of time spent on standard status inquiries.

Request type	Share in flow
Order status	41%
Refunds	19%
Technical issues	14%
General questions	17%
Complaints and claims	9%

After implementation:

Autonomous processing without operator involvement: 74%
Average first response time: from 4.2 hours to 2.1 minutes (120x faster)
Night shift: reduced from 4 to 1 operator (monitoring escalations)
Response accuracy (sample of 500 requests): 94.1%
False escalations: 8.3%
Erroneous automatic closure: 2.1%

Metric	Before	After
Average first response time	4.2 hours	2.1 min
Share of autonomous requests	0%	74%
Operators per shift	12	4 (night shift reduction)

What is included in the work

Architecture and design: modeling the state graph, defining request types, processing scenarios.
Classifier development: prompt engineering, GPT-4o fine-tuning if necessary, testing on historical data.
Channel integration: webhook, API, messengers, web forms.
Action executors: connection to CRM, ERP, Service Desk, coding for standard operations.
Monitoring system: Prometheus metrics, Grafana dashboards, SLA alerts.
Documentation: graph description, API, operator instructions.
Team training: workshop on model fine-tuning and administration.
Launch support: 2 weeks of post-deployment assistance.

How we do it: step-by-step plan

Analytics (1–2 weeks): collect logs, identify typical requests, define escalation criteria.
Graph design (1–2 weeks): create StateGraph, define nodes and edges.
Classifier development (2–3 weeks): train model, test on sample.
Executor implementation (2–4 weeks): code for each request type.
Channel integration (1–2 weeks): connect email, API, messengers.
Testing and calibration (2 weeks): run on real data, adjust thresholds.
Deployment and monitoring: deploy on Kubernetes, configure alerts.

Timeline

System architecture and graph: 1–2 weeks
Classifier + data enrichment: 2–3 weeks
Executors for each request type: 2–4 weeks
Channel integration (email, messengers): 1–2 weeks
Calibration and production launch: 2 weeks
Total: 8–13 weeks

We'll evaluate your project – just contact us. We guarantee transparency at every stage and provide full documentation. Certified engineers with 10+ years of experience will implement the system turnkey. For a preliminary assessment of your request flow, order a free audit – we will determine the automation potential and implementation timeline.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.