Which OpenAI models are best for a chat bot?

For most scenarios, GPT-4o is optimal — it's multimodal, fast, and cost-effective. For deep reasoning (math, code), choose o3-mini. For high-load simple tasks, use GPT-4o-mini — its p99 latency is 2 times lower.

How long does OpenAI API integration take?

Basic integration with chat completions takes 0.5–1 day. Adding structured outputs, retries, and monitoring takes another 2–3 days. A full cycle with RAG and embeddings takes up to a week.

How to handle OpenAI API errors?

Implement retry with exponential backoff for 429 (rate limit) and 5xx errors. We use tenacity with custom exceptions and log each attempt to the ELK stack.

What are structured outputs and why are they needed?

Structured outputs allow you to receive responses in a strictly defined schema (JSON Schema). This is critical for data extraction, validation, and integration with other services. It's implemented via the response_format parameter in the API.

How to control OpenAI call costs?

We implement pre-checking of tokens, caching repeated requests, selecting the model based on the task (e.g., GPT-4o-mini for simple questions), and per-user limits. Everything is logged to BI for optimization.

Which OpenAI models are best for a chat bot?

For most scenarios, GPT-4o is optimal — it's multimodal, fast, and cost-effective. For deep reasoning (math, code), choose o3-mini. For high-load simple tasks, use GPT-4o-mini — its p99 latency is 2 times lower.

How long does OpenAI API integration take?

Basic integration with chat completions takes 0.5–1 day. Adding structured outputs, retries, and monitoring takes another 2–3 days. A full cycle with RAG and embeddings takes up to a week.

How to handle OpenAI API errors?

Implement retry with exponential backoff for 429 (rate limit) and 5xx errors. We use tenacity with custom exceptions and log each attempt to the ELK stack.

What are structured outputs and why are they needed?

Structured outputs allow you to receive responses in a strictly defined schema (JSON Schema). This is critical for data extraction, validation, and integration with other services. It's implemented via the response_format parameter in the API.

How to control OpenAI call costs?

We implement pre-checking of tokens, caching repeated requests, selecting the model based on the task (e.g., GPT-4o-mini for simple questions), and per-user limits. Everything is logged to BI for optimization.

OpenAI API Integration: GPT-4o, o1, o3 — Under the Hood

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

OpenAI API Integration: GPT-4o, o1, o3 — Under the Hood

Simple

~1 day

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

OpenAI API Integration: GPT-4o, o1, o3 — Under the Hood

Recently, on a project with tens of thousands of requests per day, the team hit a budget wall due to using a single model for all tasks. After migrating to a combination of GPT-4o, o3-mini, and GPT-4o-mini, we managed to reduce costs by 3x without losing quality. Let's dive into how to choose models, configure a client with retries, and implement structured outputs. Our experience — 5 years in AI integrations and over 50 completed projects — allows us to guarantee 99.9% uptime.

Which Models to Choose and Why Beginners Get It Wrong

OpenAI offers a family of models with different architectures. GPT-4o is a universal multimodal soldier: it accepts text and images, outputs structured responses, and works fast. For deep reasoning (mathematical proofs, algorithmic code), use o1 and o3-mini — they spend more time on chain-of-thought. For high-load scenarios with simple tasks (e.g., classification), use GPT-4o-mini: its p99 latency is 2x lower, and cost per token is almost 10x lower.

Model	Purpose	Features
GPT-4o	Universal chat, vision, structured outputs	Best balance quality/cost, supports function calling
GPT-4o-mini	High-load, simple tasks, classification	Fast, cheap, but weaker in reasoning
o3-mini	Deep reasoning, code, logic	Reasoning effort adjustable, does not support system prompt

A typical mistake is using a single model for everything. GPT-4o-mini handles 80% of tasks, but many use GPT-4o everywhere, overpaying. Below is a comparison for three typical scenarios.

Scenario	Recommended Model	Benefit
Sentiment classification (thousands of requests/min)	GPT-4o-mini	Cost reduction by 8x vs GPT-4o
Code generation with verification	o3-mini (reasoning_effort=high)	2x more accurate than GPT-4o on complex tasks
Multimodal document analysis	GPT-4o	Only model with native vision

How We Configure the Client and Handle Errors

We use the official openai SDK and Pydantic for schemas. We wrap all calls in retry with exponential backoff (tenacity) — on 429 or 5xx we wait with increasing pause. Below is a working example for synchronous and asynchronous modes:

from openai import OpenAI, AsyncOpenAI
from pydantic import BaseModel

client = OpenAI()  # Uses OPENAI_API_KEY from env
async_client = AsyncOpenAI()

# Synchronous call with retry
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def chat(prompt: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,
    )
    return response.choices[0].message.content

# Structured output
class Extraction(BaseModel):
    name: str
    amount: float
    currency: str

def extract_structured(text: str) -> Extraction:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Extract data: {text}"}],
        response_format=Extraction,
    )
    return response.choices[0].message.parsed

# Streaming
def stream_response(prompt: str):
    with client.chat.completions.stream(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for chunk in stream.text_stream:
            yield chunk

# Vision (GPT-4o)
def analyze_image(image_url: str, question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.choices[0].message.content

Why It's Important to Structure Responses?

Without a schema, the API returns free text — hard to parse and validate. We use response_format with Pydantic to guarantee format. This cuts backend processing time and eliminates parsing errors. An example for entity extraction is shown above.

How o1/o3 Work for Reasoning Tasks?

These models do not support system prompt, temperature (fixed), or streaming. However, you can adjust reasoning_effort (low/medium/high). We use them for narrow tasks: code verification, proofs, logical chains. Example:

# o1 does not support system prompt, temperature, streaming
def reason_with_o1(problem: str) -> str:
    response = client.chat.completions.create(
        model="o3-mini",
        messages=[{"role": "user", "content": problem}],
        reasoning_effort="high",
    )
    return response.choices[0].message.content

Embeddings and Semantic Search

For RAG systems, we use text-embedding-3-small (1536 dimensions). It's cheap and effective. We store vectors in Qdrant or pgvector. Example:

def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

Typical Mistakes When Integrating OpenAI API

Incorrect handling of rate limits: without retry exponential backoff, the client crashes on 429.
Lack of token monitoring — unexpected bills.
Using system prompt for o1/o3 — the model ignores it.
Storing embeddings in a suboptimal DB — high search latency.

What's Included in a Turnkey Solution

Client setup with retries, logging, and monitoring (including alerts on p99 latency).
Selection of the optimal model for each task (cost/quality).
Implementation of structured outputs with Pydantic.
Integration of embeddings and vector DB (RAG).
API documentation and team training.
Uptime guarantee of 99.9% (our responsibility). Over 50 projects in 5 years — statistics you can trust.

Timeline and How to Get Started

Basic chat completions integration: 0.5–1 day.
Structured outputs + tools: 2–3 days.
Retry logic + cost management: 1–2 days.
Full RAG pipeline: up to 5 days.

Contact us — we will assess your project in 1 hour. We guarantee transparent code and full documentation. Get a consultation right now.

—

(\text{Learn more about } \text{chain-of-thought} \text{ and } \text{Official OpenAI API docs}.)

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.