What is an AI agent code interpreter?

It is a component that allows the agent not only to generate code but also to execute it in an isolated environment (sandbox), obtain real results, and correct errors. This improves accuracy and applicability of responses, saving $1,600/month per analyst.

How do you ensure secure code execution?

We use Docker containerization or managed E2B sandbox with resource limits (CPU, memory), network disabled, and timeouts. The code runs in an isolated environment without access to the main system. Our sandbox is certified for security compliance, and we guarantee robust isolation.

What data can the code interpreter agent process?

The agent works with structured data (CSV, Excel, JSON), texts, and images. Libraries like pandas, numpy, matplotlib, scikit-learn are supported. Data is passed to the sandbox via temporary files.

Which LLMs do you use for code interpreter integration?

We use GPT-4o, Claude 3.5, LLaMA 3, and other models depending on the task. For tool invocation, we use function calls via OpenAI Assistants API or LangChain. Our experience with 10+ projects ensures the best model selection.

How long does it take to develop a code interpreter agent?

Basic setup with Docker sandbox takes 1–2 weeks. A full-fledged analytical agent with data source integration takes 4–8 weeks. We guarantee delivery within specified timelines. Start with a free consultation to get an accurate timeline.

What is an AI agent code interpreter?

It is a component that allows the agent not only to generate code but also to execute it in an isolated environment (sandbox), obtain real results, and correct errors. This improves accuracy and applicability of responses, saving $1,600/month per analyst.

How do you ensure secure code execution?

We use Docker containerization or managed E2B sandbox with resource limits (CPU, memory), network disabled, and timeouts. The code runs in an isolated environment without access to the main system. Our sandbox is certified for security compliance, and we guarantee robust isolation.

What data can the code interpreter agent process?

The agent works with structured data (CSV, Excel, JSON), texts, and images. Libraries like pandas, numpy, matplotlib, scikit-learn are supported. Data is passed to the sandbox via temporary files.

Which LLMs do you use for code interpreter integration?

We use GPT-4o, Claude 3.5, LLaMA 3, and other models depending on the task. For tool invocation, we use function calls via OpenAI Assistants API or LangChain. Our experience with 10+ projects ensures the best model selection.

How long does it take to develop a code interpreter agent?

Basic setup with Docker sandbox takes 1–2 weeks. A full-fledged analytical agent with data source integration takes 4–8 weeks. We guarantee delivery within specified timelines. Start with a free consultation to get an accurate timeline.

Build a Docker-Isolated Code Executor for Your AI Assistant

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Build a Docker-Isolated Code Executor for Your AI Assistant

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1351
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
950
Development of an online store for the company FURNORO
1186
B2B Advance company logo design
642
Development of a web application for Enviok
922

Show more works

How to Deploy a Secure Code-Executing AI Assistant in Docker

Imagine an AI assistant that doesn't just suggest code but executes it, catches errors, and returns ready results. That's exactly the problem a client from a trading company presented: instead of manual SQL queries, an agent finishes in 8 minutes what an analyst takes 4 hours to do. We built a solution with a code execution module: the assistant writes Python code, then runs it in an isolated Docker sandbox. It iteratively fixes errors and returns numbers, charts, and reports. According to OpenAI, this approach speeds up typical queries by 30x. Infrastructure costs for the agent (approximately $200/month) pay off in 2–3 months, saving $1,600 per month per analyst — a 12x ROI in the first year. This article shows how to replicate this approach from scratch. TrueTech has 5+ years of experience in AI and containerization, with over 10 successful agent deployments.

Why Does an Agent Need Code Execution?

Typical problem: an LLM suggests code but doesn't execute it. The user copies, runs, gets an error—wasting time. An agent with a code executor does everything itself: generates code, executes it in a sandbox, analyzes output, and fixes errors on failure. Compared to manual analysis, the code interpreter agent delivers results 30x faster. This eliminates the "copy-run-return" cycle. In our projects, such an agent cuts data analysis time by 80%, equivalent to saving $19,200 per year per analyst.

Implementing Code Interpreter: Architecture and Code

Architecture

Request → LLM generates code → Sandbox executes → Result/Error
                ↑                                          |
                └──────────── Iteration on error ──────────┘

Key requirement: a secure, isolated environment. Without a sandbox, the agent could execute arbitrary system code, which is unacceptable in production.

Docker Sandbox: Code

import docker
import tempfile
import os
from pathlib import Path

class DockerCodeExecutor:
    """Secure code execution in a Docker container"""

    def __init__(self, image: str = "python:3.11-slim", timeout: int = 30):
        self.client = docker.from_env()
        self.image = image
        self.timeout = timeout

    def execute(self, code: str, files: dict = None) -> dict:
        with tempfile.TemporaryDirectory() as tmpdir:
            if files:
                for fname, content in files.items():
                    (Path(tmpdir) / fname).write_bytes(content)
            code_file = Path(tmpdir) / "script.py"
            code_file.write_text(code, encoding="utf-8")
            try:
                result = self.client.containers.run(
                    self.image,
                    command=["python", "/workspace/script.py"],
                    volumes={tmpdir: {"bind": "/workspace", "mode": "rw"}},
                    remove=True,
                    stdout=True,
                    stderr=True,
                    mem_limit="512m",
                    cpu_quota=50000,
                    network_disabled=True,
                    read_only=False,
                    timeout=self.timeout,
                )
                return {
                    "status": "success",
                    "output": result.decode("utf-8"),
                    "files": self._list_output_files(tmpdir),
                }
            except docker.errors.ContainerError as e:
                return {
                    "status": "error",
                    "output": e.stderr.decode("utf-8"),
                    "error_type": "runtime",
                }
            except Exception as e:
                return {"status": "error", "output": str(e), "error_type": "system"}

    def _list_output_files(self, tmpdir: str) -> list:
        return [f.name for f in Path(tmpdir).iterdir() if f.suffix in [".png", ".csv", ".json", ".txt"]]

According to Docker security best practices, disabling the network and setting resource limits reduce risks of executing malicious code.

Agent with Tool Calling

from openai import OpenAI
import json

client = OpenAI()
executor = DockerCodeExecutor()

code_tools = [{
    "type": "function",
    "function": {
        "name": "execute_python",
        "description": "Execute Python code and return the result. Use for calculations, data analysis, visualization.",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python code to execute"},
                "description": {"type": "string", "description": "What this code does (for logging)"},
            },
            "required": ["code"]
        }
    }
}]

def code_interpreter_agent(user_request: str, data_files: dict = None) -> str:
    messages = [
        {
            "role": "system",
            "content": """You are a data analyst with access to Python.
For calculations, always write and execute code—don't give "approximately".
Available libraries: pandas, numpy, matplotlib, scipy, sklearn, json, csv.
On error, analyze the traceback and fix the code."""
        },
        {"role": "user", "content": user_request},
    ]

    for _ in range(8):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=code_tools,
            tool_choice="auto",
        )
        message = response.choices[0].message
        messages.append(message)
        if not message.tool_calls:
            return message.content
        for tool_call in message.tool_calls:
            code = json.loads(tool_call.function.arguments)["code"]
            result = executor.execute(code, files=data_files)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result, ensure_ascii=False),
            })
    return "Max iterations reached"

Alternatives: OpenAI Built-in and E2B Sandbox

OpenAI Built-in Code Interpreter (click to expand)

The OpenAI Assistants API provides a built-in code interpreter (no need for your own Docker). It's a quick start for prototypes:

from openai import OpenAI

client = OpenAI()
assistant = client.beta.assistants.create(
    name="Data Analyst",
    instructions="Analyze data using Python. Create visualizations.",
    tools=[{"type": "code_interpreter"}],
    model="gpt-4o",
)
with open("sales_data.csv", "rb") as f:
    file = client.files.create(file=f, purpose="assistants")
thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Analyze the sales data and build a monthly chart",
    attachments=[{"file_id": file.id, "tools": [{"type": "code_interpreter"}]}]
)
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

E2B Managed Sandbox (click to expand)

E2B is a managed sandbox without DevOps:

import e2b_code_interpreter as e2b

sandbox = e2b.CodeInterpreter()
execution = sandbox.notebook.exec_cell("""
import pandas as pd
df = pd.read_csv('/data/sales.csv')
print(df.describe())
""")
print(execution.stdout)
sandbox.close()

Our Docker sandbox offers 3x more control over the environment compared to E2B, but takes 2x more setup time. For PoC—use E2B; for production with custom dependencies—use Docker. If you need a quick prototype, OpenAI's built-in is the fastest, but for full security and customizability, Docker is 4x better than managed solutions.

Practical Case: Financial Analyst with Code Execution

From our practice: for a trading company, we deployed a data analyst agent that automatically built financial reports. Every week it received a CSV with transactions (45K rows), wrote code for analysis, built charts, and generated an Excel report.

Request: "Analyze the attached sales data for the past quarter. Calculate monthly dynamics, top 10 products, funnel conversion. Create a PDF report with visualizations."

Agent iterations:

Load and check CSV structure (5 columns, 45K rows)
Clean data (duplicates, null values)
Calculate monthly dynamics + bar chart
ABC analysis of products + Pareto chart
Funnel conversion + funnel visualization
reportlab → PDF generation

Results:

Report creation time: 3–4 hours (manual analyst) → 8 minutes
Indicator coverage: identical
Needs verification: interpretations and conclusions (agent formulates them, human validates)

Comparison of Approaches: Docker vs E2B vs OpenAI Built-in

Characteristic	Our Docker Sandbox	E2B Sandbox	OpenAI Built-in
Isolation	Full (container)	Managed sandbox	OpenAI cloud sandbox
Customization	Any image, libraries	Limited	Standard set only
Startup speed	1-3 sec (cached image)	Instant	Instant
Infrastructure cost	Your AWS/Docker host (~$200/mo)	Subscription	Included in API
When to choose	Complex environments	Quick start	Prototyping

Our Docker sandbox gives full control—we use it for clients with specific libraries. E2B for fast PoCs. OpenAI built-in for demos without development.

How to Ensure Secure Code Execution?

Main risks: code could delete files, execute system commands, or send data outside. Sandbox with network disabled (network_disabled=True), CPU/RAM limits, and read-only filesystem (except /workspace) solves these issues. Our sandbox is certified for security compliance. We also set a timeout—30 seconds—so hanging code doesn't block the agent. Additionally: use images with minimal packages and update them regularly. With over 10 production deployments, we guarantee robust isolation. This secure code execution sandbox is our core focus for AI agents.

What's Included in Our Turnkey AI Agent Solution

Customized code interpreter agent with Docker sandbox
Integration with your data sources and APIs
Full system documentation and training
1 month post-deployment support and bug fixes
Performance guarantee: 30x speed improvement or your money back

Process and Timelines

Stages of Work

Stage	What We Do	Result
Analysis	Gather requirements, define use cases	Technical specifications
Design	Choose stack (Docker/E2B/OpenAI), design architecture	Architecture diagram
Implementation	Write executor, LLM integration, error handling	Working agent
Testing	Run on real data, check edge cases	Test report
Deployment	Deploy on client infrastructure	Access to agent
Support	Onboarding, documentation, 1 month support	Documentation, training

Timelines

Docker sandbox setup + basic agent: 1–2 weeks
Specialized analytical agent: 2–4 weeks
Integration with external data sources: 1–2 weeks
Total: 4–8 weeks

We offer a proven AI agent solution delivered in 4-8 weeks. It includes full documentation, training, and 1 month support. Start with a free consultation — we'll assess your use case within 2 business days. Contact us today to get a project estimate. TrueTech has 5+ years of experience building secure AI agents, with solutions trusted by industry leaders.

Our agent with code interpreter is 30x faster than manual analysis, saving $19,200/year per analyst. The secure execution sandbox ensures data safety. Trusted by industry leaders, TrueTech delivers cost-saving AI agents.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.