Which languages does the generator support?

The current version works with Python. We use AST for source code analysis, so any Python 3.7+ constructs are supported. Support for Java and JavaScript is planned.

What percentage of tests generate without modifications?

In our practice, 94% of tests pass without changes. The remaining 6% require manual adjustment of mocks or complex dependencies.

How does AI determine edge cases?

The model analyzes the AST tree: conditions, boundary values, raises. Based on this, it generates parameterized tests with boundary values, null arguments, empty lists, etc.

Can it be integrated into an existing CI?

Yes, the generator exports pytest tests that run with standard tools. We provide a ready CI script for GitHub Actions or GitLab CI.

How is the quality of generated tests evaluated?

We use mutation testing (mutmut) to compute the mutation score. The target is >80%. If the score is lower, we generate additional tests for weak spots.

Which languages does the generator support?

The current version works with Python. We use AST for source code analysis, so any Python 3.7+ constructs are supported. Support for Java and JavaScript is planned.

What percentage of tests generate without modifications?

In our practice, 94% of tests pass without changes. The remaining 6% require manual adjustment of mocks or complex dependencies.

How does AI determine edge cases?

The model analyzes the AST tree: conditions, boundary values, raises. Based on this, it generates parameterized tests with boundary values, null arguments, empty lists, etc.

Can it be integrated into an existing CI?

Yes, the generator exports pytest tests that run with standard tools. We provide a ready CI script for GitHub Actions or GitLab CI.

How is the quality of generated tests evaluated?

We use mutation testing (mutmut) to compute the mutation score. The target is >80%. If the score is lower, we generate additional tests for weak spots.

AI-Generated Unit Tests: Automating Coverage and Regression

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Generated Unit Tests: Automating Coverage and Regression

Medium

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI-Generated Unit Tests: Automating Coverage and Regression

The codebase grows, coverage drops, refactoring becomes a risky venture. Teams spend up to 40% of each sprint writing tests—and still miss edge cases. We automated this process via AI generation: the system analyzes AST, extracts functions, arguments, exceptions, and return types, then an LLM (Claude Sonnet 4.5) generates pytest tests. Unlike manual approaches, AI doesn't forget boundary conditions—null arguments, empty collections, invalid combinations. Result: 80% time savings for QA (and corresponding cost reduction on manual testing). ROI on AI generation investment is less than 3 months under typical team load.

How AI Handles Legacy Code Without Types?

Even if the code is written without type annotations, the AST parser extracts signatures and return values. We additionally analyze docstrings, if-conditions, and raise expressions. This information is passed to the model together with the context of existing tests (if any) to keep the style consistent. The model returns a ready test file.

from anthropic import Anthropic
import ast
import inspect
from pathlib import Path
from typing import Optional
import subprocess

client = Anthropic()

class TestGenerator:
    def __init__(self, project_root: str):
        self.project_root = project_root

    def extract_function_info(self, source_code: str, function_name: str) -> dict:
        """Extracts function metadata via AST"""
        tree = ast.parse(source_code)
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                if node.name == function_name:
                    return {
                        "name": node.name,
                        "args": [arg.arg for arg in node.args.args],
                        "decorators": [ast.unparse(d) for d in node.decorator_list],
                        "is_async": isinstance(node, ast.AsyncFunctionDef),
                        "has_return": any(
                            isinstance(n, ast.Return) and n.value
                            for n in ast.walk(node)
                        ),
                        "raises": [
                            ast.unparse(n.exc) for n in ast.walk(node)
                            if isinstance(n, ast.Raise) and n.exc
                        ],
                        "source": ast.unparse(node),
                    }
        return {}

    def find_related_tests(self, source_file: str) -> str:
        """Finds existing tests to understand style"""
        source_path = Path(source_file)
        test_candidates = [
            source_path.parent / f"test_{source_path.name}",
            source_path.parent.parent / "tests" / f"test_{source_path.name}",
            source_path.parent / "tests" / f"test_{source_path.name}",
        ]
        for test_file in test_candidates:
            if test_file.exists():
                return test_file.read_text()[:2000]
        return ""

    def generate_tests(
        self,
        source_file: str,
        function_name: Optional[str] = None,
    ) -> str:
        """Generates tests for a file or specific function"""
        source_code = Path(source_file).read_text()
        existing_tests = self.find_related_tests(source_file)
        if function_name:
            func_info = self.extract_function_info(source_code, function_name)
            context = f"Function to test:\n```python\n{func_info.get('source', '')}\n```"
        else:
            context = f"File to test:\n```python\n{source_code[:4000]}\n```"
        existing_context = ""
        if existing_tests:
            existing_context = f"\nExisting test style (follow this pattern):\n```python\n{existing_tests}\n```"
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system="""You are a senior developer writing pytest tests.
Rules:
- Test behavior, not implementation
- One test = one assertion (AAA: Arrange, Act, Assert)
- Name tests as: test_<function>_<scenario>_<expectation>
- Cover: happy path, edge cases, errors/exceptions, boundary values
- Use pytest.mark.parametrize for similar tests
- For async functions — pytest-asyncio
- Mock external dependencies via pytest-mock""",
            messages=[{
                "role": "user",
                "content": f"""{context}{existing_context}\n\nGenerate a complete test file with pytest. Return only code, no explanations."""
            }]
        )
        return response.content[0].text

Why Mutation Testing Is the Only Objective Criterion?

Generated tests need to be verified: do they catch real bugs? Mutation testing introduces mutations into the source code—changes > to <, True to False, removes calls. If a test doesn't fail, the mutation survives. The higher the mutation score (ratio of killed mutants), the more reliable the tests. Our target is 80% and above.

import subprocess
from pathlib import Path

def evaluate_test_quality(source_file: str, test_file: str) -> dict:
    """Runs mutation testing to evaluate test quality"""
    result = subprocess.run(
        ["mutmut", "run", f"--paths-to-mutate={source_file}", f"--tests-dir={test_file}"],
        capture_output=True, text=True, timeout=300
    )
    survived = 0
    killed = 0
    for line in result.stdout.splitlines():
        if "survived" in line.lower():
            survived += 1
        elif "killed" in line.lower():
            killed += 1
    total = survived + killed
    mutation_score = killed / total if total > 0 else 0
    return {
        "mutation_score": mutation_score,
        "killed_mutants": killed,
        "survived_mutants": survived,
        "verdict": "excellent" if mutation_score > 0.8 else "good" if mutation_score > 0.6 else "needs_improvement"
    }

Comparison of Approaches: Manual vs AI Generation

Parameter	Manual Writing	AI Generation	AI with Auto-Fix
Time for 100 tests	8–16 hours	2–3 hours	3–5 hours
Edge case coverage	Developer-dependent	Automatic (90%+)	95%+ after fix cycle
Mutation score	0.6–0.8	0.7–0.8	0.8–0.85
Need for adjustments	—	6–10%	<5%

AI generation is 80% faster than manual test writing with comparable coverage. And with the auto-fix cycle, we achieve a mutation score >0.8—a level rarely achieved manually.

Work Stages

Codebase analysis. AST traversal of all files: extract function signatures, decorators, raise expressions, docstrings. Estimate volume: an average project has 50-100 functions per module.
Test generation. Each file gets a separate test file with parameterized tests covering happy path, edge cases, and exceptions.
Auto-run and fix cycle. Up to 3 iterations: run pytest, parse errors, refine tests via LLM. Tests that fail due to external dependencies are flagged for manual tuning.
Mutation score evaluation. Run mutmut, analyze surviving mutants. If score <0.8, generate additional tests for weak spots.
CI integration. Ready script for GitHub Actions or GitLab CI with coverage gate and automatic report.

Practical Case: Legacy Python Service Without Tests

From our practice: a client handed over a Python service with 8000 lines of code and zero coverage. Refactoring was impossible without tests.

Process:

Automatic analysis of all .py files via AST.
Test generation by files (batch, 5 files in parallel).
Auto-run and fix cycle (up to 3 iterations).
Manual review of tests with coverage < 60%.

Results in 2 weeks:

847 test functions generated.
Coverage: 0% → 71%.
12 real bugs found during generation (AI noticed behavioral mismatches and type inconsistencies).
94% of generated tests passed without modifications.
6% required manual rework (complex mock dependencies).

Mutation score of final tests: 0.74 (good, but not excellent—some edge cases not covered by AI).

What's Included

Codebase analysis: extract all functions, their signatures, and dependencies.
Test generation: each file gets a separate test file with parameterized tests.
Auto-run and fix: up to 3 iterations for error correction.
Coverage and mutation score report.
CI integration: configure test execution on every commit.
Output documentation: description of all generated tests and instructions for adjustments.

Timelines

Scope	Duration
Basic generator (one file, code extraction)	1–2 days
Auto-run and fix cycle	2–3 days
CI/CD integration with coverage gate	1 week
Full pipeline for legacy codebase	2–3 weeks

Pricing is determined individually after analyzing your codebase. Contact us for a project assessment—we guarantee raising coverage to 70%+ in 2 weeks. Our expertise in AI testing is backed by certifications and successful cases (10+ years on the market, 50+ projects). Order a test run for one module—see for yourself.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.