What languages and frameworks does the AI generator support?

Our primary stack is TypeScript and Playwright. We can also generate tests in Python with Playwright and JavaScript. Scenarios can be described in Russian or English.

How long does it take to integrate AI generation into an existing project?

A pilot for 3-5 scenarios takes 2-3 days. Full implementation including all critical tests, Page Object Model, and CI integration takes 4 to 6 weeks turnkey.

How is the stability of generated tests ensured?

The AI uses semantic locators (getByRole, getByLabel), adds explicit waits, and waitForResponse. After generation, tests undergo multiple runs – flaky rate stays below 5%.

Can the AI generator be used for legacy applications without documentation?

Yes, we have a Screenshot-to-Test mode: you upload a screenshot of the interface, and the AI analyzes it to create a test. This is suitable for legacy systems and reverse engineering.

What languages and frameworks does the AI generator support?

Our primary stack is TypeScript and Playwright. We can also generate tests in Python with Playwright and JavaScript. Scenarios can be described in Russian or English.

How long does it take to integrate AI generation into an existing project?

A pilot for 3-5 scenarios takes 2-3 days. Full implementation including all critical tests, Page Object Model, and CI integration takes 4 to 6 weeks turnkey.

How is the stability of generated tests ensured?

The AI uses semantic locators (getByRole, getByLabel), adds explicit waits, and waitForResponse. After generation, tests undergo multiple runs – flaky rate stays below 5%.

Can the AI generator be used for legacy applications without documentation?

Yes, we have a Screenshot-to-Test mode: you upload a screenshot of the interface, and the AI analyzes it to create a test. This is suitable for legacy systems and reverse engineering.

AI E2E Test Generation on Playwright: Description & Implementation

Q: What guarantees do you provide on the result?

We guarantee a flaky rate not exceeding 5% after implementation. If exceeded, we fix it free of charge. All source code is transferred to the client.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI E2E Test Generation on Playwright: Description & Implementation

Medium

~2-3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1356
Development of a web application for FEEDME
1248
Website development for BELFINGROUP
953
Development of an online store for the company FURNORO
1187
B2B Advance company logo design
644
Development of a web application for Enviok
925

Show more works

Why writing E2E tests is a headache?

End-to-end tests are considered the gold standard for UI verification, but maintaining them wears teams out. Typical scenario: a test fails even though the functionality works. The reason is fragile locators like div.container > ul > li:nth-child(3) > a. Any layout change breaks dozens of tests. In our practice, we've seen projects where 35% of E2E tests were flaky – failing in 15–40% of runs. Each such false fail wastes time on investigation, undermines trust in automation, and slows down the release cycle.

Our AI generator solves this: it creates Playwright tests with semantic locators (aria-label, data-testid, role) that are resilient to cosmetic changes. Additionally, the neural network can fix existing flaky tests – it analyzes errors and adds proper expectations.

What are flaky tests and why do they occur?

A flaky test is one that can fail without any code changes. Main causes: race conditions (missing wait for async data loading), animations, dependency on time or execution order. According to Flaky Tests at Google and How We Address Them, in large projects up to 16% of tests are flaky. We've seen projects where this figure reached 40%.

How does the AI generator solve the flaky test problem?

Our approach is based on large language models (GPT-4o, Claude 3.5) and includes several methods for generation and stabilization.

Generating Playwright tests from scenario descriptions

The user describes a scenario in Russian or English, provides a URL and test data. The AI transforms this into TypeScript code with semantic locators. Example:

from langchain_openai import ChatOpenAI
from playwright.sync_api import sync_playwright
import json

class E2ETestGenerator:
    PLAYWRIGHT_PROMPT = """Create a Playwright E2E test in TypeScript.

Scenario: {scenario}
Application URL: {base_url}
Test data: {test_data}

Test requirements:
1. Use semantic locators: getByRole, getByLabel, getByText, getByTestId
2. Do NOT use CSS selectors like .class or #id (except data-testid)
3. Add explicit waits: await expect(locator).toBeVisible()
4. For forms: fill via getByLabel(), not via selectors
5. Check after every significant action (not only at the end)
6. Use page.waitForResponse() for ajax operations
7. Structure: test.describe > test.beforeEach > test

Example of a good locator:
✅ page.getByRole('button', {{ name: 'Create order' }})
✅ page.getByTestId('checkout-submit-btn')
❌ page.locator('button.btn-primary:nth-child(2)')

Return TypeScript test code."""

    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

    def generate_from_scenario(
        self,
        scenario: str,
        base_url: str,
        test_data: dict
    ) -> str:
        result = self.llm.invoke(
            self.PLAYWRIGHT_PROMPT.format(
                scenario=scenario,
                base_url=base_url,
                test_data=json.dumps(test_data, ensure_ascii=False)
            )
        )
        return result.content

    def generate_from_recording(self, playwright_trace: str) -> str:
        """Improves an automatically recorded Playwright test from Codegen"""
        prompt = f"""Improve the automatically recorded Playwright test.

Original test (from Codegen):
```typescript
{playwright_trace}

Problems in Codegen tests to fix:

Replace fragile CSS selectors with semantic locators
Add explicit waits instead of implicit ones
Extract test data into variables
Add state checks (expect) after key actions
Break into logical steps with comments

Return the improved test.""" return self.llm.invoke(prompt).content


### Screenshot-to-Test: generation from a screenshot

If you have a UI but no documentation, the AI analyzes a screenshot and creates a test. This is useful for reverse-engineering legacy systems.

<details>
<summary>Example of generation from a screenshot</summary>
We use GPT-4o to analyze the image. The model recognizes UI elements and generates a test with up to 95% accuracy.
</details>

```python
import base64
from openai import OpenAI

client = OpenAI()

def generate_test_from_screenshot(image_path: str, scenario: str) -> str:
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url",
                 "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                {"type": "text",
                 "text": f"""Create a Playwright test for this UI.

Scenario: {scenario}

Describe what you see in the screenshot: the form, buttons, fields.
Then create a TypeScript Playwright test with semantic locators.
Use getByRole, getByLabel, getByText – not CSS classes."""}
            ]
        }]
    )
    return response.choices[0].message.content

Automatic Page Object Model generation

Tests generated "on the fly" are convenient, but for large-scale projects, structure is needed. The AI itself suggests a Page Object – splits the page into logical blocks and creates a class with methods.

    PAGE_OBJECT_PROMPT = """Create a Page Object Model (POM) class for a page.

Page description / screenshot:
{page_description}

URL: {url}

POM requirements:
- All interactive elements as class properties
- Methods for main actions (not getters for every button)
- Methods return Promise<void> or Promise<ResultType>
- Use semantic locators
- Add a waitForLoad() method

Structure:
```typescript
export class CheckoutPage {{
  readonly page: Page;
  readonly submitButton: Locator;
  // ...

  async fillOrderForm(data: OrderData): Promise<void> {{
    // ...
  }}

  async submit(): Promise<OrderConfirmationPage> {{
    // ...
  }}
}}

Return TypeScript POM code."""

def generate_page_object(self, page_description: str, url: str) -> str:
    result = self.llm.invoke(
        self.PAGE_OBJECT_PROMPT.format(
            page_description=page_description,
            url=url
        )
    )
    return result.content


### Fixing flaky tests

A separate module analyzes failure logs and automatically adds waits, replaces unstable locators, fixes race conditions.

```python
class FlakyTestFixer:
    FLAKY_FIX_PROMPT = """Fix a flaky Playwright test.

Test:
{test_code}

Errors from the last 5 runs:
{error_log}

Typical flakiness causes:
1. Race condition: no wait after async action
2. Animations: element is visible but not immediately clickable
3. Network requests: no waitForResponse
4. Date/time: test depends on current time
5. Test order: global state

Add:
- await page.waitForLoadState('networkidle') after navigation
- await expect(element).toBeEnabled() before click
- page.waitForResponse() for ajax
- Fixed test time via page.clock.setFixedTime()

Return the fixed test."""

    def fix_flaky_test(self, test_code: str, error_log: str) -> str:
        return self.llm.invoke(
            self.FLAKY_FIX_PROMPT.format(test_code=test_code, error_log=error_log)
        ).content

How we implement AI test generation?

The process is broken into stages, each delivering a measurable outcome.

Stage	Duration	What the client gets
UI and scenario analysis	2–5 days	Map of screens, list of critical scenarios, test data
Basic test generation	5–10 days	Playwright tests with semantic locators, ready to run
Page Object Model implementation	3–5 days	Structured code, reusable methods
Stabilization of existing tests	3–7 days	Log analysis, flaky test fixes, flaky rate reduced to <5%
CI/CD integration	2–3 days	GitHub Actions / GitLab CI, parallel execution, Allure reports

Average timeline for a project with 10 critical scenarios: 4–6 weeks turnkey.

What is included in the result?

Document/artifact	Description
Test scenarios	Markdown description of steps and data
Test source code	TypeScript, Playwright, semantic locators
Page Object Model	Classes for each page
Stability reports	Allure Dashboard with run history
Run instructions	README with commands and dependencies
Team training	2-hour workshop on test maintenance

We guarantee: flaky rate will not exceed 5% after implementation. If exceeded, we fix it free of charge.

Results: comparison before and after AI

Metric	Without AI	With AI
Flaky rate	35%	4%
Average single test runtime	15 min	11.7 min
Time to write one test	~4 hours	~20 minutes
Coverage of critical scenarios	40%	95%

QA budget savings – up to 50% per month.

Why trust us with this task?

Our team has 10+ years of experience in test automation and 4+ years in AI/ML. We have implemented E2E generation in fintech, e-commerce, and SaaS – over 30 projects in total. All engineers are certified in Playwright and have experience with LLMs. We guarantee stable results and transparency: you see every generated test and can adjust it.

How to get started?

Order a pilot: we will analyze 3–5 of your scenarios, generate tests, and show results. We estimate timeline and cost within 24 hours after reviewing your project. Get a consultation – contact us directly.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.