AI QA Engineer — Digital Worker for Testing
AI QA Engineer automates test case development, autotest writing, testing results analysis, failed test investigation, and report generation. Used as a supplement to a real QA team to accelerate coverage and reduce routine workload.
Test Case Generation from Requirements
from openai import AsyncOpenAI
from pydantic import BaseModel
from typing import Literal
client = AsyncOpenAI()
class TestCase(BaseModel):
id: str
title: str
category: Literal["positive", "negative", "edge_case", "security", "performance"]
preconditions: list[str]
steps: list[str]
expected_result: str
priority: Literal["critical", "high", "medium", "low"]
test_data: dict
async def generate_test_cases(
feature_description: str,
acceptance_criteria: list[str],
existing_test_cases: list[str] = None,
) -> list[TestCase]:
existing_context = f"\nAlready existing test cases (do not duplicate):\n{chr(10).join(existing_test_cases[:10])}" if existing_test_cases else ""
response = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"""You are a QA engineer with 8 years of experience.
Create test cases according to IEEE 829 standard.
Must include: happy path, boundary values, negative scenarios, security.
Test data must be specific (not 'test data').{existing_context}"""
}, {
"role": "user",
"content": f"""Feature: {feature_description}
Acceptance criteria:
{chr(10).join(f'- {ac}' for ac in acceptance_criteria)}""",
}],
response_format=list[TestCase],
temperature=0.3,
)
return response.choices[0].message.parsed
Auto-Test Generation
class AutotestGenerator:
async def generate_pytest_tests(
self,
test_cases: list[TestCase],
api_schema: dict,
existing_fixtures: str = "",
) -> str:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"""Generate Python pytest tests.
Rules:
- Use parametrize for similar test cases
- Use existing fixtures: {existing_fixtures[:200] if existing_fixtures else 'none'}
- Descriptive function names: test_should_X_when_Y
- Assertions with clear error messages
- Isolated tests (each test is independent)
API Schema: {json.dumps(api_schema, indent=2)[:1000]}"""
}, {
"role": "user",
"content": f"Generate pytest tests for:\n{json.dumps([tc.model_dump() for tc in test_cases], ensure_ascii=False, indent=2)}",
}],
temperature=0.2,
)
return response.choices[0].message.content
async def generate_playwright_tests(
self,
test_cases: list[TestCase],
page_object_models: str = "",
) -> str:
"""Generates Playwright E2E tests"""
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"""Generate Playwright tests in TypeScript.
Use Page Object Model. Available POMs: {page_object_models[:300] if page_object_models else 'none'}
Each test is independent. Data — via test.use({{}}) or constants."""
}, {
"role": "user",
"content": json.dumps([tc.model_dump() for tc in test_cases], ensure_ascii=False),
}],
)
return response.choices[0].message.content
Failed Test Analysis
class FailedTestAnalyzer:
async def analyze_failure(
self,
test_name: str,
error_message: str,
stacktrace: str,
recent_commits: list[dict],
test_history: list[dict],
) -> dict:
"""Analyzes the cause of test failure and suggests a fix"""
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "You are a Senior QA Engineer. Analyze failed tests: identify root cause, distinguish flaky from real errors, propose specific fixes."
}, {
"role": "user",
"content": f"""Failed test: {test_name}
Error: {error_message}
Stacktrace: {stacktrace[:1000]}
Recent commits: {json.dumps(recent_commits[:5], ensure_ascii=False)}
Test history (last 10 runs): {[r['status'] for r in test_history[-10:]]}
Determine: 1) problem type (flaky/regression/env), 2) likely cause, 3) propose a fix.""",
}],
)
return {
"analysis": response.choices[0].message.content,
"is_flaky": self.detect_flaky_pattern(test_history),
"likely_cause": self.extract_root_cause(error_message, stacktrace),
}
def detect_flaky_pattern(self, history: list[dict]) -> bool:
"""Test is flaky if it alternates pass/fail without obvious pattern"""
statuses = [r["status"] for r in history[-10:]]
passes = statuses.count("passed")
fails = statuses.count("failed")
# Flaky: both statuses present, no clear degradation
return passes >= 2 and fails >= 2 and statuses[-1] != "failed" * 3
Coverage Report
class CoverageReporter:
async def generate_coverage_report(
self,
coverage_data: dict,
test_cases: list[dict],
code_diff: str = "",
) -> str:
report = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Create a test coverage report for the team. Highlight: uncovered critical paths, recommendations for priority test writing."
}, {
"role": "user",
"content": f"""Coverage: {json.dumps(coverage_data, indent=2)[:1000]}
Test case count: {len(test_cases)}, of which automated: {sum(1 for t in test_cases if t.get('automated'))}
Code changes (diff): {code_diff[:500] if code_diff else 'not provided'}"""
}],
)
return report.choices[0].message.content
Practical Case Study: fintech, 3 QA for 8 Developers
Situation: QA couldn't keep up with test coverage of all code. Coverage 51%, test debt was piling up.
AI QA in the Process:
- On PR opening: automatic test case generation from diff
- Pytest test generation for new API endpoints
- Analysis of failed tests in CI with fix suggestions
- Weekly coverage report with priorities
Results:
- Test coverage: 51% → 79% in 3 months
- Time to write tests: -55%
- Regression detection before production: +34%
- Flaky tests identified and flagged: 23 tests
Timeline
- Test case generator from requirements: 1–2 weeks
- Auto-generation of pytest/Playwright tests: 2–3 weeks
- Failed test analyzer + CI integration: 1–2 weeks
- Coverage reporting: 1 week
- Total: 5–8 weeks







