What types of questions can the system generate?

The system supports multiple choice, true/false, open-ended questions, and case studies. Each type is adaptable to Bloom's taxonomy levels, from simple recall to synthesis of new solutions.

How is uniqueness of variants ensured for each student?

For each question, up to 30 unique variants are generated with changes in numbers, context, and answer order. This prevents cheating while maintaining consistent difficulty.

Does the system grade open-ended answers?

Yes, a built-in AI assessor compares the student's answer against the rubric and model answer, assigning scores and providing detailed feedback. Rubrics of any complexity are supported.

How long does implementation take?

A basic test generator from textual material takes 1–2 weeks. A full platform with auto-grading, analytics, and LMS (Moodle/iSpring) integration takes 2–3 months.

Which AI models are used?

The primary model is GPT-4o from OpenAI. If needed, we integrate Claude 3.5, LLaMA 3, or local models via vLLM. The choice depends on confidentiality and latency requirements.

What types of questions can the system generate?

The system supports multiple choice, true/false, open-ended questions, and case studies. Each type is adaptable to Bloom's taxonomy levels, from simple recall to synthesis of new solutions.

How is uniqueness of variants ensured for each student?

For each question, up to 30 unique variants are generated with changes in numbers, context, and answer order. This prevents cheating while maintaining consistent difficulty.

Does the system grade open-ended answers?

Yes, a built-in AI assessor compares the student's answer against the rubric and model answer, assigning scores and providing detailed feedback. Rubrics of any complexity are supported.

How long does implementation take?

A basic test generator from textual material takes 1–2 weeks. A full platform with auto-grading, analytics, and LMS (Moodle/iSpring) integration takes 2–3 months.

Which AI models are used?

The primary model is GPT-4o from OpenAI. If needed, we integrate Claude 3.5, LLaMA 3, or local models via vLLM. The choice depends on confidentiality and latency requirements.

AI-Powered Automatic Test Generation System

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Powered Automatic Test Generation System

Medium

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI-Powered Automatic Test Generation System

Instructors spend up to 40% of their time creating tests, while students easily find ready-made answers. Manually composing a 30-question exam takes 4–6 hours—and that's just one variant. Developing an AI system for automatic test generation pays off within months: instructor time savings of up to 70%, and test cost reduced to 10–20% of manual labor. Typical project cost ranges from $5,000 to $15,000, with annual savings of $20,000 or more from reduced instructor workload. We offer a solution: a system that generates unique exam tasks based on your material, considering difficulty levels and Bloom's taxonomy. Each student gets their own variant, and grading of open-ended answers is automated. The system integrates with Moodle, iSpring, and supports custom export formats. Get a consultation on your project—we will evaluate automation opportunities in one day.

Why Traditional Tests No Longer Work

Manual question creation consumes hours, and question banks quickly become outdated. Students share answers, and instructors spend time grading. AI generation solves both problems: it creates an infinite number of variants and evaluates open-ended answers in seconds. According to a McKinsey study, automating routine tasks frees up to 30% of instructor time.

How AI Generates Questions by Bloom's Taxonomy

We use GPT-4o with prompts tailored to each level of Bloom's taxonomy: from recalling facts to creating new products. The code below shows the implementation.

from openai import AsyncOpenAI
from enum import Enum
import json

client = AsyncOpenAI()

class BloomLevel(Enum):
    REMEMBER = "remember"
    UNDERSTAND = "understand"
    APPLY = "apply"
    ANALYZE = "analyze"
    EVALUATE = "evaluate"
    CREATE = "create"

BLOOM_PROMPTS = {
    BloomLevel.REMEMBER: "Create a question on recalling facts, dates, definitions",
    BloomLevel.UNDERSTAND: "Create a question on understanding: explanation, paraphrasing, examples",
    BloomLevel.APPLY: "Create a practical task: applying knowledge in a new situation",
    BloomLevel.ANALYZE: "Create a question on analysis: comparison, identifying causes, structuring",
    BloomLevel.EVALUATE: "Create a question on evaluation: justifying a judgment, critiquing an approach",
    BloomLevel.CREATE: "Create a task on synthesis: developing a solution, creating a product",
}

async def generate_question(
    topic: str,
    source_text: str,
    question_type: str,
    bloom_level: BloomLevel = BloomLevel.UNDERSTAND,
    difficulty: str = "medium"
) -> dict:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"""Create a test question.
            Type: {question_type}.
            Bloom's taxonomy level: {bloom_level.value}. {BLOOM_PROMPTS[bloom_level]}.
            Difficulty: {difficulty}.

            For multiple_choice: 4 options, 1 correct, 3 plausible distractors.
            For open_answer: model answer + scoring rubric.
            For case_study: scenario + 3-5 questions at different levels.

            Return JSON: {{
                question: "question text",
                type: "{question_type}",
                bloom_level: "{bloom_level.value}",
                options: ["A...", "B...", ...],
                correct_answer: "...",
                explanation: "why this answer",
                scoring_rubric: {{...}}
            }}"""
        }, {
            "role": "user",
            "content": f"Topic: {topic}\n\nMaterial:\n{source_text[:2000]}"
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

Generating a Complete Exam Variant

We assemble a test from questions with a specified distribution across levels. For example, 30% understanding, 30% application, 20% analysis, and 20% recall.

async def generate_exam_variant(
    course_topics: list[str],
    total_questions: int = 30,
    time_limit_min: int = 60,
    bloom_distribution: dict = None
) -> dict:
    if not bloom_distribution:
        bloom_distribution = {
            BloomLevel.REMEMBER: 0.2,
            BloomLevel.UNDERSTAND: 0.3,
            BloomLevel.APPLY: 0.3,
            BloomLevel.ANALYZE: 0.2
        }

    questions_by_level = {
        level: int(total_questions * fraction)
        for level, fraction in bloom_distribution.items()
    }

    all_questions = []
    tasks = []

    for level, count in questions_by_level.items():
        for i in range(count):
            topic = course_topics[i % len(course_topics)]
            q_type = "multiple_choice" if level in [BloomLevel.REMEMBER, BloomLevel.UNDERSTAND] else "open_answer"
            tasks.append(generate_question(
                topic=topic,
                source_text="",
                question_type=q_type,
                bloom_level=level
            ))

    all_questions = await asyncio.gather(*tasks)

    return {
        "variant_id": f"V{random.randint(1000, 9999)}",
        "time_limit_min": time_limit_min,
        "total_points": sum(q.get("points", 1) for q in all_questions),
        "questions": list(all_questions),
        "bloom_distribution": {l.value: c for l, c in questions_by_level.items()}
    }

Taxonomy Level	Default Question Type	Share in Exam
Remember	multiple_choice	20%
Understand	multiple_choice	30%
Apply	open_answer	30%
Analyze	open_answer	20%

Comparison: AI vs. Manual Method

AI generation is 10x faster: 30 questions in 2 minutes instead of 4–6 hours. Test generation cost is 5–10x lower, and the number of unique variants reaches up to 30 per question. AI-powered grading is 100x faster than manual grading.

Parameter	Manual Method	AI System
Time for 30 questions	4–6 hours	2–3 minutes
Number of variants	1–2 (with copies)	up to 30 unique
Grading open-ended answers	manual, hours	automatic, seconds
Cost per test	high (labor cost)	low (API only)

How to Ensure Variant Uniqueness?

Each question is not a copy. The system generates up to 30 unique versions by varying numbers, names, context, and answer order. Difficulty remains consistent, verified across dozens of projects.

async def generate_unique_variants(
    base_question: str,
    n_variants: int = 30,
    maintain_difficulty: bool = True
) -> list[dict]:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"""Create {n_variants} unique versions of the question.
            Vary: numbers, names, context, order of answer options.
            Difficulty {'must remain the same' if maintain_difficulty else 'may vary'}.
            Return a JSON array."""
        }, {
            "role": "user",
            "content": f"Original question: {base_question}"
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)["variants"]

Automatic Grading of Open-Ended Answers

For open-ended questions, an AI grader compares the student's answer against the rubric and model answer, assigning points with comments.

async def auto_grade_open_answer(
    question: str,
    correct_answer: str,
    rubric: dict,
    student_answer: str
) -> dict:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"""Grade the student's answer according to the rubric.
            Question: {question}
            Model answer: {correct_answer}
            Scoring criteria: {json.dumps(rubric, ensure_ascii=False)}

            Evaluate the answer and return JSON:
            {{score: 0-100, feedback: "detailed feedback", strengths: [], weaknesses: []}}"""
        }, {
            "role": "user",
            "content": f"Student answer: {student_answer}"
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

Ensemble-Based Evaluation

For reliable grading of disputed answers, we use an ensemble of three different AI models. If two models give the same score, it is final. In case of disagreement, a third model is consulted, and the answer is flagged for manual review. This reduces the error rate to less than 1% per test.

Why is distribution across taxonomy levels important?

Distributing questions across Bloom's levels ensures that the test assesses not only recall but also understanding, analysis, and synthesis. This improves the quality of student evaluation and aligns with modern educational standards.

What Is Included in the Work

Designing the generation architecture for your content
Developing prompts for all Bloom's taxonomy levels
Integration with LMS (Moodle, iSpring, custom API)
Generation of up to 30 unique variants per question
Automatic grading module for open-ended answers with rubrics
Documentation
Instructor training
2 months of support
Access rights to the system

Process of Work

Analysis — we study your educational material and test requirements
Design — determine question types, distribution across levels, export format
Implementation — write generation code, integrate with your LMS
Testing — validate on real students, adjust prompts
Deployment — launch into production, hand over documentation

Indicative Timelines

Test generator from ready textual material: 1–2 weeks
Full platform with auto-grading, analytics, and LMS integration: 2–3 months

The cost is calculated individually based on your data volume and required integrations. Budget savings on training amount to 30–50% of manual testing costs. Investment in automation pays back within 2 months. Evaluate your project in one day — order an individual demo to discuss details.

With over 5 years of experience in AI/ML and 20+ content generation projects, we guarantee stable operation and timely delivery. Get a consultation on your project — we will evaluate automation opportunities in one day.

Generative AI Development: From Prompt to Production API

We often receive a task "generate a product image" — on the surface it seems simple. But behind this lies a choice between dozens of models, configuring the inference pipeline, manually solving consistency issues, integrating into the product backend, and answering why the model generates hands with six fingers in staging but not in production. Let's break down the directions we work with.

Image Generation: From Prompt to Production API

The current landscape includes FLUX.1 [dev/schnell/pro] from Black Forest Labs and Stable Diffusion 3.5. FLUX.1 [schnell] takes 4 steps instead of 20–50 for SDXL — 5–12 times faster — while maintaining higher quality. On an A100 80GB — 1.2–1.8 s per 1024×1024 image at batch_size=4.

A typical deployment issue: FLUX.1 [dev] requires 24+ GB VRAM in fp16. On A10G 24GB it fits tightly; at batch_size>1 — OOM. Solution: torch_dtype=torch.bfloat16 + enable_model_cpu_offload() from diffusers, or quantization via bitsandbytes to NF4 — minimal quality drop, memory consumption drops to 12–14 GB.

ControlNet and IP-Adapter are key tools for production tasks where controllability is needed. ControlNet with Canny/Depth/Pose maps provides structural control. IP-Adapter (especially IP-Adapter-FaceID) allows transferring character identity to generations — this is the foundation for personalized content. More about ControlNet can be found on Wikipedia.

Case study: e-commerce photography. A retailer with 8000 SKUs needed lifestyle photos for each product. Pipeline: product segmentation (Segment Anything Model 2) → background removal → inpainting with FLUX.1 [dev] using product image as IP-Adapter reference → upscale via RealESRGAN_x4plus. The generation cost is negligible compared to professional photography, providing huge savings. Throughput — 200 images/hour on 2× A100. Our extensive experience from 30+ projects ensures we select the optimal model for your task — an evaluation can be obtained upfront.

Why Is Model Selection Only Half the Battle?

Fine-tuning for a Specific Style or Character

Dreambooth and LoRA are the standard for adapting to a specific visual style or object. LoRA trains in 2–4 hours on 20–30 reference images on a single A100. Rank 16–32 is usually sufficient for style; rank 64+ is needed for precise face reproduction.

A common mistake: training LoRA too long — the model overfits to references, losing the ability to vary. Sign: at cfg_scale=7, all images look like copy-paste of references. Solved by early stopping (usually 1500–2000 steps for 20 images) and prior_preservation_loss.

For deeper customization — full fine-tuning via diffusers + accelerate with FSDP on multiple GPUs. But that already takes 40–80 hours of training and requires a truly large dataset (1000+ images).

Comparison of Image Generation Approaches

Model	Speed (1024×1024, A100)	Quality (CLIP score)	Controllability (ControlNet, IP-Adapter)	VRAM (fp16)
Stable Diffusion 3.5	2.0–3.5 s	0.28–0.31	via ControlNet (allowed)	16–20 GB
FLUX.1 [schnell]	0.8–1.2 s	0.30–0.33	limited (no ControlNet)	12–14 GB (4‑step)
FLUX.1 [dev]	3–5 s (50 steps)	0.32–0.34	via IP-Adapter, ControlNet (adapter)	24+ GB
Midjourney (API)	5–10 s (queue)	0.31–0.33	prompt + style reference	not required

Video Generation: Which Models Are Best?

Model	Availability	Duration	Resolution	Controllability
Sora (OpenAI)	API (limited)	up to 60 s	1080p	prompt, image-to-video
Wan2.1 (Alibaba)	open weights	up to 81 frames	720p	prompt, I2V, V2V
CogVideoX-5B	open weights	6 s	720p	prompt, I2V
Kling 1.6	API	up to 30 s	1080p	prompt, I2V
Mochi-1	open weights	5.4 s	480p	prompt

Open-weight video models still lag behind commercial ones in stability and length. Wan2.1 is the best choice for self-hosting: 14B parameters, runs on 2× A100, delivers acceptable quality for short clips.

The main pain of video generation is temporal consistency: the character changes clothing color at the third second, objects "drift." Partial solution — generation with motion_bucket_id and noise_aug_strength in Stable Video Diffusion, or using I2V (image-to-video) instead of pure text-to-video. As noted in VideoPoet research, consistency is achieved by training on long sequences.

AnimateDiff remains a working tool for short loops and motion effects on top of SD/FLUX. Not Sora, but deployable locally and predictable.

Music and Audio Generation

AudioCraft from Meta (MusicGen + AudioGen) is a production-ready stack for music generation. musicgen-large (3.3B) generates 30 s of music in ~8 s on A100. Control via text prompt and melody conditioning — you can specify a melody by humming.

Stable Audio Open from Stability AI is an alternative with length up to 47 s, better structural control (intro/verse/chorus). Deployment is similar: diffusers + FastAPI.

For voice-over and dubbing — ElevenLabs API or self-hosted XTTS v2 (see Speech AI service). For sound design and foley — AudioGen.

3D Generation: Current Practical State

3D generation has not yet reached the same maturity as 2D. But for specific tasks, tools are already working:

TripoSG and Shap-E — text/image-to-3D. Shap-E from OpenAI generates simple 3D meshes in seconds, but geometry is rough. TripoSG gives more detailed results but requires post-processing (remeshing, UV unwrapping).

Wonder3D and Zero123++ — 3D reconstruction from a single image. They work by generating multi-views (6–8 views) and then 3D reconstruction via NeuS or instant-ngp.

Gaussian Splatting (3DGS) — not generation, but reconstruction from a series of photos/videos. For product cards and real estate it's already production: 50–200 photos → 3DGS model in 15–30 min on RTX 4090 → interactive 3D viewer in browser.

What Infrastructure Is Needed for Generative AI Deployment?

Critical for generative models:

Task queue — Celery + Redis or Ray Serve. Synchronous HTTP for image generation is unacceptable with >5 concurrent requests.
Caching — similar prompts yield similar results. Semantic cache via embeddings (faiss + sentence-transformers) can reduce GPU load by 20–40%.
Quality monitoring — CLIP score for text-image alignment, FID for evaluating generation distribution. Integrate into MLflow or Weights & Biases.
Storage — generated images immediately to S3/MinIO, not on the inference server disk.

What's Included in the Deliverables

We take the project turnkey — from model selection to deployment and monitoring. The result includes:

Model (or API integration) with performance benchmarks (latency p99, throughput).
Pipeline documentation (prompt engineering guide, model card, dependency versions).
Integration with your backend (REST/gRPC, queues).
Configured monitoring (dashboards, alerts for quality drift).
Training workshop for the team (2–4 hours).
Warranty support for 3 months after launch — as part of our quality certificate.

We have completed 30+ projects in generative AI — this gives us the right to guarantee results.

How Is the Generative AI Development Process Structured?

Analysis (1–2 days): audit of current architecture, clarification of use case, selection of models and success metrics. We evaluate the project free of charge.
Proof of Concept (1–3 weeks): quick prototype on your data — to see real quality, not blog demos.
Design (1–2 weeks): pipeline architecture, infrastructure (GPU cluster/API), A/B testing plan.
Implementation and fine-tuning (4–12 weeks): development, LoRA/full fine-tuning, integration with queue and cache.
Testing (1–2 weeks): load tests, metric validation, edge-case verification (negative scenarios).
Deployment and monitoring (1–2 weeks): production deployment, monitoring setup, documentation.

What We Verify at the Proof of Concept Stage

Alignment of expectations and actual generation quality (CLIP score, user study).
Inference speed at different batch sizes and GPU types.
Likelihood of toxic/incorrect generations — checking safety filters.
Scalability: will the model handle peak load.

Timeline Estimates

Integration of a ready API (DALL·E 3, Midjourney API, Stability API) — 1–2 weeks. Self-hosted pipeline with fine-tuning — 6–12 weeks. Full platform with UI, queues and monitoring — 3–6 months. The specific cost is calculated individually after analyzing your scenario.

Contact us — order a consultation, and we will select the optimal architecture for your project. Get a preliminary cost and timeline estimate for free.