Development of a quality control system for AI Workforce
AI Workforce Quality Assurance (QA) is a systematic process of checking the performance of AI agents through sampling, automated evaluation, and human review. Without QA, the system degrades unnoticed: prompts become outdated, LLMs are updated, and data drifts.
Sampling strategy
Checking all tasks is unrealistic at scale. Proper sampling:
Random sample: 2–5% of all tasks for basic monitoring. Statistically representative.
Stratified sampling: separate samples by task type, priority, and client. We won't miss any issues in rare categories.
Risk-based sampling: enhanced control for tasks with low confidence scores, new types of tasks, and high-value clients.
Triggered sampling: in case of anomaly (increase in errors, decrease in confidence) – automatic increase in sampling rate.
class QualitySampler:
def should_sample(self, task: CompletedTask) -> tuple[bool, str]:
# Risk-based приоритеты
if task.confidence_score < 0.6:
return True, "low_confidence"
if task.task_type in self.high_risk_types:
return random.random() < 0.20, "high_risk_type" # 20% sampling
if task.customer_tier == "enterprise":
return random.random() < 0.10, "enterprise_customer"
# Базовый random sampling 3%
return random.random() < 0.03, "random"
Automatic assessment by an LLM judge
class LLMQualityJudge:
def __init__(self, judge_model: str = "gpt-4o"):
self.client = OpenAI()
self.judge_model = judge_model
def evaluate(self, task: AgentTask, result: AgentResult, rubric: EvalRubric) -> QualityScore:
prompt = f"""Ты судья качества AI-агента. Оцени работу агента по рубрике.
ЗАДАЧА: {task.description}
КОНТЕКСТ: {task.context}
ОЖИДАЕМЫЙ РЕЗУЛЬТАТ: {task.expected_outcome}
ФАКТИЧЕСКИЙ РЕЗУЛЬТАТ: {result.output}
ДЕЙСТВИЯ АГЕНТА: {format_agent_trace(result.trace)}
РУБРИКА ОЦЕНКИ:
{rubric.to_text()}
Оцени каждый критерий от 0 до 5 и дай итоговую оценку."""
response = self.client.chat.completions.create(
model=self.judge_model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
scores = json.loads(response.choices[0].message.content)
return QualityScore(
criteria_scores=scores["criteria"],
overall=scores["overall"],
reasoning=scores["reasoning"],
flagged_issues=scores.get("issues", [])
)
Calibration: LLM Judge vs. Human
An LLM judge is prone to bias: favoring longer answers and penalizing certain styles. Regular calibration:
def calibrate_judge(judge: LLMQualityJudge, human_labels: list[HumanLabel]) -> CalibrationReport:
judge_scores = [judge.evaluate(l.task, l.result, rubric).overall for l in human_labels]
human_scores = [l.human_score for l in human_labels]
# Cohen's Kappa для согласованности
kappa = cohen_kappa_score(
[round(s) for s in human_scores],
[round(s) for s in judge_scores]
)
# Систематическое смещение
bias = np.mean(np.array(judge_scores) - np.array(human_scores))
return CalibrationReport(
kappa=kappa, # цель > 0.6
bias=bias, # цель ≈ 0
correlation=np.corrcoef(human_scores, judge_scores)[0, 1],
needs_recalibration=kappa < 0.5 or abs(bias) > 0.3
)
Human review workflow
Flagged tasks are queued for manual review. Priority: low confidence + high impact → first. Reviewer interface: task, agent response, AI judge's rating, rating and comment fields. SLA: enterprise tasks – review within 4 hours, standard tasks – 24 hours.
Reporting and trend tracking
Weekly QC report: sampling statistics, score distribution, top 10 issues, comparison with the previous week. If the average quality score has decreased by > 0.1 over the week, an automatic alert is sent to the team.







