Human-in-the-Loop implementation for AI results validation
Human-in-the-Loop (HITL) is a pattern in which humans are involved in the AI's decision-making process: they review low-confidence results, correct errors, and provide feedback for retraining. This isn't an admission of AI weakness, but a rational approach to risk management in tasks with a high cost of error.
When is HITL needed?
- Model confidence is below the threshold (confidence < 0.7)
- The prediction entails irreversible consequences (medical diagnosis, legal document, major transaction)
- An abnormal input query that is outside the training distribution
- Regulatory requirements (GDPR right to explanation of the decision)
- Accumulation of data for retraining (active learning)
HITL system architecture
from enum import Enum
from dataclasses import dataclass
class ReviewOutcome(Enum):
APPROVE = "approve"
REJECT = "reject"
CORRECT = "correct"
@dataclass
class ReviewTask:
task_id: str
input_data: dict
ai_prediction: dict
confidence: float
reason: str # Почему отправлено на ревью
priority: str # high/medium/low
created_at: datetime
deadline: datetime = None
class HumanInTheLoopOrchestrator:
def __init__(self, confidence_threshold: float = 0.85):
self.threshold = confidence_threshold
self.review_queue = ReviewQueue()
def process(self, input_data: dict, ai_result: dict) -> dict:
confidence = ai_result.get('confidence', 1.0)
needs_review, reason = self._should_review(ai_result, confidence)
if needs_review:
task = self.review_queue.submit(
input_data=input_data,
ai_prediction=ai_result,
confidence=confidence,
reason=reason,
priority=self._compute_priority(confidence, input_data)
)
return {
'status': 'pending_review',
'task_id': task.task_id,
'estimated_wait_minutes': self.review_queue.estimated_wait()
}
else:
return {
'status': 'auto_approved',
'prediction': ai_result,
'confidence': confidence
}
def _should_review(self, result: dict, confidence: float) -> tuple:
if confidence < self.threshold:
return True, f"Low confidence: {confidence:.2f}"
if result.get('is_anomalous'):
return True, "Anomalous input detected"
if result.get('high_value_transaction'):
return True, "High-value transaction requires approval"
return False, None
UI for reviewers
# FastAPI endpoint для review interface
@app.get("/review/queue")
async def get_review_queue(reviewer: Reviewer = Depends(get_reviewer)):
tasks = await review_queue.get_pending(
reviewer_expertise=reviewer.expertise_areas,
limit=20
)
return [ReviewTaskResponse.from_task(t) for t in tasks]
@app.post("/review/{task_id}/submit")
async def submit_review(
task_id: str,
outcome: ReviewOutcome,
correction: dict = None,
comment: str = None,
reviewer: Reviewer = Depends(get_reviewer)
):
await review_store.save_outcome(
task_id=task_id,
reviewer_id=reviewer.id,
outcome=outcome,
correction=correction,
comment=comment
)
# Использование для активного обучения
if outcome in [ReviewOutcome.CORRECT, ReviewOutcome.REJECT]:
await active_learning_buffer.add(
input_data=task.input_data,
ground_truth=correction or {"label": "rejected"},
source="human_review"
)
# Разблокировка ожидающего запроса
await pending_requests.resolve(task_id, outcome, correction)
Active Learning from HITL Data
The results of manual labeling are a valuable learning signal:
class ActiveLearningPipeline:
def __init__(self, min_samples_for_retrain: int = 500):
self.buffer = []
self.min_samples = min_samples_for_retrain
def add_reviewed_sample(self, features: dict, ground_truth, confidence: float):
# Uncertainty sampling: приоритизировать сложные примеры
self.buffer.append({
'features': features,
'label': ground_truth,
'weight': 1 / (confidence + 0.01) # Больший вес для uncertain примеров
})
if len(self.buffer) >= self.min_samples:
self._trigger_retraining()
HITL doesn't slow down business processes—with the right architecture, 90%+ of requests are processed automatically, and review focuses on truly complex cases. Furthermore, each markup improves the model.







