Development of Intelligent RPA Bots with Computer Vision for UI Automation
Classic RPA bots (UiPath, Automation Anywhere, Blue Prism) interact with UIs through selectors and coordinates. They are fragile: if a button moves by 5 pixels, the bot breaks. Computer Vision eliminates this dependency: the bot "sees" the screen as an image and finds the required element by visual characteristics, regardless of its position and internal DOM structure or window hierarchy.
When the CV Approach is Necessary
CV extension for RPA is justified in specific scenarios:
- Working with legacy systems without API access and closed window hierarchy (old COBOL/AS400 terminals, Citrix Virtual Desktop)
- Web applications with dynamically generated classes (React/Angular with CSS Modules) where XPath is unstable
- Working with PDF documents and scanned images within the RPA flow
- Automating third-party desktop applications without SDK
Architecture of CV-RPA Bot
import cv2
import numpy as np
from ultralytics import YOLO
class CVRPAAgent:
def __init__(self, ui_detector_model: str):
# YOLOv8 fine-tuned on UI elements
self.detector = YOLO(ui_detector_model)
self.screenshot_engine = ScreenshotEngine()
def find_element(self, element_type: str,
text_hint: str = None) -> tuple[int, int]:
screenshot = self.screenshot_engine.capture()
detections = self.detector.predict(screenshot, conf=0.7)
candidates = [d for d in detections if d.class_name == element_type]
if text_hint:
candidates = self._filter_by_ocr_text(candidates, screenshot, text_hint)
if not candidates:
raise ElementNotFoundError(f"Cannot find {element_type}")
best = max(candidates, key=lambda d: d.confidence)
return best.center_x, best.center_y
def click(self, element_type: str, text_hint: str = None):
x, y = self.find_element(element_type, text_hint)
pyautogui.click(x, y)
For UI element detection, we use YOLOv8 fine-tuned on a dataset of UI components (buttons, input fields, checkboxes, dropdowns). Base model: Rico Dataset (66k Android UI) + custom annotation for the client's specific interface.
OCR Integration for Text Extraction
To extract text data from the screen: PaddleOCR (best balance of speed and accuracy for Cyrillic) or EasyOCR. Integration into the pipeline: find element → extract text from ROI (Region of Interest) → pass to processing logic.
import paddleocr
ocr = paddleocr.PaddleOCR(use_angle_cls=True, lang='ru')
def extract_text_from_region(image, bbox):
x1, y1, x2, y2 = bbox
region = image[y1:y2, x1:x2]
result = ocr.ocr(region, cls=True)
return ' '.join([line[1][0] for line in result[0]])
Working in Citrix and RDP Environments
In a Citrix environment, the bot does not have access to the window hierarchy of the remote desktop. Solution: capture screenshots through Citrix Virtual Channel or simple screen capture, analysis through CV model, clicks through virtual mouse/keyboard input. Additional complexity: Citrix video stream compression reduces image quality — we train the model on low-quality screenshots.
Reliability Metrics
| Metric | Classic RPA | CV-RPA |
|---|---|---|
| Resistance to element position change | Low | High |
| Resistance to UI framework change | Medium | High |
| Execution speed | Fast | 15–25% slower |
| Element finding accuracy | 99% (with correct XPath) | 91–96% |
| Automation Complexity | Timeline |
|---|---|
| 1–3 processes, ready interfaces | 2–4 weeks |
| 5–10 processes, Citrix/RDP | 5–8 weeks |
| Complex automation with model training | 8–14 weeks |







