Intelligent RPA Bots with Computer Vision for UI Interaction

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Intelligent RPA Bots with Computer Vision for UI Interaction
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Development of Intelligent RPA Bots with Computer Vision for UI Automation

Classic RPA bots (UiPath, Automation Anywhere, Blue Prism) interact with UIs through selectors and coordinates. They are fragile: if a button moves by 5 pixels, the bot breaks. Computer Vision eliminates this dependency: the bot "sees" the screen as an image and finds the required element by visual characteristics, regardless of its position and internal DOM structure or window hierarchy.

When the CV Approach is Necessary

CV extension for RPA is justified in specific scenarios:

  • Working with legacy systems without API access and closed window hierarchy (old COBOL/AS400 terminals, Citrix Virtual Desktop)
  • Web applications with dynamically generated classes (React/Angular with CSS Modules) where XPath is unstable
  • Working with PDF documents and scanned images within the RPA flow
  • Automating third-party desktop applications without SDK

Architecture of CV-RPA Bot

import cv2
import numpy as np
from ultralytics import YOLO

class CVRPAAgent:
    def __init__(self, ui_detector_model: str):
        # YOLOv8 fine-tuned on UI elements
        self.detector = YOLO(ui_detector_model)
        self.screenshot_engine = ScreenshotEngine()

    def find_element(self, element_type: str,
                     text_hint: str = None) -> tuple[int, int]:
        screenshot = self.screenshot_engine.capture()
        detections = self.detector.predict(screenshot, conf=0.7)

        candidates = [d for d in detections if d.class_name == element_type]
        if text_hint:
            candidates = self._filter_by_ocr_text(candidates, screenshot, text_hint)

        if not candidates:
            raise ElementNotFoundError(f"Cannot find {element_type}")

        best = max(candidates, key=lambda d: d.confidence)
        return best.center_x, best.center_y

    def click(self, element_type: str, text_hint: str = None):
        x, y = self.find_element(element_type, text_hint)
        pyautogui.click(x, y)

For UI element detection, we use YOLOv8 fine-tuned on a dataset of UI components (buttons, input fields, checkboxes, dropdowns). Base model: Rico Dataset (66k Android UI) + custom annotation for the client's specific interface.

OCR Integration for Text Extraction

To extract text data from the screen: PaddleOCR (best balance of speed and accuracy for Cyrillic) or EasyOCR. Integration into the pipeline: find element → extract text from ROI (Region of Interest) → pass to processing logic.

import paddleocr

ocr = paddleocr.PaddleOCR(use_angle_cls=True, lang='ru')

def extract_text_from_region(image, bbox):
    x1, y1, x2, y2 = bbox
    region = image[y1:y2, x1:x2]
    result = ocr.ocr(region, cls=True)
    return ' '.join([line[1][0] for line in result[0]])

Working in Citrix and RDP Environments

In a Citrix environment, the bot does not have access to the window hierarchy of the remote desktop. Solution: capture screenshots through Citrix Virtual Channel or simple screen capture, analysis through CV model, clicks through virtual mouse/keyboard input. Additional complexity: Citrix video stream compression reduces image quality — we train the model on low-quality screenshots.

Reliability Metrics

Metric Classic RPA CV-RPA
Resistance to element position change Low High
Resistance to UI framework change Medium High
Execution speed Fast 15–25% slower
Element finding accuracy 99% (with correct XPath) 91–96%
Automation Complexity Timeline
1–3 processes, ready interfaces 2–4 weeks
5–10 processes, Citrix/RDP 5–8 weeks
Complex automation with model training 8–14 weeks