What is the difference between global and local planning?

Global planning builds a route from point A to B based on an HD map, ignoring dynamic obstacles. Local planning adjusts the trajectory in real time, avoiding pedestrians, cars, and other objects. We combine both approaches with RL at the local level for flexibility.

How does RL handle unforeseen situations?

The RL agent trains in the CARLA simulator on thousands of scenarios, including sudden pedestrian appearances, intersections without traffic lights, and aggressive drivers. Thanks to reward shaping, the agent learns to safely react even to previously unseen event combinations.

What metrics are used to evaluate planning quality?

Key metrics: Route Completion Rate (RCR) — percentage of successfully completed routes; Infraction Rate — number of violations (collisions, lane crossings) per 1 km; Comfort Score — average acceleration and jerk. Additionally, we measure decision latency (p99).

How long does it take to train an RL agent?

Training a basic agent for simple urban routes takes about 12 weeks. A full hierarchical system with a safety layer and complex scenarios takes 24 to 48 weeks, depending on environment complexity and performance requirements.

What safety mechanisms are applied?

We use a formal approach: on top of the RL policy, we install a Responsibility-Sensitive Safety (RSS) layer from Intel and Control Barrier Functions (CBF). These mathematical guarantees ensure that all agent actions remain within safe limits, even if the RL model makes a mistake.

What is the difference between global and local planning?

Global planning builds a route from point A to B based on an HD map, ignoring dynamic obstacles. Local planning adjusts the trajectory in real time, avoiding pedestrians, cars, and other objects. We combine both approaches with RL at the local level for flexibility.

How does RL handle unforeseen situations?

The RL agent trains in the CARLA simulator on thousands of scenarios, including sudden pedestrian appearances, intersections without traffic lights, and aggressive drivers. Thanks to reward shaping, the agent learns to safely react even to previously unseen event combinations.

What metrics are used to evaluate planning quality?

Key metrics: Route Completion Rate (RCR) — percentage of successfully completed routes; Infraction Rate — number of violations (collisions, lane crossings) per 1 km; Comfort Score — average acceleration and jerk. Additionally, we measure decision latency (p99).

How long does it take to train an RL agent?

Training a basic agent for simple urban routes takes about 12 weeks. A full hierarchical system with a safety layer and complex scenarios takes 24 to 48 weeks, depending on environment complexity and performance requirements.

What safety mechanisms are applied?

We use a formal approach: on top of the RL policy, we install a Responsibility-Sensitive Safety (RSS) layer from Intel and Control Barrier Functions (CBF). These mathematical guarantees ensure that all agent actions remain within safe limits, even if the RL model makes a mistake.

Reinforcement Learning Path Planning for Autonomous Vehicles

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Reinforcement Learning Path Planning for Autonomous Vehicles

Complex

from 2 weeks to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Imagine an autonomous vehicle navigating dense city traffic. Suddenly, a pedestrian runs out from behind a truck. Classical path planning algorithms like A* would generate a new route in 100 ms—but that may not be enough. Reactive braking is too late. Our engineers solve this problem using Reinforcement Learning (RL) and hierarchical planning. We build systems capable of making decisions in milliseconds, considering hundreds of variables—from the speed of neighboring cars to road surface conditions. In autonomous driving, trajectory planning reliability determines safety. We use deep learning for driving in perception and control tasks. In this article, we'll break down how an RL agent learns in the CARLA simulator and why adding a formal safety layer (RSS, CBF) makes the system safe and reliable. Get a consultation—we'll evaluate your project and offer the optimal solution.

What problems does RL planning solve?

Edge cases: pedestrians running from behind obstacles, cyclists, animals, road markings under construction. Classical planners require manually coding every scenario. Safety at urban speeds up to 60 km/h: even a couple of seconds of delay can lead to an accident. Our RL agent makes decisions at 50-100 Hz. In tests on CARLA scenarios, collision rates drop by 40% compared to purely deterministic planners. Accident cost savings can reach $200,000 per year for a large fleet.

Why Reinforcement Learning outperforms classical methods?

Classical algorithms (A*, RRT, MPC) require manual coding of hundreds of exceptional situations. Reinforcement Learning automatically finds the optimal strategy by learning on thousands of simulations. As a result, the system adapts to unforeseen conditions without additional programming. For example, at an intersection with a non-functioning traffic light, the RL agent decides on its own whether to yield or proceed, evaluating the behavior of other participants. RL solves motion planning in real time, adapting to a dynamic environment. Our RL+Safety approach reduces infractions by 70% compared to classical MPC, and collision rate is 4 times lower.

How we do it: stack and architecture

Perception

LiDAR (Velodyne, Ouster), stereo cameras, radar, and GPS/IMU. Sensor fusion via Extended Kalman Filter or neural network Deep Fusion. Localization accuracy—less than 10 cm in urban conditions.

Localization

NDT matching, LOAM/LIO-SAM, matching against HD map (OpenStreetMap + Lanelet2).

Planning

Global: A* on HD map. Local: RL agent + MPC for smooth trajectory generation. Reactive: RSS safety layer.

Frameworks and tools

Autoware (ROS2, open source) for integration on a real vehicle.
CARLA simulator with Python/C++ API for RL training.
PyTorch for neural networks, Weights & Biases for experiment tracking.

How Reinforcement Learning is trained for local planning?

We train the agent in the CARLA simulator with photorealistic graphics and physics. State space includes bird-eye view, ego state, next 20 waypoints, and traffic light signals. Action space—continuous steering, throttle, and brake. Reward function penalizes collisions, lane departure, and harsh maneuvers, and rewards route progress.

# Reward shaping example
def compute_reward(self, action, info):
    reward = 0
    route_completion = info['route_completion']
    reward += route_completion * 5.0
    target_speed = 30 / 3.6
    speed_diff = abs(info['speed'] - target_speed)
    reward -= speed_diff * 0.1
    if info['collision']:
        reward -= 100.0
    if info['lane_invasion']:
        reward -= 10.0
    if info['red_light_violation']:
        reward -= 50.0
    jerk = abs(action[1] - self.prev_throttle) + abs(action[0] - self.prev_steer)
    reward -= jerk * 0.5
    return reward

Neural network architecture

We use a CNN for bird-eye view processing and GRU for waypoint sequence. The actor network outputs control signals. For multi-agent scenarios, we use Transformer with attention over other participants.

class ADPlanningNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.bev_encoder = nn.Sequential(
            nn.Conv2d(7, 32, 5, stride=2), nn.ReLU(),
            nn.Conv2d(32, 64, 5, stride=2), nn.ReLU(),
            nn.Conv2d(64, 128, 3, stride=2), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
            nn.Flatten()
        )
        self.waypoint_encoder = nn.GRU(2, 64, batch_first=True)
        self.actor = nn.Sequential(
            nn.Linear(2048 + 64 + 5, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
            nn.Linear(128, 3), nn.Tanh()
        )

How do we guarantee planning safety?

On top of the RL policy, we install a formal safety layer: RSS (Responsibility-Sensitive Safety) from Intel and Control Barrier Functions. RSS computes safe distance in real time and overrides the agent's actions if violated. CBF modifies the control signal to guaranteed avoid collision. This ensures autonomous vehicle safety. According to the work Shalev-Shwartz et al. (2017), RSS provides mathematical safety guarantees.

from cbf_safety import CBFSafetyLayer
safety_layer = CBFSafetyLayer(safety_margin=1.5)
raw_action = rl_policy.predict(state)
safe_action = safety_layer.project(raw_action, obstacles)

Technical details of the safety layer

RSS defines safe distance as a function of speed, acceleration, and reaction time. CBF uses barrier functions to constrain control signals. Both methods work in real time with latency under 1 ms.

Work process

Scenario analysis: study typical and critical situations for your application.
Data synthesis: generate thousands of scenarios in CARLA, including adversarial examples.
RL training: train on GPU cluster with metric tracking.
Safety layer integration: configure RSS and CBF per your requirements.
Testing: scenario-based and adversarial tests, evaluate RCR, Infraction Rate, Comfort.
Deployment and support: deliver model in a container, documentation, train two engineers.

What's included

Trained RL agent with configured safety layer.
Scenario configuration and reward function.
Integration into your architecture (Autoware, ROS2).
Model and API documentation.
Training for two engineers.
3-month support.

Timelines

Basic RL agent for simple urban routes—from 12 weeks. Full system with hierarchy, safety, and complex scenarios—from 24 to 48 weeks. Timelines are refined after analyzing your requirements. Typical project cost ranges from $80,000 to $250,000 depending on complexity.

Metrics and results

Metric	Classical MPC	RL + Safety
Route Completion Rate	85%	96%
Infractions per km	0.4	0.12
Comfort (max jerk)	3.2 m/s³	1.8 m/s³
Latency (p99)	50 ms	12 ms

Infraction frequency by type:

Infraction type	MPC	RL+Safety
Collisions	0.2/km	0.05/km
Lane departure	0.3/km	0.1/km
Red light running	0.01/km	0.001/km

Our team has over 5 years of experience in AI for autonomous systems, with 20+ projects completed. We guarantee quality results. Contact us to evaluate your project—we'll find the optimal solution.

Reinforcement Learning: PPO, SAC, DQN and Industrial Applications

We see projects every day that fail not because of a weak algorithm, but because of incorrect rewards. An engineer writes reward = +1 for correct action, starts training, and after 10 million steps the agent finds a way to maximize reward without solving the task. This is reward hacking — a systemic pain of industrial RL. Our experience shows: proper reward accounts for 70% of success.

Why is RL harder than supervised learning?

In supervised learning, there is a dataset with correct answers. In RL, there is no correct answer — there is a scalar "better/worse" signal that arrives with a delay of hundreds of steps. The agent explores the space and finds a strategy on its own.

Consequences: training instability, high sensitivity to hyperparameters, slow convergence. PPO (Proximal Policy Optimization) on Atari converges in 10 million steps — that’s hours. On robotic tasks with real physics — days or weeks in simulation.

Algorithm selection by task:

Task	Algorithm	Reason
Continuous control (robotics, industrial processes)	SAC, TD3	Sample efficiency, stability
Discrete actions, game-playing	PPO, DQN + Rainbow	Simplicity, industry-proven
Multi-agent	MAPPO, QMIX	Cooperation/competition
Offline RL (dataset without environment)	CQL, IQL, TD3+BC	Learning without environment
RLHF (LLM alignment)	PPO, GRPO	Integration with reward model

How to tune PPO and avoid common problems?

PPO is the workhorse of RL. The main idea: limit policy updates via ratio clipping clip_range=0.2. This provides stability compared to vanilla policy gradient. But without proper tuning, the agent does not converge.

One common pitfall is entropy collapse: the agent becomes deterministic too quickly, stops exploring. Symptom — entropy coefficient drops to zero. Cure — ent_coef=0.01–0.05 and do not lower below 0.001. Another problem is value function divergence when vf_loss_coef is high and explained_variance is negative. We recommend vf_coef=0.5 and gradient clipping max_grad_norm=0.5.

Incorrect n_steps also breaks training. n_steps=2048 is Stable-Baselines3 default. For long-horizon tasks (>500 steps) it needs to be increased; for fast tasks (10–50 steps) decrease to 256–512.

For quick start, use stable-baselines3 + sb3-contrib. For research and custom algorithms — tianshou or CleanRL.

SAC for continuous control

SAC (Soft Actor-Critic) adds entropy maximization to the objective — the agent learns to be both efficient and diverse. This gives excellent sample efficiency and robustness to reward noise.

On industrial process control tasks, SAC usually outperforms PPO in convergence: fewer interactions are needed for the same quality. The key parameter is target_entropy. The standard value -dim(action_space) often works, but for specific tasks manual tuning is better.

How to transfer a trained agent to a real device?

Training RL on a real robot is expensive and dangerous. Standard approach: train in simulation → transfer to real hardware. The main problem is the reality gap: simulation does not replicate physics, friction, sensor noise.

The primary tool is domain randomization. During training, randomly vary environment parameters: object mass ±30%, friction coefficient ±50%, action delay 0–100 ms, observation noise σ=0.01–0.1. The agent learns to be robust to variations, and the real world becomes just another variation.

Comparison of popular simulators:

Simulator	Features	Performance
MuJoCo	Standard for robotics, medium physics	Single robot — CPU
Isaac Gym / Isaac Lab (NVIDIA)	GPU-accelerated, 10,000+ parallel environments	High (up to 50,000 fps on A100)
PyBullet	Free, convenient for prototyping	Low, CPU
Gazebo	ROS integration, full cycle	Medium, CPU+GPU

Case: manipulator for PCB component sorting

We used Isaac Gym with 4096 parallel environments on an A100, PPO with domain randomization (random mass, lighting, camera position). 500 million steps — 18 hours. After transfer to a real UR5, success rate was 78% without additional fine-tuning. After 2 hours on the real robot (10k steps) — 94%. Entire process — 3 weeks.

RLHF: training LLMs from human feedback

RLHF became the standard after InstructGPT. Classic scheme: supervised fine-tuning → reward model → PPO.

Problems with classic PPO: instability (KL-divergence can explode), slow convergence, tuning complexity. Hence popular alternatives:

DPO — bypasses reward model, learns from preference pairs. Simpler, more stable, but less flexible.
GRPO — used in DeepSeek-R1, good for reasoning tasks.
ORPO — combines SFT and alignment into one stage.

The trl library from Hugging Face is the standard. Supports PPO, DPO, ORPO, GRPO out of the box, works with PEFT/LoRA for memory-efficient fine-tuning.

"Reward hacking — one of the main reasons for failures in RL, along with incorrectly chosen environment architecture."

What is included in the work

Architectural solution and justification of algorithm selection
Development and documentation of the reward function
Creating a simulator or configuring an existing one
Training, hyperparameter sweep (Optuna / Ray Tune)
Transfer to real hardware or integration into product
Documentation, access to code and simulators
Team training and 3-month support after deployment

Work process

Task audit — define goals, resources, constraints.
Reward engineering — formalize desired behavior, check for reward hacking.
Environment and algorithm selection — baseline, first runs.
Systematic hyperparameter sweep — use Optuna.
Training in simulation with domain randomization.
Testing on real equipment (if necessary).
Deployment, monitoring, support.

Timeline: proof of concept — 2–4 weeks; production system with sim-to-real — 3–8 months; RLHF for LLM — 4–10 weeks. Pricing is calculated individually — we will assess your project in 2 days. Contact us for a consultation.

Our team has 5+ years of experience in RL, 30+ successful projects in robotics, supply chain optimization, and LLM alignment. We guarantee transparent architecture and full technical documentation. Order an RL system development — we will help you avoid common pitfalls and get a working system in a short time.