Reinforcement Learning: PPO, SAC, DQN and Industrial Application
Most RL projects don't die from wrong algorithm — they die from wrong reward design. Engineer writes reward = +1 for correct action, runs training, after 10M steps agent found reward without solving task. This is reward hacking, main industrial RL pain.
Why RL Harder Than Supervised Learning
Supervised learning has dataset with correct answers. RL has scalar signal "better/worse," often delayed hundreds steps. Agent independently explores space and finds strategy.
Consequences: training instability, high hyperparameter sensitivity, slow convergence. PPO on Atari converges 10M steps — hours. On robotic tasks with real physics — days or weeks in simulator.
Algorithm selection by task.
| Task | Algorithm | Reason |
|---|---|---|
| Continuous control (robotics, process) | SAC, TD3 | Sample efficiency, stability |
| Discrete actions, game-playing | PPO, DQN + Rainbow | Simplicity, well-studied |
| Multi-agent | MAPPO, QMIX | Cooperation/competition |
| Offline RL (dataset without environment) | CQL, IQL, TD3+BC | Learning without environment |
| RLHF (LLM alignment) | PPO, GRPO | Reward model integration |
PPO: De-Facto Standard
PPO (Proximal Policy Optimization) — workhorse RL. Used from games to RLHF. Main: limit policy update via ratio clipping clip_range=0.2, stability vs vanilla policy gradient.
Common PPO tuning problems:
Entropy collapse. Agent becomes too deterministic too fast, stops exploring. Symptom: entropy coefficient drops zero, agent stuck local optimum. Solution: ent_coef=0.01–0.05, don't drop below 0.001 during training.
Value function diverges. vf_loss_coef too high, critic overfits current policy. Symptom: negative explained_variance. Treatment: vf_coef=0.5, gradient clipping max_grad_norm=0.5.
Wrong n_steps. n_steps=2048 default for Stable-Baselines3. For long-horizon tasks (>500 steps) increase. For quick tasks (10–50 steps) decrease to 256–512.
Main library for quick start — stable-baselines3 + sb3-contrib. For research — tianshou or CleanRL (single-file, easier to read/modify).
SAC for Continuous Control
SAC (Soft Actor-Critic) adds entropy maximization — agent learns be good and diverse. Excellent sample efficiency and noise robustness.
On process control tasks SAC usually beats PPO on convergence: fewer environment interactions for same quality. Key parameter — target_entropy, usually auto but manual tuning for specific tasks.
Sim-to-Real: Simulator to Real Hardware
Training real robot costly and dangerous. Standard: train in simulator → transfer to real. Main problem — reality gap: simulator doesn't reproduce real physics, friction, sensor noise.
Domain randomization — main tool. During training randomly vary environment parameters: object mass ±30%, friction ±50%, action delay 0–100ms, observation noise σ=0.01–0.1. Agent learns robustness, real world becomes another variation.
Simulators: MuJoCo (robotics standard), Isaac Gym / Isaac Lab from NVIDIA (GPU-accelerated, 10,000+ parallel environments single GPU), PyBullet (free, slower), Gazebo (ROS integration).
Case. PCB component sorter manipulator. Isaac Gym, 4096 parallel environments on A100, PPO with domain randomization (random mass, lighting, camera position). 500M steps in 18h. Transfer to real UR5: 78% success without fine-tuning. After 2h real-world fine-tuning (10k steps) — 94%.
RLHF: Training LLM from Human Feedback
RLHF became standard for LLM alignment after InstructGPT. Classic: supervised fine-tuning → reward model training → PPO.
Classic RLHF via PPO problems: instability (KL-divergence explosion), slow convergence, tuning complexity. Alternatives gaining popularity:
- DPO (Direct Preference Optimization) — bypasses reward model, trains directly on preference pairs. Simpler, stabler, less flexible.
- GRPO (Group Relative Policy Optimization) — used in DeepSeek-R1, works for reasoning.
- ORPO — combines SFT and alignment in one stage.
Library: trl from Hugging Face — standard for RLHF/DPO. Supports PPO, DPO, ORPO, GRPO out-of-box. Works with PEFT/LoRA for memory-efficient.
Workflow
Good reward design — 70% success. Start with reward engineering: detail desired behavior, formalize in reward function, check for hacking scenarios. Then — algorithm and environment selection.
Next: simulator setup, baseline experiments, systematic hyperparameter sweep via Optuna or Ray Tune, analyze learning curves.
Timelines: proof of concept on standard task — 2–4 weeks. Production system with custom environment and sim-to-real — 3–8 months. RLHF for LLM — 4–10 weeks depending on preference data volume.







