RL agent training (PPO/SAC/DQN) for trading strategy

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.
Showing 1 of 1 servicesAll 1306 services
RL agent training (PPO/SAC/DQN) for trading strategy
Complex
from 2 weeks to 3 months
FAQ
Blockchain Development Services
Blockchain Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1218
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    853
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1047
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823

Training RL Agent (PPO/SAC/DQN) for Trading Strategy

Three most commonly used RL algorithms in algorithmic trading have different strengths. Algorithm choice depends on strategy architecture: discrete or continuous action space, on-policy or off-policy learning.

DQN (Deep Q-Network)

Suitable for: discrete actions (buy/hold/sell), simple strategies, sufficient stability.

DQN trains Q-function: Q(state, action) — expected discounted reward when choosing action in state.

import torch
import torch.nn as nn
from collections import deque
import random

class DQNNetwork(nn.Module):
    def __init__(self, state_dim, n_actions, hidden_dim=256):
        super().__init__()
        # Dueling architecture: separate Value and Advantage streams
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.value_stream = nn.Linear(hidden_dim, 1)
        self.advantage_stream = nn.Linear(hidden_dim, n_actions)
    
    def forward(self, x):
        shared = self.shared(x)
        value = self.value_stream(shared)
        advantage = self.advantage_stream(shared)
        # Dueling: Q = V + (A - mean(A))
        q_values = value + (advantage - advantage.mean(dim=1, keepdim=True))
        return q_values

class PrioritizedReplayBuffer:
    """Prioritized Experience Replay — sample important transitions more often"""
    def __init__(self, capacity=50000, alpha=0.6):
        self.buffer = deque(maxlen=capacity)
        self.priorities = deque(maxlen=capacity)
        self.alpha = alpha
    
    def push(self, state, action, reward, next_state, done, td_error=1.0):
        priority = (abs(td_error) + 1e-5) ** self.alpha
        self.buffer.append((state, action, reward, next_state, done))
        self.priorities.append(priority)
    
    def sample(self, batch_size, beta=0.4):
        probs = np.array(self.priorities) / sum(self.priorities)
        indices = np.random.choice(len(self.buffer), batch_size, p=probs)
        
        # Importance sampling weights
        weights = (len(self.buffer) * probs[indices]) ** (-beta)
        weights /= weights.max()
        
        batch = [self.buffer[i] for i in indices]
        return batch, indices, weights

Double DQN: eliminates Q-value overestimation. Use online network to select action, target network to evaluate.

# Double DQN target calculation
with torch.no_grad():
    next_actions = online_net(next_states).argmax(dim=1)  # online net selects
    next_q = target_net(next_states).gather(1, next_actions.unsqueeze(1))  # target evaluates
    targets = rewards + gamma * next_q * (1 - dones)

PPO (Proximal Policy Optimization)

Suitable for: discrete and continuous actions, on-policy, stable learning.

PPO limits policy update size through clipping:

class PPOActor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )
        self.policy_head = nn.Linear(hidden_dim, action_dim)
        self.value_head = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        features = self.network(x)
        logits = self.policy_head(features)
        value = self.value_head(features)
        return logits, value

def ppo_update(model, optimizer, states, actions, old_log_probs, 
               advantages, returns, clip_eps=0.2, n_epochs=4):
    for _ in range(n_epochs):
        logits, values = model(states)
        dist = torch.distributions.Categorical(logits=logits)
        new_log_probs = dist.log_prob(actions)
        entropy = dist.entropy()
        
        # PPO clipped objective
        ratio = (new_log_probs - old_log_probs).exp()
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
        
        actor_loss = -torch.min(surr1, surr2).mean()
        critic_loss = (returns - values.squeeze()).pow(2).mean()
        entropy_loss = -entropy.mean()
        
        total_loss = actor_loss + 0.5 * critic_loss + 0.01 * entropy_loss
        
        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

SAC (Soft Actor-Critic)

Suitable for: continuous action space (position sizing 0%–100% of capital), off-policy, maximum sample efficiency.

SAC maximizes: J(π) = E[Σ γ^t (r_t + α H(π(·|s_t)))]

Additional entropy term H encourages agent exploration and prevents premature convergence.

class SACActorContinuous(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU()
        )
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        features = self.network(x)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features).clamp(-20, 2)
        std = log_std.exp()
        
        dist = torch.distributions.Normal(mean, std)
        action = dist.rsample()  # reparameterization trick
        # Squash to [-1, 1]
        action_tanh = torch.tanh(action)
        log_prob = dist.log_prob(action) - torch.log(1 - action_tanh.pow(2) + 1e-6)
        
        return action_tanh, log_prob.sum(-1, keepdim=True)

Algorithm Comparison for Crypto Trading

Algorithm Action Space Sample Efficiency Stability Best Application
DQN Discrete Medium Medium Simple buy/sell strategies
PPO Both Low (on-policy) High General purpose, reliable
SAC Continuous High High Position sizing as action

Multi-agent trading

Multiple RL agents on different timeframes:

  • Macro agent (1D): determines overall direction
  • Micro agent (1H): timing entry/exit
  • Execution agent (15M): optimal execution

Macro agent passes signal as part of micro-agent state.

Key challenges

Market non-stationarity: agent trained on 2020–2021 may perform poorly in 2022–2023. Continuous learning / periodic retraining mandatory.

Reward hacking: agent can find ways to gain reward not corresponding to actual trading profit. Careful reward design critical.

Overfitting to training data: agent can memorize specific patterns from training period. Evaluation on completely held-out test data.

Developing RL trading agent with optimal algorithm selection (DQN/PPO/SAC) for task, custom trading environment, reward shaping, walk-forward evaluation and MLflow tracking.