Training RL Agent (PPO/SAC/DQN) for Trading Strategy
Three most commonly used RL algorithms in algorithmic trading have different strengths. Algorithm choice depends on strategy architecture: discrete or continuous action space, on-policy or off-policy learning.
DQN (Deep Q-Network)
Suitable for: discrete actions (buy/hold/sell), simple strategies, sufficient stability.
DQN trains Q-function: Q(state, action) — expected discounted reward when choosing action in state.
import torch
import torch.nn as nn
from collections import deque
import random
class DQNNetwork(nn.Module):
def __init__(self, state_dim, n_actions, hidden_dim=256):
super().__init__()
# Dueling architecture: separate Value and Advantage streams
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.value_stream = nn.Linear(hidden_dim, 1)
self.advantage_stream = nn.Linear(hidden_dim, n_actions)
def forward(self, x):
shared = self.shared(x)
value = self.value_stream(shared)
advantage = self.advantage_stream(shared)
# Dueling: Q = V + (A - mean(A))
q_values = value + (advantage - advantage.mean(dim=1, keepdim=True))
return q_values
class PrioritizedReplayBuffer:
"""Prioritized Experience Replay — sample important transitions more often"""
def __init__(self, capacity=50000, alpha=0.6):
self.buffer = deque(maxlen=capacity)
self.priorities = deque(maxlen=capacity)
self.alpha = alpha
def push(self, state, action, reward, next_state, done, td_error=1.0):
priority = (abs(td_error) + 1e-5) ** self.alpha
self.buffer.append((state, action, reward, next_state, done))
self.priorities.append(priority)
def sample(self, batch_size, beta=0.4):
probs = np.array(self.priorities) / sum(self.priorities)
indices = np.random.choice(len(self.buffer), batch_size, p=probs)
# Importance sampling weights
weights = (len(self.buffer) * probs[indices]) ** (-beta)
weights /= weights.max()
batch = [self.buffer[i] for i in indices]
return batch, indices, weights
Double DQN: eliminates Q-value overestimation. Use online network to select action, target network to evaluate.
# Double DQN target calculation
with torch.no_grad():
next_actions = online_net(next_states).argmax(dim=1) # online net selects
next_q = target_net(next_states).gather(1, next_actions.unsqueeze(1)) # target evaluates
targets = rewards + gamma * next_q * (1 - dones)
PPO (Proximal Policy Optimization)
Suitable for: discrete and continuous actions, on-policy, stable learning.
PPO limits policy update size through clipping:
class PPOActor(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh()
)
self.policy_head = nn.Linear(hidden_dim, action_dim)
self.value_head = nn.Linear(hidden_dim, 1)
def forward(self, x):
features = self.network(x)
logits = self.policy_head(features)
value = self.value_head(features)
return logits, value
def ppo_update(model, optimizer, states, actions, old_log_probs,
advantages, returns, clip_eps=0.2, n_epochs=4):
for _ in range(n_epochs):
logits, values = model(states)
dist = torch.distributions.Categorical(logits=logits)
new_log_probs = dist.log_prob(actions)
entropy = dist.entropy()
# PPO clipped objective
ratio = (new_log_probs - old_log_probs).exp()
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
actor_loss = -torch.min(surr1, surr2).mean()
critic_loss = (returns - values.squeeze()).pow(2).mean()
entropy_loss = -entropy.mean()
total_loss = actor_loss + 0.5 * critic_loss + 0.01 * entropy_loss
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optimizer.step()
SAC (Soft Actor-Critic)
Suitable for: continuous action space (position sizing 0%–100% of capital), off-policy, maximum sample efficiency.
SAC maximizes: J(π) = E[Σ γ^t (r_t + α H(π(·|s_t)))]
Additional entropy term H encourages agent exploration and prevents premature convergence.
class SACActorContinuous(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU()
)
self.mean_head = nn.Linear(hidden_dim, action_dim)
self.log_std_head = nn.Linear(hidden_dim, action_dim)
def forward(self, x):
features = self.network(x)
mean = self.mean_head(features)
log_std = self.log_std_head(features).clamp(-20, 2)
std = log_std.exp()
dist = torch.distributions.Normal(mean, std)
action = dist.rsample() # reparameterization trick
# Squash to [-1, 1]
action_tanh = torch.tanh(action)
log_prob = dist.log_prob(action) - torch.log(1 - action_tanh.pow(2) + 1e-6)
return action_tanh, log_prob.sum(-1, keepdim=True)
Algorithm Comparison for Crypto Trading
| Algorithm | Action Space | Sample Efficiency | Stability | Best Application |
|---|---|---|---|---|
| DQN | Discrete | Medium | Medium | Simple buy/sell strategies |
| PPO | Both | Low (on-policy) | High | General purpose, reliable |
| SAC | Continuous | High | High | Position sizing as action |
Multi-agent trading
Multiple RL agents on different timeframes:
- Macro agent (1D): determines overall direction
- Micro agent (1H): timing entry/exit
- Execution agent (15M): optimal execution
Macro agent passes signal as part of micro-agent state.
Key challenges
Market non-stationarity: agent trained on 2020–2021 may perform poorly in 2022–2023. Continuous learning / periodic retraining mandatory.
Reward hacking: agent can find ways to gain reward not corresponding to actual trading profit. Careful reward design critical.
Overfitting to training data: agent can memorize specific patterns from training period. Evaluation on completely held-out test data.
Developing RL trading agent with optimal algorithm selection (DQN/PPO/SAC) for task, custom trading environment, reward shaping, walk-forward evaluation and MLflow tracking.







