Trading Agent with A2C/A3C
A3C (Asynchronous Advantage Actor-Critic) and A2C (its synchronous variant) are parallel RL algorithms from DeepMind (2016). Multiple parallel agents explore different parts of the state space simultaneously. For trading: parallel learning on different assets/periods, fast convergence.
A3C vs A2C: Key Difference
A3C: asynchronous. N worker threads in parallel collect experience and update global network. No synchronization between threads. CPU-based (no need for GPU-exclusive operations).
A2C: synchronous. N parallel environments → wait for all → single batch update. More deterministic, easier to debug, better GPU utilization.
For most trading tasks, A2C is preferable — GPU efficiency and reproducibility.
Advantage Function
Key idea: update policy not on raw reward, but on Advantage A(s,a) = Q(s,a) - V(s). Advantage shows how much an action is better or worse than average expectation in that state.
GAE (Generalized Advantage Estimation):
def compute_gae(rewards, values, next_value, dones, gamma=0.99, lam=0.95):
advantages = []
gae = 0
for step in reversed(range(len(rewards))):
delta = rewards[step] + gamma * next_value * (1 - dones[step]) - values[step]
gae = delta + gamma * lam * (1 - dones[step]) * gae
advantages.insert(0, gae)
next_value = values[step]
return advantages
λ=0.95 — balance between bias (λ=0, pure TD) and variance (λ=1, pure MC).
Architecture for Trading
class A2CTradingNet(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU()
)
self.actor = nn.Linear(128, action_dim) # logits
self.critic = nn.Linear(128, 1) # V(s)
def forward(self, x):
f = self.shared(x)
logits = self.actor(f)
value = self.critic(f)
return logits, value
def a2c_loss(logits, actions, advantages, values, returns, ent_coef=0.01):
dist = Categorical(logits=logits)
log_probs = dist.log_prob(actions)
actor_loss = -(log_probs * advantages.detach()).mean()
critic_loss = F.mse_loss(values.squeeze(), returns)
entropy_loss = -dist.entropy().mean()
return actor_loss + 0.5 * critic_loss + ent_coef * entropy_loss
Parallelism for Trading
A2C/A3C are especially useful when:
Multiple assets: 8 parallel environments, each with different asset (AAPL, MSFT, TSLA, ...). Agent learns on diverse market conditions simultaneously. Common policy generalizes better.
Multiple time periods: Parallel environments with different history periods. Learning on bull/bear/sideways markets simultaneously.
Walk-forward parallelism: Each worker processes its own time window. Accelerated cross-validation.
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import SubprocVecEnv
def make_env(ticker, start, end):
return lambda: TradingEnv(ticker, start, end)
# 8 parallel environments
envs = SubprocVecEnv([make_env(t, '2015', '2022') for t in tickers[:8]])
model = A2C(
"MlpPolicy",
envs,
learning_rate=7e-4,
n_steps=5, # short rollouts — fast updates
gamma=0.99,
gae_lambda=1.0,
ent_coef=0.01,
vf_coef=0.25,
max_grad_norm=0.5,
verbose=1
)
model.learn(total_timesteps=1_000_000)
n_steps=5: A2C classically uses very short rollouts (5–20 steps). This speeds up updates but increases variance.
Algorithm Comparison for Trading
| Algorithm | Sample Eff. | Stability | Parallelism | GPU |
|---|---|---|---|---|
| DQN | High | Medium | No | Yes |
| A2C | Medium | High | Excellent | Yes |
| PPO | Medium | High | Good | Yes |
| SAC | High | High | Medium | Yes |
A2C occupies a niche: simpler than SAC, more parallel than PPO. Good for quick experiments with many configurations.
Timeline: 4–8 weeks
A2C baseline with parallel environments — 3 weeks. LSTM actor, multi-asset with correlations, custom reward shaping — 6–8 weeks.







