Training pipeline¶
This page is the end-to-end view of how an RL agent calls one PowerZoo env, receives a reward and a cost, and updates its weights. It collects the moving parts that the Environment stack, Data pipeline and Training pages document individually.
The full pipeline¶
flowchart LR
Cfg["RLConfig\nor task name / dict / YAML"] --> Make["powerzoo.rl.make_env"]
Make --> Adapter["Task adapter\n(OPF / UC / Resource / EV / DC / DSO)"]
Adapter --> PE["PowerEnv\n(grid + resources + reward + clock)"]
PE --> Grid["GridEnv\n(power flow, info)"]
PE --> Res["ResourceEnv\n(SOC / queue / thermal)"]
Grid --> Data["DataLoader cache\n(load / renewable / workload)"]
Adapter --> Wrap["Wrapper stack\n(Forecast · Normalize · SafeRL · Flatten)"]
Wrap --> Algo["Trainer\n(SB3: SAC / PPO / TD3, IL / simultaneous MARL)"]
Algo --> Ckpt["save / load checkpoints"]
Algo --> Eval["evaluate(split='test')\n→ normalized score, cost, IQM"]
Read it left-to-right when designing a new experiment, right-to-left when debugging a finished one.
Single-agent flow (Gymnasium)¶
from powerzoo.rl import Trainer
t = Trainer("battery_arbitrage", algorithm="SAC", total_timesteps=200_000)
t.train()
results = t.evaluate(split="test")
t.save("./results/")
What happens internally:
Trainer.__init__resolves the input (task name / dict / YAML /RLConfig) into a singleRLConfigand callscfg.validate().- SB3 is imported lazily;
ALGORITHMS = {SAC, PPO, TD3}is populated. t.train()callsself.get_env()→make_env(...)→ task adapter →PowerEnv→ grid + resources, then wraps the result with the requested wrappers.- The SB3 model is constructed with
cfg.policy('MlpPolicy'by default),cfg.hyperparams,seed, andmodel.learn(total_timesteps, progress_bar, callback)runs. t.evaluate(split='test')rebuilds the env on the test split and callspowerzoo.benchmarks.evaluate, returning mean reward, normalized score, mean episode cost and cost-violation rate.
Multi-agent flow (PettingZoo Parallel)¶
from powerzoo.rl import Trainer
t = Trainer("marl_opf", framework="pettingzoo")
t.train_il(total_timesteps=50_000)
t = Trainer("marl_opf", framework="pettingzoo", algorithm="SAC")
t.train_marl_simultaneous(total_timesteps=200_000)
train_ilruns sequential SB3.learn()per agent (others act with their default policy). Requires homogeneous agent spaces.train_marl_simultaneousperforms one PettingZoo step per env step and updates all agents at once (SAC only). Implementation lives inpowerzoo/rl/marl_simultaneous_sb3.py.- For other frameworks (EPyMARL, MAPPO, custom loops), call
t.get_env()and plug it in directly.
Wrapper stack¶
make_env(...) accepts a small set of keyword arguments that map to wrappers (applied to single-agent envs only; silently ignored for MARL):
| Argument | Effect |
|---|---|
reward=... |
RewardWrapper replaces the reward (callable or reward-type dict). |
forecast_horizon=N |
ForecastWrapper appends N future demand values. |
normalize=True |
NormalizationWrapper rescales obs (and optionally action) to [-1, 1]. |
safe_rl=True |
GymnasiumSafeWrapper injects info['cost'] from info['cost_sum']. |
cost_threshold=... |
Forwarded to GymnasiumSafeWrapper. |
seed=... |
Calls env.reset(seed=...) immediately. |
The full per-wrapper reference (including stacking order, SafeRLWrapper 6-tuple vs GymnasiumSafeWrapper 5-tuple, ForecastWrapper's perfect/noisy/none modes) is in Training · Wrappers.
Where each piece is documented¶
- Environment stack —
BaseEnv,GridEnv,ResourceEnv,PowerEnvsemantics. - Data pipeline —
DataLoader, signals, parquet, alignment. - Python contract — what
step()returns, the 5 observation modes,framework='auto' / 'pettingzoo' / 'rllib'. - Reward and cost split — reward vs CMDP cost, the
cost_*prefix rule. - Training · Trainers —
Trainer.train,train_il,train_marl_simultaneous,evaluate,save,load. - Training · Wrappers — every wrapper signature.
- Training · Presets — ready-to-use YAML configs.
- Training · Custom loops — bypass
Trainerand write your own loop.
Performance notes¶
- The hot path (env + wrappers) is pure CPU. PowerZoo does not vectorise envs by itself — use SB3's
make_vec_envor RLlib's worker pool for batched rollouts. - The data pipeline runs once at construction and once per reset; nothing in the inner loop touches parquet or disk.
- The OPF LP backend (
solver_type ∈ {auto, gurobi, scipy, cvxpy}) dominates step time on large cases. PrefergurobiforCase118and above when available;scipy(HiGHS) is the next best free option. - For GPU-vectorised rollouts and
lax.scan-based pipelines, use the sibling PowerZooJax project, which reimplements the same five benchmark suites in pure JAX.