Skip to content

Getting Started

This page introduces the five benchmark suites that define PowerZoo's research scope, then walks you from a clean Python environment to a trained RL agent in four short steps.

1. Install

PowerZoo requires Python 3.11+ and uses uv for dependency management.

conda create -n powerzoo python=3.11 -y
conda activate powerzoo

pip install uv
git clone https://github.com/powerzoojax/PowerZooPy.git
cd PowerZooPy

uv sync --python "$(which python)"

uv sync --python "$(which python)" --extra rl
pip install h5py

The rl extra brings in Stable-Baselines3, PettingZoo, RLlib, PyTorch, and Gymnasium. h5py is only needed if you plan to record offline datasets.

Verify:

uv run python -c "import powerzoo; print(powerzoo.__version__)"

2. The Five Benchmark Suites

PowerZoo organises its public benchmark set into five agent-centric task suites. Each suite targets a different RL research question and uses a different underlying environment, agent structure, action space and constraint regime; any two suites therefore differ on at least four of these dimensions. The five suites are the recommended starting point for publication-quality experiments.

flowchart LR
    GC["GenCos\nMarket bidding\n(competitive MARL)"]
    TSO["TSO\nSecurity dispatch\n(safe RL + mixed action)"]
    DSO["DSO\nDistribution operations\n(non-stationary RL)"]
    DERs["DERs\nVoltage / DER coordination\n(scalable safe MARL)"]
    DC["DC microgrid\nData-center operation\n(multi-objective robust RL)"]
    GC --- TSO --- DSO --- DERs --- DC
Suite Underlying env RL question Public PowerZoo task(s)
GenCos — Market bidding Transmission Case5 + BidBasedMarketEnv Competitive MARL with private info and ramp-coupled offers gencos_bidding — see Benchmarks · GenCos
TSO — Security dispatch Transmission Case5 / Case118 + DC/AC OPF Safe RL with mixed discrete-continuous actions marl_uc (UC, Case5), opf_118 / opf_118_7d (large-scale ED, Case118) — see Benchmarks · TSO
DSO — Distribution operations Distribution Case33bw + 6× FlexLoad + Ausgrid traces Non-stationary single-agent RL with operational quality reward make_dso_env(...) factory — see Benchmarks · DSO
DERs — Voltage / DER coordination Distribution Case33bw / Case118zh + heterogeneous DERs Scalable safe MARL on Dec-POMDP with hard voltage limits marl_der_arbitrage (Case33bw, 3 batteries), marl_ders_benchmark (Case118zh, 12 heterogeneous DERs) — see Benchmarks · DERs
DC microgrid — Data center Self-contained DC microgrid (DCMicrogridEnv, no external grid) Multi-objective robust RL with workload + thermal + carbon trade-off dc_microgrid, dc_microgrid_safe (CMDP variant) — see Benchmarks · DC microgrid

How to read this table. A suite groups tasks that share the same RL research question. The right-hand column lists concrete env names you can pass to make_task_env(...) (or make_dso_env(...) for DSO); the next sections use exactly these names. Per-suite env design, observation / action / reward / cost contracts and OOD splits live in Benchmarks; the underlying physics in Physics; the env API contract that every benchmark obeys in Concepts · Python contract.

Smaller starter envs sit outside the five main suites but are still public benchmark tasks: battery_arbitrage (single-battery arbitrage), marl_opf (5-bus MARL ED), marl_ev_v2g (EV fleet V2G), dc_scheduling (data-center scheduling under a distribution grid). They are the recommended first targets for unit testing and quick iteration; see Examples for runnable cards.

3. First Task — Run a Benchmark Episode

make_task_env is PowerZoo's preferred entry point: it builds a benchmark task with a fixed train/val/test split and the right multi-agent or single-agent interface.

from powerzoo.tasks import make_task_env, list_public_tasks

print(list_public_tasks())

Run one episode of the multi-agent OPF task on IEEE 5-bus, using the PettingZoo Parallel API:

env = make_task_env("marl_opf", split="train", framework="pettingzoo")
obs, info = env.reset(seed=42)

while env.agents:
    actions = {a: env.action_space(a).sample() for a in env.agents}
    obs, rewards, terminations, truncations, info = env.step(actions)

print("episode done")

Single-agent tasks (such as battery_arbitrage, dc_scheduling, dc_microgrid) return a standard Gymnasium env — use the usual five-tuple loop:

env = make_task_env("battery_arbitrage", split="train")
obs, info = env.reset(seed=0)
terminated = truncated = False
while not (terminated or truncated):
    obs, reward, terminated, truncated, info = env.step(env.action_space.sample())

The framework='pettingzoo' path clears env.agents when the episode ends, which is convenient for the while env.agents idiom. The default framework='auto' returns the underlying RLlib-style adapter, where you instead check terminated.get("__all__") and truncated.get("__all__"). Both paths share identical reward, cost and observation semantics — see Concepts · Python contract.

4. First Evaluation — Get a Normalized Score

Once you have any policy, evaluate produces reproducible benchmark numbers — mean episode return, normalized score (where 0 = random baseline, 1 = oracle baseline) and CMDP cost statistics if cost signals are present.

from powerzoo.wrappers import GymnasiumWrapper
from powerzoo.benchmarks.policies import RandomPolicy
from powerzoo.benchmarks import evaluate

gym_env = GymnasiumWrapper(make_task_env("marl_opf", split="test"))
result = evaluate(
    RandomPolicy(gym_env.action_space),
    gym_env,
    n_episodes=10,
    task_id="marl_opf",
)

print(f"mean reward       : {result['mean_reward']:.2f}")
print(f"normalized score  : {result['normalized_score']}")
print(f"mean episode cost : {result['mean_episode_cost']:.4f}")
print(f"cost violation %  : {result['cost_violation_rate']}")

The normalized score lets you compare across tasks without worrying about the raw reward scale of each problem. See the Examples Overview for the formula.

5. First Training — One-Liner with powerzoo.rl

powerzoo.rl is the unified RL entry point. make_env produces a ready-to-train env (with optional normalization, forecast window, safe-RL wrapping); Trainer wraps Stable-Baselines3 with task-aware defaults.

from powerzoo.rl import make_env, Trainer

env = make_env("battery_arbitrage", split="train", normalize=True, seed=0)
print(env.observation_space, env.action_space)

trainer = Trainer("battery_arbitrage", algorithm="SAC", total_timesteps=100_000)
trainer.train()
results = trainer.evaluate(split="test")
print(results)

For full training options — including YAML configs, MARL training and reward overrides — see Training · Trainers and Training · Presets.

Next Steps

Topic Link
The three pillars and the python API contract Concepts · Overview, Python contract
Why power grids are physically distinct from typical RL benchmarks Concepts · Power systems primer
Layered architecture (envs / resources / tasks / wrappers) Architecture · Environment stack, Repository map
Underlying physics (transmission / distribution / resources / markets / microgrid) Physics
Per-suite benchmark cards (TSO, DSO, DERs, DC microgrid, GenCos) Benchmarks
Full RL training reference (wrappers, trainers, YAML presets, custom loops) Training
Low-level grid + resource API (no task wrapping) Examples 01–03

Glossary

A short reference for terms used throughout the docs. Each definition is one sentence; deeper treatment lives in Concepts · Power systems primer for physics and in Concepts · Reward and cost split plus Python contract for the env API.

Term One-sentence meaning
PF (Power Flow) Solve voltages and line flows given fixed injections — the grid's physics step.
OPF (Optimal Power Flow) Solve PF and dispatch generators to minimise cost subject to limits.
DCPF / DCOPF Linearised PF / OPF (active power only, voltage assumed 1 pu) — fast, convex.
ACPF / ACOPF Full nonlinear PF / OPF including reactive power and voltage magnitude.
BFS (Backward-Forward Sweep) Iterative PF solver for radial distribution feeders.
PTDF (Power Transfer Distribution Factor) Sensitivity matrix line_flow ≈ PTDF · injection used by DCPF.
LMP (Locational Marginal Price) Dual variable of the nodal power balance — the marginal cost of 1 extra MW at a bus.
UC (Unit Commitment) OPF + binary on/off decisions + min up/down-time and ramp constraints.
SCED / SCUC Security-Constrained ED / UC: OPF / UC with line and N-1 constraints.
SOC (State Of Charge) Battery fill fraction, 0–1; integrator state coupling adjacent steps.
G2V / V2G Grid-to-Vehicle (charge) / Vehicle-to-Grid (discharge back to the grid).
DER (Distributed Energy Resource) Small generator, battery or controllable load on a distribution feeder.
DSO / TSO Distribution / Transmission System Operator — owns the feeder / backbone.
DR (Demand Response) Curtailing or shifting load in response to grid signals or prices.
PUE (Power Usage Effectiveness) Data-center metric: total facility power / IT equipment power (lower is better).
COP (Coefficient of Performance) Cooling efficiency: heat removed per unit electrical input.
CMDP Constrained MDP — maximise reward subject to expected cost ≤ budget.
MARL Multi-Agent RL — multiple policies acting on a shared environment.
MDP / Dec-POMDP (Decentralised, partially observable) Markov Decision Process.