Power Systems Primer for ML Researchers¶
This page introduces the essential power-system concepts that underpin PowerZoo's environments. No electrical-engineering background is assumed — every concept is motivated by its impact on the RL problem.
The intended reading order is: first build intuition for the network and power flow, then look at dispatch problems, then move to stateful devices such as batteries and EVs, and finally return to safety constraints and real time-series drivers.
The map below previews how each physical object in the next sections maps to a familiar RL concept. Use it as a quick reference while reading.
flowchart LR
subgraph Phys ["Physical object"]
P1[Bus / line / PF solve]
P2[Generator + cost curve]
P3[Battery / EV SOC]
P4[Voltage / thermal limits]
P5[Real load · solar · wind traces]
end
subgraph RL ["RL concept"]
R1[Coupled state transition\n(physics-mediated agent coupling)]
R2[Per-agent action + shared reward]
R3[Hidden integrator state\n(long-horizon credit assignment)]
R4[CMDP cost channel\n(safety constraint)]
R5[Non-stationary exogenous process\n(distribution shift)]
end
P1 --> R1
P2 --> R2
P3 --> R3
P4 --> R4
P5 --> R5
If you are already comfortable with the physics, jump straight to Python contract for the env API and to Reward and cost split for the CMDP framing. The deeper physics derivations live under Physics.
1. Network and Physics¶
Start by building intuition for what the grid is and what the environment computes at each step. This section moves from static structure to the physical solve that ties all injections together.
1.1 The Power Grid as a Graph¶
A power grid can be represented as a graph where:
- Buses (nodes, labelled B1–B4 in the diagram) are connection points — every resource injects or withdraws power at a bus.
- Lines (edges, labelled L1–L4) carry power between buses. Each line has a thermal rating (maximum MW) and an impedance (resistance + reactance).
- Generators (labelled G, drawn as a circle with a sine-wave symbol) inject power into a bus — the arrow points toward the bus bar.
- Loads / demand (labelled D) withdraw power from a bus — the arrow points away from the bus bar.
The diagram shows a 4-bus ring: two generators (G1 at B1, G2 at B4), four demand points (D1–D4, one per bus), and four lines (L1–L4) connecting adjacent buses. Every bus has at least one resource attached; every line is a physical path through which power can flow.
Why this matters for RL: No bus operates independently. Increasing generator G1's output raises flows on both L1 (B1 → B2) and L2 (B1 → B3) at the same time, changing conditions at B2, B3, and B4 — without any direct signal passing between those buses. This physics-mediated coupling is what makes multi-agent grid control fundamentally different from typical MARL benchmarks like StarCraft or traffic control, where agents interact only through explicit game mechanics.
1.2 Power Flow: The Physics Engine¶
Power flow (load flow) is the grid's physics step. Given all injections (generation, load, storage), it computes:
- Voltage at every bus (magnitude and angle)
- Current / power flow on every line
- Losses (real power dissipated in line impedance)
DC Power Flow (Linear Approximation)¶
Assumes voltages are all 1.0 pu, ignores reactive power and losses:
where PTDF (Power Transfer Distribution Factor) is a constant matrix derived from line impedances. This is a single matrix-vector multiply — fast and differentiable, but only approximate.
RL implication: DC power flow is a linear constraint. The feasible action set is a polytope. Constraint satisfaction can be verified analytically.
AC Power Flow (Nonlinear, Full Physics)¶
Solves the nonlinear power-balance equations at each bus:
where \(V_i\) is voltage magnitude, \(\theta_{ij}\) is voltage angle difference, and \(G_{ij}\), \(B_{ij}\) are conductance and susceptance from the admittance matrix.
RL implication: AC power flow makes the transition function nonlinear and non-convex. Small action changes can cause large voltage swings. The solver may fail to converge (infeasible dispatch) — the environment must handle this gracefully.
Distribution vs Transmission¶
| Property | Transmission (HV) | Distribution (MV/LV) |
|---|---|---|
| Topology | Meshed (loops) | Radial (tree) |
| Voltage level | 110–765 kV | 0.4–33 kV |
| Solver | PTDF / Newton–Raphson | Forward-backward sweep (BFS) |
| Key constraint | Line thermal limits | Voltage magnitude limits |
| R/X ratio | Low (≈0.1) — reactance-dominant | High (≈1.0) — resistance and reactance comparable |
| DER penetration | Low (large generators) | High (solar, batteries, EVs) |
RL implication: Transmission tasks emphasize global coordination among large generators. Distribution tasks emphasize local voltage regulation with many small resources. PowerZoo provides both.
2. Dispatch Problems: How the System Chooses Output¶
Once the network and physics are clear, the benchmark's main decision problems become easier to place: single-step OPF and inter-temporal UC.
2.1 Optimal Power Flow (OPF)¶
OPF answers: Given demand, what is the cheapest feasible generation dispatch?
where \(C_i(P_{g,i}) = mc\_a_i P_{g,i}^2 + mc\_b_i P_{g,i} + mc\_c_i\) is the quadratic total-cost curve of generator \(i\). In PowerZoo, mc_a, mc_b, mc_c are the coefficients of this polynomial total-cost function. For cases where mc_a = mc_b = 0 (e.g. Case5), mc_c is the flat marginal cost in $/MWh and the cost simplifies to \(C(P) = mc\_c \cdot P\).
The OPF solution is the oracle baseline in PowerZoo's marl_opf and opf_118 tasks. A perfect RL policy would replicate OPF dispatch without access to the solver.
Why not just run OPF? In practice: - OPF requires full system knowledge (all costs, all limits) — RL agents may have partial observability. - OPF is a static single-step optimization — it does not account for inter-temporal constraints (battery SOC, ramping, startup costs). - Large-scale AC-OPF is NP-hard; RL offers a real-time approximation path.
Locational Marginal Price (LMP)¶
LMP is the dual variable (shadow price) of the power balance constraint at each bus. It tells you: How much would total system cost increase if demand at bus \(i\) increased by 1 MW?
where \(\lambda\) is the system energy price and \(\mu_k\) are congestion multipliers for binding line constraints.
RL implication: LMPs are the standard price signal for storage and DER agents. In marl_der_arbitrage, CostBasedMarketEnv and BidBasedMarketEnv, agents observe LMP-derived signals and must learn to buy low / sell high — a temporal credit assignment problem.
2.2 Unit Commitment (UC)¶
UC extends OPF with binary on/off decisions and inter-temporal constraints:
subject to:
- Power balance at each time step
- Generation limits: \(P_{\min} \cdot u_{i,t} \leq P_{g,i,t} \leq P_{\max} \cdot u_{i,t}\)
- Minimum up/down time: once started, a unit must stay on for \(T_{\text{up}}\) steps; once shut down, it must stay off for \(T_{\text{dn}}\) steps
- Ramp rate: \(|P_{g,i,t} - P_{g,i,t-1}| \leq R_i\)
- Startup / shutdown indicators: \(z_{i,t} \geq u_{i,t} - u_{i,t-1}\), \(w_{i,t} \geq u_{i,t-1} - u_{i,t}\)
RL implication: The marl_uc task requires mixed discrete-continuous actions — each agent outputs [score, on_off]. Minimum up/down time creates temporal coupling across steps. A greedy policy that ignores future demand may incur large startup costs. This makes marl_uc a useful testbed for algorithms that handle hybrid action spaces and long-horizon planning.
3. Flexible Devices and Inter-Temporal State¶
This layer focuses on resources that carry state across time. Their actions are not just about current feasibility; they reshape what is possible later in the episode.
3.1 Battery Storage and SOC Dynamics¶
Battery state-of-charge (SOC) evolves as:
subject to \(\text{SOC}_{\min} \leq \text{SOC}_{t+1} \leq \text{SOC}_{\max}\) and \(P_{\min} \leq P_t \leq P_{\max}\).
The sketch below keeps the encoding compact: bars are power actions, and the three lines are SOC, SOC_min, and SOC_max. The point is not exact units, but the structure of “charge first, discharge later” and how current actions reshape later feasibility.
---
config:
xyChart:
width: 480
height: 220
showTitle: false
xAxis:
titleFontSize: 12
labelFontSize: 11
yAxis:
titleFontSize: 12
labelFontSize: 11
---
xychart
x-axis [t0, t1, t2, t3, t4, t5]
y-axis "" -1 --> 1
bar [-0.60, -0.45, 0.00, 0.35, 0.75, 0.00]
line [0.32, 0.50, 0.64, 0.64, 0.36, 0.36]
line [0.15, 0.15, 0.15, 0.15, 0.15, 0.15]
line [0.90, 0.90, 0.90, 0.90, 0.90, 0.90]
RL implication: SOC is a hidden integrator state — current actions constrain future feasibility. An agent that fully discharges at time \(t\) cannot respond to a price spike at \(t+1\). This requires long-horizon credit assignment, similar to inventory management but with efficiency losses, power limits, and grid coupling.
3.2 Electric Vehicles (G2V / V2G)¶
EVs add scheduling constraints on top of battery dynamics:
- Availability: The EV can only charge/discharge when parked at home. During commute hours, the agent must output zero.
- Departure SOC: The EV must reach \(\text{SOC} \geq \text{SOC}_{\text{departure}}\) before leaving. Missing this deadline is a hard constraint violation.
- Stochastic schedule: Departure/arrival times may vary across episodes.
This diagram separates the two EV-specific constraints: availability windows determine when action is allowed, while the departure deadline and departure SOC requirement determine when the battery must already be ready.
gantt
title EV timeline: availability constraint and departure deadline
dateFormat HH:mm
axisFormat %H:%M
section Available windows
Home / available :home1, 08:00, 2h
Home / available :home2, 11:00, 2h
section Unavailable
Commute / unavailable :away1, 10:00, 1h
section Constraints
Departure SOC req. :milestone, soc1, 09:50, 10m
EV departure deadline :vert, dep1, 10:00, 1m
RL implication: The marl_ev_v2g task combines temporal credit assignment (charge now for departure later), hard deadline constraints (departure SOC), and availability masking (zero-action periods). Agents cannot simply learn a static charge profile — they must adapt to varying schedules within each episode.
4. Safety Constraints and Exogenous Drivers¶
The network, dispatch, and devices above all operate under two higher-level conditions: hard safety boundaries and real time-series inputs that make the environment non-stationary.
4.1 Voltage and Thermal Limits — Safety Constraints¶
Power systems enforce hard physical constraints:
| Constraint | Physical meaning | Consequence of violation |
|---|---|---|
| Voltage (\(V_{\min} \leq V_i \leq V_{\max}\)) | Bus voltage must stay within ±5% of nominal | Equipment damage, cascading outage |
| Thermal ($ | S_k | \leq S_k^{\max}$) |
| SOC bounds | Battery cannot exceed physical capacity | Cell degradation, safety hazard |
| Generation limits (\(P_{\min} \leq P_g \leq P_{\max}\)) | Generator output within nameplate range | Mechanical stress, turbine damage |
In PowerZoo these violations are not penalised through reward. They flow into separate CMDP cost channels (constraint_costs plus named cost_* components), with scalar info['cost'] reserved for compatibility wrappers. The full rules — which cost components each task uses, how resources expose them, how wrappers convert them — live in Reward and cost split.
RL implication: Standard reward shaping (adding a penalty term to reward) conflates economic objectives with safety requirements. CMDP separation lets researchers experiment with Lagrangian methods, constrained policy optimisation (CPO) or primal-dual approaches. The SafeRLWrapper exposes the cost signal in OmniSafe-compatible format.
4.2 Time-Series Data and Non-Stationarity¶
PowerZoo bundles real half-hourly time series from the GB (Great Britain) electricity system:
- System demand: total national load (MW), showing daily/weekly/seasonal patterns
- Solar capacity factor: fraction of installed PV actually generating (0–1)
- Wind capacity factor: fraction of installed wind actually generating (0–1)
These drive the exogenous dynamics in every task. Key properties:
| Property | Impact on RL |
|---|---|
| Diurnal cycle (peak at 18:00, trough at 04:00) | Policies must learn time-of-day patterns |
| Weekly seasonality (weekday vs weekend) | Generalization across week structure |
| Seasonal variation (winter peak ≈ 1.5× summer) | Train/val/test splits span different seasons — distribution shift |
| Weather correlation (solar ↔ cloud, wind ↔ storm) | Multivariate exogenous state, imperfect forecasts |
| Trend (increasing renewables, decreasing baseload) | Year-over-year concept drift |
RL implication: Policies trained on summer data may fail in winter. The fixed train/val/test date splits (non-overlapping, spanning 2023–2025) test whether agents learn robust strategies rather than memorizing specific load patterns.
5. Why This Becomes a Distinct RL Problem¶
| Power property | RL challenge | PowerZoo task |
|---|---|---|
| Power flow couples all buses | Implicit agent coupling | All grid tasks |
| Nonlinear AC equations | Non-convex transition | AC-mode tasks |
| Hard voltage/thermal limits | CMDP, safe RL | All tasks via SafeRLWrapper |
| Generator cost curves | Multi-agent credit assignment | marl_opf, opf_118 |
| SOC integrator dynamics | Long-horizon planning | marl_der_arbitrage, marl_ev_v2g |
| On/off + continuous dispatch | Hybrid action spaces | marl_uc |
| Real time-series driving loads | Distribution shift, generalization | All tasks (train/val/test splits) |
| EV departure deadlines | Hard deadline constraints | marl_ev_v2g |
| Competing cost vs safety | Pareto trade-offs, multi-objective | EV, safe RL tasks |
| Grid topology (mesh vs radial) | Different constraint structures | Trans vs Dist environments |
These properties arise from physics, not from artificial benchmark design. PowerZoo's role is to expose them through clean RL interfaces while preserving their physical realism.
See also¶
- Python contract and Reward and cost split — env API and CMDP framing.
- Physics · Transmission, Distribution, Resources — implementation-level physics for each layer.
- Benchmarks · Overview — how each task instantiates these physics into an MDP.
- A short Glossary of recurring acronyms (PF, OPF, BFS, LMP, SOC, V2G, …).