Skip to content

Power Systems Primer for ML Researchers

This page introduces the essential power-system concepts that underpin PowerZoo's environments. No electrical-engineering background is assumed — every concept is motivated by its impact on the RL problem.

The intended reading order is: first build intuition for the network and power flow, then look at dispatch problems, then move to stateful devices such as batteries and EVs, and finally return to safety constraints and real time-series drivers.

The map below previews how each physical object in the next sections maps to a familiar RL concept. Use it as a quick reference while reading.

flowchart LR
    subgraph Phys ["Physical object"]
      P1[Bus / line / PF solve]
      P2[Generator + cost curve]
      P3[Battery / EV SOC]
      P4[Voltage / thermal limits]
      P5[Real load · solar · wind traces]
    end
    subgraph RL ["RL concept"]
      R1[Coupled state transition\n(physics-mediated agent coupling)]
      R2[Per-agent action + shared reward]
      R3[Hidden integrator state\n(long-horizon credit assignment)]
      R4[CMDP cost channel\n(safety constraint)]
      R5[Non-stationary exogenous process\n(distribution shift)]
    end
    P1 --> R1
    P2 --> R2
    P3 --> R3
    P4 --> R4
    P5 --> R5

If you are already comfortable with the physics, jump straight to Python contract for the env API and to Reward and cost split for the CMDP framing. The deeper physics derivations live under Physics.


1. Network and Physics

Start by building intuition for what the grid is and what the environment computes at each step. This section moves from static structure to the physical solve that ties all injections together.

1.1 The Power Grid as a Graph

A power grid can be represented as a graph where:

  • Buses (nodes, labelled B1–B4 in the diagram) are connection points — every resource injects or withdraws power at a bus.
  • Lines (edges, labelled L1–L4) carry power between buses. Each line has a thermal rating (maximum MW) and an impedance (resistance + reactance).
  • Generators (labelled G, drawn as a circle with a sine-wave symbol) inject power into a bus — the arrow points toward the bus bar.
  • Loads / demand (labelled D) withdraw power from a bus — the arrow points away from the bus bar.
4-bus power grid example A 4-bus ring network: generator G1 and load D1 at Bus 1 (top-left), load D2 at Bus 2 (top-right), load D3 at Bus 3 (bottom-left), generator G2 and load D4 at Bus 4 (bottom-right). Lines L1 (B1–B2), L2 (B1–B3), L3 (B2–B4), and L4 (B3–B4) form the ring. B1 B2 B3 B4 G1 D1 D2 D3 D4 G2 L1 L2 L3 L4

The diagram shows a 4-bus ring: two generators (G1 at B1, G2 at B4), four demand points (D1–D4, one per bus), and four lines (L1–L4) connecting adjacent buses. Every bus has at least one resource attached; every line is a physical path through which power can flow.

Why this matters for RL: No bus operates independently. Increasing generator G1's output raises flows on both L1 (B1 → B2) and L2 (B1 → B3) at the same time, changing conditions at B2, B3, and B4 — without any direct signal passing between those buses. This physics-mediated coupling is what makes multi-agent grid control fundamentally different from typical MARL benchmarks like StarCraft or traffic control, where agents interact only through explicit game mechanics.


1.2 Power Flow: The Physics Engine

Power flow (load flow) is the grid's physics step. Given all injections (generation, load, storage), it computes:

  • Voltage at every bus (magnitude and angle)
  • Current / power flow on every line
  • Losses (real power dissipated in line impedance)

DC Power Flow (Linear Approximation)

Assumes voltages are all 1.0 pu, ignores reactive power and losses:

\[P_{\text{line}} = \text{PTDF} \times P_{\text{injection}}\]

where PTDF (Power Transfer Distribution Factor) is a constant matrix derived from line impedances. This is a single matrix-vector multiply — fast and differentiable, but only approximate.

RL implication: DC power flow is a linear constraint. The feasible action set is a polytope. Constraint satisfaction can be verified analytically.

AC Power Flow (Nonlinear, Full Physics)

Solves the nonlinear power-balance equations at each bus:

\[P_i = V_i \sum_j V_j (G_{ij} \cos\theta_{ij} + B_{ij} \sin\theta_{ij})\]
\[Q_i = V_i \sum_j V_j (G_{ij} \sin\theta_{ij} - B_{ij} \cos\theta_{ij})\]

where \(V_i\) is voltage magnitude, \(\theta_{ij}\) is voltage angle difference, and \(G_{ij}\), \(B_{ij}\) are conductance and susceptance from the admittance matrix.

RL implication: AC power flow makes the transition function nonlinear and non-convex. Small action changes can cause large voltage swings. The solver may fail to converge (infeasible dispatch) — the environment must handle this gracefully.

Distribution vs Transmission

Property Transmission (HV) Distribution (MV/LV)
Topology Meshed (loops) Radial (tree)
Voltage level 110–765 kV 0.4–33 kV
Solver PTDF / Newton–Raphson Forward-backward sweep (BFS)
Key constraint Line thermal limits Voltage magnitude limits
R/X ratio Low (≈0.1) — reactance-dominant High (≈1.0) — resistance and reactance comparable
DER penetration Low (large generators) High (solar, batteries, EVs)

RL implication: Transmission tasks emphasize global coordination among large generators. Distribution tasks emphasize local voltage regulation with many small resources. PowerZoo provides both.


2. Dispatch Problems: How the System Chooses Output

Once the network and physics are clear, the benchmark's main decision problems become easier to place: single-step OPF and inter-temporal UC.

2.1 Optimal Power Flow (OPF)

OPF answers: Given demand, what is the cheapest feasible generation dispatch?

\[\min_{P_g} \sum_i C_i(P_{g,i}) \quad \text{s.t.} \quad \text{power balance, line limits, voltage limits}\]

where \(C_i(P_{g,i}) = mc\_a_i P_{g,i}^2 + mc\_b_i P_{g,i} + mc\_c_i\) is the quadratic total-cost curve of generator \(i\). In PowerZoo, mc_a, mc_b, mc_c are the coefficients of this polynomial total-cost function. For cases where mc_a = mc_b = 0 (e.g. Case5), mc_c is the flat marginal cost in $/MWh and the cost simplifies to \(C(P) = mc\_c \cdot P\).

The OPF solution is the oracle baseline in PowerZoo's marl_opf and opf_118 tasks. A perfect RL policy would replicate OPF dispatch without access to the solver.

Why not just run OPF? In practice: - OPF requires full system knowledge (all costs, all limits) — RL agents may have partial observability. - OPF is a static single-step optimization — it does not account for inter-temporal constraints (battery SOC, ramping, startup costs). - Large-scale AC-OPF is NP-hard; RL offers a real-time approximation path.

Locational Marginal Price (LMP)

LMP is the dual variable (shadow price) of the power balance constraint at each bus. It tells you: How much would total system cost increase if demand at bus \(i\) increased by 1 MW?

\[\text{LMP}_i = \lambda + \sum_k \mu_k \cdot \text{PTDF}_{k,i}\]

where \(\lambda\) is the system energy price and \(\mu_k\) are congestion multipliers for binding line constraints.

RL implication: LMPs are the standard price signal for storage and DER agents. In marl_der_arbitrage, CostBasedMarketEnv and BidBasedMarketEnv, agents observe LMP-derived signals and must learn to buy low / sell high — a temporal credit assignment problem.


2.2 Unit Commitment (UC)

UC extends OPF with binary on/off decisions and inter-temporal constraints:

\[\min \sum_{t} \sum_{i} \bigl[ C_i(P_{g,i,t}) \cdot u_{i,t} + S_i^{\text{up}} \cdot z_{i,t} + S_i^{\text{dn}} \cdot w_{i,t} \bigr]\]

subject to:

  • Power balance at each time step
  • Generation limits: \(P_{\min} \cdot u_{i,t} \leq P_{g,i,t} \leq P_{\max} \cdot u_{i,t}\)
  • Minimum up/down time: once started, a unit must stay on for \(T_{\text{up}}\) steps; once shut down, it must stay off for \(T_{\text{dn}}\) steps
  • Ramp rate: \(|P_{g,i,t} - P_{g,i,t-1}| \leq R_i\)
  • Startup / shutdown indicators: \(z_{i,t} \geq u_{i,t} - u_{i,t-1}\), \(w_{i,t} \geq u_{i,t-1} - u_{i,t}\)

RL implication: The marl_uc task requires mixed discrete-continuous actions — each agent outputs [score, on_off]. Minimum up/down time creates temporal coupling across steps. A greedy policy that ignores future demand may incur large startup costs. This makes marl_uc a useful testbed for algorithms that handle hybrid action spaces and long-horizon planning.


3. Flexible Devices and Inter-Temporal State

This layer focuses on resources that carry state across time. Their actions are not just about current feasibility; they reshape what is possible later in the episode.

3.1 Battery Storage and SOC Dynamics

Battery state-of-charge (SOC) evolves as:

\[\text{SOC}_{t+1} = \text{SOC}_t + \frac{\Delta t}{E_{\text{cap}}} \begin{cases} -P_t \cdot \eta_{\text{charge}} & \text{if charging } (P_t < 0) \\ -P_t / \eta_{\text{discharge}} & \text{if discharging } (P_t > 0) \end{cases}\]

subject to \(\text{SOC}_{\min} \leq \text{SOC}_{t+1} \leq \text{SOC}_{\max}\) and \(P_{\min} \leq P_t \leq P_{\max}\).

The sketch below keeps the encoding compact: bars are power actions, and the three lines are SOC, SOC_min, and SOC_max. The point is not exact units, but the structure of “charge first, discharge later” and how current actions reshape later feasibility.

---
config:
  xyChart:
    width: 480
    height: 220
    showTitle: false
    xAxis:
      titleFontSize: 12
      labelFontSize: 11
    yAxis:
      titleFontSize: 12
      labelFontSize: 11
---
xychart
    x-axis [t0, t1, t2, t3, t4, t5]
    y-axis "" -1 --> 1
    bar [-0.60, -0.45, 0.00, 0.35, 0.75, 0.00]
    line [0.32, 0.50, 0.64, 0.64, 0.36, 0.36]
    line [0.15, 0.15, 0.15, 0.15, 0.15, 0.15]
    line [0.90, 0.90, 0.90, 0.90, 0.90, 0.90]

RL implication: SOC is a hidden integrator state — current actions constrain future feasibility. An agent that fully discharges at time \(t\) cannot respond to a price spike at \(t+1\). This requires long-horizon credit assignment, similar to inventory management but with efficiency losses, power limits, and grid coupling.


3.2 Electric Vehicles (G2V / V2G)

EVs add scheduling constraints on top of battery dynamics:

  • Availability: The EV can only charge/discharge when parked at home. During commute hours, the agent must output zero.
  • Departure SOC: The EV must reach \(\text{SOC} \geq \text{SOC}_{\text{departure}}\) before leaving. Missing this deadline is a hard constraint violation.
  • Stochastic schedule: Departure/arrival times may vary across episodes.

This diagram separates the two EV-specific constraints: availability windows determine when action is allowed, while the departure deadline and departure SOC requirement determine when the battery must already be ready.

gantt
    title EV timeline: availability constraint and departure deadline
    dateFormat HH:mm
    axisFormat %H:%M
    section Available windows
    Home / available     :home1, 08:00, 2h
    Home / available     :home2, 11:00, 2h
    section Unavailable
    Commute / unavailable :away1, 10:00, 1h
    section Constraints
    Departure SOC req.    :milestone, soc1, 09:50, 10m
    EV departure deadline :vert, dep1, 10:00, 1m

RL implication: The marl_ev_v2g task combines temporal credit assignment (charge now for departure later), hard deadline constraints (departure SOC), and availability masking (zero-action periods). Agents cannot simply learn a static charge profile — they must adapt to varying schedules within each episode.


4. Safety Constraints and Exogenous Drivers

The network, dispatch, and devices above all operate under two higher-level conditions: hard safety boundaries and real time-series inputs that make the environment non-stationary.

4.1 Voltage and Thermal Limits — Safety Constraints

Power systems enforce hard physical constraints:

Constraint Physical meaning Consequence of violation
Voltage (\(V_{\min} \leq V_i \leq V_{\max}\)) Bus voltage must stay within ±5% of nominal Equipment damage, cascading outage
Thermal ($ S_k \leq S_k^{\max}$)
SOC bounds Battery cannot exceed physical capacity Cell degradation, safety hazard
Generation limits (\(P_{\min} \leq P_g \leq P_{\max}\)) Generator output within nameplate range Mechanical stress, turbine damage

In PowerZoo these violations are not penalised through reward. They flow into separate CMDP cost channels (constraint_costs plus named cost_* components), with scalar info['cost'] reserved for compatibility wrappers. The full rules — which cost components each task uses, how resources expose them, how wrappers convert them — live in Reward and cost split.

RL implication: Standard reward shaping (adding a penalty term to reward) conflates economic objectives with safety requirements. CMDP separation lets researchers experiment with Lagrangian methods, constrained policy optimisation (CPO) or primal-dual approaches. The SafeRLWrapper exposes the cost signal in OmniSafe-compatible format.


4.2 Time-Series Data and Non-Stationarity

PowerZoo bundles real half-hourly time series from the GB (Great Britain) electricity system:

  • System demand: total national load (MW), showing daily/weekly/seasonal patterns
  • Solar capacity factor: fraction of installed PV actually generating (0–1)
  • Wind capacity factor: fraction of installed wind actually generating (0–1)

These drive the exogenous dynamics in every task. Key properties:

Property Impact on RL
Diurnal cycle (peak at 18:00, trough at 04:00) Policies must learn time-of-day patterns
Weekly seasonality (weekday vs weekend) Generalization across week structure
Seasonal variation (winter peak ≈ 1.5× summer) Train/val/test splits span different seasons — distribution shift
Weather correlation (solar ↔ cloud, wind ↔ storm) Multivariate exogenous state, imperfect forecasts
Trend (increasing renewables, decreasing baseload) Year-over-year concept drift

RL implication: Policies trained on summer data may fail in winter. The fixed train/val/test date splits (non-overlapping, spanning 2023–2025) test whether agents learn robust strategies rather than memorizing specific load patterns.


5. Why This Becomes a Distinct RL Problem

Power property RL challenge PowerZoo task
Power flow couples all buses Implicit agent coupling All grid tasks
Nonlinear AC equations Non-convex transition AC-mode tasks
Hard voltage/thermal limits CMDP, safe RL All tasks via SafeRLWrapper
Generator cost curves Multi-agent credit assignment marl_opf, opf_118
SOC integrator dynamics Long-horizon planning marl_der_arbitrage, marl_ev_v2g
On/off + continuous dispatch Hybrid action spaces marl_uc
Real time-series driving loads Distribution shift, generalization All tasks (train/val/test splits)
EV departure deadlines Hard deadline constraints marl_ev_v2g
Competing cost vs safety Pareto trade-offs, multi-objective EV, safe RL tasks
Grid topology (mesh vs radial) Different constraint structures Trans vs Dist environments

These properties arise from physics, not from artificial benchmark design. PowerZoo's role is to expose them through clean RL interfaces while preserving their physical realism.


See also