Skip to content

Reward and Cost Split

PowerZoo treats every task as a Constrained MDP (CMDP): maximise the discounted reward subject to a discounted cost being below a budget.

\[ \max_\pi \; \mathbb{E}_\pi\!\left[\sum_{t=0}^{T} \gamma^t r_t\right] \quad \text{s.t.} \quad \mathbb{E}_\pi\!\left[\sum_{t=0}^{T} \gamma^t c_{k,t}\right] \leq d_k \]

This page explains how that split appears in the codebase and in the info dict at every step. The full env API is in Python contract; the underlying physics is in Power systems primer.

Why a separate cost channel

Reward shaping cannot guarantee constraint satisfaction. A reward like economic_value − λ · violation only biases the policy; under-tuned, it ignores the constraint, and over-tuned, it sacrifices reward to stay safe. By keeping safety in a separate channel, you can plug in Lagrangian, primal-dual, CPO or any other constrained-RL algorithm without rewriting the env. A standard reward-only RL agent that ignores the vector cost (or its scalar compatibility alias) still trains — it just produces an unsafe policy.

What goes into reward

The reward channel carries only the economic / task objective:

  • OPF / UC (marl_opf, opf_118, opf_118_7d, marl_uc): negative generation cost (plus startup / shutdown cost for UC).
  • Battery / DER / EV arbitrage (battery_arbitrage, marl_der_arbitrage, marl_ev_v2g): trading profit (plus departure-readiness bonus for EV).
  • Data center (dc_scheduling): weighted -(energy + SLA + PUE).
  • DC microgrid (dc_microgrid, dc_microgrid_safe): scalarised r_energy + w_cost · r_cost + w_carbon · r_carbon, with the per-component vector also exposed in info["reward_vector"].
  • Markets (gencos_bidding, CostBasedMarketEnv, BidBasedMarketEnv): per-step LMP-driven settlement profit.
  • DSO (make_dso_env): network loss (-loss_penalty_weight * p_loss_MW).

What goes into cost

The cost channel carries physical safety violations in physical units:

  • Line thermal overload (MW), bus voltage violation (pu).
  • Battery / EV SOC bound violations, EV departure SOC missed, EV away-but-acted.
  • Data-center zone over-temperature (°C above critical).
  • Microgrid SLA / power-deficit violations.

How costs flow from a resource to the agent

PowerZoo uses a simple prefix convention: any key returned by resource.status() whose name starts with cost_ is collected automatically.

flowchart LR
    R["Resource.status()\n{ ..., cost_clipped_power: 0.3 }"]
    --> P["PowerEnv._augment_info()\nsums all cost_* fields per resource"]
    --> S["info['cost_resource']"]
    --> C["info['constraint_costs']\n(full vector)"]
    --> T["TaskCMDPWrapper\nselected_constraint_costs"]
    --> W["SafeRLWrapper / GymnasiumSafeWrapper\nscalar compatibility alias"]

To add a new cost signal in a custom resource:

def status(self) -> dict:
    return {
        ...,
        'cost_my_new_violation': max(0.0, value),   # non-negative, physical units
    }

No registration call is needed.

Per-task cost components

Different tasks use different subsets:

Task Cost components in info Benchmark CMDP selection
marl_opf, marl_uc, opf_118, opf_118_7d cost_thermal_overload, cost_voltage_violation legacy scalar projection only
marl_der_arbitrage cost_voltage_violation, cost_clipped_power (battery SOC clip) legacy starter task; scalar projection only when wrapped
marl_ev_v2g cost_voltage_violation, cost_clipped_power, EV departure violation, home availability violation legacy starter task; scalar projection only when wrapped
dc_scheduling cost_overtemp, grid cost_* from PowerEnv legacy starter task; cost_sum diagnostic
dc_microgrid, dc_microgrid_safe cost_sla, cost_overtemp, cost_power_deficit selected_constraint_costs = ['sla', 'overtemp', 'power_deficit']
make_dso_env(...) full vector + task selection selected_constraint_costs = ['voltage_violation']
comparison_tso_centralized cost_thermal_overload, cost_reserve_shortfall selected_constraint_costs = ['thermal_overload', 'reserve_shortfall']
marl_ders_benchmark per-agent voltage_violation, thermal_overload, resource current MARL training = CMDP env + MDP fallback
gencos_bidding per-agent thermal_overload current MARL training = CMDP env + MDP fallback

Where useful, adapters also expose the breakdown through info['costs'] (a dict of named components).

Wiring the scalar cost into a Safe-RL algorithm

Two wrappers cover the common interface styles:

Wrapper Returns Use when
SafeRLWrapper 6-tuple (obs, reward, cost, terminated, truncated, info) Algorithm reads cost as a separate return value (OmniSafe, Safety-Gymnasium).
GymnasiumSafeWrapper Standard 5-tuple, but injects the selected scalar projection into info['cost'] Algorithm expects the standard 5-tuple and reads cost from info.

Stacking example:

from powerzoo.envs.grid.trans import TransGridEnv
from powerzoo.wrappers import GymnasiumWrapper, SafeRLWrapper

env = SafeRLWrapper(GymnasiumWrapper(TransGridEnv()), cost_threshold=25.0)
obs, info = env.reset(seed=0)
obs, reward, cost, terminated, truncated, info = env.step(env.action_space.sample())

powerzoo.rl.make_env(...) exposes the same wrappers behind keyword arguments — see Training · Trainers.

Anti-patterns

  • Do not move safety into reward. If you find yourself writing reward -= w * thermal_violation, you are pushing the cost channel back into the objective. The right place is info['constraint_costs'] plus a wrapper-level projection if needed.
  • Do not silently widen voltage / SOC bounds. The bounds are part of the benchmark contract; if your algorithm cannot satisfy them, that inability is the experimental result.
  • Do not use the soft-penalty path on DistGridEnv for benchmark runs. It exists for compatibility but breaks the CMDP separation when enabled — use the default loss-only reward and read violations from info.