Reward and Cost Split¶
PowerZoo treats every task as a Constrained MDP (CMDP): maximise the discounted reward subject to a discounted cost being below a budget.
This page explains how that split appears in the codebase and in the info dict at every step. The full env API is in Python contract; the underlying physics is in Power systems primer.
Why a separate cost channel¶
Reward shaping cannot guarantee constraint satisfaction. A reward like economic_value − λ · violation only biases the policy; under-tuned, it ignores the constraint, and over-tuned, it sacrifices reward to stay safe. By keeping safety in a separate channel, you can plug in Lagrangian, primal-dual, CPO or any other constrained-RL algorithm without rewriting the env. A standard reward-only RL agent that ignores the vector cost (or its scalar compatibility alias) still trains — it just produces an unsafe policy.
What goes into reward¶
The reward channel carries only the economic / task objective:
- OPF / UC (
marl_opf,opf_118,opf_118_7d,marl_uc): negative generation cost (plus startup / shutdown cost for UC). - Battery / DER / EV arbitrage (
battery_arbitrage,marl_der_arbitrage,marl_ev_v2g): trading profit (plus departure-readiness bonus for EV). - Data center (
dc_scheduling): weighted-(energy + SLA + PUE). - DC microgrid (
dc_microgrid,dc_microgrid_safe): scalarisedr_energy + w_cost · r_cost + w_carbon · r_carbon, with the per-component vector also exposed ininfo["reward_vector"]. - Markets (
gencos_bidding,CostBasedMarketEnv,BidBasedMarketEnv): per-step LMP-driven settlement profit. - DSO (
make_dso_env): network loss (-loss_penalty_weight * p_loss_MW).
What goes into cost¶
The cost channel carries physical safety violations in physical units:
- Line thermal overload (MW), bus voltage violation (pu).
- Battery / EV SOC bound violations, EV departure SOC missed, EV away-but-acted.
- Data-center zone over-temperature (°C above critical).
- Microgrid SLA / power-deficit violations.
How costs flow from a resource to the agent¶
PowerZoo uses a simple prefix convention: any key returned by resource.status() whose name starts with cost_ is collected automatically.
flowchart LR
R["Resource.status()\n{ ..., cost_clipped_power: 0.3 }"]
--> P["PowerEnv._augment_info()\nsums all cost_* fields per resource"]
--> S["info['cost_resource']"]
--> C["info['constraint_costs']\n(full vector)"]
--> T["TaskCMDPWrapper\nselected_constraint_costs"]
--> W["SafeRLWrapper / GymnasiumSafeWrapper\nscalar compatibility alias"]
To add a new cost signal in a custom resource:
def status(self) -> dict:
return {
...,
'cost_my_new_violation': max(0.0, value), # non-negative, physical units
}
No registration call is needed.
Per-task cost components¶
Different tasks use different subsets:
| Task | Cost components in info |
Benchmark CMDP selection |
|---|---|---|
marl_opf, marl_uc, opf_118, opf_118_7d |
cost_thermal_overload, cost_voltage_violation |
legacy scalar projection only |
marl_der_arbitrage |
cost_voltage_violation, cost_clipped_power (battery SOC clip) |
legacy starter task; scalar projection only when wrapped |
marl_ev_v2g |
cost_voltage_violation, cost_clipped_power, EV departure violation, home availability violation |
legacy starter task; scalar projection only when wrapped |
dc_scheduling |
cost_overtemp, grid cost_* from PowerEnv |
legacy starter task; cost_sum diagnostic |
dc_microgrid, dc_microgrid_safe |
cost_sla, cost_overtemp, cost_power_deficit |
selected_constraint_costs = ['sla', 'overtemp', 'power_deficit'] |
make_dso_env(...) |
full vector + task selection | selected_constraint_costs = ['voltage_violation'] |
comparison_tso_centralized |
cost_thermal_overload, cost_reserve_shortfall |
selected_constraint_costs = ['thermal_overload', 'reserve_shortfall'] |
marl_ders_benchmark |
per-agent voltage_violation, thermal_overload, resource |
current MARL training = CMDP env + MDP fallback |
gencos_bidding |
per-agent thermal_overload |
current MARL training = CMDP env + MDP fallback |
Where useful, adapters also expose the breakdown through info['costs'] (a dict of named components).
Wiring the scalar cost into a Safe-RL algorithm¶
Two wrappers cover the common interface styles:
| Wrapper | Returns | Use when |
|---|---|---|
SafeRLWrapper |
6-tuple (obs, reward, cost, terminated, truncated, info) |
Algorithm reads cost as a separate return value (OmniSafe, Safety-Gymnasium). |
GymnasiumSafeWrapper |
Standard 5-tuple, but injects the selected scalar projection into info['cost'] |
Algorithm expects the standard 5-tuple and reads cost from info. |
Stacking example:
from powerzoo.envs.grid.trans import TransGridEnv
from powerzoo.wrappers import GymnasiumWrapper, SafeRLWrapper
env = SafeRLWrapper(GymnasiumWrapper(TransGridEnv()), cost_threshold=25.0)
obs, info = env.reset(seed=0)
obs, reward, cost, terminated, truncated, info = env.step(env.action_space.sample())
powerzoo.rl.make_env(...) exposes the same wrappers behind keyword arguments — see Training · Trainers.
Anti-patterns¶
- Do not move safety into reward. If you find yourself writing
reward -= w * thermal_violation, you are pushing the cost channel back into the objective. The right place isinfo['constraint_costs']plus a wrapper-level projection if needed. - Do not silently widen voltage / SOC bounds. The bounds are part of the benchmark contract; if your algorithm cannot satisfy them, that inability is the experimental result.
- Do not use the soft-penalty path on
DistGridEnvfor benchmark runs. It exists for compatibility but breaks the CMDP separation when enabled — use the default loss-only reward and read violations frominfo.