A CMDP-within-online framework for Meta-Safe Reinforcement Learning
We study meta-safe reinforcement learning (Meta-SRL) via a CMDP-within-online framework and establish the first provable guarantees. Using gradient-based meta-learning, we derive task-averaged regret bounds for reward optimality and constraint violations that improve with task similarity/relatedness. We propose a practical meta-algorithm performing inexact online learning on upper bounds estimated via off-policy stationary distribution corrections, with per-task adaptive learning rates and an extension to a competing dynamic oracle. Experiments validate the approach.