Non-stationary Risk-sensitive Reinforcement Learning: Near-optimal Dynamic Regret, Adaptive Detection, and Separation Design
We study risk-sensitive RL with entropic risk in episodic non-stationary MDPs where rewards and transitions vary over time under a variation budget. We propose restart-based algorithms (Restart-RSMB, Restart-RSQ) with dynamic regret guarantees and a meta-algorithm that adaptively detects non-stationarity without prior variation knowledge. We establish a dynamic regret lower bound, showing near-optimality. Results reveal that risk control and handling non-stationarity can be designed separately when the variation budget is known, while adaptive detection depends on the risk parameter.