Optimization Solution Functions as Deterministic Policies for Offline Reinforcement Learning
Offline reinforcement learning (RL) faces challenges such as limited data coverage and value overestimation. We propose an implicit actor-critic (iAC) framework that uses optimization solution functions as a deterministic policy (actor) and a monotone function over the optimal value as a critic. By encoding optimality in the actor policy, the learned policies become robust to suboptimality of actor parameters through an exponentially decaying sensitivity (EDS) property. We provide performance guarantees and validate the framework on two real-world applications, showing notable gains over state-of-the-art offline RL methods.