Adversarial Unlearning of Backdoors via Implicit Hypergradient
We propose a minimax formulation to remove backdoors from a poisoned model using only a small clean set. Our Implicit Backdoor Adversarial Unlearning (I-BAU) algorithm solves the minimax via implicit hypergradients, capturing inner–outer dependencies unlike prior methods. We prove convergence and generalization of robustness from clean-data minimax to unseen test data. Across seven attacks and two datasets, I-BAU matches or outperforms six state-of-the-art defenses, is robust to trigger/settings/poison ratios, needs less compute (notably >× faster in single-target attacks), and remains effective with only 100 clean samples.