TeaMs-RL: Teaching LLMs to Teach Themselves Better Instructions via Reinforcement Learning

The development of Large Language Models (LLMs) often confronts challenges stemming from the heavy reliance on human annotators in reinforcement learning with human feedback (RLHF), or the frequent and costly external queries tied to self-instruct. We pivot to Reinforcement Learning (RL) with a twist: instead of refining LLMs post instruction training, we use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning. TeaMs-RL uses textual operations and rules to diversify training data, enabling high-quality data generation without heavy reliance on external advanced models—paving the way for a single fine-tuning step and negating subsequent RLHF stages. Results show reduced human involvement and far fewer model queries (~5.73% of a strong baseline), improved ability to craft and comprehend complex instructions, and stronger privacy protection.

Authors

Shangding Gu

Alois Knoll

Ming Jin

Published

January 1, 2024