Fine-tuning enhances models' understanding and generates more natural responses, but it can lead to issues like toxic language or incorrect answers due to training on diverse Internet data.
For instance, models might respond in a funny but unhelpful way or provide misleading information. Harmful completions like offensive or criminal suggestions also arise.
To address these concerns, fine-tuning with human feedback is employed. It aligns models better with human preferences, improving helpfulness, honesty, harmlessness, and reducing toxicity and misinformation.
These values guide AI developers towards responsible use.
These challenges highlight the importance of Human Values (HHH) – helpfulness, honesty, and harmlessness.
In 2020, OpenAI researchers introduced a paper on using fine-tuning with human feedback to teach a model to write concise text summaries.
The fine-tuned model outperformed pretrained, instruct fine-tuned models, and even human-written summaries.
RLHF, or reinforcement learning from human feedback, uses reinforcement learning to fine-tune LLMs with human input.
This improves model alignment with human preferences, ensuring useful and relevant outputs.
RLHF also reduces harm potential by encouraging responsible language and avoiding toxic content.
RLHF has a promising use case in personalizing LLMs. Through ongoing feedback, models can learn user preferences, enabling personalized learning plans and AI assistants tailored to individuals.
Reinforcement learning is a machine learning approach where an agent learns to achieve a goal by taking actions in an environment.
The aim is to maximize cumulative rewards.
The agent learns by making decisions, observing outcomes, and receiving rewards or penalties.
Over time, the agent refines its strategy to make better choices and improve its performance.
Consider the example of training a model to play Tic-Tac-Toe. Here, the agent, which is the model
, aims to win the game. The environment is the game board
, and the current board configuration is the state
. The agent's choices are its actions
, which are positions on the board. The agent follows a strategy called the RL policy. By making decisions, the agent earns rewards based on how effective its actions are in winning
. The goal of reinforcement learning is for the agent to learn the best strategy in the environment to maximize rewards
. This involves trial and error. Initially, the agent takes random actions, leading to new states. It explores subsequent states through more actions, creating a series of actions and states called a rollout. As the agent gains experience, it discovers actions that bring the highest rewards
, leading to success in the game.
Expanding on the Tic-Tac-Toe analogy for fine-tuning large language models with RLHF, the LLM acts as the agent's policy
, striving to generate text aligning with human preferences like helpfulness and accuracy. The context window serves as the environment
where text is input via prompts
. The current context forms the state
before the LLM takes action to generate text. This action, be it a word
, sentence, or more, is chosen from the token vocabulary. The next token's choice depends on the model's language representation, context, and vocabulary distribution. Rewards are assigned based on alignment with human preferences. In contrast to Tic-Tac-Toe, determining rewards is complex due to varied human responses. Human assessment
or a reward model
can guide reward determination
, updating LLM weights iteratively to enhance alignment with human preferences.
To start LLM fine-tuning with RLHF, choose a model for the task and create a dataset for human feedback.
Use the LLM and prompt dataset to generate responses.
Human labelers then assess these completions, constituting the human feedback phase of RLHF.
Clear instructions are essential for obtaining quality human feedback, with labelers representing diverse perspectives.
Start by selecting a specific assessment criterion, like helpfulness or toxicity, and have human labelers rank completions based on that criterion. For example, if the prompt is "my house is too hot," labelers rank completions by helpfulness. This is repeated for different prompt-completion sets to create a dataset for training the reward model, which will eventually replace human assessors. Multiple labelers assess the same sets to establish consensus and mitigate misunderstandings.
Here's an example of instructions for human labelers.
For instance, with three completions for a prompt, you can get three prompt-completion pairs from each human ranking.
After human labelers assess prompt-completion sets, the collected data is used to train a reward model, replacing humans in classifying model completions during reinforcement learning fine-tuning. Convert ranking data into pairwise comparisons, assigning 1 to the preferred response and 0 to the less preferred. Reorder prompts for the preferred option to come first, preparing data for the reward model. Ranked feedback yields more prompt-completion data for effective reward model training.
Once the reward model is trained, you can remove the need for further human involvement.
The reward model, often another language model, automatically selects the preferred completion during reinforcement learning fine-tuning.
Trained using supervised learning on pairwise comparison data, the reward model learns to favor human-preferred completions while minimizing the difference in rewards.
human-preferred option is always the first one labeled y_j.
The higher score of the positive/negative class becomes the reward in RLHF.
With this reward model, you now possess a strong tool for aligning your LLM with desired behavior.
Once your model has been trained using the ranked prompt-completion pairs from humans, you can utilize the reward model as a binary classifier to generate scores for positive and negative outcomes.
These scores are called logits and serve as the model's raw outputs before further processing.
For example, if you want your LLM to avoid generating hate speech, you'll classify completions into non-toxic (positive class) and toxic (negative class) categories.
When you apply a Softmax function to the logits, you get probabilities. The example demonstrates a high reward for a non-toxic response and a low reward for a toxic one.