1. Why Human Feedback

Fine-tuning enhances models' understanding and generates more natural responses, but it can lead to issues like toxic language or incorrect answers due to training on diverse Internet data.

For instance, models might respond in a funny but unhelpful way or provide misleading information. Harmful completions like offensive or criminal suggestions also arise.

To address these concerns, fine-tuning with human feedback is employed. It aligns models better with human preferences, improving helpfulness, honesty, harmlessness, and reducing toxicity and misinformation.

Untitled

These values guide AI developers towards responsible use.

These challenges highlight the importance of Human Values (HHH) – helpfulness, honesty, and harmlessness.

2. Fine-Tuning with Human Feedback

Untitled

In 2020, OpenAI researchers introduced a paper on using fine-tuning with human feedback to teach a model to write concise text summaries.

The fine-tuned model outperformed pretrained, instruct fine-tuned models, and even human-written summaries.

RLHF, or reinforcement learning from human feedback, uses reinforcement learning to fine-tune LLMs with human input.

This improves model alignment with human preferences, ensuring useful and relevant outputs.

RLHF also reduces harm potential by encouraging responsible language and avoiding toxic content.

Untitled

RLHF has a promising use case in personalizing LLMs. Through ongoing feedback, models can learn user preferences, enabling personalized learning plans and AI assistants tailored to individuals.

2.1 Reinforcement Learning

Reinforcement learning is a machine learning approach where an agent learns to achieve a goal by taking actions in an environment.

The aim is to maximize cumulative rewards.

The agent learns by making decisions, observing outcomes, and receiving rewards or penalties.

Over time, the agent refines its strategy to make better choices and improve its performance.

Untitled

2.1.1 Example: Tic Tac Toe

Consider the example of training a model to play Tic-Tac-Toe. Here, the agent, which is the model, aims to win the game. The environment is the game board, and the current board configuration is the state. The agent's choices are its actions, which are positions on the board. The agent follows a strategy called the RL policy. By making decisions, the agent earns rewards based on how effective its actions are in winning. The goal of reinforcement learning is for the agent to learn the best strategy in the environment to maximize rewards. This involves trial and error. Initially, the agent takes random actions, leading to new states. It explores subsequent states through more actions, creating a series of actions and states called a rollout. As the agent gains experience, it discovers actions that bring the highest rewards, leading to success in the game.

Untitled

2.1.2 Example: Fine-Tune LLMs

Expanding on the Tic-Tac-Toe analogy for fine-tuning large language models with RLHF, the LLM acts as the agent's policy, striving to generate text aligning with human preferences like helpfulness and accuracy. The context window serves as the environment where text is input via prompts. The current context forms the state before the LLM takes action to generate text. This action, be it a word, sentence, or more, is chosen from the token vocabulary. The next token's choice depends on the model's language representation, context, and vocabulary distribution. Rewards are assigned based on alignment with human preferences. In contrast to Tic-Tac-Toe, determining rewards is complex due to varied human responses. Human assessment or a reward model can guide reward determination, updating LLM weights iteratively to enhance alignment with human preferences.

Untitled

3. Obtaining Feedback

To start LLM fine-tuning with RLHF, choose a model for the task and create a dataset for human feedback.

Use the LLM and prompt dataset to generate responses.

Human labelers then assess these completions, constituting the human feedback phase of RLHF.

Untitled

Clear instructions are essential for obtaining quality human feedback, with labelers representing diverse perspectives.

Start by selecting a specific assessment criterion, like helpfulness or toxicity, and have human labelers rank completions based on that criterion. For example, if the prompt is "my house is too hot," labelers rank completions by helpfulness. This is repeated for different prompt-completion sets to create a dataset for training the reward model, which will eventually replace human assessors. Multiple labelers assess the same sets to establish consensus and mitigate misunderstandings.

Here's an example of instructions for human labelers.

They guide labelers through the task of choosing the best completion for a prompt.
Details include assessing correctness, using fact-checking, and handling tied completions.
Clear instructions enhance the quality of responses and ensure task consistency, leading to a consensus viewpoint among labeled completions.

Untitled

For instance, with three completions for a prompt, you can get three prompt-completion pairs from each human ranking.

After human labelers assess prompt-completion sets, the collected data is used to train a reward model, replacing humans in classifying model completions during reinforcement learning fine-tuning. Convert ranking data into pairwise comparisons, assigning 1 to the preferred response and 0 to the less preferred. Reorder prompts for the preferred option to come first, preparing data for the reward model. Ranked feedback yields more prompt-completion data for effective reward model training.

4. Reward Model

Once the reward model is trained, you can remove the need for further human involvement.

The reward model, often another language model, automatically selects the preferred completion during reinforcement learning fine-tuning.

Trained using supervised learning on pairwise comparison data, the reward model learns to favor human-preferred completions while minimizing the difference in rewards.

1. Why Human Feedback

2. Fine-Tuning with Human Feedback

2.1 Reinforcement Learning

2.1.1 Example: Tic Tac Toe

2.1.2 Example: Fine-Tune LLMs

3. Obtaining Feedback

4. Reward Model

4.1 How to use this Reward Model