Week 3.2 LLM powered applications

This paper introduces ReAct, a novel approach that integrates verbal reasoning and interactive decision making in large language models (LLMs). While LLMs have excelled in language understanding and decision making, the combination of reasoning and acting has been neglected. ReAct enables LLMs to generate reasoning traces and task-specific actions, leveraging the synergy between them. The approach demonstrates superior performance over baselines in various tasks, overcoming issues like hallucination and error propagation. ReAct outperforms imitation and reinforcement learning methods in interactive decision making, even with minimal context examples. It not only enhances performance but also improves interpretability, trustworthiness, and diagnosability by allowing humans to distinguish between internal knowledge and external information.

In summary, ReAct bridges the gap between reasoning and acting in LLMs, yielding remarkable results across language reasoning and decision making tasks. By interleaving reasoning traces and actions, ReAct overcomes limitations and outperforms baselines, not only enhancing model performance but also providing interpretability and trustworthiness, empowering users to understand the model's decision-making process.

Image: The figure provides a comprehensive visual comparison of different prompting methods in two distinct domains. The first part of the figure (1a) presents a comparison of four prompting methods: Standard, Chain-of-thought (CoT, Reason Only), Act-only, and ReAct (Reason+Act) for solving a HotpotQA question. Each method's approach is demonstrated through task-solving trajectories generated by the model (Act, Thought) and the environment (Obs). The second part of the figure (1b) focuses on a comparison between Act-only and ReAct prompting methods to solve an AlfWorld game. In both domains, in-context examples are omitted from the prompt, highlighting the generated trajectories as a result of the model's actions and thoughts and the observations made in the environment. This visual representation enables a clear understanding of the differences and advantages offered by the ReAct paradigm compared to other prompting methods in diverse task-solving scenarios.

Below you'll find links to the research papers discussed in this weeks videos. You don't need to understand all the technical details discussed in these papers - you have already seen the most important points you'll need to answer the quizzes in the lecture videos.

However, if you'd like to take a closer look at the original research, you can read the papers and articles via the links below.

Reinforcement Learning from Human-Feedback (RLHF)

**Training language models to follow instructions with human feedback -** Paper by OpenAI introducing a human-in-the-loop process to create a model that is better at following instructions (InstructGPT).
Learning to summarize from human feedback - This paper presents a method for improving language model-generated summaries using a reward-based approach, surpassing human reference summaries.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization Algorithms - The paper from researchers at OpenAI that first proposed the PPO algorithm. The paper discusses the performance of the algorithm on a number of benchmark tasks including robotic locomotion and game play.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model - This paper presents a simpler and effective method for precise control of large-scale unsupervised language models by aligning them with human preferences.

Scaling human feedback

**Constitutional AI: Harmlessness from AI Feedback** ****This paper introduces a method for training a harmless AI assistant without human labels, allowing better control of AI behavior with minimal human input.

Reinforcement Learning from Human-Feedback (RLHF)

Proximal Policy Optimization (PPO)

Scaling human feedback

Advanced Prompting Techniques