machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, reinforcement learning from human feedback (RLHF) is a technique to align an

intelligent agent In artificial intelligence, an intelligent agent is an entity that Machine perception, perceives its environment, takes actions autonomously to achieve goals, and may improve its performance through machine learning or by acquiring knowledge r ...

with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

. In classical reinforcement learning, an intelligent agent's goal is to learn a function that guides its behavior, called a

policy Policy is a deliberate system of guidelines to guide decisions and achieve rational outcomes. A policy is a statement of intent and is implemented as a procedure or protocol. Policies are generally adopted by a governance body within an or ...

. This function is iteratively updated to maximize rewards based on the agent's task performance. However, explicitly defining a reward function that accurately approximates human preferences is challenging. Therefore, RLHF seeks to train a "reward model" directly from human

feedback Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause and effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handle ...

. The reward model is first trained in a supervised manner to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model then serves as a reward function to improve an agent's policy through an

optimization algorithm Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

like proximal policy optimization. RLHF has applications in various domains in machine learning, including

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

tasks such as text summarization and

conversational agents A dialogue system, or conversational agent (CA), is a computer system intended to converse with a human. Dialogue systems employed one or more of text, speech, graphics, haptics, gestures, and other modes for communication on both the input and ...

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

tasks like

text-to-image model A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...

s, and the development of

video game bot In video games, a bot or drone is a type of artificial intelligence (AI)-based expert system software that plays a video game in the place of a human. Bots are used in a variety of video game genres for a variety of tasks: a bot written for a f ...

s. While RLHF is an effective method of training models to act better in accordance with human preferences, it also faces challenges due to the way the human preference data is collected. Though RLHF does not require massive amounts of data to improve performance, sourcing high-quality preference data is still an expensive process. Furthermore, if the data is not carefully collected from a representative

sample Sample or samples may refer to: * Sample (graphics), an intersection of a color channel and a pixel * Sample (material), a specimen or small quantity of something * Sample (signal), a digital discrete sample of a continuous analog signal * Sample ...

, the resulting model may exhibit unwanted

biases Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

Background and motivation

Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge. For example, one may want to train a model to generate

safe A safe (also called a strongbox or coffer) is a secure lockable enclosure used for securing valuable objects against theft or fire. A safe is usually a hollow cuboid or cylinder, with one face being removable or hinged to form a door. The body ...

text that is both helpful and harmless (such as lacking

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

, toxicity, or otherwise harmful content). Asking humans to manually create examples of harmless and harmful text would be difficult and time-consuming. However, humans are adept at swiftly assessing and comparing the harmfulness of different AI-generated text. Therefore, a more practical objective would be to allow the model to use this type of human feedback to improve its text generation. Despite the clear benefits of incorporating human feedback in training models, prior efforts—including some that leverage

—have encountered significant challenges. Most attempts were either narrow and difficult to generalize, breaking down on more complex tasks, or they faced difficulties learning from sparse (lacking specific information and relating to large amounts of text at a time) or noisy (inconsistently rewarding similar outputs) reward functions. RLHF was not the first successful method of using human feedback for reinforcement learning, but it is one of the most widely used. The foundation for RLHF was introduced as an attempt to create a general algorithm for learning from a practical amount of human feedback. The algorithm as used today was introduced by

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

in a paper on enhancing text continuation or summarization based on human feedback, and it began to gain popularity when the same method was reused in their paper on

InstructGPT Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based a ...

. RLHF has also been shown to improve the

robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, ...

of RL agents and their capacity for

exploration Exploration is the process of exploring, an activity which has some Expectation (epistemic), expectation of Discovery (observation), discovery. Organised exploration is largely a human activity, but exploratory activity is common to most organis ...

, which results in an optimization process more adept at handling

uncertainty Uncertainty or incertitude refers to situations involving imperfect or unknown information. It applies to predictions of future events, to physical measurements that are already made, or to the unknown, and is particularly relevant for decision ...

and efficiently exploring its environment in search of the highest reward.

Collecting human feedback

Human feedback is commonly collected by prompting humans to rank instances of the agent's behavior. These rankings can then be used to score outputs, for example, using the

Elo rating system The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor. The Elo system wa ...

, which is an algorithm for calculating the relative skill levels of players in a game based only on the outcome of each game. While ranking outputs is the most widely adopted form of feedback, recent research has explored other forms, such as numerical feedback, natural language feedback, and prompting for direct edits to the model's output. One initial motivation of RLHF was that it requires relatively small amounts of comparison data to be effective. It has been shown that a small amount of data can lead to comparable results to a larger amount. In addition, increasing the amount of data tends to be less effective than proportionally increasing the size of the reward model. Nevertheless, a larger and more diverse amount of data can be crucial for tasks where it is important to avoid

from a partially representative group of annotators. When learning from human feedback through

pairwise comparison Pairwise generally means "occurring in pairs" or "two at a time." Pairwise may also refer to: * Pairwise disjoint In set theory in mathematics and Logic#Formal logic, formal logic, two Set (mathematics), sets are said to be disjoint sets if th ...

under the Bradley–Terry–Luce model (or the Plackett–Luce model for K-wise comparisons over more than two comparisons), the

maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

(MLE) for linear reward functions has been shown to

converge Converge may refer to: * Converge (band), American hardcore punk band * Converge (Baptist denomination), American national evangelical Baptist body * Limit (mathematics) In mathematics, a limit is the value that a function (or sequence) app ...

if the comparison data is generated under a well-specified

linear model In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, t ...

. This implies that, under certain conditions, if a model is trained to decide which choices people would prefer between pairs (or groups) of choices, it will necessarily improve at predicting future preferences. This improvement is expected as long as the comparisons it learns from are based on a consistent and simple rule. Both offline data collection models, where the model is learning by interacting with a static dataset and updating its policy in batches, as well as online data collection models, where the model directly interacts with the dynamic environment and updates its policy immediately, have been mathematically studied proving sample complexity bounds for RLHF under different feedback models. In the offline data collection model, when the objective is policy training, a pessimistic MLE that incorporates a lower confidence bound as the reward estimate is most effective. Moreover, when applicable, it has been shown that considering K-wise comparisons directly is asymptotically more efficient than converting them into pairwise comparisons for prediction purposes. In the online scenario, when human feedback is collected through pairwise comparisons under the Bradley–Terry–Luce model and the objective is to minimize the algorithm's

regret Regret is the emotion of wishing one had made a different decision in the past, because the consequences of the decision one did make were unfavorable. Regret is related to perceived opportunity. Its intensity varies over time after the decisi ...

(the difference in performance compared to an optimal agent), it has been shown that an optimistic MLE that incorporates an upper confidence bound as the reward estimate can be used to design sample efficient algorithms (meaning that they require relatively little training data). A key challenge in RLHF when learning from pairwise (or dueling) comparisons is associated with the non-Markovian nature of its optimal policies. Unlike simpler scenarios where the optimal strategy does not require memory of past actions, in RLHF, the best course of action often depends on previous events and decisions, making the strategy inherently memory-dependent.

Applications

RLHF has been applied to various domains of

(NLP), such as conversational agents, text summarization, and natural language understanding. Ordinary reinforcement learning, in which agents learn from their actions based on a predefined "reward function", is difficult to apply to NLP tasks because the rewards tend to be difficult to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can steer NLP models, in particular

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

s, to provide answers that align with human preferences with regard to such tasks by capturing their preferences beforehand in the reward model. This results in a model capable of generating more relevant responses and rejecting inappropriate or irrelevant queries. Some notable examples of RLHF-trained language models are

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

(and its predecessor

DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...

's Sparrow,

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

Gemini Gemini most often refers to: * Gemini (constellation), one of the constellations of the zodiac * Gemini (astrology), an astrological sign Gemini may also refer to: Science and technology Space * Gemini in Chinese astronomy, the Gemini constellat ...

, and

Anthropic Anthropic PBC is an American artificial intelligence (AI) startup company founded in 2021. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI's ChatGPT and Google's Gemini. According to the ...

Claude Claude may refer to: People and fictional characters * Claude (given name), a list of people and fictional characters * Claude (surname), a list of people * Claude Callegari (1962–2021), English Arsenal supporter * Claude Debussy (1862–1918), ...

. In computer vision, RLHF has also been used to align

s. Studies that successfully used RLHF for this goal have noted that the use of KL regularization in RLHF, which aims to prevent the learned policy from straying too far from the unaligned model, helped to stabilize the training process by reducing overfitting to the reward model. The final image outputs from models trained with KL regularization were noted to be of significantly higher quality than those trained without. Other methods tried to incorporate the feedback through more direct training—based on maximizing the reward without the use of reinforcement learning—but conceded that an RLHF-based approach would likely perform better due to the online sample generation used in RLHF during updates as well as the aforementioned KL regularization over the prior model, which mitigates

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

to the reward function. RLHF was initially applied to other areas, such as the development of

s and tasks in simulated robotics. For example, OpenAI and DeepMind trained agents to play

Atari Atari () is a brand name that has been owned by several entities since its inception in 1972. It is currently owned by French holding company Atari SA (formerly Infogrames) and its focus is on "video games, consumer hardware, licensing and bl ...

games based on human preferences. In classical RL-based training of such bots, the reward function is simply correlated to how well the agent is performing in the game, usually using metrics like the in-game

score SCORE may refer to: *SCORE (software), a music scorewriter program * SCORE (television), a weekend sports service of the defunct Financial News Network *SCORE! Educational Centers *SCORE International, an offroad racing organization *Sarawak Corrido ...

. In comparison, in RLHF, a human is periodically presented with two clips of the agent's behavior in the game and must decide which one ''looks'' better. This approach can teach agents to perform at a competitive level without ever having access to their score. In fact, it was shown that RLHF can sometimes lead to superior performance over RL with score metrics because the human's preferences can contain more useful information than performance-based metrics. The agents achieved strong performance in many of the environments tested, often surpassing human performance.

Training

In RLHF, two different models are trained: a reward model and a

(RL) policy. The reward model learns to determine what behavior is desirable based on human feedback, while the policy is guided by the reward model to determine the agent's actions. Both models are commonly initialized using a pre-trained

autoregressive In statistics, econometrics, and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it can be used to describe certain time-varying processes in nature, economics, behavior, etc. The autoregre ...

. This model is then customarily trained in a supervised manner on a relatively small dataset of pairs of prompts to an assistant and their accompanying responses, written by human annotators.

Reward model

The reward model is usually initialized with a pre-trained model, as this initializes it with an understanding of language and focuses training explicitly on learning human preferences. In addition to being used to initialize the reward model and the RL policy, the model is then also used to sample data to be compared by annotators. The reward model is then trained by replacing the final layer of the previous model with a randomly initialized

regression Regression or regressions may refer to: Arts and entertainment * ''Regression'' (film), a 2015 horror film by Alejandro Amenábar, starring Ethan Hawke and Emma Watson * ''Regression'' (magazine), an Australian punk rock fanzine (1982–1984) * ...

head. This change shifts the model from its original

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

task over its vocabulary to simply outputting a number corresponding to the score of any given prompt and response. This model is trained on the human preference comparison data collected earlier from the supervised model. In particular, it is trained to minimize the following

cross-entropy In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the ...

loss function:

\mathcal(\theta)=-\fracE_log(\sigma(r_\theta(x,y_w)-r_\theta(x,y_l))) = -\fracE_\log\left frac\right

where

K

is the number of responses the labelers ranked,

r_\theta(x,y)

is the output of the reward model for prompt

x

and completion

y

y_w

is the preferred completion over

y_l

\sigma(x)

denotes the

sigmoid function A sigmoid function is any mathematical function whose graph of a function, graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function, which is defined by the formula :\sigma(x ...

, and

E /math> denotes the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

. This can be thought of as a form of

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

, where the model predicts the probability that a response

y_w

is preferred over

y_l

. This loss function essentially measures the difference between the reward model's predictions and the decisions made by humans. The goal is to make the model's guesses as close as possible to the humans' preferences by minimizing the difference measured by this equation. In the case of only pairwise comparisons,

K = 2

, so the factor of

1/ = 1

. In general, all

comparisons from each prompt are used for training as a single batch. After training, the outputs of the model are normalized such that the reference completions have a mean score of 0. That is,

\sum_y r_\theta(x, y) = 0

for each query and reference pair

(x,y)

by calculating the mean reward across the training dataset and setting it as the bias in the reward head.

Policy

Similarly to the reward model, the human feedback policy is also initialized from a pre-trained model. The key is to understand language generation as if it is a game to be learned by RL. In RL, a policy is a function that maps a game state to a game action. In RLHF, the "game" is the game of replying to prompts. A prompt is a game state, and a response is a game action. This is a fairly trivial kind of game, since every game lasts for exactly one step. Nevertheless, it is a game, and so RL algorithms can be applied to it. The first step in its training is supervised fine-tuning (SFT). This step does not require the reward model. Instead, the pre-trained model is trained on a dataset

D_

that contains prompt-response pairs

(x, y)

. Then, during SFT, the model is trained to auto-regressively generate the corresponding response

y

when given a random prompt

x

. The original paper recommends to SFT for only one epoch, since more than that causes overfitting. The dataset

D_

is usually written by human contractors, who write both the prompts and responses. The second step uses a

policy gradient method Policy gradient methods are a class of reinforcement learning algorithms. Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization meth ...

to the reward model. It uses a dataset

D_

, which contains prompts, but not responses. Like most policy gradient methods, this algorithm has an outer loop and two inner loops: * Initialize the policy

\pi^_\phi

\pi^

, the policy output from SFT. * Loop for many steps. ** Initialize a new empty dataset

D_

. ** Loop for many steps *** Sample a random prompt

x

from

D_

. *** Generate a response

y

from the policy

\pi^_\phi

. *** Calculate the reward signal

r_\theta(x, y)

from the reward model

r_\theta

. *** Add the triple

(x, y, r_\theta(x, y))

D_

. ** Update

\phi

by a policy gradient method to increase the objective function

\text(\phi)=E_\left_\theta(x,y)-\beta\log\left(\frac\right)\right /math>
Note that (x,y)\sim D_is equivalent to x \sim D_, y \sim \pi_\phi^\text(\cdot ,  x), which means "sample a prompt from D_, then sample a response from the policy".

The objective function has two parts. The first part is simply the expected reward E /math>, and is standard for any RL algorithm. The second part is a "penalty term" involving the

KL divergence KL, kL, kl, or kl. may refer to: Businesses and organizations * KLM, a Dutch airline (IATA airline designator KL) * Koninklijke Landmacht, the Royal Netherlands Army * Kvenna Listin ("Women's List"), a political party in Iceland * KL FM, a Ma ...

. The strength of the penalty term is determined by the hyperparameter

\beta

. This KL term works by penalizing the

(a measure of

statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be bet ...

between distributions) between the model being fine-tuned and the initial supervised model. By choosing an appropriate

\beta

, the training can balance learning from new data while retaining useful information from the initial model, increasing

generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...

by avoiding fitting too closely to the new data. Aside from preventing the new model from producing outputs too dissimilar those of the initial model, a second motivation of including the KL term is to encourage the model to output high-

entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

text, so as to prevent the model from collapsing to a small number of canned responses. In simpler terms, the objective function calculates how well the policy's responses are expected to align with human feedback. The policy generates responses to prompts, and each response is evaluated both on how well it matches human preferences (as measured by the reward model) and how similar it is to responses the model would naturally generate. The goal is to balance improving alignment with human preferences while ensuring the model's responses remain diverse and not too far removed from what it has learned during its initial training. This helps the model not only to provide answers that people find useful or agreeable but also to maintain a broad understanding and avoid overly narrow or repetitive responses.

Proximal policy optimization

The policy function is usually trained by proximal policy optimization (PPO) algorithm. That is, the parameter

\phi

is trained by gradient ascent on the clipped surrogate function. Classically, the PPO algorithm employs generalized advantage estimation, which means that there is an extra ''value estimator''

V_(x)

, that updates concurrently with the policy

\pi^_

during PPO training:

\pi^_, V_, \pi^_, V_, \dots

. The value estimator is used only during training, and not outside of training. The PPO uses gradient descent on the following ''clipped surrogate advantage'':

L_(\phi) := E_\left \min\left(\frac A(x,y) , 
\mathrm\left( \frac, 1-\epsilon, 1+\epsilon \right) A(x,y)\right)
\right /math>

where the advantage term A(x, y) is defined as r_(x,y) - V_(x) . That is, the advantage is computed as the difference between the reward (the expected return) and the value estimation (the expected return from the policy). This is used to train the policy by gradient ''ascent'' on it, usually using a standard momentum-gradient optimizer, like the

Adam optimizer Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

. The original paper initialized the value estimator from the trained reward model. Since PPO is an actor-critic algorithm, the value estimator is updated concurrently with the policy, via minimizing the squared TD-error, which in this case equals the squared advantage term:

L_(\xi) = \mathbb_ \left \left( r_(x,y) - \beta \log\left( \frac \right) - V_(x) \right)^2 \right /math>which is minimized by gradient ''descent'' on it. Other methods than squared TD-error might be used. See the

actor-critic algorithm The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learn ...

page for details.

Mixing pretraining gradients

A third term is commonly added to the objective function to prevent the model from catastrophic forgetting. For example, if the model is only trained in customer service, then it might forget general knowledge in geography. To prevent this, the RLHF process incorporates the original language modeling objective. That is, some random texts

x

are sampled from the original pretraining dataset

D_\text

, and the model is trained to maximize the log-likelihood of the text

\log(\pi^_\phi(x))

. The final objective function is written as:

L(\phi)=E_\left_\theta(x,y)-\beta\log\left(\frac\right)\right \gamma E_log(\pi_\phi^\text(x)) /math>

where \gamma controls the strength of this pretraining term. This combined objective function is called PPO-ptx, where "ptx" means "Mixing Pretraining Gradients". It was first used in the InstructGPT paper. In total, this objective function defines the method for adjusting the RL policy, blending the aim of aligning with human feedback and maintaining the model's original language understanding.

So, writing out fully explicitly, the PPO-ptx objective function is: L_(\phi) := E_\left \min\left(\frac A(x,y) , 
\mathrm\left( \frac, 1-\epsilon, 1+\epsilon \right) A(x,y)\right)
-\beta\log\left(\frac\right)\right + \gamma E_log(\pi_\phi^\text(x)) /math>which is optimized by gradient ''ascent'' on it.

Limitations

RLHF suffers from challenges with collecting human feedback, learning a reward model, and optimizing the policy. Compared to data collection for techniques like

unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. On November 17, 2012, the series was canceled after one season. Plot The series f ...

self-supervised learning Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...

, collecting data for RLHF is less scalable and more expensive. Its quality and consistency may vary depending on the task, interface, and the preferences and biases of individual humans. The effectiveness of RLHF depends on the quality of human feedback. For instance, the model may become biased, favoring certain groups over others, if the feedback lacks impartiality, is inconsistent, or is incorrect. There is a risk of

overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

ting, where the model memorizes specific feedback examples instead of learning to generalize. For instance, feedback predominantly from a specific demographic might lead the model to learn peculiarities or noise, along with the intended alignment. Excessive alignment to the specific feedback it received (that is, to the bias therein) can lead to the model performing sub-optimally in new contexts or when used by different groups. A single reward function cannot always represent the opinions of diverse groups of people. Even with a representative sample, conflicting views and preferences may result in the reward model favoring the majority's opinion, potentially disadvantaging underrepresented groups. In some cases, as is possible in regular

, there may be a risk of the model learning to manipulate the feedback process or

game the system The letter of the law and the spirit of the law are two possible ways to regard rules or laws. To obey the "letter of the law" is to follow the literal reading of the words of the law, whereas following the "spirit of the law" is to follow th ...

to achieve higher rewards rather than genuinely improving its performance. In the case of RLHF, a model may learn to exploit the fact that it is rewarded for what is evaluated positively and not necessarily for what is actually good, which can lead to it learning to persuade and manipulate. For example, models might learn that apparent confidence, even if inaccurate, garners higher rewards. Such behavior, if unchecked, is not just incentivized but can cause significant deployment issues due to the model's potential to mislead. Studies have found that humans are not skilled at identifying mistakes in LLM outputs in complex tasks; therefore, models learning to generate confident-sounding yet incorrect text can lead to significant issues when deployed.

Alternatives

Reinforcement learning from AI feedback

Similarly to RLHF, ''reinforcement learning from AI feedback'' (RLAIF) relies on training a preference model, except that the feedback is automatically generated. This is notably used in

's constitutional AI, where the AI feedback is based on the conformance to the principles of a constitution.

Direct alignment algorithms

Direct alignment algorithms (DAA) have been proposed as a new class of algorithms that seek to directly optimize

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s (LLMs) on human feedback data in a supervised manner instead of the traditional policy-gradient methods. These algorithms aim to align models with human intent more transparently by removing the intermediate step of training a separate reward model. Instead of first predicting human preferences and then optimizing against those predictions, direct alignment methods train models end-to-end on human-labeled or curated outputs. This reduces potential misalignment risks introduced by proxy objectives or reward hacking. By directly optimizing for the behavior preferred by humans, these approaches often enable tighter alignment with human values, improved

interpretability In mathematical logic, interpretability is a relation between formal theories that expresses the possibility of interpreting or translating one into the other. Informal definition Assume ''T'' and ''S'' are formal theories. Slightly simplified, ...

, and simpler training pipelines compared to RLHF.

Direct preference optimization

Direct preference optimization (DPO) is a technique to learn human preferences. Like RLHF, it has been applied to align pre-trained large language models using human-generated preference data. Unlike RLHF, however, which first trains a separate intermediate model to understand what good outcomes look like and then teaches the main model how to achieve those outcomes, DPO simplifies the process by directly adjusting the main model according to people's preferences. It uses a

change of variables In mathematics, a change of variables is a basic technique used to simplify problems in which the original variables are replaced with functions of other variables. The intent is that when expressed in new variables, the problem may become si ...

to define the "preference

loss Loss may refer to: *Economic loss *Grief, an emotional response to loss **Animal loss, grief over the loss of an animal Mathematics, science, and technology * Angular misalignment loss, power loss caused by the deviation from optimum angular al ...

" directly as a function of the policy and uses this loss to fine-tune the model, helping it understand and prioritize human preferences without needing a separate step. Essentially, this approach directly shapes the model's decisions based on positive or negative human feedback. Recall, the pipeline of RLHF is as follows: * We begin by gathering human preference dataset

D

. * We then fit a reward model

r^*

to data, by

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

using the Plackett–Luce model

r^* = \arg\max_ \mathbb_ \left ln\prod_^N\frac\right /math>
* We finally train an optimal policy \pi^* that maximizes the objective function: \pi^* = \arg\max_\mathbb_\left^*(x,y)-\beta\log\left(\frac\right)\right However, instead of doing the intermediate step of the reward model, DPO directly optimizes for the final policy.

First, solve directly for the optimal policy, which can be done by

Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function (mathematics), function subject to constraint (mathematics), equation constraints (i.e., subject to the conditio ...

s, as usual in

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...

\pi^*(y, x) = \frac,

where

Z(x)

is the partition function. This is unfortunately not tractable, since it requires summing over ''all possible responses'':

Z(x) = \sum_y \pi^(y, x) \exp(r^*(x,y)/\beta) = \mathbb E_exp(r^*(x,y)/\beta) /math>

Next, invert this relationship to express the reward implicitly in terms of the optimal policy: r^*(x,y) = \beta \log \frac + \beta \log Z(x). Finally, plug it back to the maximum likelihood estimator, we obtain \pi^* = \arg\max_ \mathbb_ 
\left ln\prod_^N\frac\right /math>

Usually, DPO is used for modeling human preference in pairwise comparisons, so that N = 2 . In that case, we have \pi^* = \arg\max_ \mathbb_ 
\left log \sigma\left(
\beta \log \frac - \beta \log \frac
\right)\right /math>

DPO eliminates the need for a separate reward model or reinforcement learning loop, treating alignment as a supervised learning problem over preference data. This is simpler to implement and train than RLHF and has been shown to produce comparable and sometimes superior results. Nevertheless, RLHF has also been shown to beat DPO on some datasets, for example, on benchmarks that attempt to measure truthfulness. Therefore, the choice of method may vary depending on the features of the human preference data and the nature of the task.

Identity preference optimization

Identity preference optimization (IPO) is a modification to the original DPO objective that introduces a regularization term to reduce the chance of overfitting. It remains robust to overtraining by assuming noise in the preference data. Foremost, IPO first applies a non-linear mapping over the probability distribution of preferences

\Psi(q) = \log(q / (1-q))

instead of the Bradley-Terry assumption to soften the probability of preferences and smooth the labels. Here,

\Psi(q)

denotes the

\Psi

preference objective separate from the policy objective. This helps avoid the overfitting issue of the assumption that pairwise preferences can be substituted for point-wise rewards, which weakens the KL regularization by heavily skewing the preference distribution. As with DPO, IPO is also formulated as an offline learning objective learned over a human preference dataset

D

. In particular, the IPO introduces a new objective by applying a mapping

\Psi

over the preference probability distribution. Practically,

\Psi

is taken as the identity mapping, which results in IPO. Hence, IPO also directly optimizes for the final policy from the preference dataset and bypasses the reward modeling stage by the following objective:

- \beta D_(\pi_ , , \pi_)

where

p^*(y_w \succ y_l , x)

is preference distribution of the chosen responses

y_w

over the rejected responses

y_l

. However, since

p^*

is not observed directly, we sample from a

Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with pro ...

from the offline preference dataset as:

p^*(y \succ y' , x) = \mathbb_h \ /math>

To solve this objective, IPO minimizes the quadratic loss function: \begin
\text &\mathbb_h_\pi(x, y_w, y_l) -  I(y_w, y_l)) 2
\\ &= \mathbb_h_\pi(x, y_w, y_l) - (1 - I) h_\pi(x, y_l, y_w) - \frac\beta^2
\\ &= \mathbb__\pi(x, y_w, y_l) - \frac\beta^2
\end where h_\pi(x, y_w, y_l) = \log\left (\frac\right) 
- \log\left (\frac\right ) and I(y_w, y_l) is a function drawn from the Bernoulli distribution from the preference dataset. Here, I(y, y') is 1 if y is preferred to y' which happens with probability p^*(y \succ y'), and 0 otherwise. As such, the simplification of the expression directly follows from exploiting the symmetry of y and y' from the Bernoulli such that for each datapoint (y_w, y_l)_i \sim D . In particular this symmetry can be represented as (y, y', I(y, y')) = (y_, y_, 1) and (y, y', I(y, y')) = (y_, y_, 0) with \mathbb__y = \frac and \mathbb (y, y') = p_y .

In summary, IPO can control the gap between the log-likelihood ratios of the policy model and the reference by always regularizing the solution towards the reference model. It allows learning directly from preferences without a reward modelling stage and without relying on the Bradley-Terry modelisation assumption that assumes that pairwise preferences can be substituted with pointwise rewards. Thus, it avoids overfitting to the preference dataset especially when preferences are near deterministic and the KL term fails.

Kahneman-Tversky optimization

Kahneman-Tversky optimization (KTO) is another direct alignment algorithm drawing from

prospect theory Prospect theory is a theory of behavioral economics, judgment and decision making that was developed by Daniel Kahneman and Amos Tversky in 1979. The theory was cited in the decision to award Kahneman the 2002 Nobel Memorial Prize in Economics. ...

to model uncertainty in human decisions that may not maximize the expected value. In general, KTO seeks to optimize a class of new loss functions proposed as “human-aware losses” (HALO) formulated under prospect theory to model “human values” of a query, response pair

(x,y)

v(r_\theta(x, y) - E_Q_\theta(x,y)'

. A function is defined as a human-aware loss for the value described by the general HALO objective:

f(\pi_\theta, \pi_) = \mathbb__ 
v\Bigl(
  r_\theta(x, y) 
  \;-\; 
  \underbrace_
\Bigr) + C_D

where

D

is the preference data,

C_D

is some constant relevant to the dataset, and

Q

is some distribution representing the baseline or “reference”. Each training example is attached a label

a_ \in \

that tells us if the example is desirable (we want to push up its reward) and -1 if it’s undesirable (in order to push down its reward). Unlike previous definitions of the reward, KTO defines

r_\theta(x,y)

as the “implied reward” taken by the log-likelihood ratio between the policy model and the reference model

\log \left( \frac \right )

. Here, the value function

v

is a non-linear (typically

concave Concave or concavity may refer to: Science and technology * Concave lens * Concave mirror Mathematics * Concave function, the negative of a convex function * Concave polygon A simple polygon that is not convex is called concave, non-convex or ...

) function that mimics human

loss aversion In cognitive science and behavioral economics, loss aversion refers to a cognitive bias in which the same situation is perceived as worse if it is framed as a loss, rather than a gain. It should not be confused with risk aversion, which descri ...

and

risk aversion In economics and finance, risk aversion is the tendency of people to prefer outcomes with low uncertainty to those outcomes with high uncertainty, even if the average outcome of the latter is equal to or higher in monetary value than the more c ...

. As opposed to previous preference optimization algorithms, the motivation of KTO lies in maximizing the utility of model outputs from a human perspective rather than maximizing the likelihood of a “better” label (chosen vs. rejected responses). Hence, it constructs a more relaxed generalization to preference distributions by requiring only a binary feedback signal

a_

instead of explicit preference pairs. For each example

(x,y)

in the dataset

D

, KTO explicitly optimizes the HALO objective as:

\pi_\theta^* 
\;=\;
\arg\max_
\;\;
\mathbb_
\Bigl \gamma_y \;-\; v(x,y)
\Bigr /math>, where \gamma_y is a class-specific constant (e.g., \gamma_y = \lambda_D \text \lambda_U) controlling how strongly the model should push up good outputs vs. push down bad ones. The value function v(x,y) is defined piecewise depending on whether y is desirable (\lambda_D) or undesirable (\lambda_U): v(x,y)
\;=\;
\begin
\lambda_D \,\sigma\!\bigl(\,\beta\,\bigl(r_\theta(x, y) \;-\; z_0\bigr)\bigr),
  & \quad \text y \sim y_,\\ pt \lambda_U \,\sigma\!\bigl(\,\beta\,\bigl(z_0 \;-\; r_\theta(x, y)\bigr)\bigr),
  & \quad \text y \sim y_
\end and z_0 
=
\mathrm\!\Bigl(
  \,\pi_\theta(y' \mid x)
  \;\big\Vert\;
  \pi_(y' \mid x)
\Bigr) is a baseline given by the Kullback–Leibler divergence. Here, \beta controls how “risk-averse” the value function is (larger \beta = faster saturation in the logistic function \sigma). Intuitively, desirable outputs push the model to increase r_\theta so that r_\theta - z_0 becomes more positive. Undesirable ones push it in the opposite direction, so the reward is less than the reference. Since many real-world feedback pipelines yield "like/dislike" data more easily than pairwise comparisons, KTO is designed to be data-cheap and to reflect "loss aversion" more directly by using a straightforward notion of "good vs. bad" at the example level.

Background and motivation

Collecting human feedback

Applications

Training

Reward model

Policy

Proximal policy optimization

Mixing pretraining gradients

Limitations

Alternatives

Reinforcement learning from AI feedback

Direct alignment algorithms

Direct preference optimization

Identity preference optimization

Kahneman-Tversky optimization

See also

References

Further reading