The exploration–exploitation dilemma, also known as the explore–exploit tradeoff, is a fundamental concept in

decision-making In psychology, decision-making (also spelled decision making and decisionmaking) is regarded as the Cognition, cognitive process resulting in the selection of a belief or a course of action among several possible alternative options. It could be ...

that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system (which may be incomplete or misleading), while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.

Application in machine learning

In the context of machine learning, the exploration–exploitation tradeoff is fundamental in

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

(RL), a type of machine learning that involves training agents to make decisions based on feedback from the environment. Crucially, this feedback may be incomplete or delayed. The agent must decide whether to exploit the current best-known policy or explore new policies to improve its performance.

Multi-armed bandit methods

The multi-armed bandit (MAB) problem was a classic example of the tradeoff, and many methods were developed for it, such as epsilon-greedy, Thompson sampling, and the upper confidence bound (UCB). See the page on MAB for details. In more complex RL situations than the MAB problem, the agent can treat each choice as a MAB, where the payoff is the expected future reward. For example, if the agent performs an epsilon-greedy method, then the agent will often "pull the best lever" by picking the action that had the best predicted expected reward (exploit). However, it would pick a random action with probability epsilon (explore).

Monte Carlo tree search In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree. MCTS ...

, for example, uses a variant of the UCB method.

Exploration problems

There are some problems that make exploration difficult. * Sparse reward. If rewards occur only once a long while, then the agent might not persist in exploring. Furthermore, if the space of actions is large, then the sparse reward would mean the agent would not be guided by the reward to find a good direction for deeper exploration. A standard example is '' Montezuma's Revenge''. * Deceptive reward. If some early actions give immediate small reward, but other actions give later large reward, then the agent might be lured away from exploring the other actions. * Noisy TV problem. If certain observations are irreducibly noisy (such as a television showing random images), then the agent might be trapped exploring those observations (watching the television).

Exploration reward

This section based on. The exploration reward (also called exploration bonus) methods convert the exploration-exploitation dilemma into a balance of exploitations. That is, instead of trying to get the agent to balance exploration and exploitation, exploration is simply treated as another form of exploitation, and the agent simply attempts to maximize the sum of rewards from exploration and exploitation. The exploration reward can be treated as a form of intrinsic reward. We write these as

r_t^i, r_t^e

, meaning the intrinsic and extrinsic rewards at time step

t

. However, exploration reward is different from exploitation in two regards: * The reward of exploitation is not freely chosen, but given by the environment, but the reward of exploration may be picked freely. Indeed, there are many different ways to design

r_t^i

described below. * The reward of exploitation is usually stationary (i.e. the same action in the same state gives the same reward), but the reward of exploration is non-stationary (i.e. the same action in the same state should give less and less reward). Count-based exploration uses

N_n(s)

, the number of visits to a state

s

during the time-steps

1:n

, to calculate the exploration reward. This is only possible in small and discrete state space. Density-based exploration extends count-based exploration by using a density model

\rho_n(s)

. The idea is that, if a state has been visited, then nearby states are also partly-visited. In maximum entropy exploration, the entropy of the agent's policy

\pi

is included as a term in the intrinsic reward. That is,

r_t^i = -\sum_a \pi(a ,  s_t) \ln \pi(a ,  s_t) + \cdots

Prediction-based

This section based on. The forward dynamics model is a function for predicting the next state based on the current state and the current action:

f: (s_t, a_t) \mapsto s_

. The forward dynamics model is trained as the agent plays. The model becomes better at predicting state transition for state-action pairs that had been done many times. A forward dynamics model can define an exploration reward by

r_t^i = \, f(s_t, a_t) - s_\, _2^2

. That is, the reward is the squared-error of the prediction compared to reality. This rewards the agent to perform state-action pairs that had not been done many times. This is however susceptible to the noisy TV problem. Dynamics model can be run in latent space. That is,

r_t^i = \, f(s_t, a_t) - \phi(s_)\, _2^2

for some featurizer

\phi

. The featurizer can be the identity function (i.e.

\phi(x) = x

), randomly generated, the encoder-half of a

variational autoencoder In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian metho ...

, etc. A good featurizer improves forward dynamics exploration. The Intrinsic Curiosity Module (ICM) method trains simultaneously a forward dynamics model and a featurizer. The featurizer is trained by an inverse dynamics model, which is a function for predicting the current action based on the features of the current and the next state:

g: (\phi(s_t), \phi(s_)) \mapsto a_t

. By optimizing the inverse dynamics, both the inverse dynamics model and the featurizer are improved. Then, the improved featurizer improves the forward dynamics model, which improves the exploration of the agent. Random Network Distillation (RND) method attempts to solve this problem by teacher–student distillation. Instead of a forward dynamics model, it has two models

f, f'

. The

f'

teacher model is fixed, and the

f

student model is trained to minimize

\, f(s) - f'(s)\, _2^2

on states

s

. As a state is visited more and more, the student network becomes better at predicting the teacher. Meanwhile, the prediction error is also an exploration reward for the agent, and so the agent learns to perform actions that result in higher prediction error. Thus, we have a student network attempting to minimize the prediction error, while the agent attempting to maximize it, resulting in exploration. The states are normalized by subtracting a running average and dividing a running variance, which is necessary since the teacher model is frozen. The rewards are normalized by dividing with a running variance. Exploration by disagreement trains an ensemble of forward dynamics models, each on a random subset of all

(s_t, a_t, s_)

tuples. The exploration reward is the

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of the models' predictions.

Noise

For

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

–based agents, the NoisyNet method changes some of its neural network modules by noisy versions. That is, some network parameters are random variables from a probability distribution. The parameters of the distribution are themselves learnable. For example, in a linear layer

y = Wx + b

, both

W, b

are sampled from

Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

\mathcal N(\mu_W, \Sigma_W), \mathcal N(\mu_b, \Sigma_b)

at every step, and the parameters

\mu_W, \Sigma_W, \mu_b, \Sigma_b

are learned via the

reparameterization trick The reparameterization trick (aka "reparameterization gradient estimator") is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the effici ...

References

* {{Decision theory Machine learning Strategy Cognition Decision theory Optimal decisions