Reinforcement learning (RL) is an interdisciplinary area of
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and
optimal control
Optimal control theory is a branch of control theory that deals with finding a control for a dynamical system over a period of time such that an objective function is optimized. It has numerous applications in science, engineering and operations ...
concerned with how an
intelligent agent
In artificial intelligence, an intelligent agent is an entity that Machine perception, perceives its environment, takes actions autonomously to achieve goals, and may improve its performance through machine learning or by acquiring knowledge r ...
should
take actions in a dynamic environment in order to
maximize a reward signal. Reinforcement learning is one of the
three basic machine learning paradigms, alongside
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
and
unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
.
Reinforcement learning differs from supervised learning in not needing labelled input-output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with the goal of maximizing the cumulative reward (the feedback of which might be incomplete or delayed).
The search for this balance is known as the
exploration–exploitation dilemma.
The environment is typically stated in the form of a
Markov decision process (MDP), as many reinforcement learning algorithms use
dynamic programming techniques. The main difference between classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the Markov decision process, and they target large MDPs where exact methods become infeasible.
Principles
Due to its generality, reinforcement learning is studied in many disciplines, such as
game theory
Game theory is the study of mathematical models of strategic interactions. It has applications in many fields of social science, and is used extensively in economics, logic, systems science and computer science. Initially, game theory addressed ...
,
control theory
Control theory is a field of control engineering and applied mathematics that deals with the control system, control of dynamical systems in engineered processes and machines. The objective is to develop a model or algorithm governing the applic ...
,
operations research
Operations research () (U.S. Air Force Specialty Code: Operations Analysis), often shortened to the initialism OR, is a branch of applied mathematics that deals with the development and application of analytical methods to improve management and ...
,
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
,
simulation-based optimization,
multi-agent systems,
swarm intelligence, and
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
. In the operations research and control literature, RL is called ''approximate dynamic programming'', or ''neuro-dynamic programming.'' The problems of interest in RL have also been studied in the
theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation (particularly in the absence of a mathematical model of the environment).
Basic reinforcement learning is modeled as a
Markov decision process:
* A set of environment and agent states (the state space),
;
* A set of actions (the action space),
, of the agent;
*
, the transition probability (at time
) from state
to state
under action
.
*
, the immediate reward after transition from
to
under action
.
The purpose of reinforcement learning is for the agent to learn an optimal (or near-optimal) policy that maximizes the reward function or other user-provided reinforcement signal that accumulates from immediate rewards. This is similar to
processes that appear to occur in animal psychology. For example, biological brains are hardwired to interpret signals such as pain and hunger as negative reinforcements, and interpret pleasure and food intake as positive reinforcements. In some circumstances, animals learn to adopt behaviors that optimize these rewards. This suggests that animals are capable of reinforcement learning.
A basic reinforcement learning agent interacts with its environment in discrete time steps. At each time step , the agent receives the current state
and reward
. It then chooses an action
from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state
and the reward
associated with the ''transition''
is determined. The goal of a reinforcement learning agent is to learn a ''policy'':
,
that maximizes the expected cumulative reward.
Formulating the problem as a Markov decision process assumes the agent directly observes the current environmental state; in this case, the problem is said to have ''full observability''. If the agent only has access to a subset of states, or if the observed states are corrupted by noise, the agent is said to have ''partial observability'', and formally the problem must be formulated as a
partially observable Markov decision process. In both cases, the set of actions available to the agent can be restricted. For example, the state of an account balance could be restricted to be positive; if the current value of the state is 3 and the state transition attempts to reduce the value by 4, the transition will not be allowed.
When the agent's performance is compared to that of an agent that acts optimally, the difference in performance yields the notion of
regret. In order to act near optimally, the agent must reason about long-term consequences of its actions (i.e., maximize future rewards), although the immediate reward associated with this might be negative.
Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including
energy storage
Energy storage is the capture of energy produced at one time for use at a later time to reduce imbalances between energy demand and energy production. A device that stores energy is generally called an Accumulator (energy), accumulator or Batte ...
,
robot control
Robotic control is the system that contributes to the movement of robots. This involves the mechanical aspects and programmable systems that makes it possible to control robots. Robotics can be controlled by various means including manual, wirele ...
,
photovoltaic generators,
backgammon
Backgammon is a two-player board game played with counters and dice on tables boards. It is the most widespread Western member of the large family of tables games, whose ancestors date back at least 1,600 years. The earliest record of backgammo ...
,
checkers
Checkers (American English), also known as draughts (; English in the Commonwealth of Nations, Commonwealth English), is a group of Abstract strategy game, strategy board games for two players which involve forward movements of uniform game ...
,
Go (
AlphaGo), and
autonomous driving systems.
Two elements make reinforcement learning powerful: the use of samples to optimize performance, and the use of
function approximation
In general, a function approximation problem asks us to select a function (mathematics), function among a that closely matches ("approximates") a in a task-specific way. The need for function approximations arises in many branches of applied ...
to deal with large environments. Thanks to these two key components, RL can be used in large environments in the following situations:
* A model of the environment is known, but an
analytic solution is not available;
* Only a simulation model of the environment is given (the subject of
simulation-based optimization);
* The only way to collect information about the environment is to interact with it.
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
problems.
Exploration
The exploration vs. exploitation trade-off has been most thoroughly studied through the
multi-armed bandit problem and for finite state space Markov decision processes in Burnetas and Katehakis (1997).
Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.
One such method is
-greedy, where
is a parameter controlling the amount of exploration vs. exploitation. With probability
, exploitation is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). Alternatively, with probability
, exploration is chosen, and the action is chosen uniformly at random.
is usually a fixed parameter but can be adjusted either according to a schedule (making the agent explore progressively less), or adaptively based on heuristics.
Algorithms for control learning
Even if the issue of exploration is disregarded and even if the state was observable (assumed hereafter), the problem remains to use past experience to find out which actions lead to higher cumulative rewards.
Criterion of optimality
Policy
The agent's action selection is modeled as a map called ''policy'':
: