Overview
Deep learning
Reinforcement learning
Deep reinforcement learning
In many practical decision-making problems, the states of the MDP are high-dimensional (eg. images from a camera or the raw sensor stream from a robot) and cannot be solved by traditional RL algorithms. Deep reinforcement learning algorithms incorporate deep learning to solve such Maps a, often representing the policy or other learned functions as a neural network, and developing specialized algorithms that perform well in this setting.History
Along with rising interest in neural networks beginning in the mid 1980s, interest grew in deep reinforcement learning, where a neural network is used in reinforcement learning to represent policies or value functions. Because in such a system, the entire decision making process from sensors to motors in a robot or agent involves a single neural network, it is also sometimes called end-to-end reinforcement learning. One of the first successful applications of reinforcement learning with neural networks was TD-Gammon, a computer program developed in 1992 for playing backgammon. Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level by self-play and TD(). Seminal textbooks by Sutton and Barto on reinforcement learning, Bertsekas and Tsitiklis on neuro-dynamic programming, and others advanced knowledge and interest in the field. Katsunari Shibata's group showed that various functions emerge in this framework, including image recognition, color constancy, sensor motion (active recognition), hand-eye coordination and hand reaching movement, explanation of brain activities, knowledge transfer, memory, selective attention, prediction, and exploration. Starting around 2012, the so called Deep learning revolution led to an increased interest in using deep neural networks as function approximators across a variety of domains. This led to a renewed interest in researchers using deep neural networks to learn the policy, value, and/or Q functions present in existing reinforcement learning algorithms. Beginning around 2013, DeepMind showed impressive learning results using deep RL to play Atari video games. The computer player a neural network trained using a deep RL algorithm, a deep version of Q-learning they termed deep Q-networks (DQN), with the game score as the reward. They used a deepAlgorithms
Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics. In model-based deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually by supervised learning using a neural network. Then, actions are obtained by using model predictive control using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized using Monte Carlo methods such as the cross-entropy method, or a combination of model-learning with model-free methods. In model-free deep reinforcement learning algorithms, a policy is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied. Another class of model-free deep reinforcement learning algorithms rely on dynamic programming, inspired by temporal difference learning and Q-learning. In discrete action spaces, these algorithms usually learn a neural network Q-function that estimates the future returns taking action from state . In continuous spaces, these algorithms often learn both a value estimate and a policy.Research
Deep reinforcement learning is an active area of research, with several lines of inquiry.Exploration
An RL agent must balance the exploration/exploitation tradeoff: the problem of deciding whether to pursue actions that are already known to yield high rewards or explore other actions in order to discover higher rewards. RL agents usually collect data with some type of stochastic policy, such as a Boltzmann distribution in discrete action spaces or a Gaussian distribution in continuous action spaces, inducing basic exploration behavior. The idea behind novelty-based, or curiosity-driven, exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify ngthe loss function (or even the network architecture) by adding terms to incentivize exploration". An agent may also be aided in exploration by utilizing demonstrations of successful trajectories, or reward-shaping, giving an agent intermediate rewards that are customized to fit the task it is attempting to complete.Off-policy reinforcement learning
An important distinction in RL is the difference between on-policy algorithms that require evaluating or improving the policy that collects data, and off-policy algorithms that can learn a policy from data generated by an arbitrary policy. Generally, value-function based methods such as Q-learning are better suited for off-policy learning and have better sample-efficiency - the amount of data required to learn a task is reduced because data is re-used for learning. At the extreme, offline (or "batch") RL considers learning a policy from a fixed dataset without additional interaction with the environment.Inverse reinforcement learning
Inverse RL refers to inferring the reward function of an agent given the agent's behavior. Inverse reinforcement learning can be used for learning from demonstrations (orGoal-conditioned reinforcement learning
Another active area of research is in learning goal-conditioned policies, also called contextual or universal policies that take in an additional goal as input to communicate a desired aim to the agent. Hindsight experience replay is a method for goal-conditioned RL that involves storing and learning from previous failed attempts to complete a task. While a failed attempt may not have reached the intended goal, it can serve as a lesson for how achieve the unintended result through hindsight relabeling.Multi-agent reinforcement learning
Many applications of reinforcement learning do not involve just a single agent, but rather a collection of agents that learn together and co-adapt. These agents may be competitive, as in many games, or cooperative as in many real-world multi-agent systems.Generalization
The promise of using deep learning tools in reinforcement learning is generalization: the ability to operate correctly on previously unseen inputs. For instance, neural networks trained for image recognition can recognize that a picture contains a bird even it has never seen that particular image or even that particular bird. Since deep RL allows raw data (e.g. pixels) as input, there is a reduced need to predefine the environment, allowing the model to be generalized to multiple applications. With this layer of abstraction, deep reinforcement learning algorithms can be designed in a way that allows them to be general and the same model can be used for different tasks. One method of increasing the ability of policies trained with deep RL policies to generalize is to incorporate representation learning.References