reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

(RL), a model-free algorithm is an algorithm which does not estimate the ''transition

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

'' (and the ''reward function'') associated with the Markov decision process (MDP), which, in RL, represents the problem to be solved. The transition probability distribution (or transition model) and the reward function are often collectively called the "model" of the environment (or MDP), hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit"

trial-and-error Trial and error is a fundamental method of problem-solving characterized by repeated, varied attempts which are continued until success, or until the practicer stops trying. According to W.H. Thorpe, the term was devised by C. Lloyd Morgan ( ...

algorithm. Typical examples of model-free algorithms include

Monte Carlo Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...

(MC) RL, SARSA, and

Q-learning ''Q''-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment ( model-free). It can handle problems with stochastic tra ...

. Monte Carlo estimation is a central component of many model-free RL algorithms. The MC learning algorithm is essentially an important branch of generalized policy iteration, which has two periodically alternating steps: policy evaluation (PEV) and policy improvement (PIM). In this framework, each policy is first evaluated by its corresponding value function. Then, based on the evaluation result, greedy search is completed to produce a better policy. The MC estimation is mainly applied to the first step of policy evaluation. The simplest idea is used to judge the effectiveness of the current policy, which is to average the returns of all collected samples. As more experience is accumulated, the estimate will converge to the true value by the

law of large numbers In probability theory, the law of large numbers is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the law o ...

. Hence, MC policy evaluation does not require any prior knowledge of the environment dynamics. Instead, only experience is needed (i.e., samples of state, action, and reward), which is generated from interacting with an environment (which may be real or simulated). Value function estimation is crucial for model-free RL algorithms. Unlike MC methods, temporal difference (TD) methods learn this function by reusing existing value estimates. TD learning has the ability to learn from an incomplete sequence of events without waiting for the final outcome. It can also approximate the future return as a function of the current state. Similar to MC, TD only uses experience to estimate the value function without knowing any prior knowledge of the environment dynamics. The advantage of TD lies in the fact that it can update the value function based on its current estimate. Therefore, TD learning algorithms can learn from incomplete episodes or continuing tasks in a step-by-step manner, while MC must be implemented in an episode-by-episode fashion.

Model-free reinforcement learning algorithms

Model-free RL algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Go. Deep

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

are responsible for recent artificial intelligence breakthroughs, and they can be combined with RL to create superhuman agents such as

Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Goo ...

AlphaGo AlphaGo is a computer program that plays the board game Go. It was developed by the London-based DeepMind Technologies, an acquired subsidiary of Google. Subsequent versions of AlphaGo became increasingly powerful, including a version that c ...

. Mainstream model-free RL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional Soft Actor-Critic (DSAC), etc. Some model-free (deep) RL algorithms are listed as follows: {, class="wikitable sortable" style="font-size: 96%;" !Algorithm , , class=unsortable, Description , , Policy , , class=unsortable , Action Space , , class=unsortable , State Space , , Operator , - ! scope="row" , DQN , Deep Q Network , , Off-policy , , Discrete , , Typically Discrete or Continuous , , Q-value , - ! scope="row" , DDPG , Deep Deterministic Policy Gradient , , Off-policy , , Continuous , , Discrete or Continuous , , Q-value , - ! scope="row" , A3C , Asynchronous Advantage Actor-Critic Algorithm , , On-policy , , Continuous , , Discrete or Continuous , , Advantage , - ! scope="row" , TRPO , Trust Region Policy Optimization , , On-policy , , Continuous or Discrete , , Discrete or Continuous , , Advantage , - ! scope="row" , PPO , Proximal Policy Optimization , , On-policy , , Continuous or Discrete , , Discrete or Continuous , , Advantage , - ! scope="row" , TD3 , Twin Delayed Deep Deterministic Policy Gradient , , Off-policy , , Continuous , , Continuous , , Q-value , - ! scope="row" , SAC , Soft Actor-Critic , , Off-policy , , Continuous , , Discrete or Continuous , , Advantage , - !scope="row" , DSAC{{cite journal, author1=J Duan , author2=Y Guan, author3=S Li, title= Distributional Soft Actor-Critic: Off-policy reinforcement learning for addressing value estimation errors, journal= IEEE Transactions on Neural Networks and Learning Systems , volume=33 , issue=11 , year= 2021 , pages= 6584–6598 , doi=10.1109/TNNLS.2021.3082568 , pmid=34101599 , arxiv=2001.02811 , s2cid=211259373 , url= https://ieeexplore.ieee.org/document/9448360 , Distributional Soft Actor-Critic , , Off-policy , , Continuous , , Continuous , , Value distribution

References

Reinforcement learning