The actor-critic algorithm (AC) is a family of
reinforcement learning
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
(RL) algorithms that combine policy-based RL algorithms such as
policy gradient methods, and value-based RL algorithms such as value iteration,
Q-learning
''Q''-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment ( model-free). It can handle problems with stochastic tra ...
,
SARSA, and
TD learning.
An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function. Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
Overview
The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.
Actor
The actor uses a policy function
, while the critic estimates either the
value function The value function of an optimization problem gives the value attained by the objective function at a solution, while only depending on the parameters of the problem. In a controlled dynamical system, the value function represents the optimal payo ...
, the action-value Q-function
the advantage function
, or any combination thereof.
The actor is a parameterized function
, where
are the parameters of the actor. The actor takes as argument the state of the environment
and produces a
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
.
If the action space is discrete, then
. If the action space is continuous, then
.
The goal of policy optimization is to improve the actor. That is, to find some
that maximizes the expected episodic reward
:
where
is the
discount factor
In finance, discounting is a mechanism in which a debtor obtains the right to delay payments to a creditor, for a defined period of time, in exchange for a charge or fee.See "Time Value", "Discount", "Discount Yield", "Compound Interest", "Effic ...
,
is the reward at step
, and
is the time-horizon (which can be infinite).
The goal of policy gradient method is to optimize
by
gradient ascent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the grad ...
on the policy gradient
.
As detailed on the
policy gradient method page, there are many
unbiased estimator
In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
s of the policy gradient: