Imitation learning is a paradigm in

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

, where an agent learns to perform a task by

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

from expert demonstrations. It is also called learning from demonstration and apprenticeship learning. It has been applied to underactuated robotics, self-driving cars, quadcopter navigation, helicopter aerobatics, and locomotion.

Approaches

Expert demonstrations are recordings of an expert performing the desired task, often collected as state-action pairs

(o_t^*, a_t^*)

Behavior Cloning

Behavior Cloning (BC) is the most basic form of imitation learning. Essentially, it uses supervised learning to train a policy

\pi_\theta

such that, given an observation

o_t

, it would output an action distribution

\pi_\theta(\cdot ,  o_t)

that is approximately the same as the action distribution of the experts.CS 285 at UC Berkeley: Deep Reinforcement Learning. Lecture 2: Supervised Learning of Behaviors
/ref> BC is susceptible to

distribution shift Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

. Specifically, if the trained policy differs from the expert policy, it might find itself straying from expert trajectory into observations that would have never occurred in expert trajectories. This was already noted by ALVINN, where they trained a neural network to drive a van using human demonstrations. They noticed that because a human driver never strays far from the path, the network would never be trained on what action to take if it ever finds itself straying far from the path.

DAgger

Dagger (Dataset Aggregation) improves on behavior cloning by iteratively training on a dataset of expert demonstrations. In each iteration, the algorithm first collects data by rolling out the learned policy

\pi_\theta

. Then, it queries the expert for the optimal action

a_t^*

on each observation

o_t

encountered during the rollout. Finally, it aggregates the new data into the dataset

D \leftarrow D \cup \

and trains a new policy on the aggregated dataset.

Decision transformer

The Decision Transformer approach models reinforcement learning as a sequence modelling problem. Similar to Behavior Cloning, it trains a sequence model, such as a

Transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

, that models rollout sequences

(R_1, o_1, a_1), (R_2, o_2, a_2), \dots, (R_t, o_t, a_t),

where

R_t = r_t + r_ + \dots + r_T

is the sum of future reward in the rollout. During training time, the sequence model is trained to predict each action

a_t

, given the previous rollout as context:

(R_1, o_1, a_1), (R_2, o_2, a_2), \dots, (R_t, o_t)

During inference time, to use the sequence model as an effective controller, it is simply given a very high reward prediction

R

, and it would generalize by predicting an action that would result in the high reward. This was shown to scale predictably to a Transformer with 1 billion parameters that is superhuman on 41

Atari games Atari Games Corporation was an American producer of arcade video games, active from 1985 to 1999, then as Midway Games West Inc. until 2003. It was formed when the coin-operated video game division of Atari, Inc. was transferred by its owner Wa ...

Other approaches

See for more examples.

Related approaches

Inverse Reinforcement Learning (IRL) learns a reward function that explains the expert's behavior and then uses reinforcement learning to find a policy that maximizes this reward. Recent works have also explored multi-agent extensions of IRL in networked systems. Generative Adversarial Imitation Learning (GAIL) uses

generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

s (GANs) to match the distribution of agent behavior to the distribution of expert demonstrations. It extends a previous approach using game theory.

References

{{Artificial intelligence navbox Supervised learning