] Multi-agent reinforcement learning (MARL) is a sub-field of

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex

group dynamics Group dynamics is a system of behaviors and psychological processes occurring within a social group (''intra''group dynamics), or between social groups ( ''inter''group dynamics). The study of group dynamics can be useful in understanding decision ...

. Multi-agent reinforcement learning is closely related to

game theory Game theory is the study of mathematical models of strategic interactions. It has applications in many fields of social science, and is used extensively in economics, logic, systems science and computer science. Initially, game theory addressed ...

and especially

repeated games In game theory, a repeated game (or iterated game) is an extensive form game that consists of a number of repetitions of some base game (called a stage game). The stage game is usually one of the well-studied 2-person games. Repeated games capt ...

, as well as

multi-agent systems A multi-agent system (MAS or "self-organized system") is a computerized system composed of multiple interacting intelligent agents.H. Pan; M. Zahmatkesh; F. Rekabi-Bana; F. Arvin; J. HuT-STAR: Time-Optimal Swarm Trajectory Planning for Quadroto ...

. Its study combines the pursuit of finding ideal algorithms that maximize rewards with a more sociological set of concepts. While research in single-agent reinforcement learning is concerned with finding the algorithm that gets the biggest number of points for one agent, research in multi-agent reinforcement learning evaluates and quantifies social metrics, such as cooperation, reciprocity, equity, social influence, language and discrimination.

Definition

Similarly to single-agent reinforcement learning, multi-agent reinforcement learning is modeled as some form of a Markov decision process (MDP). Fix a set of agents

I = \

. We then define: * A set

S

of environment states. * One set

\mathcal A_i

of actions for each of the agents

i \in I = \

. *

P_\overrightarrow(s,s')=\Pr(s_=s'\mid s_t=s, \overrightarrow_t=\overrightarrow)

is the probability of transition (at time

t

) from state

s

to state

s'

under joint action

\overrightarrow

. *

\overrightarrow_\overrightarrow(s,s')

is the immediate joint reward after the transition from

s

s'

with joint action

\overrightarrow

. In settings with

perfect information Perfect information is a concept in game theory and economics that describes a situation where all players in a game or all participants in a market have knowledge of all relevant information in the system. This is different than complete informat ...

, such as the games of

chess Chess is a board game for two players. It is an abstract strategy game that involves Perfect information, no hidden information and no elements of game of chance, chance. It is played on a square chessboard, board consisting of 64 squares arran ...

and Go, the MDP would be fully observable. In settings with imperfect information, especially in real-world applications like

self-driving cars A self-driving car, also known as an autonomous car (AC), driverless car, robotic car or robo-car, is a car that is capable of operating with reduced or no User input, human input. They are sometimes called robotaxi, robotaxis, though this te ...

, each agent would access an observation that only has part of the information about the current state. In the partially observable setting, the core model is the partially observable

stochastic game In game theory, a stochastic game (or Markov game) is a repeated game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage the game is in some state. The players s ...

in the general case, and the decentralized POMDP in the cooperative case.

Cooperation vs. competition

When multiple agents are acting in a shared environment their interests might be aligned or misaligned. MARL allows exploring all the different alignments and how they affect the agents' behavior: * In pure competition settings, the agents' rewards are exactly opposite to each other, and therefore they are playing ''against'' each other. * Pure cooperation settings are the other extreme, in which agents get the exact same rewards, and therefore they are playing ''with'' each other. * Mixed-sum settings cover all the games that combine elements of both cooperation and competition.

Pure competition settings

When two agents are playing a

zero-sum game Zero-sum game is a Mathematical model, mathematical representation in game theory and economic theory of a situation that involves two competition, competing entities, where the result is an advantage for one side and an equivalent loss for the o ...

, they are in pure competition with each other. Many traditional games such as

and Go fall under this category, as do two-player variants of video games like

StarCraft ''StarCraft'' is a military science fiction media franchise created by Chris Metzen and James Phinney and owned by Blizzard Entertainment. The series, set in the beginning of the 26th century, centers on a galactic struggle for dominance amon ...

. Because each agent can only win at the expense of the other agent, many complexities are stripped away. There is no prospect of communication or social dilemmas, as neither agent is incentivized to take actions that benefit its opponent. The Deep Blue and

AlphaGo AlphaGo is a computer program that plays the board game Go. It was developed by the London-based DeepMind Technologies, an acquired subsidiary of Google. Subsequent versions of AlphaGo became increasingly powerful, including a version that c ...

projects demonstrate how to optimize the performance of agents in pure competition settings. One complexity that is not stripped away in pure competition settings is autocurricula. As the agents' policy is improved using self-play, multiple layers of learning may occur.

Pure cooperation settings

MARL is used to explore how separate agents with identical interests can communicate and work together. Pure cooperation settings are explored in recreational

cooperative games Cooperative game may refer to: * Cooperative board game, board games in which players work together to achieve a common goal * Cooperative game theory, in game theory, a game with competition between groups of players and the possibility of coopera ...

such as

Overcooked ''Overcooked'' (stylised as ''Overcooked!'') is a 2016 cooking simulation game developed by Ghost Town Games and published by Team17. In a local cooperative experience, players control a number of chefs in kitchens filled with various obstacle ...

, as well as real-world scenarios in

robotics Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots. Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...

. In pure cooperation settings all the agents get identical rewards, which means that social dilemmas do not occur. In pure cooperation settings, oftentimes there are an arbitrary number of coordination strategies, and agents converge to specific "conventions" when coordinating with each other. The notion of conventions has been studied in language and also alluded to in more general multi-agent collaborative tasks.

Mixed-sum settings

Most real-world scenarios involving multiple agents have elements of both cooperation and competition. For example, when multiple

are planning their respective paths, each of them has interests that are diverging but not exclusive: Each car is minimizing the amount of time it's taking to reach its destination, but all cars have the shared interest of avoiding a

traffic collision A traffic collision, also known as a motor vehicle collision, or car crash, occurs when a vehicle collides with another vehicle, pedestrian, animal, road debris, or other moving or stationary obstruction, such as a tree, pole or building. Tr ...

. Zero-sum settings with three or more agents often exhibit similar properties to mixed-sum settings, since each pair of agents might have a non-zero utility sum between them. Mixed-sum settings can be explored using classic

matrix games Matrix Games is a publisher of PC games, specifically strategy games and wargames. It is based in Ohio, US, and Surrey, UK. Their focus is primarily but not exclusively on wargames and turn-based strategy. The product line-up also includes sp ...

such as

prisoner's dilemma The prisoner's dilemma is a game theory thought experiment involving two rational agents, each of whom can either cooperate for mutual benefit or betray their partner ("defect") for individual gain. The dilemma arises from the fact that while def ...

, more complex sequential social dilemmas, and recreational games such as

Among Us ''Among Us'' is a 2018 online multiplayer social deduction game developed and published by American game studio Innersloth. The game allows for cross-platform play; it was released on iOS and Android (operating system), Android devices in J ...

Diplomacy Diplomacy is the communication by representatives of State (polity), state, International organization, intergovernmental, or Non-governmental organization, non-governmental institutions intended to influence events in the international syste ...

and

StarCraft II ''StarCraft II'' is a real-time strategy video game created by Blizzard Entertainment, first released in 2010. A sequel to the successful '' StarCraft'', released in 1998, it is set in a militaristic far future. The narrative centers on a galacti ...

. Mixed-sum settings can give rise to communication and social dilemmas.

Social dilemmas

As in

, much of the research in MARL revolves around social dilemmas, such as

chicken The chicken (''Gallus gallus domesticus'') is a domesticated subspecies of the red junglefowl (''Gallus gallus''), originally native to Southeast Asia. It was first domesticated around 8,000 years ago and is now one of the most common and w ...

and

stag hunt In game theory, the stag hunt, sometimes referred to as the assurance game, trust dilemma or common interest game, describes a conflict between safety and social cooperation. The stag hunt problem originated with philosopher Jean-Jacques Roussea ...

. While game theory research might focus on

Nash equilibria In game theory, the Nash equilibrium is the most commonly used solution concept for non-cooperative games. A Nash equilibrium is a situation where no player could gain by changing their own strategy (holding all other players' strategies fixed) ...

and what an ideal policy for an agent would be, MARL research focuses on how the agents would learn these ideal policies using a trial-and-error process. The

algorithms that are used to train the agents are maximizing the agent's own reward; the conflict between the needs of the agents and the needs of the group is a subject of active research. Various techniques have been explored in order to induce cooperation in agents: Modifying the environment rules, adding intrinsic rewards, and more.

Sequential social dilemmas

Social dilemmas like prisoner's dilemma, chicken and stag hunt are "matrix games". Each agent takes only one action from a choice of two possible actions, and a simple 2x2 matrix is used to describe the reward that each agent will get, given the actions that each agent took. In humans and other living creatures, social dilemmas tend to be more complex. Agents take multiple actions over time, and the distinction between cooperating and defecting is not as clear cut as in matrix games. The concept of a sequential social dilemma (SSD) was introduced in 2017 as an attempt to model that complexity. There is ongoing research into defining different kinds of SSDs and showing cooperative behavior in the agents that act in them.

Autocurricula

An autocurriculum (plural: autocurricula) is a reinforcement learning concept that's salient in multi-agent experiments. As agents improve their performance, they change their environment; this change in the environment affects themselves and the other agents. The feedback loop results in several distinct phases of learning, each depending on the previous one. The stacked layers of learning are called an autocurriculum. Autocurricula are especially apparent in adversarial settings, where each group of agents is racing to counter the current strategy of the opposing group. Th
Hide and Seek game
is an accessible example of an autocurriculum occurring in an adversarial setting. In this experiment, a team of seekers is competing against a team of hiders. Whenever one of the teams learns a new strategy, the opposing team adapts its strategy to give the best possible counter. When the hiders learn to use boxes to build a shelter, the seekers respond by learning to use a ramp to break into that shelter. The hiders respond by locking the ramps, making them unavailable for the seekers to use. The seekers then respond by "box surfing", exploiting a

glitch A glitch is a short-lived technical fault, such as a transient one that corrects itself, making it difficult to troubleshoot. The term is particularly common in the computing and electronics industries, in circuit bending, as well as among pl ...

in the game to penetrate the shelter. Each "level" of learning is an emergent phenomenon, with the previous level as its premise. This results in a stack of behaviors, each dependent on its predecessor. Autocurricula in reinforcement learning experiments are compared to the stages of the evolution of life on Earth and the development of

human culture Culture ( ) is a concept that encompasses the social behavior, institutions, and Social norm, norms found in human societies, as well as the knowledge, beliefs, arts, laws, Social norm, customs, capabilities, Attitude (psychology), attitudes ...

. A major stage in evolution happened 2-3 billion years ago, when photosynthesizing life forms started to produce massive amounts of

oxygen Oxygen is a chemical element; it has chemical symbol, symbol O and atomic number 8. It is a member of the chalcogen group (periodic table), group in the periodic table, a highly reactivity (chemistry), reactive nonmetal (chemistry), non ...

, changing the balance of gases in the atmosphere. In the next stages of evolution, oxygen-breathing life forms evolved, eventually leading up to land

mammals A mammal () is a vertebrate animal of the class Mammalia (). Mammals are characterised by the presence of milk-producing mammary glands for feeding their young, a broad neocortex region of the brain, fur or hair, and three middle e ...

and human beings. These later stages could only happen after the photosynthesis stage made oxygen widely available. Similarly, human culture could not have gone through the

Industrial Revolution The Industrial Revolution, sometimes divided into the First Industrial Revolution and Second Industrial Revolution, was a transitional period of the global economy toward more widespread, efficient and stable manufacturing processes, succee ...

in the 18th century without the resources and insights gained by the agricultural revolution at around 10,000 BC.

Applications

Multi-agent reinforcement learning has been applied to a variety of use cases in science and industry:

AI alignment

Multi-agent reinforcement learning has been used in research into

AI alignment In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered ''aligned'' if it advances the intended objectives. A '' ...

. The relationship between the different agents in a MARL setting can be compared to the relationship between a human and an AI agent. Research efforts in the intersection of these two fields attempt to simulate possible conflicts between a human's intentions and an AI agent's actions, and then explore which variables could be changed to prevent these conflicts.

Limitations

There are some inherent difficulties about multi-agent

deep reinforcement learning {{Short description, Subfield of machine learning Deep reinforcement learning (DRL) is a subfield of machine learning that combines principles of reinforcement learning (RL) and deep learning. It involves training agents to make decisions by interac ...

. The environment is not stationary anymore, thus the

Markov property In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Ma ...

is violated: transitions and rewards do not only depend on the current state of an agent.

References

{{reflist Reinforcement learning Multi-agent systems Deep learning