AIXI is a theoretical
mathematical formalism for
artificial general intelligence
Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks.
Some researchers argue that sta ...
.
It combines
Solomonoff induction with
sequential decision theory.
AIXI was first proposed by
Marcus Hutter in 2000 and several results regarding AIXI are proved in Hutter's 2005 book ''Universal Artificial Intelligence''.
AIXI is a
reinforcement learning
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
(RL) agent. It maximizes the expected total rewards received from the environment. Intuitively, it simultaneously considers every computable hypothesis (or environment). In each time step, it looks at every possible program and evaluates how many rewards that program generates depending on the next action taken. The promised rewards are then weighted by the
subjective belief that this program constitutes the true environment. This belief is computed from the length of the program: longer programs are considered less likely, in line with
Occam's razor
In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...
. AIXI then selects the action that has the highest expected total reward in the weighted sum of all these programs.
Etymology
According to Hutter, the word "AIXI" can have several interpretations. AIXI can stand for AI based on Solomonoff's distribution, denoted by
(which is the Greek letter xi), or e.g. it can stand for AI "crossed" (X) with induction (I). There are other interpretations.
Definition
AIXI is a reinforcement learning agent that interacts with some stochastic and unknown but computable environment
. The interaction proceeds in time steps, from
to
, where
is the lifespan of the AIXI agent. At time step ''t'', the agent chooses an action
(e.g. a limb movement) and executes it in the environment, and the environment responds with a "percept"
, which consists of an "observation"
(e.g., a camera image) and a reward
, distributed according to the
conditional probability
In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...
, where
is the "history" of actions, observations and rewards. The environment
is thus mathematically represented as a
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
over "percepts" (observations and rewards) which depend on the ''full'' history, so there is no
Markov assumption (as opposed to other RL algorithms). Note again that this probability distribution is ''unknown'' to the AIXI agent. Furthermore, note again that
is computable, that is, the observations and rewards received by the agent from the environment
can be computed by some program (which runs on a
Turing machine
A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...
), given the past actions of the AIXI agent.
The ''only'' goal of the AIXI agent is to maximize
, that is, the sum of rewards from time step 1 to m.
The AIXI agent is associated with a stochastic policy
, which is the function it uses to choose actions at every time step, where
is the space of all possible actions that AIXI can take and
is the space of all possible "percepts" that can be produced by the environment. The environment (or probability distribution)
can also be thought of as a stochastic policy (which is a function):
, where the
is the
Kleene star
In mathematical logic and theoretical computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation on a Set (mathematics), set to generate a set of all finite-length strings that are composed of zero or more repe ...
operation.
In general, at time step
(which ranges from 1 to m), AIXI, having previously executed actions
(which is often abbreviated in the literature as
) and having observed the history of percepts
(which can be abbreviated as
), chooses and executes in the environment the action,
, defined as follows:
:
or, using parentheses, to disambiguate the precedences
:
Intuitively, in the definition above, AIXI considers the sum of the total reward over all possible "futures" up to
time steps ahead (that is, from
to
), weighs each of them by the complexity of programs
(that is, by
) consistent with the agent's past (that is, the previously executed actions,
, and received percepts,
) that can generate that future, and then picks the action that maximizes expected future rewards.
Let us break this definition down in order to attempt to fully understand it.
is the "percept" (which consists of the observation
and reward
) received by the AIXI agent at time step
from the environment (which is unknown and stochastic). Similarly,
is the percept received by AIXI at time step
(the last time step where AIXI is active).
is the sum of rewards from time step
to time step
, so AIXI needs to look into the future to choose its action at time step
.
denotes a
monotone universal Turing machine
In computer science, a universal Turing machine (UTM) is a Turing machine capable of computing any computable sequence, as described by Alan Turing in his seminal paper "On Computable Numbers, with an Application to the Entscheidungsproblem". Co ...
, and
ranges over all (deterministic) programs on the universal machine
, which receives as input the program
and the sequence of actions
(that is, all actions), and produces the sequence of percepts
. The universal Turing machine
is thus used to "simulate" or compute the environment responses or percepts, given the program
(which "models" the environment) and all actions of the AIXI agent: in this sense, the environment is "computable" (as stated above). Note that, in general, the program which "models" the ''current'' and actual environment (where AIXI needs to act) is unknown because the current environment is also unknown.
is the length of the program
(which is encoded as a string of bits). Note that
. Hence, in the definition above,
should be interpreted as a
mixture
In chemistry, a mixture is a material made up of two or more different chemical substances which can be separated by physical method. It is an impure substance made up of 2 or more elements or compounds mechanically mixed together in any proporti ...
(in this case, a sum) over all computable environments (which are consistent with the agent's past), each weighted by its complexity
. Note that
can also be written as
, and
is the sequence of actions already executed in the environment by the AIXI agent. Similarly,
, and
is the sequence of percepts produced by the environment so far.
Let us now put all these components together in order to understand this equation or definition.
At time step t, AIXI chooses the action
where the function
attains its maximum.
Parameters
The parameters to AIXI are the universal Turing machine ''U'' and the agent's lifetime ''m'', which need to be chosen. The latter parameter can be removed by the use of
discounting
In finance, discounting is a mechanism in which a debtor obtains the right to delay payments to a creditor, for a defined period of time, in exchange for a charge or fee.See "Time Value", "Discount", "Discount Yield", "Compound Interest", "Effici ...
.
Optimality
AIXI's performance is measured by the expected total number of rewards it receives.
AIXI has been proven to be optimal in the following ways.
*
Pareto optimality
In welfare economics, a Pareto improvement formalizes the idea of an outcome being "better in every possible way". A change is called a Pareto improvement if it leaves at least one person in society better off without leaving anyone else worse ...
: there is no other agent that performs at least as well as AIXI in all environments while performing strictly better in at least one environment.
* Balanced Pareto optimality: like Pareto optimality, but considering a weighted sum of environments.
* Self-optimizing: a policy ''p'' is called self-optimizing for an environment
if the performance of ''p'' approaches the theoretical maximum for
when the length of the agent's lifetime (not time) goes to infinity. For environment classes where self-optimizing policies exist, AIXI is self-optimizing.
It was later shown by Hutter and
Jan Leike that balanced Pareto optimality is subjective and that any policy can be considered Pareto optimal, which they describe as undermining all previous optimality claims for AIXI.
However, AIXI does have limitations. It is restricted to maximizing rewards based on percepts as opposed to external states. It also assumes it interacts with the environment solely through action and percept channels, preventing it from considering the possibility of being damaged or modified. Colloquially, this means that it doesn't consider itself to be contained by the environment it interacts with. It also assumes the environment is computable.
Computational aspects
Like
Solomonoff induction, AIXI is
incomputable. However, there are computable approximations of it. One such approximation is AIXI''tl'', which performs at least as well as the provably best time ''t'' and space ''l'' limited agent.
Another approximation to AIXI with a restricted environment class is MC-AIXI (FAC-CTW) (which stands for
Monte Carlo
Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...
AIXI FAC-
Context-Tree Weighting), which has had some success playing simple games such as
partially observable Pac-Man
''Pac-Man,'' originally called in Japan, is a 1980 maze video game developed and published by Namco for arcades. In North America, the game was released by Midway Manufacturing as part of its licensing agreement with Namco America. The pla ...
.
Playing Pacman using AIXI Approximation – YouTube
/ref>
See also
* Gödel machine
A Gödel machine is a hypothetical self-improving computer program that solves problems in an optimal way. It uses a recursive self-improvement protocol in which it rewrites its own code when it can prove the new code provides a better strategy. Th ...
References
* "Universal Algorithmic Intelligence: A mathematical top->down approach", Marcus Hutter, ; also in ''Artificial General Intelligence'', eds. B. Goertzel and C. Pennachin, Springer, 2007, , pp. 227–290, {{doi, 10.1007/978-3-540-68677-4_8.
Optimal decisions
Decision theory
Machine learning