AIXI is a theoretical mathematical formalism for

artificial general intelligence Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks. Some researchers argue that sta ...

. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved in Hutter's 2005 book ''Universal Artificial Intelligence''. AIXI is a

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

(RL) agent. It maximizes the expected total rewards received from the environment. Intuitively, it simultaneously considers every computable hypothesis (or environment). In each time step, it looks at every possible program and evaluates how many rewards that program generates depending on the next action taken. The promised rewards are then weighted by the subjective belief that this program constitutes the true environment. This belief is computed from the length of the program: longer programs are considered less likely, in line with

Occam's razor In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...

. AIXI then selects the action that has the highest expected total reward in the weighted sum of all these programs.

Etymology

According to Hutter, the word "AIXI" can have several interpretations. AIXI can stand for AI based on Solomonoff's distribution, denoted by

\xi

(which is the Greek letter xi), or e.g. it can stand for AI "crossed" (X) with induction (I). There are other interpretations.

Definition

AIXI is a reinforcement learning agent that interacts with some stochastic and unknown but computable environment

\mu

. The interaction proceeds in time steps, from

t=1

t=m

, where

m \in \mathbb

is the lifespan of the AIXI agent. At time step ''t'', the agent chooses an action

a_t \in \mathcal

(e.g. a limb movement) and executes it in the environment, and the environment responds with a "percept"

e_t \in \mathcal = \mathcal \times \mathbb

, which consists of an "observation"

o_t \in \mathcal

(e.g., a camera image) and a reward

r_t \in \mathbb

, distributed according to the

conditional probability In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...

\mu(o_t r_t ,  a_1 o_1 r_1 ... a_ o_ r_ a_t)

, where

a_1 o_1 r_1 ... a_ o_ r_ a_t

is the "history" of actions, observations and rewards. The environment

\mu

is thus mathematically represented as a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

over "percepts" (observations and rewards) which depend on the ''full'' history, so there is no Markov assumption (as opposed to other RL algorithms). Note again that this probability distribution is ''unknown'' to the AIXI agent. Furthermore, note again that

\mu

is computable, that is, the observations and rewards received by the agent from the environment

\mu

can be computed by some program (which runs on a

Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...

), given the past actions of the AIXI agent. The ''only'' goal of the AIXI agent is to maximize

\sum_^m r_t

, that is, the sum of rewards from time step 1 to m. The AIXI agent is associated with a stochastic policy

\pi : (\mathcal \times \mathcal)^* \rightarrow \mathcal

, which is the function it uses to choose actions at every time step, where

\mathcal

is the space of all possible actions that AIXI can take and

\mathcal

is the space of all possible "percepts" that can be produced by the environment. The environment (or probability distribution)

\mu

can also be thought of as a stochastic policy (which is a function):

\mu  : (\mathcal \times \mathcal)^* \times \mathcal \rightarrow \mathcal

, where the

*

is the

Kleene star In mathematical logic and theoretical computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation on a Set (mathematics), set to generate a set of all finite-length strings that are composed of zero or more repe ...

operation. In general, at time step

t

(which ranges from 1 to m), AIXI, having previously executed actions

a_1\dots a_

(which is often abbreviated in the literature as

a_

) and having observed the history of percepts

o_1 r_1 ... o_ r_

(which can be abbreviated as

e_

), chooses and executes in the environment the action,

a_t

, defined as follows: :

\sum_ 2^

or, using parentheses, to disambiguate the precedences :

\left( \sum_ 2^ \right) \right) \right)

Intuitively, in the definition above, AIXI considers the sum of the total reward over all possible "futures" up to

m - t

time steps ahead (that is, from

t

m

), weighs each of them by the complexity of programs

q

(that is, by

2^

) consistent with the agent's past (that is, the previously executed actions,

a_

, and received percepts,

e_

) that can generate that future, and then picks the action that maximizes expected future rewards. Let us break this definition down in order to attempt to fully understand it.

o_t r_t

is the "percept" (which consists of the observation

o_t

and reward

r_t

) received by the AIXI agent at time step

t

from the environment (which is unknown and stochastic). Similarly,

o_m r_m

is the percept received by AIXI at time step

m

(the last time step where AIXI is active).

r_t + \ldots + r_m

is the sum of rewards from time step

t

to time step

m

, so AIXI needs to look into the future to choose its action at time step

t

U

denotes a monotone

universal Turing machine In computer science, a universal Turing machine (UTM) is a Turing machine capable of computing any computable sequence, as described by Alan Turing in his seminal paper "On Computable Numbers, with an Application to the Entscheidungsproblem". Co ...

, and

q

ranges over all (deterministic) programs on the universal machine

U

, which receives as input the program

q

and the sequence of actions

a_1\dots a_m

(that is, all actions), and produces the sequence of percepts

o_1 r_1 \ldots o_m r_m

. The universal Turing machine

U

is thus used to "simulate" or compute the environment responses or percepts, given the program

q

(which "models" the environment) and all actions of the AIXI agent: in this sense, the environment is "computable" (as stated above). Note that, in general, the program which "models" the ''current'' and actual environment (where AIXI needs to act) is unknown because the current environment is also unknown.

\textrm(q)

is the length of the program

q

(which is encoded as a string of bits). Note that

2^ = \frac

. Hence, in the definition above,

\sum_ 2^

should be interpreted as a

mixture In chemistry, a mixture is a material made up of two or more different chemical substances which can be separated by physical method. It is an impure substance made up of 2 or more elements or compounds mechanically mixed together in any proporti ...

(in this case, a sum) over all computable environments (which are consistent with the agent's past), each weighted by its complexity

2^

. Note that

a_1 \ldots a_m

can also be written as

a_1 \ldots a_a_t \ldots a_m

, and

a_1 \ldots a_ = a_

is the sequence of actions already executed in the environment by the AIXI agent. Similarly,

o_1 r_1 \ldots o_m r_m = o_1 r_1 \ldots o_ r_o_ r_ \ldots o_m r_m

, and

o_1 r_1 \ldots o_ r_

is the sequence of percepts produced by the environment so far. Let us now put all these components together in order to understand this equation or definition. At time step t, AIXI chooses the action

a_t

where the function

\sum_ 2^

attains its maximum.

Parameters

The parameters to AIXI are the universal Turing machine ''U'' and the agent's lifetime ''m'', which need to be chosen. The latter parameter can be removed by the use of

discounting In finance, discounting is a mechanism in which a debtor obtains the right to delay payments to a creditor, for a defined period of time, in exchange for a charge or fee.See "Time Value", "Discount", "Discount Yield", "Compound Interest", "Effici ...

Optimality

AIXI's performance is measured by the expected total number of rewards it receives. AIXI has been proven to be optimal in the following ways. *

Pareto optimality In welfare economics, a Pareto improvement formalizes the idea of an outcome being "better in every possible way". A change is called a Pareto improvement if it leaves at least one person in society better off without leaving anyone else worse ...

: there is no other agent that performs at least as well as AIXI in all environments while performing strictly better in at least one environment. * Balanced Pareto optimality: like Pareto optimality, but considering a weighted sum of environments. * Self-optimizing: a policy ''p'' is called self-optimizing for an environment

\mu

if the performance of ''p'' approaches the theoretical maximum for

\mu

when the length of the agent's lifetime (not time) goes to infinity. For environment classes where self-optimizing policies exist, AIXI is self-optimizing. It was later shown by Hutter and Jan Leike that balanced Pareto optimality is subjective and that any policy can be considered Pareto optimal, which they describe as undermining all previous optimality claims for AIXI. However, AIXI does have limitations. It is restricted to maximizing rewards based on percepts as opposed to external states. It also assumes it interacts with the environment solely through action and percept channels, preventing it from considering the possibility of being damaged or modified. Colloquially, this means that it doesn't consider itself to be contained by the environment it interacts with. It also assumes the environment is computable.

Computational aspects

Like Solomonoff induction, AIXI is incomputable. However, there are computable approximations of it. One such approximation is AIXI''tl'', which performs at least as well as the provably best time ''t'' and space ''l'' limited agent. Another approximation to AIXI with a restricted environment class is MC-AIXI (FAC-CTW) (which stands for

Monte Carlo Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...

AIXI FAC- Context-Tree Weighting), which has had some success playing simple games such as partially observable

Pac-Man ''Pac-Man,'' originally called in Japan, is a 1980 maze video game developed and published by Namco for arcades. In North America, the game was released by Midway Manufacturing as part of its licensing agreement with Namco America. The pla ...

.Playing Pacman using AIXI Approximation – YouTube
/ref>

References

* "Universal Algorithmic Intelligence: A mathematical top->down approach", Marcus Hutter, ; also in ''Artificial General Intelligence'', eds. B. Goertzel and C. Pennachin, Springer, 2007, , pp. 227–290, {{doi, 10.1007/978-3-540-68677-4_8. Optimal decisions Decision theory Machine learning

Etymology

Definition

Parameters

Optimality

Computational aspects

See also

References