TD-Gammon is a

computer A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...

backgammon Backgammon is a two-player board game played with counters and dice on tables boards. It is the most widespread Western member of the large family of tables games, whose ancestors date back at least 1,600 years. The earliest record of backgammo ...

program developed in the 1990s by

Gerald Tesauro Gerald J. "Gerry" Tesauro is an American computer scientist and a researcher at IBM, known for his development of TD-Gammon, a backgammon program that taught itself to play at a world-championship level through self-play and temporal difference l ...

IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...

Thomas J. Watson Research Center The Thomas J. Watson Research Center is the headquarters for IBM Research. Its main laboratory is in Yorktown Heights, New York, 38 miles (61 km) north of New York City. It also operates facilities in Cambridge, Massachusetts and Albany, ...

. Its name comes from the fact that it is an artificial neural net trained by a form of

temporal-difference learning Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, a ...

, specifically TD-Lambda. It explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play. In 1993, TD-Gammon (version 2.1) was trained with 1.5 million games of self-play, and achieved a level of play just slightly below that of the top human backgammon players of the time. In 1998, during a 100-game series, it was defeated by the world champion by a mere margin of 8 points. Its unconventional assessment of some opening strategies had been accepted and adopted by expert players. TD-gammon is commonly cited as an early success of reinforcement learning and neural networks, and was cited in, for example, papers for deep Q-learning and

AlphaGo AlphaGo is a computer program that plays the board game Go. It was developed by the London-based DeepMind Technologies, an acquired subsidiary of Google. Subsequent versions of AlphaGo became increasingly powerful, including a version that c ...

Algorithm for play and learning

During play, TD-Gammon examines on each turn all possible legal moves and all their possible responses ( lookahead search), feeds each resulting board position into its

evaluation function An evaluation function, also known as a heuristic evaluation function or static evaluation function, is a function used by game-playing computer programs to estimate the value or goodness of a position (usually at a leaf or terminal node) in a g ...

, and chooses the move that leads to the board position that got the highest score. In this respect, TD-Gammon is no different than almost any other computer board-game program. TD-Gammon's innovation was in how it learned its evaluation function. TD-Gammon's learning algorithm consists of updating the weights in its neural net after each turn to reduce the difference between its evaluation of previous turns' board positions and its evaluation of the present turn's board position—hence "

". The score of any board position is a set of four numbers reflecting the program's estimate of the likelihood of each possible game result: White wins normally, Black wins normally, White wins a gammon, Black wins a gammon. For the final board position of the game, the algorithm compares with the actual result of the game rather than its own evaluation of the board position. The core of TD-gammon is a neural network with 3 layers. * The input layer has two types of neurons. ** One type codes for the board position. They are non-negative integers ranging from 0 to 15, indicating the number of White or Black checkers at each board location. There are 99 input neurons for each, totaling 198 neurons. ** Another type codes for hand-crafted features previously used in ''

Neurogammon Neurogammon is a computer backgammon program written by Gerald Tesauro at IBM's Thomas J. Watson Research Center. It was the first viable computer backgammon program implemented as a neural net, and set a new standard in computer backgammon play. I ...

''. These features encoded standard concepts used by human experts, such as "advanced anchor," "blockade strength," "home board strength" and the probability of a "blot" (single checker) being hit. * The hidden layer contains hidden neurons. Later versions had more of these. * The output layer contains 4 neurons, representing the network's estimate of the probability ("equity") that the current board would lead to. The 4 neurons code for: White normal win, White gammon win, Black normal win, Black gammon win. Backgammon win is so rare that Tesauro opted to not represent it. After each turn, the learning algorithm updates each weight in the neural net according to the following rule: :

w_ - w_t = \alpha(Y_ - Y_t)\sum_^\lambda^ \nabla_w Y_k

where: : It was found that picking small

\lambda

offered performance roughly equally good, and large

\lambda

degraded performance. Because of this, after 1992, TD-Gammon was trained with

\lambda = 0

, degenerating into standard TD-learning. This saved compute by a factor of 2.

Development history

Version 1.0 used simple 1-ply search: every next move is scored by the neural net, and the highest-scoring move is selected. Versions 2.0 and 2.1 used 2-ply search: * Make a 1-ply analysis to remove unlikely moves ("forward pruning"). * Make a 2-play minimax analysis for only the likely moves. Pick the best move, probability-weighted by each of the opponent's 21 possible dice rolls (weighting non-doubles twice as much as doubles). Versions 3.0 and 3.1 used 3-ply search, using

21^2 = 441

possible dice rolls instead of 21. The last version, 3.1, was trained specifically for an exhibition match against Malcolm Davis at the 1998

AAAI The Association for the Advancement of Artificial Intelligence (AAAI) is an international scientific society devoted to promote research in, and responsible use of, artificial intelligence. AAAI also aims to increase public understanding of artif ...

Hall of Champions. It lost at -8 points, mainly due to one blunder, where TD-Gammon opted to double and got gammoned at -32 points.

Experiments and stages of training

Unlike previous neural-net backgammon programs such as

(also written by Tesauro), where an expert trained the program by supplying the "correct" evaluation of each position, TD-Gammon was at first programmed "knowledge-free". In early experimentation, using only a raw board encoding with no human-designed features, TD-Gammon reached a level of play comparable to Neurogammon: that of an intermediate-level human backgammon player. Even though TD-Gammon discovered insightful features on its own, Tesauro wondered if its play could be improved by using hand-designed features like Neurogammon's. Indeed, the self-training TD-Gammon with expert-designed features soon surpassed all previous computer backgammon programs. It stopped improving after about 1,500,000 games (self-play) using a three-layered neural network, with 198 input units encoding expert-designed features, 80 hidden units, and one output unit representing predicted probability of winning.

Advances in backgammon theory

TD-Gammon's exclusive training through self-play (rather than

imitation learning Imitation learning is a paradigm in reinforcement learning, where an agent learns to perform a task by supervised learning from expert demonstrations. It is also called learning from demonstration and apprenticeship learning. It has been applied ...

) enabled it to explore strategies that humans previously had not considered or had ruled out erroneously. Its success with unorthodox strategies had a significant impact on the backgammon community. Late 1991,

Bill Robertie William Gerard (Bill) Robertie (born July 9, 1946, in Cambridge, Massachusetts, United States) is a backgammon, chess, and poker player, author and teacher. He is one of several (6 as of 2022) backgammon players to have won the World Backgammon Ch ...

, Paul Magriel, and Malcolm Davis, were invited to play against TD-Gammon (version 1.0). A total of 51 games were played, with TD-Gammon losing at -0.25 ppg. Robertie found TD-Gammon to be at the level of a competent advanced player, and better than any previous backgammon program. Robertie subsequently wrote about the use of TD-Gammon for backgammon study. For example, on the opening play, the conventional wisdom was that given a roll of 2-1, 4-1, or 5-1, White should move a single checker from point 6 to point 5. Known as "slotting", this technique trades the risk of a hit for the opportunity to develop an aggressive position. TD-Gammon found that the more conservative play of splitting 24-23 was superior. Tournament players began experimenting with TD-Gammon's move, and found success. Within a few years, slotting had disappeared from tournament play, replaced by splitting, though in 2006 it made a reappearance for 2-1. Backgammon expert

Kit Woolsey Kit Woolsey (born Christopher Robin Woolsey in 1943) is an American bridge and backgammon player. He was inducted into the ACBL Hall of Fame in 2005. Personal life Woolsey was born in Washington, DC. He graduated from Oberlin College in 1964 an ...

found that TD-Gammon's positional judgement, especially its weighing of risk against safety, was superior to his own or any human's. TD-Gammon's excellent positional play was undercut by occasional poor endgame play. The endgame requires a more analytical approach, sometimes with extensive lookahead. TD-Gammon's limitation to two-ply lookahead put a ceiling on what it could achieve in this part of the game. TD-Gammon's strengths and weaknesses were the opposite of

symbolic artificial intelligence Symbolic may refer to: * Symbol, something that represents an idea, a process, or a physical entity Mathematics, logic, and computing * Symbolic computation, a scientific area concerned with computing with mathematical formulas * Symbolic dynamic ...

programs and most computer software in general: it was good at matters that require an intuitive "feel" but bad at systematic analysis. It is also poor at doubling strategies. This is likely due to the fact that the neural network is trained without the doubling cube, with the doubling added by feeding the neural network's cubeless equity estimates into theoretically-based heuristic formulae. This was particularly the case in the 1988 exhibition match, where it played 100 games against Malcolm Davis. A single doubling blunder lost the match. TD-gammon was never commercialized or released to the public in some other form, but it inspired commercial backgammon programs based on neural networks, such as ''JellyFish'' (1994) and ''Snowie'' (1998).

References

Works by Tesauro

* * G. Tesauro, ''Optimal doubling in multi-outcome probabilistic games.'' IBM Research, unpublished manuscript (1990). * * *

External links

TD-Gammon
at IBM *
TD-Gammon
a draft by Gerald Tesauro on

Scholarpedia ''Scholarpedia'' is an English-language wiki-based online encyclopedia with features commonly associated with Open access (publishing), open-access online academic journals, which aims to have quality content in science and medicine. ''Scholarpe ...

. {{tables games Backgammon Applied machine learning Reinforcement learning