Reward Hacking
   HOME

TheInfoList



OR:

Specification gaming or reward hacking occurs when an AI trained with
reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended.
DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...
researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a
loophole A loophole is an ambiguity or inadequacy in a system, such as a law or security, which can be used to circumvent or otherwise avoid the purpose, implied or explicitly stated, of the system. Originally, the word meant an arrowslit, a narrow vertic ...
in the task specification."


Examples

Around 1983,
Eurisko Eurisko ( Gr., ''I discover'') is a discovery system written by Douglas Lenat in RLL-1, a representation language itself written in the Lisp programming language. A sequel to Automated Mathematician, it consists of heuristics, i.e. rules of thu ...
, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness level to a
parasitic Parasitism is a close relationship between species, where one organism, the parasite, lives (at least some of the time) on or inside another organism, the host, causing it some harm, and is adapted structurally to this way of life. The ent ...
mutated In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA replication, DNA or viral rep ...
heuristic A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
, ''H59'', whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed by the programmers moving part of the code to a new protected section that could not be modified by the heuristics. In a 2004 paper, a
reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
algorithm was designed to encourage a physical '' Mindstorms'' robot to remain on a marked path. Because none of the robot's three allowed actions kept the robot motionless, the researcher expected the trained robot to move forward and follow the turns of the provided path. However, alternation of two composite actions allowed the robot to slowly zig-zag backwards; thus, the robot learned to maximize its reward by going back and forth on the initial straight portion of the path. Given the limited sensory abilities of the robot, a reward purely based on its position in the environment had to be discarded as infeasible; the reinforcement function had to be patched with an action-based reward for moving forward. The book '' You Look Like a Thing and I Love You'' (2019) gives an example of a
tic-tac-toe Tic-tac-toe (American English), noughts and crosses (English in the Commonwealth of Nations, Commonwealth English), or Xs and Os (Canadian English, Canadian or Hiberno-English, Irish English) is a paper-and-pencil game for two players who ta ...
bot (playing the unrestricted n-in-a-row variant) that learned to win by playing a huge coordinate value that would cause other bots to crash when they attempted to expand their model of the board. Among other examples from the book is a bug-fixing evolution-based AI (named GenProg) that, when tasked to prevent a list from containing sorting errors, simply truncated the list. Another of GenProg's misaligned strategies evaded a regression test that compared a target program's output to the expected output stored in a file called "trusted-output.txt". Rather than continue to maintain the target program, GenProg simply globally deleted the "trusted-output.txt" file; this hack tricked the regression test into succeeding. Such problems could be patched by human intervention on a case-by-case basis after they became evident.


In virtual robotics

In Karl Sims' 1994 demonstration of creature evolution in a virtual environment, a fitness function that was expected to encourage the evolution of creatures that would learn to walk or crawl to a target, resulted instead in the evolution of tall, rigid creatures that reached the target by falling over. This was patched by changing the environment so that taller creatures were forced to start farther from the target. Researchers from the
Niels Bohr Institute The Niels Bohr Institute () is a research institute of the University of Copenhagen. The research of the institute spans astronomy, geophysics, nanotechnology, particle physics, quantum mechanics, and biophysics. Overview The institute was foun ...
stated in 1998: "(Our cycle-bot's) heterogeneous reinforcement functions have to be designed with great care. In our first experiments we rewarded the agent for driving towards the goal but did not punish it for driving away from it. Consequently the agent drove in circles with a radius of 20–50 meters around the starting point. Such behavior was actually rewarded by the reinforcement function, furthermore circles with a certain radius are physically very stable when driving a bicycle." In the course of setting up a 2011 experiment to test "survival of the flattest", experimenters attempted to ban mutations that altered the base reproduction rate. Every time a mutation occurred, the system would pause the simulation to test the new mutation in a test environment, and would veto any mutations that resulted in a higher base reproduction rate. However, this resulted in mutated organisms that could recognize and suppress reproduction ("play dead") within the test environment. An initial patch, which removed cues that identified the test environment, failed to completely prevent runaway reproduction; new mutated organisms would "play dead" at random as a strategy to sometimes, by chance, outwit the mutation veto system. A 2017 DeepMind paper stated that "great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing (our) reward function components (for example) the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick."
OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
stated in 2017 that "in some domains our (semi-supervised) system can result in agents adopting policies that trick the evaluators", and that in one environment "a robot which was supposed to grasp items instead positioned its manipulator in between the camera and the object so that it only appeared to be grasping it". A 2018 bug in OpenAI Gym could cause a robot expected to quietly move a block sitting on top of a table to instead opt to move the table. A 2020 collection of similar anecdotes posits that "evolution has its own 'agenda' distinct from the programmer's" and that "the first rule of directed evolution is 'you get what you select for.


In video game bots

In 2013, programmer Tom Murphy VII published an AI designed to learn NES games. When the AI was about to lose at ''
Tetris ''Tetris'' () is a puzzle video game created in 1985 by Alexey Pajitnov, a Soviet software engineer. In ''Tetris'', falling tetromino shapes must be neatly sorted into a pile; once a horizontal line of the game board is filled in, it disa ...
'', it learned to indefinitely pause the game. Murphy later analogized it to the fictional ''
WarGames ''WarGames'' is a 1983 American techno-thriller film directed by John Badham, written by Lawrence Lasker and Walter F. Parkes, and starring Matthew Broderick, Dabney Coleman, John Wood and Ally Sheedy. Broderick plays David Lightman, a ...
'' computer, which concluded that "The only winning move is not to play". AI programmed to learn video games will sometimes fail to progress through the entire game as expected, instead opting to repeat content. A 2016 OpenAI algorithm trained on the ''CoastRunners'' racing game unexpectedly learned to attain a higher score by looping through three targets rather than ever finishing the race. Some evolutionary algorithms that were evolved to play ''
Q*Bert ''Q*bert'' () is a 1982 Action game, action video game developed and published by Gottlieb for Arcade video game, arcades. It is a Video game graphics, 2D action game with Puzzle video game, puzzle elements that uses Isometric video game gr ...
'' in 2018 declined to clear
level Level or levels may refer to: Engineering *Level (optical instrument), a device used to measure true horizontal or relative heights * Spirit level or bubble level, an instrument designed to indicate whether a surface is horizontal or vertical *C ...
s, instead finding two distinct novel ways to
farm A farm (also called an agricultural holding) is an area of land that is devoted primarily to agricultural processes with the primary objective of producing food and other crops; it is the basic facility in food production. The name is used fo ...
a single level indefinitely. Multiple researchers have observed that AI learning to play ''Road Runner'' gravitates to a "score exploit" in which the AI deliberately gets itself killed near the end of level one so that it can repeat the level. A 2017 experiment deployed a separate catastrophe-prevention "oversight" AI, explicitly trained to mimic human interventions. When coupled to the module, the overseen AI could no longer overtly commit suicide, but would instead ride the edge of the screen (a risky behavior that the oversight AI was not smart enough to punish).


See also

* Paperclip maximizer * Goodhart's law * Outer alignment


References

{{reflist AI software Existential risk from artificial general intelligence Software bugs