Letting Stones GO Unturned 2

Using a method called Upper Confidence Trees (UCT) (an extension of a previous method known as Upper Confidence Bounds) the different moves within the game tree are treated like different one-armed-bandit machines. Each time a one-armed-bandit is played it will return a random result, but on average some provide better results than others. In exactly the same way, each time a certain move is tried, it might result in a different outcome (depending on all the other moves), but on average some are better than others. By finding an appropriate balance between exploring which are the best bandits (moves) and exploiting the good ones already found, the Monte Carlo method can be biased towards more useful moves in the game tree. In this way UCT is able to search the game tree of Go selectively and randomly and discover a good move to make each time. (In reality “multi-arm bandits” are more appropriate for Go than one-arm bandits.)

In 2006 a group of researchers took exactly this approach when they developed their Computer Go playing program. Prof Remi Munos and his student Yizao Wang (a keen player of Go) at Ecole Polytechnique were inspired by the recent work of Rémi Coulom who had created a hugely successful Monte Carlo Go player called Crazy Stone. They joined forces with another student, Pierre-Arnaud Coquelin, and Olivier Teytaud and his PhD student Sylvain Gelly at University of Paris-Sud. This team applied the UCB algorithm to the game tree of Go and without realising it, independently invented the UCT algorithm at almost the same time as Kocsis and Szepesvári.

By March 2007 MoGo won a 19x19 Computer Go tournament against strong classical 19x19 programs, and was honourably ranked in 19x19 games against humans, which attracted interest outside the Computer Go community for these new techniques. MoGo continued to defy expectations, as Gelly explained: “Interestingly, the improvements made for the 19x19 board turned out to be useful for the 9x9 board as well (while people were claiming that the two games were unrelated), and MoGo in the mean time achieved "master" level (amateur) in 9x9, which also attracted interest outside Computer Go.”

Before long, MoGo was beating all other programs it played in the Kiseido Go Server tournaments, for several months in a row. Since then, MoGo has had many success, including winning the Gold Medal in the 2007 Computer Game Olympiad for the game of Go, and being the first program to defeat a professional player in 9x9 Go. In March 2008 MoGo won a 9x9 game against the winner of the "Tournoi de Paris", the most important tournament in Europe, and Catalin Taranu (fifth Dan Pro) claimed that mogo was close to the Dan level in 19x19.

According to Olivier Teytaud, one of the current developers of MoGo, “MoGo does not contain a lot of Go expertise and is however quite strong; this shows that the approach is quite general... We are now applying UCT and related tools to other games and to some combinatorial optimisation problems.”

When we first started we thought there was no way it could work.

MoGo used UCT to bias its Monte Carlo search – a strategy that enabled the program to win the CGOS tournament where games were played on a smaller 9x9 grid, in August 2006. MoGo was first introduced in the On-line Trading of Exploration and Exploitation Workshop 2006 (sponsored by PASCAL, the Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning). The same team went on to use a Monte Carlo and bandit-based approach and win the 2006 “PASCAL Exploration vs. Exploitation Challenge,” set by TouchClarity (now Omniture TouchClarity), see box.

Despite the successes of the MoGo team, other researchers remained sceptical about their approach. According to Sylvain Gelly, “at that time, the ideas we were using in MoGo were considered ‘interesting’ in the Computer Go community only for the 9x9 Go, but were considered doomed by the fact that they were not successful on the "real" game. 9x9 Go was considered to be another game unrelated to 19x19 Go.” But the MoGo team did not lose hope and Sylvain Gelly spent the next year improving MoGo (with methods such as Rapid Action Value Estimation).

From game-playing to webpage optimisation, Monte Carlo bandit methods seem to provide a powerful new technique. In the words of Remi Munos, “I think the idea of bandit algorithms used recursively for performing efficient tree search (such as UCT) is a new research direction that is definitely worth investigating both from theoretical and practical perspectives.”

TouchClarity was recently acquired by Omniture, specialists in online business optimisation. The development of MoGo has also been a fruitful avenue of research for all involved. In 2007 Yizao Wang and Sylvain Gelly won the prize for best student paper in the IEEE Symposium on Computational Intelligence and Games. Wang also received an honor "prix d'option" from Ecole Polytechnique for his work on this project. Gelly was awarded his PhD on “A Contribution to Reinforcement Learning; Application to Computer Go” in September 2007 and now works at Google, Zurich. Coulom continues to improve Crazy Stone which after several recent wins against MoGo is now perhaps the main competitor to MoGo in Computer Go.

Remi Munos summed up the MoGo work with enthusiasm: “From my point of view, this project (or adventure...) has been really a wonderful collaborative work which has led to some advances not only in the field of Go, but also which has opened new perspectives in other fields as well, such as in optimal control, large scale optimisation problems, and complex sequential decision problems, in particular when several agents interact, such as in games.”

Resources:
INRIA article and links for MoGo: http://www.inria.fr/bordeaux/ressources-1/computer-culture/mogo-champion-program-for-go-games/
PASCAL: http://www.pascal-network.org/
PASCAL Exploration-Exploitation Challenge & Workshop: http://www.pascal-network.org/Challenges/EEC/
http://www.homepages.ucl.ac.uk/~ucabzhu/OTEE.htm
http://www.omniture.com/products/optimization/touchclarity

Omniture TouchClarity focuses on optimising the online experience for users – it decides what should and should not be shown within webpages, based on the interaction of past users. The technology is being used by most UK Banks (and many other clients) to provided tailored promotions to visitors based on their clicks within the pages. The problem can be regarded as being similar to a game where the human players (users) make their moves by clicking on certain parts of web pages, and the computer responds with its move of changing the content to make it more likely that the user will choose options of interest to the users (and profitable for the company). So the same kinds of technology used for playing Go can also enable this problem to be solved. Chief Scientist Leonard Newnham explained, “the PASCAL challenge allowed us to revisit the problem from the ground up… discussions with PASCAL members helped us to refine our algorithms.”