AlphaGo is born

Google DeepMind is making the front page of Nature (again) with a new AI for Go, named AlphaGo (see also this Nature youtube video). Computer Go is a notoriously difficult problem, and up to now AI were faring very badly compared to good human players. In their paper the DeepMind team reports that AlphaGo won 5-0 against the best European player Fan Hui!!! This is truly a jump in performance: the previous best AI, Crazy Stone, needed several handicap stones to compete with pro players. Congratulations to the DeepMind team for this breakthrough!

How did they do it? From a very high level point of view they simply combined the previous state of the art (Monte Carlo Tree Search) with the new deep learning techniques. Recall that MCTS is a technique inspired from multi-armed bandits to efficiently explore the tree of possible action sequences in a game, for more details see this very nice survey by my PhD advisor Remi Munos: From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. Now in MCTS there are two key elements beyond the bandit part (i.e., how to deal with exploration v.s. exploitation): one needs a way to combine all the information collected to produce a value function for each state (this value is the key term in the upper confidence bound used by the bandit algorithm); and one needs a reasonable random policy to carry out the random rollouts once the bandit strategy has gone deep enough in the tree. In AlphaGo the initial random rollout strategy is learned via supervised deep learning on a dataset of human expert games, and the value function is learned (online, with MCTS guiding the search) via convolutional neural networks (in some sense this corresponds to a very natural inductive bias for this game).

Of course there is much more to AlphaGo than what I described above and you are invited to take a look at the paper (see this reddit thread to find the paper)!

3 Responses to "AlphaGo is born"

By 192.168.l.0 May 30, 2017 - 10:51 am

The value function is learnt on the self-play games, and both networks (policy network and value function network) are used in the MCTS, one for expansions, and the other for biasing the MCTS.

By 192.168 10.1 May 30, 2017 - 10:50 am

There is a neural network for learning the value function, but also a neural network as a policy – it is trained both by imitation (on expert games) and by self-play.

By Teytaud January 27, 2016 - 2:26 pm

There is a neural network for learning the value function, but also a neural network as a policy – it is trained both by imitation (on expert games) and by self-play.

The value function is learnt on the self-play games, and both networks (policy network and value function network) are used in the MCTS, one for expansions, and the other for biasing the MCTS.

By 192.168.l.0 May 30, 2017 - 10:51 am

The value function is learnt on the self-play games, and both networks (policy network and value function network) are used in the MCTS, one for expansions, and the other for biasing the MCTS.
By 192.168 10.1 May 30, 2017 - 10:50 am

There is a neural network for learning the value function, but also a neural network as a policy – it is trained both by imitation (on expert games) and by self-play.
By Teytaud January 27, 2016 - 2:26 pm

There is a neural network for learning the value function, but also a neural network as a policy – it is trained both by imitation (on expert games) and by self-play.

The value function is learnt on the self-play games, and both networks (policy network and value function network) are used in the MCTS, one for expansions, and the other for biasing the MCTS.

AlphaGo is born

3 Responses to "AlphaGo is born"

By 192.168.l.0 May 30, 2017 - 10:51 am

By 192.168 10.1 May 30, 2017 - 10:50 am

By Teytaud January 27, 2016 - 2:26 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll