Reinforcement Learning can best be described as learning from experience and has been traditionally used to train robots to move in its environment. In Reinforcement Learning, the "agent" is first presented with the
State of the world. The agent takes an
Action from a set of possible actions and is given a
Reward. The Reward, typically a numeric value, provides a feedback to the agent on whether the Action it took was a success or a failure. Initially, the agent is completely naive and has no idea which moves are good and which are bad. However as the agent takes more and more actions, the rewards it is given allows it to slowly build up a picture of which actions are good and which are bad.
One of the earliest examples of Reinforcement Learning in pop culture is in the 1983 movie, Wargames. In the movie, the protagonist tries to teach a millitary computer that has control of all the nukes in the States that the best possible action to winning the Global Thermonuclear War against Russia is to not launch a nuke. He does so by getting the computer to play Tic-Tac-Toe repeatedly until the computer gains enough experience to realize there are no winning moves in Tic-Tac-Toe.
[yt]
[/yt]
In Tic-Tac-Toe, the "State" of the world is the positions of all the X and O on the board. One possible "State" of Tic-Tac-Toe would be:
XOX
OXX
O
There are two possible actions that can be taken, either place an O at the bottom-center square, or the bottom-right square. If the agent decides to act by placing an O in the bottom-center square, it will be given a negative reward as it has failed to block X from winning the game. If the agent places an O in the bottom-right, it will win the game and is thus given a positive reward. The agent keeps a record of this State-Action-Reward tuple in it's memory bank. In the future, when the agent encounters this exact game state, it will know that taking the bottom-right square is the action to take because in the past, it won the game with the move.
And this is exactly what the Googlers did. The Deep Learning recognizes patterns on the Go board, which is given to the Reinforcement Learner as the State. The Reinforcement Learner then places a piece and learns if it is a good or bad move when it eventually wins or looses the game. By getting the computer to play Go against itself many, many, many times, the Go Reinforcement Learning Agent gradually builds up experience on what are the good moves to make in every situation that can happen in Go.