by Volodymyr Mnih, Adria Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy Lillicrap, David Silver & Koray Kavokcuoglu
The paper uses asynchronous gradient descent to perform deep reinforcement learning. The authors execute multiple agents in parallel, on multiple instances of the environment to de-correlate the agents’ learning process in order to create a more stationary process. The authors observe that multiple parallel agents tend to explore different spaces in the environment, thus each agent would learn a policy much different from each other. This indicates that a proper asynchronous combination of each parallel agent’s policy can create a better shared global policy. The authors choose to implement these asynchronous reinforcement learning framework on a single multi-core CPU machine, rather than under a distributed setting or using GPU as in previous Gorila frameworks. This dramatically reduces the computation cost and improves training speed.
Moreover, the change (gradient) toward the global policy made by multiple actor-learners are less correlated in time than a single agent while applying online updates, therefore the multi-agent strategy would also stabilize the learning process. The authors uses multiple agents to update the controller without using experience replay as in the DQN algorithm, result in 1) a more efficient algorithm while reducing computation cost, and 2) an algorithm that can also perform one-shot learnings.
The authors then describe four asynchronous algorithms for reinforcement learning, namely asynchronous one-step Q-learning, asynchronous one-step Sarsa, asynchronous n-step Q-learning, and asynchronous advantage actor-critic, and their performance learning different games. The new algorithms succeed on a wide variety of continuous motor control problems. The asynchronous training using 16 core CPU spends only half of the time to train as the previous GPU implementation. These algorithms also obtain almost linear speedup in learning while using multiple CPU threads compared to a single CPU thread. Lastly, the authors state that the asynchronous methods they used is complementary to the use of experience replay, the dueling architecture, and other techniques in order to make the learned-agent to perform the different tasks more accurately.
This is a well done study on authors’ new asynchronous parallel reinforcement learning algorithm. This work makes the observation that multiple agents typically explore different part of the the environment. Leveraging effects of this phenomena on a global policy, the authors improve the previous DQN Gorila framework (Nair, 2015) by implementing asynchronous updates from parallel multiple agent. This dramatically speeded up training in various games as parallel, multiple agents can share their decorrelated experiences to improve learning speed.
Further, due to the nature of the de-correlated policy learned by these parallel agents, the new system no longer needs the experience replay memory as in Gorila to stabilize learning. This simplified the model, reduced redundant memory use, and after all, simple is better for generalization in learning models.
The authors also indicate that the asynchronous one-shot Q-learning and SARSA achieved super linear scaling up. It would be interesting to see the reason or analysis behind why parallel one-shot methods require less data to achieve similar level of convergence than its counterparts.
The authors not only perform various games (including 3D games) to compare speed of convergence of their asynchronous algorithms to previous synchronous algorithm, they also try to use different asynchronous version of two optimization methods, momentum SGD (Rect 2011) and RMSProp (Tieleman and Hinton 2012) to investigate which has better performance. Shared RMSProp performs most robust and stable compared to momentum SGD. It would have been interesting to see the comparison in the robustness between the new methods and DQN using these two optimization methods as well, but it could be out of the scope of this paper.
The paper also mentions few future directions for improvement in their new algorithms. We find that the direction of better estimating the state-action Q value interesting. Methods such as reducing overestimation bias (Van Hasselt 2015), (Bellemare, 2016), and more accurate estimation of Q-value using dueling architecture (Wang 2015) should provide further improvements and would be complementary to the current paper. A graph indicating the learned estimated Q-value in the state action space compared to the true Q-value would have been great to support their claim. On the other hand, the true Q-value in the state action space could be expensive and hard to compute.