by **Ziyu Wang, Nando de Freitas & Marc Lanctot**

Arxiv, 2016

This paper is motivated by the recent successes in deep reinforcement learning using advantage function, which is a measure of the importance of taking actions from a finite discrete set at each possible state. The authors propose a dueling neural network architecture for model-free reinforcement learning which separates estimating the state value function from the state-dependent action advantage function for the final goal of Q-function approximation.

This proposed framework is specifically useful in situations where the action does not make any changes in the environment while the agent is in some particular states. This happens when the size of the action domain is large or there is a redundancy in the action space. In such situations, the proposed dueling network will approximate the Q-function more accurately than other state-of-the-art Q-learning approaches. Another benefit of the proposed architecture is that it can be easily combined with any other RL algorithms.

The first stream, which estimates the state value function, is needed for situations where estimation of the value of each action choice is unnecessary. The second stream, which estimates the advantage function values, is useful when the network needs to make a preference over the actions in a given state. The output of these two streams are combined by an aggregator to produce the Q-function. This aggregator layer will lead to automatic estimation of both value and advantage functions through back-propagation, without any need for extra supervision.

The performance of the proposed dueling network is compared against other baselines in different situations. First, the authors judge the learned Q-value in a simple corridor environment, where taking any action at each state under the learned policy makes the computation of Q-value independent of other states and actions. In this corridor environment, the authors compare the performance of their proposed dueling network to a single-stream architecture with the same number of parameters in the task of policy evaluation. Increasing the number of possible actions makes a larger gap between the performance of the two methods, supporting the dueling network. Secondly, in a series of Atari games, the proposed dueling network follows the DDQN algorithm. Thus the authors compare the performance of dueling network with the state-of-the-art DDQN results, again with the same number of parameters (assuming they have the same capacity). The experiments reveal the improvements made by the dueling network.

Although it is a novelty of the paper to “automatically” estimate value and advantage functions by a back-propagation step through the proposed aggregator, it is not exactly clear how this back-propagation will result in the estimates of the two other functions. Additional explanation of the properties of this aggregator or a mathematical proof would be useful. Furthermore, it is mentioned in the paper that the proposed network has a complementary role to the other Q-network algorithms such as DQN and DDQN and any modification to those methods (such as replay memories) can be applied in the new case too. It would be interesting to see how this claim is true in practice. Besides, some detailed comments about the games for which dueling network did particularly well or poorly against the DDQN or human players can be useful. Why is it so good at Atlantis and Breakout, but so bad at Freeway and Asteroids? If any common features or patterns appeared among these games, they might highlight the kinds of situations in which dueling network performs particularly well. Also, it would be useful to know in which other situations (rather than large action spaces) this dueling architecture can be applied.