by **Song Han, Huizi Mao & William J. Dally**

ICLR, 2016

Through a three-stage pipeline that includes pruning, quantization and Huffman coding, deep compression is able to reduce the storage of neural networks by an average of 42x without a loss in accuracy. This significant storage saving is a big step towards ‘portable neural networks’ that facilitate the use of deep networks on mobile platforms. This compression framework is comprehensive in that it generalizes to many network architectures well, and considers both computation and storage memory reduction. Additionally, the quantization step is dynamically learned through re-training, similar with network distillation ([1] Hinton et al) where large networks are mimicked by smaller networks, but different from some previous works that use predefined functions for binning network parameters into buckets.

The pruning step is built upon the author’s previous work that only uses pruning to compress network. The final weights are determined by retraining the network after pruning, and the sparse structure is encoded in a sparse format with relative indexing.

The next step, quantization, is the paper’s main focus. Values are quantized by first clustering weights, and representing each weight value with its corresponding centroid. Quantization is not performed across layers, but only for one layer at a time. They experimented and compared among three centroid initialization schemes and found linear quantization to achieve a higher accuracy, due to its larger probability to initialize with larger weight values, which are more significant but rare. K-means is used to compute cluster centroids and shared weights are updated by calculating the gradient with respect to the centroid.

Huffman coding is the last stage that efficiently stores the trained network parameters. More frequent values are encoded with fewer bits to minimize the amount of storage memory. It could be interesting to compare the compression rate between applying huffman coding and commands of ‘zip’ or ‘tar’.

Other than the contribution of presenting a compression framework, the paper also provides comprehensive experimental results applying deep compression to many state-of-art deep networks, and tabled accuracy change, storage size, compression rates, time complexity, etc. One interesting fact from their experiments is that pruning and quantization work the best when combined, even better than pruning only. This seems less intuitive as quantization leads to loss of network parameters, but results show that these compressed networks get higher accuracy. It’d be an interesting direction to look into to understanding why getting rid of some parameter information could lead to better performances, and what the redundant parameter information encodes; whether it is related to network overfitting. The proposed method achieves an impressive 35x and 49x compression on AlexNet and VGG respectively, without loss of accuracy, but it is unclear if this compression rate approaches a lower bound (or if a lower bound could be determined).

In terms of speedup and energy efficiency, they normalized time complexity and energy consumption to CPU. And since they are targeting real time processing on mobile platform, batch size is set to one. Results show that pruned network layer to obtain 3× to 4× speedup over the dense network on average, and a 7x less energy on CPU is obtained for pruned network. However, other than the relative comparisons, there is no explicit analysis on whether mobile platform processing is viable or not.

Furthermore, it would be interesting to apply the proposed method to GoogleNet which is already substantially more compact than VGG, while obtaining higher accuracy. An analysis on transfer learning would also be an interesting direction to pursue, to test the generalization of the proposed compression method. The proposed method would be better validated if applying compressed network to other tasks could produce good performance.

[1] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” *arXiv preprint arXiv:1503.02531* (2015).