Deep Fried Convnets

By ZiChao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song & Ziyu Wang
Arxiv, 2015

Fully connected layers can constitute more than 90% of all parameters in a convolutional neural network. This results in many networks growing to a few hundred megabytes in storage. In many storage sensitive scenarios, such as distributed training and deployment on mobile devices, the network must be compressed to be practical.

There are several network compression methods in the literature. In network destillation [1] a smaller network is trained to mimic the performance of a larger one. Post-processing is an economic and fast method to compress the method after training, using SVD or low rank factorization. Alternatively one could use sparse methods, either with regularization or post-processing. Finally, random projections, including the method proposed in this paper and hashing techniques, are a group of methods that attempt to preserve pairwise distance with lower dimensions. Many methods have succeeded in compressing the network, but most of them suffer from lower classification power. In contrast, Deep Fried Convnet marginally improves accuracy.

The proposed “Deep Fried Convnet” use the Fastfood transform [4] to replace the two fully connected layers in the end of the network. Theoretically, the Fastfood transform reduces storage and computation from O(nd) to O(n) and O(nlogd) respectively. With extensive experiments, the authors show that the top 1 accuracy could be preserved or even improved with only ⅓ ~ ½ number of the original number of parameters. Moreover, they make the continuous parameters in the Fastfood transform end-to-end learnable. They denote this method “Adaptive Fastfood transform” and show that it improves the top-1 accuracy on the ILSVRC-2012 task by around 4%, compared to the non-adaptive version. The authors motivate their method with connections to structured random projection and the explicit feature mapping kernels. These connections help demystify the seemingly unnecessarily complex projection structure.

The Adaptive Fastfood Transform reduces storage requirements on certains nets, such as alexnet of VGG16, but it’s not as compact as nets designed ground-up with storage in mind. For example, GoogLeNet (v1) has significantly higher accuracy while having ~1/10x the parameters of AlexNet. In comparison, the Adaptive Fastfood Transform reduces ⅔ of Alexnet’s parameters with similar accuracy. Further, amont network compression methods, it inferior to the 42x compression rate in the recent deep compression paper [3]. The modest storage reduction is mainly due to high parameter dimension after the Adaptive Fastfood transform. This requires around 1000×32K parameters in the final softmax layer. Although the authors could have applied the same technique to fc8, it’s not clear whether such approach will hinder the accuracy. A follow up work, ACDC[2] claims to have solved this problem.

It’s also not clear from the paper whether the compressed model could be transferred to other tasks by fine tuning. When trying to solve a new problem, people usually start from the pretrained network and finetune it on the target domain. If the compressed network is “specialized” on the ILSVRC image classification task, we probably could not transfer the learnt filter anymore. That could largely limit the usage of the compressed network.

In conclusion, the Adaptive Fastfood Transform layer compress the network by two thirds while maintaining similar performance. Computationally, the fully connected layer is not the bottleneck, so the overall timing would not vary a lot. The complexity of the transform and its implementation trouble may hinder widespread adoption. Theoretically, this paper is interesting and could pave the way for more practical methods. The “random projection” techniques used in the paper is quite general and could potentially be applied to compress the footprints of other networks. This will enable researchers to explore a richer set of models.

[1] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ArXiv, 1503.02531, 2015. 
[2] Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. ACDC: A Structured Efficient Linear Layer. In International Conference on Learning Representations (ICLR). 2016.
[3] Han, Song, Huizi Mao, and William J. Dally. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” arXiv preprint arXiv:1510.00149 (2015). 
[4] Le, Quoc Viet, Tamas Sarlos, and Alexander Johannes Smola. “Fastfood: Approximate Kernel Expansions in Loglinear Time.” arXiv preprint arXiv:1408.3060 (2014). 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s