Watch-Bot: Unsupervised Learning for Reminding Humans of Forgotten Actions

by Chenxia Wu, Jiemi Zhang, Bart Selman, Silvio Savarese & Ashutosh Saxena
Arxiv, 2015

Advances in computer vision and human activity recognition have the potential to greatly improve assistive robots.  One problem that spans settings ranging from manufacturing to surgery to cooking is that humans frequently forget to complete parts of a task, sometimes with disastrous consequences.  To notify humans when they’ve skipped part of a process, Wu et al. created a robot called Watch-Bot, which is among the first robots to tackle this type of problem.

The approach of Wu et al. uses a graphical model to model how activity data are generated.  Their idea is to focus on the co-occurrence and temporal relations between human actions and the objects humans are interacting with.  The input is RGB-D data from a Kinect sensor, which represents a human’s pose as a list of joint angles and positions.  By clustering user trajectories using k-means, the authors generate a dictionary of movements called “action-topics”.  Similarly, objects that users interacted with were tracked and simple visual features of these objects were used to form an object-dictionary, containing object-topics, for each action cluster.  Their insight is to model user actions as a process similar to latent Dirichlet allocation (LDA) in document modeling.  In lDA, each word in a document is generated by first picking a topic from the topic distribution for that document and then picking a word from that topic.  Critically, this model allows the authors to calculate the probability of a given sequence of actions, which is the key to identifying forgotten actions.

To identify forgotten actions, the algorithm attempts to insert an additional action or object from the dictionary whenever there is a transition between actions, and then checks to see if that addition increases the probability of that action sequence.  Sequences that most closely match the training data will have higher probability than sequences that contain missing or superfluous actions.  Because only one additional action is proposed between each action transition, it is unclear whether this system would work if more than one action in a row were forgotten.  For example, what would happen if the milk was left out and the faucet was also left on?  Calculating the probability of a sequence with multiple forgotten actions should not require any changes to the model, and a simple greedy algorithm that repeatedly adds the most probable action might work.  The problem of searching through all sequences of multiple missed actions, however, would grow exponentially with the number of actions that might conceivably have been missed.

They tested their algorithm on 450 videos. In their tests, the authors directed volunteers to act out different sequences of either complete action sequences, or “forgotten” sequences in which  some necessary actions were skipped. They reported performance on both activity recognition (~ 35-40% accuracy for 21 action and 23 object types) as well as forgotten action or object recognition (46% accuracy on forgotten actions, 36% on forgotten objects).  The reported accuracies seem fairly impressive given that the training was completely unsupervised.  The data consisted of scenes from both a kitchen and an office.  Activity recognition performance was better in the office than in the kitchen, and unsurprisingly, forgotten activity and object recognition performance were also better in the office.  The ability to recognize actions is critical to recognizing when one has been skipped.

Comparing activity recognition performance among datasets is difficult because datasets differ drastically in the number of actions, the similarity among different actions, and scene complexity.  The authors did not mention any attempts to train their model on other activity recognition datasets such as ActivityNet [1], although the ActivityNet data may only have been available after Wu. et al. submitted their manuscript.  They do report that performance is higher than two other models they implemented.  The first is an LDA model, which does not account for the joint distribution of activities and objects, and the second is a hidden Markov model, which does not account for long-range action relations.  It also would have been interesting if the authors had compared their performance or approach with the unusual or suspicious activity detection literature, which focuses on essentially the same goal.

In the real world, kitchen and office settings vary wildly in their geometry and complexity.  Although better action recognition performance may be achievable using supervised training in very complex settings, it is unknown to what degree a system trained in one setting will translate to another.  Given that the geometry of the room might be tightly coupled with the trajectories corresponding to particular activities, differences in geometry among settings might severely degrade performance.  Wu et al. demonstrate that the forgotten activity recognition problem is approachable using unsupervised training, allowing the system to be trained with minimal human effort in the exact setting in which it will be deployed.

One nice feature of this study is that the training set consisted of 90% full action sequences and 10% forgotten action sequences.  The fact that performance was still fairly high even when the the training set contained a large number of forgotten actions (10% seems quite high for a real situation) shows that the algorithm is robust to potential human errors during the generation of training videos.  On the other hand, the dataset may be somewhat different from reality.  It is unclear whether the prescribed actions performed by the actors capture the extent of variation among repetitions of the same action performed in the wild, and how far apart action clusters are relative to this variation.  To address this issue, future work could make use of footage from surveillance cameras in the wild.

In conclusion, Wu et al. have identified and tackled a very important and societally relevant problem.  Accurately recognizing when humans forget to complete intended actions could greatly enhance the performance and safety of both professionals and home users.  Wu et al. train their model in an unsupervised setting, and despite using techniques that have existed for several years, they achieve good performance using off the shelf hardware (a Kinect sensor, a pan-tilt RGB camera, a laptop, and a laser pointer).  When Wu et al. deployed their system into a simulated field setting using a webcam-guided laser pointer, users rated the robot’s helpfulness as 3.9/5.  Other interfaces such as texts or audible alerts might be more practical for real use.  In addition to predicting forgotten actions, the co-occurrence and temporal relations learned from the model might be helpful for addressing the more general problem of task anticipation.  It will be interesting to see how other approaches, such as adversarial nets [2], perform on human forgotten activity recognition.

[1] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR 2015.

[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS 2014.

Convergent Learning: Do Different Neural Networks Learn the Same Representations?

by Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson & John Hopcraft
ICLR, 2016

It is well known within the deep learning community that networks with different random initializations learn different weights. In Convergent Learning: Do Different Neural Networks Learn the Same Representations? Li et. al. aim to learn properties of the learned weights across neural networks trained with different initializations. Though there are no immediate applications of their studies, a better understanding of what neural networks learn has myriad applications such as model compression and training better model ensembles.

The basic analysis presented by the authors relies on aligning filters learned in different networks. To match filters, the authors propose computing the correlation between activations of different filters. Though correlation is a simple metric, a small study demonstrates that correlation yields similar results to mutual information, a more statistically powerful metric, and is considerably faster to compute. For all experiments the authors use CaffeNet networks.

After defining a way to measure the similarity between filters, the authors conduct a series of experiments to better understand how filters align in various networks. A first set of experiments aims to understand if there exists a direct mapping from filters learned in one network to filters learned in a second network. Filters are aligned across networks by matching each filter in the first network (Net1) to its most highly correlated filter in the second network (Net2); the paper explores matching filters in Net2 to multiple filters in Net1, as well as a traditional bipartite mapping. Both mappings reveal similar conclusions; many filters in Net1 have highly similar filters in Net2, though some filters do not match well. Furthermore, filters align best in Conv1 and Conv5, and worse in the layers in between. It is unsurprising that filters do not align well after Conv1 since differences in filters can propagate from layer to layer, but less intuitive why filters align better in Conv5 than Conv4.  One possibility is that networks are trained using the same label sets.

Though some filters do not match well, it is still possible that a group of neurons in one network can interact together to extract features similar to a single filter or a group of filters in another network. In a second set of experiments, the authors demonstrate that filters in Net1 can be mapped to filters in Net2 by finding a linear combination of filters in Net2 which can represent a filter from Net1 based on minimizing a sum of squared prediction loss with L1 regularization. An average of 4.7 conv1 weights from Net2 is needed to accurately represent each conv1 weight in Net1 with a small loss in accuracy. The authors then use agglomerative clustering to match groups of filters between networks. Clustering results demonstrate that networks learn groups of filters which represent similar subspaces, but the filters which span the same subspace across networks are different.  Another method to explore clusters learned by different networks could be representational dissimilarity analysis (RDMs) [1], which is used in the neuroscience community to measure if different neural codes represent the same information.

The experiments in this paper provide insight into what kinds of weights neural networks learn. None of the results are particularly surprising, but the experiments are thorough and provide a solid foundation for future applications. Though it is unlikely that results change substantially between separate networks, it would be worthwhile to conduct similar experiments on other networks to confirm that the patterns observed in this paper are consistent across network architectures.  The authors also concentrate their analysis on convolutional filters, but fully connected layers could be analyzed in a similar way.  Additionally, network design is increasingly moving towards deeper networks. The authors note that matching is best between Conv1 and Conv5 for CaffeNet, but it is possible that matching ceases to occur at very deep layers in current models such as VGG, GoogleNet and ResNet. If this is the case, it is unclear how the analysis presented can lead to applications suggested by the authors.

[1] Kriegeskorte, Nikolaus, Marieke Mur, and Peter A. Bandettini. “Representational similarity analysis-connecting the branches of systems neuroscience.” Frontiers in systems neuroscience 2 (2008): 4.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

by Song Han, Huizi Mao & William J. Dally
ICLR, 2016

Through a three-stage pipeline that includes pruning, quantization and Huffman coding, deep compression is able to reduce the storage of neural networks by an average of 42x without a loss in accuracy. This significant storage saving is a big step towards ‘portable neural networks’ that facilitate the use of deep networks on mobile platforms. This compression framework is comprehensive in that it generalizes to many network architectures well, and considers both computation and storage memory reduction. Additionally, the quantization step is dynamically learned through re-training, similar with network distillation ([1] Hinton et al) where large networks are mimicked by smaller networks, but different from some previous works that use predefined functions for binning network parameters into buckets.

The pruning step is built upon the author’s previous work that only uses pruning to compress network. The final weights are determined by retraining the network after pruning, and the sparse structure is encoded in a sparse format with relative indexing.

The next step, quantization, is the paper’s main focus. Values are quantized by first clustering weights, and representing each weight value with its corresponding centroid. Quantization is not performed across layers, but only for one layer at a time. They experimented and compared among three centroid initialization schemes and found linear quantization to achieve a higher accuracy, due to its larger probability to initialize with larger weight values, which are more significant but rare. K-means is used to compute cluster centroids and shared weights are updated by calculating the gradient with respect to the centroid.

Huffman coding is the last stage that efficiently stores the trained network parameters. More frequent values are encoded with fewer bits to minimize the amount of storage memory. It could be interesting to compare the compression rate between applying huffman coding and commands of ‘zip’ or ‘tar’.

Other than the contribution of presenting a compression framework, the paper also provides comprehensive experimental results applying deep compression to many state-of-art deep networks, and tabled accuracy change, storage size, compression rates, time complexity, etc. One interesting fact from their experiments is that pruning and quantization work the best when combined, even better than pruning only. This seems less intuitive as quantization leads to loss of network parameters, but results show that these compressed networks get higher accuracy. It’d be an interesting direction to look into to understanding why getting rid of some parameter information could lead to better performances, and what the redundant parameter information encodes; whether it is related to network overfitting. The proposed method achieves an impressive 35x and 49x compression on AlexNet and VGG respectively, without loss of accuracy, but it is unclear if this compression rate approaches a lower bound (or if a lower bound could be determined).

In terms of speedup and energy efficiency, they normalized time complexity and energy consumption to CPU. And since they are targeting real time processing on mobile platform, batch size is set to one. Results show that pruned network layer to obtain 3× to 4× speedup over the dense network on average, and a 7x less energy on CPU is obtained for pruned network. However, other than the relative comparisons, there is no explicit analysis on whether mobile platform processing is viable or not.

Furthermore, it would be interesting to apply the proposed method to GoogleNet which is already substantially more compact than VGG, while obtaining higher accuracy. An analysis on transfer learning would also be an interesting direction to pursue, to test the generalization of the proposed compression method. The proposed method would be better validated if applying compressed network to other tasks could produce good performance.

[1] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).



Deep Fried Convnets

By ZiChao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song & Ziyu Wang
Arxiv, 2015

Fully connected layers can constitute more than 90% of all parameters in a convolutional neural network. This results in many networks growing to a few hundred megabytes in storage. In many storage sensitive scenarios, such as distributed training and deployment on mobile devices, the network must be compressed to be practical.

There are several network compression methods in the literature. In network destillation [1] a smaller network is trained to mimic the performance of a larger one. Post-processing is an economic and fast method to compress the method after training, using SVD or low rank factorization. Alternatively one could use sparse methods, either with regularization or post-processing. Finally, random projections, including the method proposed in this paper and hashing techniques, are a group of methods that attempt to preserve pairwise distance with lower dimensions. Many methods have succeeded in compressing the network, but most of them suffer from lower classification power. In contrast, Deep Fried Convnet marginally improves accuracy.

The proposed “Deep Fried Convnet” use the Fastfood transform [4] to replace the two fully connected layers in the end of the network. Theoretically, the Fastfood transform reduces storage and computation from O(nd) to O(n) and O(nlogd) respectively. With extensive experiments, the authors show that the top 1 accuracy could be preserved or even improved with only ⅓ ~ ½ number of the original number of parameters. Moreover, they make the continuous parameters in the Fastfood transform end-to-end learnable. They denote this method “Adaptive Fastfood transform” and show that it improves the top-1 accuracy on the ILSVRC-2012 task by around 4%, compared to the non-adaptive version. The authors motivate their method with connections to structured random projection and the explicit feature mapping kernels. These connections help demystify the seemingly unnecessarily complex projection structure.

The Adaptive Fastfood Transform reduces storage requirements on certains nets, such as alexnet of VGG16, but it’s not as compact as nets designed ground-up with storage in mind. For example, GoogLeNet (v1) has significantly higher accuracy while having ~1/10x the parameters of AlexNet. In comparison, the Adaptive Fastfood Transform reduces ⅔ of Alexnet’s parameters with similar accuracy. Further, amont network compression methods, it inferior to the 42x compression rate in the recent deep compression paper [3]. The modest storage reduction is mainly due to high parameter dimension after the Adaptive Fastfood transform. This requires around 1000×32K parameters in the final softmax layer. Although the authors could have applied the same technique to fc8, it’s not clear whether such approach will hinder the accuracy. A follow up work, ACDC[2] claims to have solved this problem.

It’s also not clear from the paper whether the compressed model could be transferred to other tasks by fine tuning. When trying to solve a new problem, people usually start from the pretrained network and finetune it on the target domain. If the compressed network is “specialized” on the ILSVRC image classification task, we probably could not transfer the learnt filter anymore. That could largely limit the usage of the compressed network.

In conclusion, the Adaptive Fastfood Transform layer compress the network by two thirds while maintaining similar performance. Computationally, the fully connected layer is not the bottleneck, so the overall timing would not vary a lot. The complexity of the transform and its implementation trouble may hinder widespread adoption. Theoretically, this paper is interesting and could pave the way for more practical methods. The “random projection” techniques used in the paper is quite general and could potentially be applied to compress the footprints of other networks. This will enable researchers to explore a richer set of models.

[1] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ArXiv, 1503.02531, 2015. 
[2] Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. ACDC: A Structured Efficient Linear Layer. In International Conference on Learning Representations (ICLR). 2016.
[3] Han, Song, Huizi Mao, and William J. Dally. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” arXiv preprint arXiv:1510.00149 (2015). 
[4] Le, Quoc Viet, Tamas Sarlos, and Alexander Johannes Smola. “Fastfood: Approximate Kernel Expansions in Loglinear Time.” arXiv preprint arXiv:1408.3060 (2014). 

Human-level concept learning through probabilistic program induction

by Brenden M. Lake, Ruslan Salakhutdinov & Joshua B. Tenenbaum
Science, 2015

Using a novel Hierarchical Bayesian Program Learning (HBPL) algorithm, this Science paper, and it’s NIPS predecessor, “One-shot learning by inverting a compositional causal process”, achieved human-level accuracy on several tasks related to classifying and generating written characters. Viewed from a computer vision perspective, HBPL is a drastic departure from the deep convolutional neural networks that dominate the field, and while HBPLs may not be ready for production systems, the remarkable accuracy and careful composition of concept learning provide insights into how humans and machines learn.

The two papers (NIPS and Science versions) build on a long tradition of probabilistic programming efforts, and also on advances in parsing and modelling hand-written symbols. The HBPL model can also be viewed as an extension of other structural description models. The two papers discussed here make a threefold contribution. First, they provide “Omniglot”, a dataset containing over 1600 written characters with 20 examples each; this can be thought of as the “transpose” of MNIST which has few classes with high number of samples per class. Second, they propose the HBPL algorithm, which is compositional, generative and causal. Third, extensive “visual-turing tests” compare HBPL to humans, demonstrating the HBPL is very-nearly as good as human on classifying, and generating characters, both within and outside known alphabets.

The HBPL model itself is rather sophisticated, with a plethora of parameters. Indeed, the authors could have made good use of the richness of Omniglot in denoting the various parameters, relations, and hyper-parameters. At the core of HBPL lies the concept of character types (e.g. “a”, “b”) and tokens (e.g. “a” as written by a particular person). A character type is modelled by a 3-tuple of (stroke-count, strokes, and relations). Strokes are in-turn broken down into sub-strokes, with intuitive sub-stroke splits occurring when the pen rests or drastically changes direction. Sub-strokes are in-turn defined by a tuple of: an index into a set of prototypes; noise perturbation; and scale. The third component of the stroke, the Relation, define relations between sub-strokes, e.g. independence, along, or start and stop. The Tokens are conceptually broken down into pen trajectories and an ink model, each with its own generative model and hyper-parameters.

Learning is done in two stages. The first stage, which the authors refer to as “learning to learn” requires learning model hyper-parameters on a “background” set of 30 alphabets; these alphabets are only used for this purpose. The remaining 20 alphabets are used for one-shot learning and sample synthesis experiments. On a high level, inference uses bayes classification rule to find the character type that best explains the given input image. In practise, this requires a sequence of steps including thinning the input image; running a search algorithm to find the top 5 parses; approximate type-level variability around each parse; and re-optimize the token-level variables. Extensive experimentation verify the capacity of the model and show that both classification and synthesis capabilities are on par with human performance.

In class discussion, there were some concerns regarding the validity of the synthesis experiments. It seems that the characters in the data-set were generated by non-native writers, and entered using a computer mouse rather than a pen. Therefore, the general character appearances and stroke structure may be quite different from, and exhibit less variation than, characters generated by native writers using a standard writing device. Further, all human drawers are treated equally, when there may in fact be significant personal variation, in particular among native writers. This all affects the validity of the proposed “visual Turing test”. Humans may not be particularly good at judging handwriting they are not familiar with, in particular since the human characters in the data-set, as mentioned above, may be exhibiting less variation than actual human hand writing. This could contribute to making it easier for the machine generated characters to pass the Turing test. It would be interesting to know whether a neural network, trained on this task, would be more suitable than humans to serve as judges in this situation. The classification experiments were more straight-forward, and showed impressive performance, but it’s worth noting that a generic siamese network achieved 8% accuracy [1], which is close to 5% of the proposed method. Another concern was with regards to method generality. Although a modified HBPL framework has been used for speech regocnition [2], several aspects of the method, e.g. stroke with, ink-model, as presented in the reviewed papers, seem tuned specifically to hand-written character recognition. The original claim of method generality as stated by the authors, therefore isn’t very convincingly supported.

In conclusion, the HBPL and related probabilistic programming methods offer a compelling alternative to the current domiant recognition paradigm of deep neural networks.

[1] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov. “Siamese Neural Networks for One-shot Image Recognition”. ICML workshop on deep learning 2015
[2] Brenden M. Lake, Chia-ying Lee, James R. Glass, Joshua B. Tenenbaum. “One-shot learning of generative speech concepts”. Proceedings of the 36th Annual Meeting of the Cognitive Science Society, 2014.