by Brenden M. Lake, Ruslan Salakhutdinov & Joshua B. Tenenbaum
Using a novel Hierarchical Bayesian Program Learning (HBPL) algorithm, this Science paper, and it’s NIPS predecessor, “One-shot learning by inverting a compositional causal process”, achieved human-level accuracy on several tasks related to classifying and generating written characters. Viewed from a computer vision perspective, HBPL is a drastic departure from the deep convolutional neural networks that dominate the field, and while HBPLs may not be ready for production systems, the remarkable accuracy and careful composition of concept learning provide insights into how humans and machines learn.
The two papers (NIPS and Science versions) build on a long tradition of probabilistic programming efforts, and also on advances in parsing and modelling hand-written symbols. The HBPL model can also be viewed as an extension of other structural description models. The two papers discussed here make a threefold contribution. First, they provide “Omniglot”, a dataset containing over 1600 written characters with 20 examples each; this can be thought of as the “transpose” of MNIST which has few classes with high number of samples per class. Second, they propose the HBPL algorithm, which is compositional, generative and causal. Third, extensive “visual-turing tests” compare HBPL to humans, demonstrating the HBPL is very-nearly as good as human on classifying, and generating characters, both within and outside known alphabets.
The HBPL model itself is rather sophisticated, with a plethora of parameters. Indeed, the authors could have made good use of the richness of Omniglot in denoting the various parameters, relations, and hyper-parameters. At the core of HBPL lies the concept of character types (e.g. “a”, “b”) and tokens (e.g. “a” as written by a particular person). A character type is modelled by a 3-tuple of (stroke-count, strokes, and relations). Strokes are in-turn broken down into sub-strokes, with intuitive sub-stroke splits occurring when the pen rests or drastically changes direction. Sub-strokes are in-turn defined by a tuple of: an index into a set of prototypes; noise perturbation; and scale. The third component of the stroke, the Relation, define relations between sub-strokes, e.g. independence, along, or start and stop. The Tokens are conceptually broken down into pen trajectories and an ink model, each with its own generative model and hyper-parameters.
Learning is done in two stages. The first stage, which the authors refer to as “learning to learn” requires learning model hyper-parameters on a “background” set of 30 alphabets; these alphabets are only used for this purpose. The remaining 20 alphabets are used for one-shot learning and sample synthesis experiments. On a high level, inference uses bayes classification rule to find the character type that best explains the given input image. In practise, this requires a sequence of steps including thinning the input image; running a search algorithm to find the top 5 parses; approximate type-level variability around each parse; and re-optimize the token-level variables. Extensive experimentation verify the capacity of the model and show that both classification and synthesis capabilities are on par with human performance.
In class discussion, there were some concerns regarding the validity of the synthesis experiments. It seems that the characters in the data-set were generated by non-native writers, and entered using a computer mouse rather than a pen. Therefore, the general character appearances and stroke structure may be quite different from, and exhibit less variation than, characters generated by native writers using a standard writing device. Further, all human drawers are treated equally, when there may in fact be significant personal variation, in particular among native writers. This all affects the validity of the proposed “visual Turing test”. Humans may not be particularly good at judging handwriting they are not familiar with, in particular since the human characters in the data-set, as mentioned above, may be exhibiting less variation than actual human hand writing. This could contribute to making it easier for the machine generated characters to pass the Turing test. It would be interesting to know whether a neural network, trained on this task, would be more suitable than humans to serve as judges in this situation. The classification experiments were more straight-forward, and showed impressive performance, but it’s worth noting that a generic siamese network achieved 8% accuracy , which is close to 5% of the proposed method. Another concern was with regards to method generality. Although a modified HBPL framework has been used for speech regocnition , several aspects of the method, e.g. stroke with, ink-model, as presented in the reviewed papers, seem tuned specifically to hand-written character recognition. The original claim of method generality as stated by the authors, therefore isn’t very convincingly supported.
In conclusion, the HBPL and related probabilistic programming methods offer a compelling alternative to the current domiant recognition paradigm of deep neural networks.
 Gregory Koch, Richard Zemel, Ruslan Salakhutdinov. “Siamese Neural Networks for One-shot Image Recognition”. ICML workshop on deep learning 2015
 Brenden M. Lake, Chia-ying Lee, James R. Glass, Joshua B. Tenenbaum. “One-shot learning of generative speech concepts”. Proceedings of the 36th Annual Meeting of the Cognitive Science Society, 2014.