by Chenxia Wu, Jiemi Zhang, Bart Selman, Silvio Savarese & Ashutosh Saxena
Advances in computer vision and human activity recognition have the potential to greatly improve assistive robots. One problem that spans settings ranging from manufacturing to surgery to cooking is that humans frequently forget to complete parts of a task, sometimes with disastrous consequences. To notify humans when they’ve skipped part of a process, Wu et al. created a robot called Watch-Bot, which is among the first robots to tackle this type of problem.
The approach of Wu et al. uses a graphical model to model how activity data are generated. Their idea is to focus on the co-occurrence and temporal relations between human actions and the objects humans are interacting with. The input is RGB-D data from a Kinect sensor, which represents a human’s pose as a list of joint angles and positions. By clustering user trajectories using k-means, the authors generate a dictionary of movements called “action-topics”. Similarly, objects that users interacted with were tracked and simple visual features of these objects were used to form an object-dictionary, containing object-topics, for each action cluster. Their insight is to model user actions as a process similar to latent Dirichlet allocation (LDA) in document modeling. In lDA, each word in a document is generated by first picking a topic from the topic distribution for that document and then picking a word from that topic. Critically, this model allows the authors to calculate the probability of a given sequence of actions, which is the key to identifying forgotten actions.
To identify forgotten actions, the algorithm attempts to insert an additional action or object from the dictionary whenever there is a transition between actions, and then checks to see if that addition increases the probability of that action sequence. Sequences that most closely match the training data will have higher probability than sequences that contain missing or superfluous actions. Because only one additional action is proposed between each action transition, it is unclear whether this system would work if more than one action in a row were forgotten. For example, what would happen if the milk was left out and the faucet was also left on? Calculating the probability of a sequence with multiple forgotten actions should not require any changes to the model, and a simple greedy algorithm that repeatedly adds the most probable action might work. The problem of searching through all sequences of multiple missed actions, however, would grow exponentially with the number of actions that might conceivably have been missed.
They tested their algorithm on 450 videos. In their tests, the authors directed volunteers to act out different sequences of either complete action sequences, or “forgotten” sequences in which some necessary actions were skipped. They reported performance on both activity recognition (~ 35-40% accuracy for 21 action and 23 object types) as well as forgotten action or object recognition (46% accuracy on forgotten actions, 36% on forgotten objects). The reported accuracies seem fairly impressive given that the training was completely unsupervised. The data consisted of scenes from both a kitchen and an office. Activity recognition performance was better in the office than in the kitchen, and unsurprisingly, forgotten activity and object recognition performance were also better in the office. The ability to recognize actions is critical to recognizing when one has been skipped.
Comparing activity recognition performance among datasets is difficult because datasets differ drastically in the number of actions, the similarity among different actions, and scene complexity. The authors did not mention any attempts to train their model on other activity recognition datasets such as ActivityNet , although the ActivityNet data may only have been available after Wu. et al. submitted their manuscript. They do report that performance is higher than two other models they implemented. The first is an LDA model, which does not account for the joint distribution of activities and objects, and the second is a hidden Markov model, which does not account for long-range action relations. It also would have been interesting if the authors had compared their performance or approach with the unusual or suspicious activity detection literature, which focuses on essentially the same goal.
In the real world, kitchen and office settings vary wildly in their geometry and complexity. Although better action recognition performance may be achievable using supervised training in very complex settings, it is unknown to what degree a system trained in one setting will translate to another. Given that the geometry of the room might be tightly coupled with the trajectories corresponding to particular activities, differences in geometry among settings might severely degrade performance. Wu et al. demonstrate that the forgotten activity recognition problem is approachable using unsupervised training, allowing the system to be trained with minimal human effort in the exact setting in which it will be deployed.
One nice feature of this study is that the training set consisted of 90% full action sequences and 10% forgotten action sequences. The fact that performance was still fairly high even when the the training set contained a large number of forgotten actions (10% seems quite high for a real situation) shows that the algorithm is robust to potential human errors during the generation of training videos. On the other hand, the dataset may be somewhat different from reality. It is unclear whether the prescribed actions performed by the actors capture the extent of variation among repetitions of the same action performed in the wild, and how far apart action clusters are relative to this variation. To address this issue, future work could make use of footage from surveillance cameras in the wild.
In conclusion, Wu et al. have identified and tackled a very important and societally relevant problem. Accurately recognizing when humans forget to complete intended actions could greatly enhance the performance and safety of both professionals and home users. Wu et al. train their model in an unsupervised setting, and despite using techniques that have existed for several years, they achieve good performance using off the shelf hardware (a Kinect sensor, a pan-tilt RGB camera, a laptop, and a laser pointer). When Wu et al. deployed their system into a simulated field setting using a webcam-guided laser pointer, users rated the robot’s helpfulness as 3.9/5. Other interfaces such as texts or audible alerts might be more practical for real use. In addition to predicting forgotten actions, the co-occurrence and temporal relations learned from the model might be helpful for addressing the more general problem of task anticipation. It will be interesting to see how other approaches, such as adversarial nets , perform on human forgotten activity recognition.
 F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR 2015.
 I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS 2014.