by: Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent & Roberto Cipolla
This paper addresses the problem of insufficient training data for indoor scene understanding deep learning models. Presently, the only indoor depth datasets with per-pixel labels are NYUv2 and SUN RGB-D which contain only 795 and 5285 training images, respectively. Since both of these datasets relied on humans to label them, dataset creation in this fashion is expensive and time consuming. The datasets also suffer from human error with missing or incorrect labels. To overcome some of these challenges, the authors compile a new set of fully annotated large 3D (basis) scenes from the internet and generate new scenes by adding many objects from shape repositories. This allows the authors to render many videos of the scenes.
To add variation to the basis scenes, objects can be removed, added, or perturbed from their original positions. In order to make the scenes as physically realistic as possible, many different constraints are used on an object’s potential location. By solving an optimization of bounding box intersection, pairwise distance, visibility, distance to wall, and angle to wall a realistic scene can be generated. Since the objects are retrieved from an object repository, the label of each object is already known, meaning that any additions to the scene will not affect the completely labeled nature of the basis scenes. The authors recognize that the noise distributions for these newly generated scenes may not mimic the real world, so they apply the simulated Kinect noise model. This paper is not concerned with correctly texturing the objects as it only uses depth maps for the model training.
In order to see the improvements synthetic data has on semantic segmentation, the authors use a state of the art semantic segmentation algorithm build on a VGG network. Since these algorithms are for RGB images, they are modified to work on a three channel depth based input (DHA) and are then trained on synthetic data and fine-tuned using existing datasets. Fine tuning is required to make the results compelling, and when used on NYUv2, yields a 5 point advantage over the dataset without additional information. A similar trend holds for fine tuning on SUN RGB-D data. While the results are not always as good as the results of Eigen et. al  or Hermans et al  they are comparable on a fair number of classes. The main failure classes are paintings, televisions, and windows. This can be accounted for by relying on only depth information.
It would be interesting to see the effects of adding in object textures to these models. Since the models perform well based off only depth, adding color data should further improve performance. The authors say that applying textures from OpenSurfaces did not mimic the real world and ray tracing was too time consuming. It might be interesting to see if models using these incorrect textures would help performance, it should at the minimum improve detection of the “flat” objects that this technical had trouble with before.
In summary, this paper proposes an interesting solution to the problem of not having enough data to adequately train deep models for indoor scene understanding. The method, even without using RGB data performs reasonably well and can train faster than models using only real-world data.
 David Eigen, Rob Fergus. “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture”
 A. Hermans, G. Floros and B. Leibe, “Dense 3D semantic mapping of indoor scenes from RGB-D images”