by Xiaolong Wang & Abhinav Gupta
Generative Adversarial Networks (GAN) have been recently proposed to generate realistic-looking images from random noise vectors. While images generated by GAN achieve state-of-the-art perceptual quality, it ignores the basic principle that image formation is a product of different factors (e.g. geometry, lighting, texture, etc.), and the lack of explicit control/guidance of the generation process makes the sampled images hard to interpret. This paper shows that by explicitly factorizing the generation process into two components: structure (3D geometry) and style (texture and lighting), it is possible to obtain higher-quality image samples, stronger features for representation learning, and better interpretability of the generative process than the standard GAN.
The proposed approach is simple and intuitive: first train the Structure-GAN and Style-GAN independently, and then merge the two networks for joint training. At the independent training stage, the Structure-GAN is trained to generate surface normal maps from random noise vectors, while the Style-GAN is trained to generate images conditioned on ground-truth surface normal maps, and the output images are optimized to not only look realistic but also satisfy pixel-wise surface normal constraints. At the joint training stage, the output of the Structure-GAN replaces ground-truth surface normal maps as input to the Style-GAN, and both networks are fine-tuned end-to-end to generate realistic images.
This paper is one of the first attempts in learning factorized GANs. The explicit factorization of style and structure allows one to probe into the generative process of the network by varying one factor at a time while fixing the other. However, the effect of factorization seems more evident for the setting of varying structure while fixing style than the other way around (style does not seem to change much while walking in the latent space in Fig.9). One potential explanation is that unlike Structure-GAN, Style-GAN is trained in a one-to-one conditional setting that limits its capability of modeling a multi-modal distribution. In other words, the Style-generator might learn to ignore the input noise vector since the supervision is defined on a normal-image pair (i.e. no randomness).
The use of multi-task FCN-loss for training the Style-GAN is also interesting. Theoretically, the per-pixel normal reconstruction loss is not needed for obtaining pixel-level correspondence between the input normal maps and generated images, since the Style-GAN is conditional, and treats a normal-image pair as a training example. The fact that adding a per-pixel loss helps in getting better alignment highlights the difficulty in optimizing the conditional GAN alone. The optimization instability is also implied by several engineering tricks (e.g. BatchNorm is used in Structure-Generator but not Structure-Discriminator, FCN is not fine-tuned initially when generated images are bad, gradients from the Style-GAN need to be weighted lower to prevent over-fitting when joint training, etc.).
Given that quantitative evaluation of generative models is infamously challenging, this paper proposes to use response statistics of pre-trained classifiers on the generated images as a pseudo-metric. The assumption is that if the images are realistic enough, classifiers would fire on them with high scores. However, off-the-shelf classifiers are not perfect, and exhibit different error modes that might be particularly sensitive to artifacts unique to a certain method. Therefore, the numbers in Fig.11 (a) and (b) need to be taken with a grain of salt.
The strongest quantitative result comes from the representation learning experiments, especially on scene classification, where the proposed method outperforms DCGAN by a large margin and only 3.7% away from the Places-AlexNet. However, it is unclear whether the improvement comes from the factorization or simply extra supervision (groundtruth surface normals) that are not available to DCGAN (or Places-AlexNet). In other words, it remains to be seen if factorized generation is indeed better in representation learning than direct generation when both are given the same amount of supervision.
Overall, this paper presents a novel method to factorize image generation into structure and style using generative adversarial networks, and demonstrates its advantages over standard GAN on the quality of image samples, interpretability/control of the generative process, and generalization ability of learned features for recognition tasks. Nonetheless, it would be nice to have more in-depth analysis on the effect of factorization (e.g. what is distilled by the style-GAN that is conflated in the standard GAN) and ablation study on whether the improvement is a result of factorization or extra surface normal supervision not available to the standard GAN.