by Justin Johnson, Alexandre Alahi & Fei-Fei Li
This paper proposes the use of “perceptual losses” for training a feed-forward network for the applications of style transfer and superresolution. The perceptual losses used in these applications are raw features and 2nd order statistics computed from a pre-defined CNN, namely the VGG network. This work builds primarily on the work by Gatys et al. in style transfer. Gatys et al. proposed a method for transferring the style of one image to the content of another. The method synthesizes a new image which matches the Gram matrix statistics from multiple layers of the style image and the raw feature of a single higher layer in the content image. This was performed in an iterative optimization framework, which can take on the order of seconds. In this paper, a feed-forward network is trained to perform this task. The observation that an optimization framework can be replaced by a feed-forward network is similar to the work from Dosovitskiy and Brox, which trained a feed-forward network to invert features, which was previously proposed by Manhendran and Vedaldi.
One central claim of the paper is that the method produces “similar qualitative results but…three orders of magnitude faster”, as highlighted in the abstract. Table 1 shows the timing between the feed-forward method proposed compared to the iterative optimization framework proposed by Gatys et al, and the relative speed-up factor. However, the results from the feed-forward method are of lower quality, as indicated quantitatively by the loss values in Figure 5, and qualitatively from the examples in Figure 6. Table 1 would be more meaningful if the time the Gatys et al. method takes to get to the same quality as the feed-forward network, rather than the time the method takes to meet convergence criteria, were highlighted. From Figure 5, this seems to be at less than 100 iterations, which means the speedup factor is closer to 150x rather than 1000x, as highlighted in the table and claimed in the abstract. This is a fairer representation of the results and nonetheless an impressive result.
In addition, a human study, such as a 2AFC test, would help quantify the dropoff in quality between the feed-forward network versus the iterative optimization framework and substantiate the claim that the feed-forward network is of similar quality to the Gatys et al method. An unmentioned application for this work is also as an initialization for the Gatys et al. method, should the user desire results of the higher quality. In this case, the speedup factor of the overall system becomes less impressive, as increasing iterations of the original slower optimization method have to be run to achieve the desired quality. Thus, a speed-up factor vs performance curve would provide a good reference for a potential user.
The paper also trains a similar framework for the application of superresolution, with some results shown in Figure 8. The paper mentions that automated metrics such as the standard PSNR and SSIM metrics do not correlate perceptually well with actual image quality. This is indeed a common problem in image synthesis problems. However, the paper does not offer any alternatives, such as a human study. Though the results are undoubtedly sharper, as pointed on the in the paper, there is a very apparent stippling pattern in the results, which may be displeasing for a human evaluator. This is possibly due to the use of only a 1st order perceptual loss.
In summary, the paper proposes the novel use of perceptual losses, extracted from pre-trained networks, to train feed-forward networks and currently applies the loss to train network for the tasks of style transfer and superresolution. The use of the proposed perceptual losses as a general framework for other structured output tasks, such as colorization, semantic segmentation, and surface normal prediction, as mentioned in the paper, certainly seems like a plausible and worthy direction of further research and exploration for the community.