Objective Evaluation:
An empirical study on evaluation metrics of generative adversarial networks [1] with code.
Inception Score (IS): classification score using the InceptionNet pretrained on ImageNet
in which $p_M(y)$ is the marginal distribution of $p_M(y|x)$. Expect $p_M(y)$ to be of low entropy while $p_M(y|x)$ to be of high entropy. The higher, the better.
Mode score: extension of Inception score
Kernel MMD: MMD distance between two data distributions
Wasserstein distance: Wasserstein distance (Earth mover’s distance) between two data distributions.
Fréchet Inception Distance (FID): extract InceptionNet features and measure the data distribution distance. The lower, the better.
KNN score: treat true data as positive and generated data as negative. Calculate the leave-one-out (LOO) accuracy based on 1-NN classifier.
Learned Perceptual Image Patch Similarity (LPIPS): [3] [code]
Subjective Evaluation:
- Each user sees two randomly selected results at a time and is asked to choose the one that looks more realistic. After obtaining all the pairwise results, Bradley-Terry model (B-T model) is used to calculate the global ranking score for each method. [2]
Reference
- Xu, Qiantong, et al. “An empirical study on evaluation metrics of generative adversarial networks.” arXiv preprint arXiv:1806.07755 (2018).
- Tsai, Yi-Hsuan, et al. “Deep image harmonization.” CVPR, 2017.
- Zhang, Richard, et al. “The unreasonable effectiveness of deep features as a perceptual metric.” CVPR, 2018.