GAN Evaluation Metric

Objective Evaluation:

An empirical study on evaluation metrics of generative adversarial networks [1] with code.

Inception Score (IS): classification score using the InceptionNet pretrained on ImageNet
$IS=\exp\{E_x[KL(p_M(y|x)||p_M(y))]\}$
in which $p_M(y)$ is the marginal distribution of $p_M(y|x)$. Expect $p_M(y)$ to be of low entropy while $p_M(y|x)$ to be of high entropy. The higher, the better.
Mode score: extension of Inception score
Kernel MMD: MMD distance between two data distributions
Wasserstein distance: Wasserstein distance (Earth mover’s distance) between two data distributions.
Fréchet Inception Distance (FID): extract InceptionNet features and measure the data distribution distance. The lower, the better.
$FID=\|\mu_r-\mu_g\|+trace(\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{\frac{1}{2}})$
KNN score: treat true data as positive and generated data as negative. Calculate the leave-one-out (LOO) accuracy based on 1-NN classifier.
Learned Perceptual Image Patch Similarity (LPIPS): $d(x,x_0)=\sum_l \frac{1}{H_l W_l}\sum_{h,w}\|w_l\circ (\hat{y}_{hw}^l-\hat{y}^l_{0hw})\|^2$ [3] [code]

Each user sees two randomly selected results at a time and is asked to choose the one that looks more realistic. After obtaining all the pairwise results, Bradley-Terry model (B-T model) is used to calculate the global ranking score for each method. [2]

Xu, Qiantong, et al. “An empirical study on evaluation metrics of generative adversarial networks.” arXiv preprint arXiv:1806.07755 (2018).
Tsai, Yi-Hsuan, et al. “Deep image harmonization.” CVPR, 2017.
Zhang, Richard, et al. “The unreasonable effectiveness of deep features as a perceptual metric.” CVPR, 2018.