On the Jensen-Shannon divergence and the variation distance for categorical probability distributions

Corander, Jukka; Remes, Ulpu; Koski, Timo

About DML-CZ | FAQ | Conditions of Use | Math Archives | Contact Us

Previous | Up | Next

Article

Corander, Jukka ; Remes, Ulpu ; Koski, Timo

On the Jensen-Shannon divergence and the variation distance for categorical probability distributions. (English). Kybernetika, vol. 57 (2021), issue 6, pp. 879-907

MSC: 62B10, 62H05, 94A17 | MR 4376866 | Zbl 07478645 | DOI: 10.14736/kyb-2021-6-0879

Full entry |

PDF (0.7 MB) Feedback

Keywords:
blended divergences; Chan-Darwiche metric; likelihood-free inference; implicit maximum likelihood; reverse Pinsker inequality; simulator-based inference

Summary:
We establish a decomposition of the Jensen-Shannon divergence into a linear combination of a scaled Jeffreys' divergence and a reversed Jensen-Shannon divergence. Upper and lower bounds for the Jensen-Shannon divergence are then found in terms of the squared (total) variation distance. The derivations rely upon the Pinsker inequality and the reverse Pinsker inequality. We use these bounds to prove the asymptotic equivalence of the maximum likelihood estimate and minimum Jensen-Shannon divergence estimate as well as the asymptotic consistency of the minimum Jensen-Shannon divergence estimate. These are key properties for likelihood-free simulator-based inference.

Similar articles:

References:

[1] Barnet, N. S., Dragomir, S.: A survey of recent inequalities for $\phi$-divergences of discrete probability distributions. In: Advances in Inequalities from Probability Theory and Statistics (N. S. Barnett and S. S. Dragomir, eds.), Nova Science Publishing, New York 2008, pp. 1-85. DOI | MR 2459969

[2] Basseville, M.: Divergence measures for statistical data processing -- An annotated bibliography. Signal Processing 93 (2013), 621-633. DOI

[3] Berend, D., Kontorovich, A.: A sharp estimate of the binomial mean absolute deviation with applications. Stat. Probab. Lett. 83 (2013), 1254-259. DOI 10.1016/j.spl.2013.01.023 | MR 3041401

[4] Tutorial, BOLFI, Manual: https://elfi.readthedocs.io/en/latest/usage/BOLFI.html, 2017.

[5] Böhm, U., Dahm, P. F., McAllister, B. F., Greenbaum, I. F.: Identifying chromosomal fragile sites from individuals: a multinomial statistical model. Human Genetics 95 (1995), 249-256. DOI 10.1007/BF00225189

[6] Chan, H., Darwiche, A.: A distance measure for bounding probabilistic belief change. Int. J. Approx. Reasoning 38 (2005), 149-174. DOI | MR 2116782

[7] Chan, H., Darwiche, A.: On the revision of probabilistic beliefs using uncertain evidence. Artif. Intell. 163 (2005), 67-90. DOI 10.1016/j.artint.2004.09.005 | MR 2120039

[8] Charalambous, C. D., Tzortzis, I., Loyka, S., Charalambous, T.: Extremum problems with total variation distance and their applications. IEEE Trans. Automat. Control 59 (2014), 2353-2368. DOI | MR 3254531

[9] Corander, J., Fraser, C., Gutmann, M. U., Arnold, B., Hanage, W. P., Bentley, S. D., Lipsitch, M., Croucher, N. J.: Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nature Ecology Evolution 1 (2017), 1950-1960. DOI

[10] Cover, Th. M., Thomas, J. A.: Elements of Information Theory. Second edition. John Wiley and Sons, New York 2012. MR 2239987

[11] Cranmer, K., Brehmer, J., Louppe, G.: The frontier of simulation-based inference. Proc. Natl. Acad. Sci. USA 117 (2020), 30055-30062. DOI | MR 4263287

[12] Csiszár, I., Talata, Z.: Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52 (2006), 1007-1016. DOI | MR 2238067

[13] Csiszár, I., Shields, P. C.: Information Theory and Statistics: A tutorial. Now Publishers Inc, Delft 2004.

[14] Devroye, L.: The equivalence of weak, strong and complete convergence in $ L_1 $ for kernel density estimates. Ann. Statist. 11 (1983), 896-904. DOI 10.1214/aos/1176346255 | MR 0707939

[15] Diggle, P. J., Gratton, R. J.: Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 46, (1984), 193-212. MR 0781880

[16] M.Endre, D., Schindelin, J. E.: A new metric for probability distributions. IEEE Trans. Inform. Theory 49 (2003), 1858-1860. DOI | MR 1985590

[17] Fedotov, A. A., Harremoës, P., Topsøe, F.: Refinements of Pinsker's inequality. IEEE Trans. Inform. Theory 49 (2003), 1491-1498. DOI | MR 1984937

[18] Gibbs, A. L., Su, F. E.: On choosing and bounding probability metrics. Int. Stat. Rev. 70 (2002), 419-435. DOI

[19] Guntuboyina, A.: Lower bounds for the minimax risk using $ f $-divergences, and applications. IEEE Trans. Inform. Theory 57 (2011), 2386-2399. DOI | MR 2809097

[20] Gutmann, M. U., Corander, J.: Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res. 17, (2016), 4256-4302. MR 3555016

[21] Gyllenberg, M., Koski, T., Reilink, E., Verlaan, M.: Non-uniqueness in probabilistic numerical identification of bacteria. J. App. Prob. 31 (1994), 542-548. DOI | MR 1274807

[22] Gyllenberg, M., Koski, T.: Numerical taxonomy and the principle of maximum entropy. J. Classification 13 (1996), 213-229. DOI | MR 1421666

[23] Holopainen, I.: Evaluating Uncertainty with Jensen-Shannon Divergence. Master's Thesis, Faculty of Science, University of Helsinki 2021.

[24] Hou, C-D., Chiang, J., Tai, J. J.: Identifying chromosomal fragile sites from a hierarchical-clustering point of view. Biometrics 57 (2001), 435-440. DOI | MR 1855677

[25] Janžura, M., Boček, P.: A method for knowledge integration. Kybernetika 34 (1998), 41-55. MR 1619054

[26] Jardine, N., Sibson, R.: Mathematical Taxonomy. J. Wiley and Sons, London 1971. MR 0441395

[27] Khosravifard, M., Fooladivanda, D., Gulliver, T. A.: Exceptionality of the variational distance. In: 2006 IEEE Information Theory Workshop-ITW'06 Chengdu 2006, pp. 274-276.

[28] Koski, T.: Probability Calculus for Data Science. Studentlitteratur, Lund 2020.

[29] Kůs, V.: Blended $\phi $-divergences with examples. Kybernetika 39 (2003), 43-54. MR 1980123

[30] Kůs, V., Morales, D., Vajda, I.: Extensions of the parametric families of divergences used in statistical inference. Kybernetika 44 (2008), 95-112. DOI | MR 2405058

[31] LeCam, L.: On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Ann. Math. Statist. 41 (1970), 802-828. DOI | MR 0267676

[32] Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inform. Theory 52 (2006), 4394-4412. DOI | MR 2300826

[33] Li, K., Mitendra, J.: Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087, 2018).

[34] Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory 37 (1991), 145-151. DOI | MR 1087893

[35] Lintusaari, J., Gutmann, M. U, Dutta, R., Kaski, S., Corander, J.: Fundamentals and recent developments in approximate Bayesian computation. Systematic Biology 66 (2017), e66-e82.

[36] Lintusaari, J., Vuollekoski, H., Kangasrääsiö, A., Skytén, K., Järvenpää, M., Marttinen, P., Gutmann, M. U., Vehtari, A., Corander, J., Kaski, S.: ELFI: Engine for likelihood-free inference. J. Mach. Learn. Res. 19 (2018), 1-7. MR 3862423

[37] Morales, D., Pardo, L., Vajda, I.: Asymptotic divergence of estimates of discrete distributions. J. Statist. Plann. Inference 48 (1995), 347-369. DOI | MR 1368984

[38] Nowozin, S., Cseke, B., Tomioka, R.: f-gan: Training generative neural samplers using variational divergence minimization. Advances Neural Inform. Process. Systems (2016), 271-279.

[39] Okamoto, M.: Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst.of Statist. Math. 10 (1959), 29-35. DOI | MR 0099733

[40] Sason, I.: On f-divergences: Integral representations, local behavior, and inequalities. Entropy 20 (2018), 383-405. DOI | MR 3862573

[41] Sason, I., Verdu, S.: $f$-divergence inequalities. IEEE Trans. Inform. Theory 62 (2016), 5973-6006. DOI | MR 3565096

[42] Shannon, M.: Properties of f-divergences and f-GAN training. arXiv preprint arXiv:2009.00757, 2020.

[43] Sibson, R.: Information radius. Z. Wahrsch. Verw. Geb. 14 (1969), 149-160. DOI | MR 0258198

[44] Sinn, M., Rawat, A.: Non-parametric estimation of Jensen-Shannon divergence in generative adversarial network training. In: International Conference on Artificial Intelligence and Statistics 2018, pp. 642-651.

[45] Taneja, I. J.: On mean divergence measures. In: Advances in Inequalities from Probability Theory and Statistics (N. S. Barnett and S. S. Dragomir, eds.), Nova Science Publishing, New York 2008, pp. 169-186. MR 2459974

[46] Topsøe, F.: Information-theoretical optimization techniques. Kybernetika 15 (1979), 8-27. MR 0529888

[47] Topsøe, F.: Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inform. Theory 46 (2000), 1602-1609. DOI | MR 1768575

[48] Vajda, I.: Note on discrimination information and variation (Corresp.). IEEE Trans. Inform. Theory 16 (1970), 771-773. DOI | MR 0275575

[49] Vajda, I.: Theory of Statistical Inference and Information. Kluwer Academic Publ., Delft 1989.

[50] Vajda, I.: On metric divergences of probability measures. Kybernetika 45 (2009), 885-900. DOI | MR 2650071

[51] Jr., J. I. Yellott: The relationship between Luce's choice axiom, Thurstone's theory of comparative judgment, and the double exponential distribution. J. Math. Psych. 15 (1977), 109-144. DOI | MR 0449795

[52] Österreicher, F., Vajda, I.: Statistical information and discrimination. IEEE Trans. Inform. Theory 39 (1993), 1036-1039. DOI | MR 1237725

Browse
- Collections
- Titles
- Authors
- MSC

About DML-CZ

Partner of

Article

Search

Browse