from: stackexchange.com
The Kullback-Leibler Divergence is not a metric proper, since it is not symmegtric and also, it does not satisfy the triangle inequality. So the "roles" played by the two distributions are different, and it is important to distribute these roles according to the real-world phenomenon under study.
When we write (the OP has calculated the expression using base-2 logarithms)
we consider the
Now,
where
Writing
(here too, the order in which we write the distributions in the expression of the cross-entropy matters, since it too is not symmetric), permits us to see that KL-Divergence reflects an increase in entropy over the unavoidable entropy of distribution
So, no, KL-divergence is better not to be interpreted as a "distance measure" between distributions, but rather as a measure of entropy increase due to the use of an approximation to the true distribution rather than the true distribution itself.
So we are in Information Theory land. To hear it from the masters (Cover & Thomas) "
...if we knew the true distribution
P of the random variable, we could construct a code with average description lengthH(P) . If, instead, we used the code for a distributionQ , we would needH(P)+K(P||Q) bits on the average to describe the random variable.
The same wise people say
...it is not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality. Nonetheless, it is often useful to think of relative entropy as a “distance” between distributions.
But this latter approach is useful mainly when one attempts to minimize KL-divergence in order to optimize some estimation procedure. For the interpretation of its numerical value per se, it is not useful, and one should prefer the "entropy increase" approach.
For the specific distributions of the question (always using base-2 logarithms)
In other words, you need 25% more bits to describe the situation if you are going to use
댓글