Feeds:
Posts

## A Tale of Two Entropies

For those of us who work with degree-of-plausibility (“Bayesian”) probabilities, two situations regularly arise. The first is the need to update probabilities to take into account new information. This is usually done using Bayes’ Rule, when the information comes in the form of a proposition that is known to be true. An example of such a proposition is “The data are 3.444, 7.634, 1.227”.

More generally, information is any justified constraint on our probabilities. For example, “P(x > 3) should be 0.75” is information. If our current probability distribution $q(x)$ doesn’t satisfy the constraint, then we better change to a new distribution $p(x)$ that does. This doesn’t mean that any old $p(x)$ will do – our $q(x)$ contained hard-won information and we want to preserve that. To proceed, we choose the $p(x)$ that is as close as possible to $q(x)$, but satisfies the constraint. Various quite persuasive arguments (see here) suggest that the correct notion of closeness that we should maximise is the relative entropy:

$H(p; q) = -\int p(x) \log \frac{p(x)}{q(x)} dx$

With no constraints, the best possible $p(x)$ is equal to $q(x)$.

Another situation that arises often is the need to simplify complex problems. For example, we might have some probability distribution $q(x)$ that is non-Gaussian, but for some reason we only want to use Gaussians for the rest of the calculation, perhaps for presentation or computational reasons. Which Gaussian should we choose to become our $p(x)$? Many people recommend maximising the relative entropy for this also: in the literature, this is known as a variational approximation, variational Bayes, or the Bogoliubov approximation (there are also variations (pun not intended) on this theme).

There are known problems with this technique. For instance, as David MacKay notes, the resulting probability distribution $p(x)$ is usually narrower than the original $q(x)$. This makes sense, since the variational approximation basically amounts to pretending you have information that you don’t actually have. This issue raises the question of whether there is something better that we could do.

I suggest that the correct functional to maximise in the case of approximating one distribution by another is actually the relative entropy, but with the two distributions reversed:

$H(q; p) = -\int q(x) \log \frac{q(x)}{p(x)} dx$

Why? Well, for one, it just works better in extreme examples I’ve concocted to magnify (a la Ed Jaynes) the differences between using $H(p; q)$ and $H(q; p)$. See the figure below:

If the blue distribution represented your actual state of knowledge, but out of necessity you could only use the red or the green distribution, which would you prefer? I find it very hard to imagine an argument that would make me choose the red distribution over the green. Another argument supporting the use of this ‘reversed’ entropy is that it is equivalent to generating a large number of samples from q, and then doing a maximum likelihood fit of p to these samples. I know maximum likelihood isn’t the best, most principled thing in the world, but in the limit of a large number of samples it’s pretty hard to argue with.

A further example supporting the ‘reversed’ entropy is what happens if $q(x)$ is zero at some points. According to the regular entropy, any distribution $p(x)$ that is nonzero where $q(x)$ is zero, is infinitely bad. I don’t think that’s true, in the case of approximations – some leakage of probability to values we know are impossible is no catastrophe. This is manifestly different to the case where we have legitimate information – if $q(x)$ is zero somewhere then of course we want to have $p(x)$ zero there as well. If we’re updating probabilities, we’re trying to narrow down the possibilities, and resurrecting some is certaintly unwarranted – but the goal in doing an approximation is different.

Maximising the reversed entropy also has some pretty neat properties. If the approximating distribution is a Gaussian, then the first and second moments should be chosen to match the moments of $q(x)$. If the original distribution is over many variables, but you want to approximate it by a distribution where the variables are all independent, just take all of the marginal distributions and product them together, and there’s your optimal approximation.

If $H(p; q)$ isn’t the best thing to use for approximations, that means that something in the derivation of $H(p; q)$ applies to legitimate information but does not apply to approximations. Most of the axioms (coordinate independence, consistency for independent systems, etc) make sense, and both entropies discussed in this post satisfy those. It is only at the very end of the derivation that the reversed entropy is ruled out, and by some pretty esoteric arguments that I admit I don’t fully understand. I think the examples I’ve presented in this post are suggestive enough that there is room here for a proof that the reversed entropy $H(q; p)$ is the thing to use for approximations. This means that maximum relative entropy is a little less than universal, but that’s okay – the optimal solutions to different problems are allowed to be different!

### 4 Responses

1. How are you deciding between the red and the green line? Intuitively, the green one is preferred because it takes into account the two side bumps, whereas the red one completely ignores them.

But there is something like confirmation bias here (en.wikipedia.org/wiki/Confirmation_bias). The troughs between the bumps shouldn’t be ignored, either. We could just as easily prefer the red one because it takes into account the troughs, whereas the green one ignores them.

What happens if you break out Kolmogorov–Smirnov? (Or is that considered a statistical faux pas?)

2. “How are you deciding between the red and the green line? Intuitively, the green one is preferred because it takes into account the two side bumps, whereas the red one completely ignores them.”

Yep, that’s my claim.

“The troughs between the bumps shouldn’t be ignored, either.”

Ideally, of course they shouldn’t – we could just use the blue curve then. But if we’re forced to, what are we better of ignoring? The bumps or the troughs? My intuition stronly prefers ignoring the troughs.

3. on September 24, 2015 at 12:29 pm | Reply quantum information geometry guy

I feel there is something mixed up here, and not according to the way it’s usually presented. I believe, your conclusions are right, but with somewhat faulty reasoning.

By various reasons that Jaynes was first pointing out, one is interested in maximising the entropy of a distribution, given constraints, to obtain a distribution that fullfills the constraints and having the least additional information. As an example, we use Gaussians to describe distribution where only mean and variance is known.

When going from a prior (original) distribution Q to posterior (updated) distribution P, that is, updating Q to obtain P, one does quite similar.

Entropy, Relative entropy (also called Kullback Leibler divergence) are commonly defined as below.

S(P) = – \sum p_i log p_i
D(P||Q) = \sum p_i log (p_i / q_i)

The relationship between entropy and and the relative entropy is as follows:

D(P || uniform) = log(N) – S(P)

where N is the support of uniform, that is, the number of events considered (e.g. 2^n for repeated coin flips.)

So in order to maximise the entropy, one has to minimises the relative entropy.

In your definition, you added a minus sign in front of the relative entropy. Of course, one then has to maximise instead of minimise. This is however a bit unusual, as one wants the relative entropy to be always positive, as a measure of uncertainty. Without minus sign, it is, and the relative entropy is only zero if P equals Q.

(Aside: Maximising the relative entropy D(P || Q) by varing P is easy. Just find a P with supp(P) > supp(Q). Then, D(P||Q) = inf, because the term P log (P/Q) will be infinite when P>0, Q=0)

You’re quite right in saying that a “reversed relative entropy” is better for approximating a distribution with another one. This however is different scenario then the one described above. Before, you wanted to update a distribution, going from the Q prior to the posterior P. Now you look for a best approximation, that is, looking for the best prior distribution Q (fullfilling a reduced set of constraints), that needs the minimal amount of update to obtain the posterior P. For this case one minimises D(P,Q), where Q is the best approximation, fulfilling some but not all constraints put on P.

To keep these things apart, on has only to remember that the posterior (updated distribution) is in the first slot, and the prior in the second slot of the relative entropy.

D( Posterior || Prior)

I think this article covers most points: Maximum Entropy and Bayesian inference: Where do we stand and where do we go?

4. on September 24, 2015 at 12:35 pm | Reply quantum information geometry guy

“When going from a prior (original) distribution Q to posterior (updated) distribution P, that is, updating Q to obtain P, one does quite similar.”

I forgot to add: it is similar, because in the case where the prior is uniform (no previous knowledge of the distribution), minimising the relative entropy corresponds exactly to maximising the entropy. One can also perform iteratively updates, and when done right (one has to always stay on a so called mixed geodesic), the relative entropies obtained by each update add up to give the entropy of the final posterior.