My title is taken from a similarly titled article by the physicist Ed Jaynes, whose work influenced me greatly. It refers to a controversial idea of epistemological probability theory: the method of maximum entropy, that was popularised and (arguably) invented by Jaynes. This principle states that, when choosing probabilities on a discrete hypothesis space, subject to constraints on the probabilities (e.g. a certain expectation value is specified), you should distribute the probability as uniformly as possible by the criterion of Shannon entropy.
It was soon realised that this is not, as Jaynes hoped, a method of assigning probabilities from scratch. With no constraints apart from normalisation, you get a uniform distribution, which is lurking in the background as a “prior” that is assumed by MaxEnt. The uniform distribution might be justified by another argument (e.g. invariance under relabelling of the hypotheses), but the point remains: Maximum Entropy updates probabilities from a previous distribution, it doesn’t generate them from scratch (I will use the term `ME’ to refer to MaxEnt applied in this updating fashion). This puts the principle on the same turf as another, much more well-accepted method for updating probabilities: Bayes’s theorem. The main difference seems to be that ME updates given constraints on the probabilities, and Bayes updates on new data.
Of course, when there are two methods that claim to do the same, or similar, things, disagreements can occur. There is a large, confusing literature on the relationship and possible conflicts between Bayes’s theorem and Maximum Entropy. I don’t recommend reading it. Actually, it gets worse: there are at least three different but vaguely related ideas that are called maximum entropy in the literature! The most common conflict is demonstrated in this short discussion by David MacKay. The problem posed is the classic one, along these lines (although MacKay presents it slightly differently): given that a biased die averaged 4.5 on a large number of tosses, assign probabilities for the next toss, x. This problem can seemingly be solved by Bayesian Inference, or by MaxEnt with a constraint on the expected value of x: E(x) =4.5. These two approaches give different answers!
Given the success of Bayes, I was confused and frustrated that nobody could clearly explain this old MaxEnt business, and whether it was still worth studying. All of this was on my mind when I attended the ISBA conference earlier this year. So, aided by free champagne, I sought out some opinions. John Skilling, an elder statesman (sorry John!) of the MaxEnt crowd, seems to have all but given up on the idea. Iain Murray, a recent PhD graduate in machine learning, dismissed MaxEnt’s claim to fundamental status, saying that it was just a curious way of deriving the exponential families. He also reminded me that Radford Neal rejects MaxEnt. These are all people whose opinions I respect highly. But in the end of this story I end up disagreeing with them.
How did this come about? At ISBA, I tracked down the only person who mentioned maximum entropy on their poster – Adom Giffin, from the USA, and had a long discussion/debate, essentially boiling down to the same issues raised by MacKay in the previous link: MaxEnt and Bayes can both be used for this problem, and are quite capable of giving different answers. I was still confused, and after returning to Sydney I thought about it some more and looked up some articles by Giffin and his colleague, Ariel Caticha. These can be found here and here. After devouring these I came to agree with their compatibilist position. The rationale given for ME is quite simple: prior information is valuable and we shouldn’t arbitrarily discard it. Suppose we start with some probability distribution q(x), and then learn that, actually, our probabilities should satisfy some constraint that q(x) doesn’t satisfy. We need to choose a new distribution p(x) that satisfies the new constraint – but we also want to keep the valuable information contained in q(x). If you seek a general method for doing this that satisfies a few obvious axioms, ME is it – you choose your p(x) such that it is as close to q(x) as possible (i.e. maximum relative entropy, minimum Kullback-Leibler distance) while satisfying the new constraint.
This seems to present a philosophical problem: where do we get constraints on probabilities from, if not from data? It is easy to imagine updating your probabilities using ME if some deity (or exam question writer) provides a command “thou shalt have an expectation value of 4.5”, but in real research problems, information never comes in this form. An experimental average is not an expectation value. I emailed Ariel Caticha with a suggested analogy for understanding this situation, which he agreed with (apparently a rare phenomenon in this field). The analogy is that, in classical mechanics, all systems can be specified by a Hamiltonian, and the equations of motion are obtained by differentiating the Hamiltonian in various ways. But hang on a second – what about a damped pendulum? What about a forced pendulum? I remember studying those in physics, and they are not Hamiltonian systems! But we understand why. Our model was designed to include only the coordinates and momenta that we are interested in – the ones about the pendulum – and not those describing the rest of the universe; these are dubbed “external to the system”, and their effects summarised by the damping coefficient or the driving force f(t). However, our use of a model of this kind does not stop us from believing that energy is actually conserved, if only our model included these extra coordinates and momenta. It also doesn’t mean we can be arbitrary in choosing the damping coefficient and the driving force f(t) – these ought to be true summaries of the relevant information about the environment.
Similarly, in inference, one should write down every possibility imaginable, and delete the ones that are inconsistent with all of our experiences (data). This would correspond to coming up with a really big model and then using Bayes’s theorem given all the data you can think of. However, this is impossible in practice, so for pragmatic reasons we summarise some of the data by a constraint on the probabilities we should use on a smaller hypothesis space, in much the same way that, in physics, we reduce the whole rest of the universe to just a single damping coefficient, or a driving force term. That is where constraints on probabilities come from – summaries of relevant data that we deem to be “external to the system” of interest. Once we have them, we need to process them to update our probabilities, and ME is the right tool for this job.
The simplest way to reduce some data to a constraint on probabilities is if the statement “I got data D” is in your hypothesis space, as it is in the normal Bayesian setup. Applying the syllogism “I got data D ==> my probabilities should satisfy P(D)=1”, then applying ME, leads directly to the conventional Bayesian result – as demonstrated by Giffin and Caticha. Thus, Bayesian Inference isn’t about accumulating data in order to overwhelm the prior information, as it is often presented. It is just the opposite – we are really trying to preserve as much prior information as possible!
This leaves one remaining loose end – that pesky biased die problem, or the analogous one discussed by MacKay. Which answer is correct? In my opinion, both are correct but deal with different prior states of knowledge (the ultimate Bayesian’s cop-out ;-)). If we actually knew that it was a repeated experiment, the Bayesian set-up of a “uniform prior over unknown probabilities”, and then conditioning on the observed mean, is correct. The hypothesis space here is the space of possible values for the 6 “true probabilities” of the die, producted with the space of possible sequences of rolls, {1,1,1,…,1}, {1,1,1,…,2}, …, {6,6,6,…,6}. Note that not all of these sequences of tosses are equally likely. If we condition on a 1 for the first toss, this raises our probability for the 2nd toss being a 1 as well. This is relevant prior information that should, and does, affect the result. This is the source of the disagreement between MaxEnt and Bayes.
If we didn’t know this whole setup about the die, merely that there were 6^N possibilities, with N large, this model would be inappropriate. We would have uniform probabilities for the sequence of tosses (this corresponds to the “poorly informed robot” from Jaynes’s book). In this scenario MaxEnt with E(x) = 4.5 completely agrees with Bayesian Inference. It is this case where MaxEnt is appropriate because we really do possess no information other than the specified average value.
This concludes my narrative of my journey from confusion to some level of understanding of this issue. At the moment, I am working on some ideas related to ME that can help clear up some difficulties in conventional Bayesian Inference. Particularly, there’s been a flare-up of controversy, basically over Lindley’s Paradox in cosmology, that I believe ME can go some way to resolving.
I’d like to leave you with a quote from neuroscientist V. S. Ramachandran, that gave me the confidence to reveal my heretical thoughts on this matter.
“I tell my students, when you go to these meetings, see what direction everyone is headed, so you can go in the opposite direction. Don’t polish the brass on the bandwagon.” – V. S. Ramachandran
Read Full Post »