My title is taken from a similarly titled article by the physicist Ed Jaynes, whose work influenced me greatly. It refers to a controversial idea of epistemological probability theory: the method of maximum entropy, that was popularised and (arguably) invented by Jaynes. This principle states that, when choosing probabilities on a discrete hypothesis space, subject to constraints on the probabilities (e.g. a certain expectation value is specified), you should distribute the probability as uniformly as possible by the criterion of Shannon entropy.
It was soon realised that this is not, as Jaynes hoped, a method of assigning probabilities from scratch. With no constraints apart from normalisation, you get a uniform distribution, which is lurking in the background as a “prior” that is assumed by MaxEnt. The uniform distribution might be justified by another argument (e.g. invariance under relabelling of the hypotheses), but the point remains: Maximum Entropy updates probabilities from a previous distribution, it doesn’t generate them from scratch (I will use the term `ME’ to refer to MaxEnt applied in this updating fashion). This puts the principle on the same turf as another, much more well-accepted method for updating probabilities: Bayes’s theorem. The main difference seems to be that ME updates given constraints on the probabilities, and Bayes updates on new data.
Of course, when there are two methods that claim to do the same, or similar, things, disagreements can occur. There is a large, confusing literature on the relationship and possible conflicts between Bayes’s theorem and Maximum Entropy. I don’t recommend reading it. Actually, it gets worse: there are at least three different but vaguely related ideas that are called maximum entropy in the literature! The most common conflict is demonstrated in this short discussion by David MacKay. The problem posed is the classic one, along these lines (although MacKay presents it slightly differently): given that a biased die averaged 4.5 on a large number of tosses, assign probabilities for the next toss, x. This problem can seemingly be solved by Bayesian Inference, or by MaxEnt with a constraint on the expected value of x: E(x) =4.5. These two approaches give different answers!
Given the success of Bayes, I was confused and frustrated that nobody could clearly explain this old MaxEnt business, and whether it was still worth studying. All of this was on my mind when I attended the ISBA conference earlier this year. So, aided by free champagne, I sought out some opinions. John Skilling, an elder statesman (sorry John!) of the MaxEnt crowd, seems to have all but given up on the idea. Iain Murray, a recent PhD graduate in machine learning, dismissed MaxEnt’s claim to fundamental status, saying that it was just a curious way of deriving the exponential families. He also reminded me that Radford Neal rejects MaxEnt. These are all people whose opinions I respect highly. But in the end of this story I end up disagreeing with them.
How did this come about? At ISBA, I tracked down the only person who mentioned maximum entropy on their poster – Adom Giffin, from the USA, and had a long discussion/debate, essentially boiling down to the same issues raised by MacKay in the previous link: MaxEnt and Bayes can both be used for this problem, and are quite capable of giving different answers. I was still confused, and after returning to Sydney I thought about it some more and looked up some articles by Giffin and his colleague, Ariel Caticha. These can be found here and here. After devouring these I came to agree with their compatibilist position. The rationale given for ME is quite simple: prior information is valuable and we shouldn’t arbitrarily discard it. Suppose we start with some probability distribution q(x), and then learn that, actually, our probabilities should satisfy some constraint that q(x) doesn’t satisfy. We need to choose a new distribution p(x) that satisfies the new constraint – but we also want to keep the valuable information contained in q(x). If you seek a general method for doing this that satisfies a few obvious axioms, ME is it – you choose your p(x) such that it is as close to q(x) as possible (i.e. maximum relative entropy, minimum Kullback-Leibler distance) while satisfying the new constraint.
This seems to present a philosophical problem: where do we get constraints on probabilities from, if not from data? It is easy to imagine updating your probabilities using ME if some deity (or exam question writer) provides a command “thou shalt have an expectation value of 4.5”, but in real research problems, information never comes in this form. An experimental average is not an expectation value. I emailed Ariel Caticha with a suggested analogy for understanding this situation, which he agreed with (apparently a rare phenomenon in this field). The analogy is that, in classical mechanics, all systems can be specified by a Hamiltonian, and the equations of motion are obtained by differentiating the Hamiltonian in various ways. But hang on a second – what about a damped pendulum? What about a forced pendulum? I remember studying those in physics, and they are not Hamiltonian systems! But we understand why. Our model was designed to include only the coordinates and momenta that we are interested in – the ones about the pendulum – and not those describing the rest of the universe; these are dubbed “external to the system”, and their effects summarised by the damping coefficient or the driving force f(t). However, our use of a model of this kind does not stop us from believing that energy is actually conserved, if only our model included these extra coordinates and momenta. It also doesn’t mean we can be arbitrary in choosing the damping coefficient and the driving force f(t) – these ought to be true summaries of the relevant information about the environment.
Similarly, in inference, one should write down every possibility imaginable, and delete the ones that are inconsistent with all of our experiences (data). This would correspond to coming up with a really big model and then using Bayes’s theorem given all the data you can think of. However, this is impossible in practice, so for pragmatic reasons we summarise some of the data by a constraint on the probabilities we should use on a smaller hypothesis space, in much the same way that, in physics, we reduce the whole rest of the universe to just a single damping coefficient, or a driving force term. That is where constraints on probabilities come from – summaries of relevant data that we deem to be “external to the system” of interest. Once we have them, we need to process them to update our probabilities, and ME is the right tool for this job.
The simplest way to reduce some data to a constraint on probabilities is if the statement “I got data D” is in your hypothesis space, as it is in the normal Bayesian setup. Applying the syllogism “I got data D ==> my probabilities should satisfy P(D)=1”, then applying ME, leads directly to the conventional Bayesian result – as demonstrated by Giffin and Caticha. Thus, Bayesian Inference isn’t about accumulating data in order to overwhelm the prior information, as it is often presented. It is just the opposite – we are really trying to preserve as much prior information as possible!
This leaves one remaining loose end – that pesky biased die problem, or the analogous one discussed by MacKay. Which answer is correct? In my opinion, both are correct but deal with different prior states of knowledge (the ultimate Bayesian’s cop-out ;-)). If we actually knew that it was a repeated experiment, the Bayesian set-up of a “uniform prior over unknown probabilities”, and then conditioning on the observed mean, is correct. The hypothesis space here is the space of possible values for the 6 “true probabilities” of the die, producted with the space of possible sequences of rolls, {1,1,1,…,1}, {1,1,1,…,2}, …, {6,6,6,…,6}. Note that not all of these sequences of tosses are equally likely. If we condition on a 1 for the first toss, this raises our probability for the 2nd toss being a 1 as well. This is relevant prior information that should, and does, affect the result. This is the source of the disagreement between MaxEnt and Bayes.
If we didn’t know this whole setup about the die, merely that there were 6^N possibilities, with N large, this model would be inappropriate. We would have uniform probabilities for the sequence of tosses (this corresponds to the “poorly informed robot” from Jaynes’s book). In this scenario MaxEnt with E(x) = 4.5 completely agrees with Bayesian Inference. It is this case where MaxEnt is appropriate because we really do possess no information other than the specified average value.
This concludes my narrative of my journey from confusion to some level of understanding of this issue. At the moment, I am working on some ideas related to ME that can help clear up some difficulties in conventional Bayesian Inference. Particularly, there’s been a flare-up of controversy, basically over Lindley’s Paradox in cosmology, that I believe ME can go some way to resolving.
I’d like to leave you with a quote from neuroscientist V. S. Ramachandran, that gave me the confidence to reveal my heretical thoughts on this matter.
“I tell my students, when you go to these meetings, see what direction everyone is headed, so you can go in the opposite direction. Don’t polish the brass on the bandwagon.” – V. S. Ramachandran
I think this all sounds sensible – funny, I didn’t realise there was any controversy! Anyone who has actually used the maxent code for image reconstruction can see the influence of the prior information (or enforced lack thereof). These days I use the maxent principle to justify various pragmatically simple priors, but always feel a little weedy while doing so… Strange how wedded people are to the ideal of an uniformative prior – I think it must be because we never really understand our experimental setups, and (at least in astronomy) are often blundering around in the dark.
As I recall, I did say that MaxEnt didn’t have any fundamental status. Then as an after-thought I mentioned that exponential family distributions are clearly pretty significant. I shouldn’t have claimed that that is the only reason MaxEnt could be interesting though (if I did).
Radford’s comment that I was referring to was this:
http://groups.google.com/group/sci.stat.consult/msg/2cf57ceb8ec46e0f
There is also an older similar post
http://groups.google.com/group/sci.stat.math/msg/6bacf6af89d60808
which was followed by a heated debate.
—
I like your analogy with physical systems. Any arbitrary predictor that gives a proper conditional probability distribution given some observations is (trivially) compatible with a Bayesian interpretation. Just pick any joint prior distribution over observations and new outcomes that has that conditional. So any simple updating procedure, not just MaxEnt, could be seen as a compact description of the effect of some, potentially complicated, beliefs. Some predictors’ implicit priors have suspect properties, such as depending on the number of observations that will be made. I would treat these with care.
Whenever a seemingly non-Bayesian approach beats “the” Bayesian approach, it’s often due to a misunderstanding that has limited the flexibility of the prior too much. This should be a signal that too many assumptions have been made. Three examples: 1) Example 11.9, p186 of Wasserman’s “All of statistics” puts a uniform prior on a high-dimensional quantity, rather than using a hierarchical prior that only makes each component uniform marginally. This leads to an apparent paradox, although the book spuriously blames Bayesian inference as the cause (with no real explanation). 2) This paper: http://books.nips.cc/papers/files/nips20/NIPS2007_0756.pdf also shows Bayesian methods apparently failing, again the reason is that there isn’t hierarchical structure in the priors. 3) This (non-Bayesian) paper http://www.machinelearning.org/proceedings/icml2007/papers/303.pdf seems to use information that should be irrelevant and that couldn’t be used in a Bayesian approach. The reason the procedure helps is that it adds flexibility to the models. If this had been realized, it could have been motivation to improve the models directly instead.
I hope your MaxEnt journey does give you insight into your problem and leads to better models of it.
Statistics is one of those rare fields of study which is far more interesting in its fundamentals than in its applications …
Debating the superiority of Baysianism to slice bread? Fine!
Actual calculations? No thanks!
I disagree with your broad point, Luke: granting the false dichotomy between fundamentals and applications, I
challengedefy you to name three disciplines in which applications are a key attraction. The only one I can think of is philosophy.Computer modelling?
Fundamentals are translations, scaling, rotations (matirx multiplications) etc.
Applications: http://www.magnusviri.com/lego/gallery/at_stp.html for instance.
Hey, I am having to read this several times Brendon, but it’s interesting so that’s okay. I was going to request you explain a little more the basic confusion between Bayes and MaxEnt with regard to the biased die, but I realise you linked several articles, so I’ll do some more reading myself before coming back with requests.
Opening up Caticha’s lectures, I find in the preface: “Science consists in using information about the world for the purpose of predicting, explaining, understanding, and/or controlling phenomena of interest. The basic difficulty is that the available information is usually insufficient to attain any of those goals with certainty.”
So, I disagree with that second sentiment. I know it’s just the preface & at 170 pages he can spare some effort at the beginning, but surely the basic difficulty [in science] is the inability to account for hypotheses or experiments we haven’t thought of yet. If we had access to the nature of every idea concretely testable and merely lacked the complete account of the Universe with regard to their predictions, probability would make pretty short work of natural science.
Here’s the paper.
http://arxiv.org/abs/0906.5609
I’ll let you know how the MaxEnt people like or dislike this idea next week, and whether anything interesting happens when we apply it to dark energy.
[…] justified constraint on our probabilities. For example, “P(x > 3) should be 0.75″ is information. If our current probability distribution doesn’t satisfy the constraint, then we better […]
Sorry to bring up a 2 year old thread, but the McKay link doesn’t work. Could you remember a couple of keywords so I can try searching for it directl? Many thanks.
I believe this is the working link to MacKay’s explanation:
http://beta.metafaq.com/action/answer?id=QCVH1D35ARGFRC5VCSDR6JKU0E&ref=http%3a%2f%2fapi.transversal.com%2fmfapi%2fobjectref%2fEntryStore%2fEntry%2fhttp%3a%2f%2fwww.metafaq.com%2fmfapi%2fMetafaq%2fClients%2fmackay%2fModules%2fbook%2fTopics%2finference%3a83538%3a0
[…] with the marginal . The right way to update probabilities given this kind of information is to use ME, which implements “minimal updating”: you should stay as close as possible to your […]
[…] this updating procedure is equivalent to a certain way of using MaxEnt, which means that Bayesian updating is a way of staying as close to the prior as possible: a fun […]
I know this is quite an old thread, but the MacKay link doesn’t seem to be working again. I would love an updated link if possible.