As background for some future posts, I need to catalogue a few facts about Bayes’ theorem. This is all standard probability theory. I’ll be roughly following the discussion and notation of Jaynes.
We start with propositions, represented by A, B, etc. These propositions are in fact true or false, though we may not know which. We assume that we can do Boolean things with these propositions, in particular:
Conjunction: means “both A and B are true”
Disjunction: means “at least one of the propositions A, B is true”
Negation: means “not A” (i.e. A is false)
We want to assign probabilities to propositions to represent how likely they are to be true, given the information in other propositions. The function assigns a probability to the proposition , using only the information in . Read “p” as “the probability of” and the vertical bar “ | ” as “given”, so reads “the probability of A given B”. There are no “raw” probabilities, .
What is p? The older approach to probability of Kolmogorov et al. requires that P satisfy certain axioms. Roughly,
A1. is a non-negative number.
A2. One means certain: if i.e. if A is certain, given B.
Since the publication of Cox’s Theorem, an alternative approach to probability has gained popularity. Cox (again I’m following Jaynes) proposes the following desiderata of rationality, not as arbitrary axioms but as expectations of any rational approach to reasoning in the face of uncertainty.
D1. Probabilities are represented by real numbers.
D2. Probabilities change in common sense ways. For example, if learning C makes B more likely, but doesn’t change how likely A is, then learning C should make AB more likely.
D3. If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
D4. Information must not be arbitrarily ignored. All given evidence must be taken into account.
D5. Identical states of knowledge (except perhaps for the labeling of the propositions) should result in identical assigned probabilities.
The great advance of Cox’s theorem was to show that one can start with these desiderata and arrive at Kolmogorov’s axioms (at least, for the finite version of A3). Thus, the traditional laws of probability can be applied to more than just frequencies.
Conjunction and Total Probability
Beginning with the desiderata, we can derive more than Kolmogorov’s axioms. There are a number of useful probability identities. Remember that an identity is a formula that holds for any propositions we may substitute in. Here are two preliminary identities, before we get to Bayes’ theorem.
Conjunction: If then,
This is the full version of the “and means multiply” rule. Apply recursively for larger conjunctions. Note that B is “along for the ride”, and appears as a given in each term.
Law of Total Probability: We can expand probabilities in terms of some extra information T. We can’t just assume that this extra information is true, obviously, so there is a term that assumes and a term that assumes , weighted by the probability of each.
Again, B is along for the ride.
Here is the usual story about Bayes’ theorem. We have some background information about the world at large, B. We also have some data D about a specific scenario. We want to know: what is the probability that some theory T is true, given the background information and the data? We write this as . Using Bayes’ theorem, we can expand this as,
The posterior is the probability of T given D and B
The likelihood is the probability that D is true given B and T.
The prior is the probability that T is true given only the background information B.
The marginal likelihood is the probability that D is true given only the background information B.
What to do with identities
The purpose of the identities (1) – (3) is to take the probability we want and write it in terms of probabilities we have. For example, we can use equation (2) to write the marginal likelihood in terms of and ,
More generally, we can expand the marginal likelihood using a partition, wherein we have a set of theories (I’ll simplify things by considering only finite cases) which are mutually exclusive and exhaustive. In other words, exactly one of the theories is true. We use a more general form of the Law of Total Probability to write the marginal likelihood as,
Nice properties of Bayes’ theorem
We can incorporate data piece by piece
Theories are rewarded for making data more probable
Good theories beat their rivals
Sherlock Holmes’ principle: the best theory need not be perfect
Ambiguous information changes nothing
Nothing new, nothing changes
Extraordinary claims require extraordinary evidence
Certainty is serious
We can compare two theories independently of all others
A theory divided against itself will not stand.
1. We can incorporate data piece by piece
Suppose the data D can be split into two: . Then,
The second line has the same form as if we assumed that was part of our background, ie. if . Equation 6 shows why Bayes’ theorem is presented as method for updating our probabilities given new data. could represent what I knew before I did any experiments, is what I learned from the experiments on day 1, on day 2, and so on. Today’s posterior becomes tomorrow’s prior. Hence, Bayesian updating.
2. Theories are rewarded for making data more likely
Under what circumstances does the probability of T increase when D is added? From Bayes’ theorem,
Similarly for “less than”. This is a nice result. If assuming that T is true makes it more likely that D is true, then the addition of D to our knowledge increases the probability that T is true. (It doesn’t follow that T is likely; only that it is more likely with D and B than on B alone.)
3. Good theories beat their rivals
When is a theory more likely than not to be true? Using equation (4),
The important quantity for a theory is the product of its likelihood and its prior. If this product is larger for T than for not-T, that theory is more probable than not. Thus, in evaluating the probability of a theory, we compare it to the set of all of its rivals, not-T.
4. Sherlock Holmes’ principle: the best theory need not be perfect
Note that, for inequality (7) above to hold, neither the likelihood nor the prior nor their product need to be large. The product just has to be greater than for its rivals (put together, not just better than each individual rival). The fact that some data is unlikely on T (i.e. is small) is not sufficient to rule out that theory. In short, you win the Bayesian game by being better, not necessarily by being perfect . This is the Bayesian, more qualified version of Sherlock Holmes’ saying, “When you have eliminated the impossible, whatever remains, however improbable, must be the truth”.
5. Ambiguous information doesn’t change anything
If certain data D is equally probable on and , then it doesn’t help us decide between these two theories. This is true even if the data appears problematic for , ie. is small.
Brendon has made this point quite nicely in a previous post, using the example of a lottery. Note, in particular, that we don’t need to throw away ambiguous information by, say, replacing “John Smith won the lottery” with “someone won the lottery”. Bayes’ theorem automatically filters out what is irrelevant.
6. Nothing new, nothing changes
If knowing B is enough to know D (i.e. if D follows from B) then . It also follows that the likelihood , since if D follows from B then adding T won’t change this. Thus, from Bayes’ theorem if then . This makes sense – if D doesn’t tell us anything new, then the probability of T shouldn’t change.
7. Extraordinary claims require extraordinary evidence
This is another corollary of (4), which we can formulate as follows,
If , then requires .
An extraordinary claim is a priori unlikely, hence . The extraordinary evidence, in this case, is not that D makes is large. This is neither necessary nor sufficient, as we saw above. We need the evidence to dramatically favour our theory over all others, hence .
Note that an extraordinary claim can be the most probable a posteriori. “Extraordinary claims require extraordinary evidence” is sometimes interpreted as “extraordinary claims can go jump”. Bayes’ theorem won’t allow this. We cannot dismiss theories on their prior alone (unless that prior is exactly zero).
8. Certainty is serious
Bayes’ theorem takes certainty seriously. Probabilities get stuck at zero and one. Given some theory T, if there is some background information B which causes you to assign or , then no “new” information D can change your mind.
If , then and by Bayes’ theorem,
Similarly if , then . (Exercise for the reader).
This doesn’t hold approximately. Probabilities that are nearly one or zero don’t necessarily remain near one or zero when other data is added.
9. We can compare two theories independently of all others
We can consider just two theories as follows,
This is independent of all the other possible theories considered above, since the marginal likelihood cancels out. Restricting our focus to just two theories can obviously be very useful in practice.
10. A theory divided against itself will not stand.
Suppose my theory can be broken up into a number of propositions . The likelihood depends on the whole theory T, and the posterior probability of that theory is,
The first term on the right hand side is the one I’m after. is the probability that the first part of my theory is true, given the second part and what we known . If the theory has some internal tension then this will penalize the theory as a whole. For example, consider:
= a murder scene. A locked third storey room. An open window. A boiled sweet is lying on the floor.
= Information about other murders, people in general etc.
= the killer climbed up a drainpipe into the third storey room.
= the killer is over 80 years old
While an 80+ year old, drainpipe climbing killer explains the evidence (), the probability that the killer climbed a drainpipe given that they are 80+ years old is rather low . This counts against, though doesn’t definitively rule out, the theory.