I laid down the basics of probability theory and Bayes’ theorem in a previous post. Here’s the story, as a reminder. We have some background information B and some data D. We want to know: what is the probability that some theory T is true, given the background information and the data? We write this as , and expand it using Bayes’ theorem:
However, this “background information” is a little vague. What puts something in the background? How much background do I need to dig up? How do we divide our knowledge into background information B and data D? Does it matter? Is it that background information is being assumed as known, while for the data, as with all measurements, we must acknowledge a degree of uncertainty?
Here’s a few things about background information.
Tell me everything
The posterior views data and background equally
Calculate probabilities of data, with background information
Both background and data are taken as given
You should divide K cleanly
1. Tell me everything
The question is this: given everything I know , what is the probability that some statement is true? The idea is that a rational thinker, in evaluating some statement T, will take into account everything they know. Remember that one of the desiderata of probability theory, taken as a rational approach to reasoning with uncertainty, is that information must not be arbitrarily ignored. In principle, everything we know should be in somewhere. So tell me everything.
In practice, thankfully, irrelevant information can be ignored as it will factor out anyway (Point 5). That gives us the definition of “relevant” in probability theory: a statement is relevant if including it as given changes our probabilities.
2. The posterior views data and background equally
Why, then, have we decided to break up everything we know into “data” and “background” ? (Remember: DB means “both D and B are true”). The reason is that probabilities don’t grow on trees. If we had a black box that handed out posteriors for any information K and theory T we care to think of, then we wouldn’t need to worry about Bayes theorem or background and data. Remember: the whole point of probability identities is to take the probability we want and write it in terms of probabilities we have.
If K contains a great many statements (as it always does) then there will be multiple ways to divide it into “background” and “data”. We could have , or , or … . It is one of the desiderata of probability theory that identical states of knowledge should result in identical assigned probabilities. Since knowing and is the same state of knowledge as knowing and , the posterior must be the same. If you like, . In particular, . So take heart: there is no right way and wrong way to divide K. There are, however, easier and harder ways.
3. Calculate probabilities of data, with background information.
There is a unique posterior probability of T, given K. Thus, any division of K into D and B that allows us to calculate the likelihood, prior and marginal likelihood (i.e. the right hand side of Bayes’ theorem) will do. The question to ask is: how can I divide K into parts such that I can calculate the probability of one part with the other part (and T)? That’s all there is to it – call the first part the “data” and the second part the “background”.
These names – background and data – are simply the way it usually works out practice. I am in principle free to reverse the labels of the background and data. In practice, however, I will usually be unable to calculate the terms of Bayes theorem in this case.
4. Both background and data are taken as given.
Looking again at the terms of Bayes theorem, notice that the background information appears as a given in each term, while two of the terms consider the probability of the data. Does this mean that, while the uncertainty of the data is taken into account, the background is treated as certain for the sake of this calculation?
The discussion above shows that the background and data are simply labels, and thus they cannot have a different status in the calculation. In particular, nothing would change in principle if I reversed the labels.
The posterior shows that both the data and the background are taken as given. The posterior is the question of a rational enquiry – what follows from what I know, and with what degree of certainty? Calculating the probability of D given TB (the likelihood) doesn’t imply that we are treating D as uncertain. Rather, it is really T we are probing. We are testing the strength of the connection between T and D by asking: “how probable would D be if all I knew were T and B?”. I’ll have more to say about this is a future post.
5. You should divide K cleanly
I’ve been assuming that, when apportioning everything you know K into B and D, there is no overlap between the pieces. This is a good idea. Not because it would be wrong – as long as everything is in there somewhere, you’re OK. But you’re wasting your time, because this will happen …
where C represents the common statement which you’ve unwisely left in both B and D. Recall that, in a boolean algebra, AA = A, i.e. if A is true if and only if “A and A” is true. Now, here’s what happens in Bayes theorem, for any theory T:
(substituting from above)
(product rule, CC = C)
(p(C|C) = 1)
Since D and B are just labels, I could swap them and prove that . In other words, if there is an overlap between B and D, you’ll just end up doing the calculations as if you’d put the overlap C into one or the other. So you might as well divide K cleanly to start with.