As background for some future posts, I need to catalogue a few facts about Bayes’ theorem. This is all standard probability theory. I’ll be roughly following the discussion and notation of Jaynes.
Probability
We start with propositions, represented by A, B, etc. These propositions are in fact true or false, though we may not know which. We assume that we can do Boolean things with these propositions, in particular:
-
Conjunction:
means “both A and B are true”
-
Disjunction:
means “at least one of the propositions A, B is true”
-
Negation:
means “not A” (i.e. A is false)
We want to assign probabilities to propositions to represent how likely they are to be true, given the information in other propositions. The function assigns a probability to the proposition
, using only the information in
. Read “p” as “the probability of” and the vertical bar “ | ” as “given”, so
reads “the probability of A given B”. There are no “raw” probabilities,
.
What is p? The older approach to probability of Kolmogorov et al. requires that P satisfy certain axioms. Roughly,
A1. is a non-negative number.
A2. One means certain: if
i.e. if A is certain, given B.
A3. Or means add: if at most one of (countable) is true (i.e. disjoint), then
.
Since the publication of Cox’s Theorem, an alternative approach to probability has gained popularity. Cox (again I’m following Jaynes) proposes the following desiderata of rationality, not as arbitrary axioms but as expectations of any rational approach to reasoning in the face of uncertainty.
D1. Probabilities are represented by real numbers.
D2. Probabilities change in common sense ways. For example, if learning C makes B more likely, but doesn’t change how likely A is, then learning C should make AB more likely.
D3. If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
D4. Information must not be arbitrarily ignored. All given evidence must be taken into account.
D5. Identical states of knowledge (except perhaps for the labeling of the propositions) should result in identical assigned probabilities.
The great advance of Cox’s theorem was to show that one can start with these desiderata and arrive at Kolmogorov’s axioms (at least, for the finite version of A3). Thus, the traditional laws of probability can be applied to more than just frequencies.
Conjunction and Total Probability
Beginning with the desiderata, we can derive more than Kolmogorov’s axioms. There are a number of useful probability identities. Remember that an identity is a formula that holds for any propositions we may substitute in. Here are two preliminary identities, before we get to Bayes’ theorem.
Conjunction: If then, (more…)