Feeds:
Posts

## How to win at the races

I’ve rambled about this before, but with the Melbourne Cup - “the race that stops a nation” – a few days away and Tom Waterhouse’s annoying face on TV too often, it’s worth repeating.

Don’t bet on the horse you think will win!

More precisely, don’t necessarily bet on the horse you think will win. Here is the only betting system that works:

1. For each horse in the race, and before you look at the price offered by the bookmaker, write what you think the probability (as a percentage) is that the horse will win. I.e. if the race was run 100 times, how many times would this horse win? You’ll have to do your homework on the field.
2. For each horse, take your probability and multiply it by the bookmakers price. Call that the magic number.
3. If any of the horses have a magic number greater than 100, bet on the horse with the highest magic number.
4. If none of the horses have a magic number greater than 100, don’t bet. Go home.

The magic number is how much (on average) you would make if you bet $1 on the horse 100 times, so it better be more than 100. The way that the bookmaker guarantees that they will make a profit in the long run is to ensure that no magic numbers are greater than 100. Because of the bookmakers slice (the overround), the odds are stacked against the average punter. You will only end up with a magic number greater than 100 if either you have made a mistake on step 1, or the bookmaker has made a mistake on his price. This leads to the following advice. You should only bet on a horse if a) You know more than the bookmaker, and b) The bookmaker has significantly underestimated one of the horses. Thus, the better the bookmaker, the more reason not to bet. And so, we come to Tom Waterhouse’s online betting business: “I’ve got four generations of betting knowledge in my blood. … Bet with me, and that knowledge can be yours.” This is exactly the information you need to conclude that you should never bet with Tom Waterhouse. The ad might as well say “bet with me; I know how to take your money”. You don’t want a bookmaker who knows horse racing inside-and-out, from horse racing stock, armed will all the facts, knowing all the right people. You don’t want a professional in a sharp suit surrounded by a analysts at computer screens. You want an idiot. You want someone who doesn’t know which end of the horse is the front, armed with a broken abacus and basing his prices on a combination of tea-leaf-reading, a lucky 8-ball and “the vibe“. You want a bookmaker that is going out of business. The more successful the bookmaker, the further you should stay away. The TAB was established in 1964, has over a million customers, 2,500 retail outlets, and made a profit of$534.8 million in 2011, up 14%. Translation: never bet with the TAB. Betfair’s profits were $600 million, SportingBet made$2 billion in 2009.  With those resources, they’ll always know more than you. If you’ve heard of them, don’t bet with them. Go home.

Hopefully you’re getting my point. Don’t bet on sports. If you go to the races, put on a nice outfit, drink a few beers and give the money to charity. If you must bet, have a random sweepstakes with your friends. You’ll get much better odds that way.

## Coincidences and the Lottery

Coincidences happen surprisingly often. Yet they are often not meaningful, i.e. they are “just a coincidence” and do not imply that we should change our worldview. For example, suppose there are a million people in contention for a lottery, and John Smith is found to win. Before knowing this, our probability for it is $10^{-6}$:

$P(\textnormal{John Smith wins} | \textnormal{fair lottery}) = 10^{-6}$

People often get afraid of this tiny probability, and proclaim something like “it’s not the probability of John Smith winning the lottery that is relevant, but the probability that someone wins”. However, this is anti-Bayesian nonsense. This tiny probability is, by Bayes’ rule, relevant for getting a posterior probability for $\textnormal{fair lottery}$. So how is it that we often still believe in the fair lottery (or that a coincidence is not meaningful)?

The answer is quite simple: the likelihood for the alternative, $\textnormal{unfair lottery}$ hypothesis, is just as small:
$P(\textnormal{John Smith wins} | \textnormal{unfair lottery}) = 10^{-6}$.
The reason is that before we knew who won, we had no reason to single out John Smith, and had to spread the total probability (1) over a million minus one alternatives (that the lottery was rigged in favor of one of the other entrants). Using analogous reasoning, yes, coincidences have tiny probability, but they also have tiny probability given the hypothesis of a mysterious force operating, because before the coincidence happened we didn’t know which of the multitude of coincidences were going to occur.

For more on this topic, you may be interested in this paper (by myself and Matt).

## Conjecture of the evening

Especially for Cusp, I note the following (proof left for undergraduates):

(Convex h-index conjecture) For n chronologically distinct papers, each of which cites all previous papers, the corresponding h-index is the number of non-congruent diagonals in a regular polygon with number of sides 2 greater than n.

As a corollary, academics engaging in such cheeky behaviour may be indexed with the dimension of their corresponding polygon.

## Surprising Statistic of the Day

From the Sydney Morning Herald:

Alcohol plays a role in 50 to 60 per cent of the nearly 300,000 criminal cases that come before the state’s Local Courts each year, [New South Wales] Chief Magistrate Graeme Henson said.

That’s about twice as high as I’d have guessed. I tried to track down the source of this statistic, but the closest I could find was a report called “Alcohol related crime for each NSW Local Government Area: Numbers, proportions, rates, trends and ratios” from the NSW Bureau of Crime Statistics and Research. The report gives the percentage of “incidents of non-domestic violence related assault recorded by NSW Police” that are alcohol related as 45%.

I’d love to know what that number is for the United Kingdom, as well as European countries like France or Germany who seem to have an alcohol culture without having as much of a binge drinking culture. I’d expect that the percentage of alcohol related crime was lower for the UK and even lower for most of Europe. I’ll try to track those down.

As to what should be done about the problem, I have no idea. Perhaps nothing – it may be a correlation without causation. Perhaps its an alpha male thing: put too many young men in a nightclub with available women and testosterone will cause friction. The alcohol just happened to be there as well. On the other hand, the anecdotal evidence that certain people are more likely to “kick off after having a few” is well known.

## A Tale of Two Entropies

For those of us who work with degree-of-plausibility (“Bayesian”) probabilities, two situations regularly arise. The first is the need to update probabilities to take into account new information. This is usually done using Bayes’ Rule, when the information comes in the form of a proposition that is known to be true. An example of such a proposition is “The data are 3.444, 7.634, 1.227″.

More generally, information is any justified constraint on our probabilities. For example, “P(x > 3) should be 0.75″ is information. If our current probability distribution $q(x)$ doesn’t satisfy the constraint, then we better change to a new distribution $p(x)$ that does. This doesn’t mean that any old $p(x)$ will do – our $q(x)$ contained hard-won information and we want to preserve that. To proceed, we choose the $p(x)$ that is as close as possible to $q(x)$, but satisfies the constraint. Various quite persuasive arguments (see here) suggest that the correct notion of closeness that we should maximise is the relative entropy:

$H(p; q) = -\int p(x) \log \frac{p(x)}{q(x)} dx$

With no constraints, the best possible $p(x)$ is equal to $q(x)$.

Another situation that arises often is the need to simplify complex problems. For example, we might have some probability distribution $q(x)$ that is non-Gaussian, but for some reason we only want to use Gaussians for the rest of the calculation, perhaps for presentation or computational reasons. Which Gaussian should we choose to become our $p(x)$? Many people recommend maximising the relative entropy for this also: in the literature, this is known as a variational approximation, variational Bayes, or the Bogoliubov approximation (there are also variations (pun not intended) on this theme).

There are known problems with this technique. For instance, as David MacKay notes, the resulting probability distribution $p(x)$ is usually narrower than the original $q(x)$. This makes sense, since the variational approximation basically amounts to pretending you have information that you don’t actually have. This issue raises the question of whether there is something better that we could do.

I suggest that the correct functional to maximise in the case of approximating one distribution by another is actually the relative entropy, but with the two distributions reversed:

$H(q; p) = -\int q(x) \log \frac{q(x)}{p(x)} dx$

Why? Well, for one, it just works better in extreme examples I’ve concocted to magnify (a la Ed Jaynes) the differences between using $H(p; q)$ and $H(q; p)$. See the figure below:

If the blue distribution represented your actual state of knowledge, but out of necessity you could only use the red or the green distribution, which would you prefer? I find it very hard to imagine an argument that would make me choose the red distribution over the green. Another argument supporting the use of this ‘reversed’ entropy is that it is equivalent to generating a large number of samples from q, and then doing a maximum likelihood fit of p to these samples. I know maximum likelihood isn’t the best, most principled thing in the world, but in the limit of a large number of samples it’s pretty hard to argue with.

A further example supporting the ‘reversed’ entropy is what happens if $q(x)$ is zero at some points. According to the regular entropy, any distribution $p(x)$ that is nonzero where $q(x)$ is zero, is infinitely bad. I don’t think that’s true, in the case of approximations – some leakage of probability to values we know are impossible is no catastrophe. This is manifestly different to the case where we have legitimate information – if $q(x)$ is zero somewhere then of course we want to have $p(x)$ zero there as well. If we’re updating probabilities, we’re trying to narrow down the possibilities, and resurrecting some is certaintly unwarranted – but the goal in doing an approximation is different.

Maximising the reversed entropy also has some pretty neat properties. If the approximating distribution is a Gaussian, then the first and second moments should be chosen to match the moments of $q(x)$. If the original distribution is over many variables, but you want to approximate it by a distribution where the variables are all independent, just take all of the marginal distributions and product them together, and there’s your optimal approximation.

If $H(p; q)$ isn’t the best thing to use for approximations, that means that something in the derivation of $H(p; q)$ applies to legitimate information but does not apply to approximations. Most of the axioms (coordinate independence, consistency for independent systems, etc) make sense, and both entropies discussed in this post satisfy those. It is only at the very end of the derivation that the reversed entropy is ruled out, and by some pretty esoteric arguments that I admit I don’t fully understand. I think the examples I’ve presented in this post are suggestive enough that there is room here for a proof that the reversed entropy $H(q; p)$ is the thing to use for approximations. This means that maximum relative entropy is a little less than universal, but that’s okay – the optimal solutions to different problems are allowed to be different!

## Homogeneity, features

Yesterday I read a few of the recent papers of Francesco Sylos Labini, who has pursued a distinction between the common or garden type statistical homogeneity in the Universe that one reads about in textbooks, and a stronger form (‘super-homogeneity’) in which the mass fluctuations follow a behaviour that is sub-Poisson as a function of scale. This implies a sort of anti-correlation—a lattice of points is, for instance, sub-Poisson, as the points are deliberately avoiding one another—and has consequences for the form of the two-point correlation function:

$\int \xi(r) d^3 r = 0$

that look remarkably similar to those imposed by the integral constraint, but which are, in fact, quite different—the super-homogeneity condition affects the actual correlation function, while the correction usually referred to as the integral constraint affects estimators of the correlation function. I started writing a summary document on this topic for the reference of myself and others.

After DARK’s infamous $\gamma\Lambda$ session, I hit a sweet spot in coding productivity and wrote a bunch of scripts to extract spatial features from galaxy images, along lines suggested to me a week or so ago by Andrew Zirm. These features are extracted from a matrix that encodes the frequency of adjacency between threshold intensity levels in the image. It’s the sort of thing best shown with pictures, which perhaps I can post once Andrew has decided which direction to pursue next.

## WMAP 7 cosmological parameter set

Your Universe ca. 2010, per the WMAP+BAO+H0[1] maximum likelihood parameter set:

 Parameter WMAP+BAO+H0 ML Hubble parameter h 0.702 H0 70.2 km/s/Mpc Dark matter density Ωch2 0.1120 Ωc 0.227 Baryonic matter density Ωbh2 0.02246 Ωb 0.0455 Total matter density Ωmh2 0.1344 Ωm 0.272 Vacuum tension[2] ΩΛ 0.728 Amplitude of curvature perturbation at k = 0.002/Mpc Δ2R 2.45 x 10-9 Spectral index of density perturbations ns 0.961 Size of linear density fluctuation at 8 Mpc/h σ8 0.807 Redshift of matter– radiation equatlity zeq 3196 Age of the Universe t0 13.78 Gyr

Parameters fit directly from the data are shown in a slightly different colour; all the others have been derived from the fit parameters using the usual definitions. The determination of zeq is carried out using the WMAP 7-year data on its own. The two papers in which these figures are given are:

Larson et al. (2010), arXiv:1001.4635
Komatsu et al. (2010), arXiv:1001.4538

These papers contain many other numbers: in particular, for extensions to ΛCDM cosmology, such as neutrino species, non-zero spatial curvature and dark energy that is not the cosmological constant. I expect some of the parameters mentioned there and not here—particularly the fNLstatistics of non-Gaussianity—to gain more public attention in the next decade as observations begin to determine the properties of the cosmological inflation that occurred in the very early Universe.

A final note: I’ve written this post only because these numbers are not written on an actual webpage—they are all in pdf or postscript files. But, it also gives me a chance to congratulate the WMAP team on their ongoing achievement.

Footnotes
1. Riess, A. et al. (2009), ApJ 699 539, arXiv:0905.0695
2. Dark energy, or, as assumed here, the cosmological constant.