I’m (Berian, that is, rather than Luke) attending the SF Data Science Summit today and tomorrow. I’m taking some rough notes as I go and want to publish them in digestible bits. One of the speakers I most enjoyed today was Carlos Guestrin (@guestrin), who gave a keynote and then a little 25-minute appendix later in the day. Here’s what I wrote down.
Archive for the ‘Statistics and Metrics’ Category
Don’t bet on the horse you think will win!
More precisely, don’t necessarily bet on the horse you think will win. Here is the only betting system that works:
- For each horse in the race, and before you look at the price offered by the bookmaker, write what you think the probability (as a percentage) is that the horse will win. I.e. if the race was run 100 times, how many times would this horse win? You’ll have to do your homework on the field.
- For each horse, take your probability and multiply it by the bookmakers price. Call that the magic number.
- If any of the horses have a magic number greater than 100, bet on the horse with the highest magic number.
- If none of the horses have a magic number greater than 100, don’t bet. Go home.
The magic number is how much (on average) you would make if you bet $1 on the horse 100 times, so it better be more than 100. The way that the bookmaker guarantees that they will make a profit in the long run is to ensure that no magic numbers are greater than 100. Because of the bookmakers slice (the overround), the odds are stacked against the average punter. You will only end up with a magic number greater than 100 if either you have made a mistake on step 1, or the bookmaker has made a mistake on his price. This leads to the following advice.
You should only bet on a horse if
a) You know more than the bookmaker, and
b) The bookmaker has significantly underestimated one of the horses.
Thus, the better the bookmaker, the more reason not to bet. And so, we come to Tom Waterhouse’s online betting business:
“I’ve got four generations of betting knowledge in my blood. … Bet with me, and that knowledge can be yours.”
This is exactly the information you need to conclude that you should never bet with Tom Waterhouse. The ad might as well say “bet with me; I know how to take your money”. You don’t want a bookmaker who knows horse racing inside-and-out, from horse racing stock, armed will all the facts, knowing all the right people. You don’t want a professional in a sharp suit surrounded by a analysts at computer screens. You want an idiot. You want someone who doesn’t know which end of the horse is the front, armed with a broken abacus and basing his prices on a combination of tea-leaf-reading, a lucky 8-ball and “the vibe“. You want a bookmaker that is going out of business.
The more successful the bookmaker, the further you should stay away. The TAB was established in 1964, has over a million customers, 2,500 retail outlets, and made a profit of $534.8 million in 2011, up 14%. Translation: never bet with the TAB. Betfair’s profits were $600 million, SportingBet made $2 billion in 2009. With those resources, they’ll always know more than you. If you’ve heard of them, don’t bet with them. Go home.
Hopefully you’re getting my point. Don’t bet on sports. If you go to the races, put on a nice outfit, drink a few beers and give the money to charity. If you must bet, have a random sweepstakes with your friends. You’ll get much better odds that way.
Coincidences happen surprisingly often. Yet they are often not meaningful, i.e. they are “just a coincidence” and do not imply that we should change our worldview. For example, suppose there are a million people in contention for a lottery, and John Smith is found to win. Before knowing this, our probability for it is :
People often get afraid of this tiny probability, and proclaim something like “it’s not the probability of John Smith winning the lottery that is relevant, but the probability that someone wins”. However, this is anti-Bayesian nonsense. This tiny probability is, by Bayes’ rule, relevant for getting a posterior probability for . So how is it that we often still believe in the fair lottery (or that a coincidence is not meaningful)?
The answer is quite simple: the likelihood for the alternative, hypothesis, is just as small:
The reason is that before we knew who won, we had no reason to single out John Smith, and had to spread the total probability (1) over a million minus one alternatives (that the lottery was rigged in favor of one of the other entrants). Using analogous reasoning, yes, coincidences have tiny probability, but they also have tiny probability given the hypothesis of a mysterious force operating, because before the coincidence happened we didn’t know which of the multitude of coincidences were going to occur.
For more on this topic, you may be interested in this paper (by myself and Matt).
Especially for Cusp, I note the following (proof left for undergraduates):
(Convex h-index conjecture) For n chronologically distinct papers, each of which cites all previous papers, the corresponding h-index is the number of non-congruent diagonals in a regular polygon with number of sides 2 greater than n.
As a corollary, academics engaging in such cheeky behaviour may be indexed with the dimension of their corresponding polygon.
From the Sydney Morning Herald:
Alcohol plays a role in 50 to 60 per cent of the nearly 300,000 criminal cases that come before the state’s Local Courts each year, [New South Wales] Chief Magistrate Graeme Henson said.
That’s about twice as high as I’d have guessed. I tried to track down the source of this statistic, but the closest I could find was a report called “Alcohol related crime for each NSW Local Government Area: Numbers, proportions, rates, trends and ratios” from the NSW Bureau of Crime Statistics and Research. The report gives the percentage of “incidents of non-domestic violence related assault recorded by NSW Police” that are alcohol related as 45%.
I’d love to know what that number is for the United Kingdom, as well as European countries like France or Germany who seem to have an alcohol culture without having as much of a binge drinking culture. I’d expect that the percentage of alcohol related crime was lower for the UK and even lower for most of Europe. I’ll try to track those down.
As to what should be done about the problem, I have no idea. Perhaps nothing – it may be a correlation without causation. Perhaps its an alpha male thing: put too many young men in a nightclub with available women and testosterone will cause friction. The alcohol just happened to be there as well. On the other hand, the anecdotal evidence that certain people are more likely to “kick off after having a few” is well known.
For those of us who work with degree-of-plausibility (“Bayesian”) probabilities, two situations regularly arise. The first is the need to update probabilities to take into account new information. This is usually done using Bayes’ Rule, when the information comes in the form of a proposition that is known to be true. An example of such a proposition is “The data are 3.444, 7.634, 1.227”.
More generally, information is any justified constraint on our probabilities. For example, “P(x > 3) should be 0.75” is information. If our current probability distribution doesn’t satisfy the constraint, then we better change to a new distribution that does. This doesn’t mean that any old will do – our contained hard-won information and we want to preserve that. To proceed, we choose the that is as close as possible to , but satisfies the constraint. Various quite persuasive arguments (see here) suggest that the correct notion of closeness that we should maximise is the relative entropy:
With no constraints, the best possible is equal to .
Another situation that arises often is the need to simplify complex problems. For example, we might have some probability distribution that is non-Gaussian, but for some reason we only want to use Gaussians for the rest of the calculation, perhaps for presentation or computational reasons. Which Gaussian should we choose to become our ? Many people recommend maximising the relative entropy for this also: in the literature, this is known as a variational approximation, variational Bayes, or the Bogoliubov approximation (there are also variations (pun not intended) on this theme).
There are known problems with this technique. For instance, as David MacKay notes, the resulting probability distribution is usually narrower than the original . This makes sense, since the variational approximation basically amounts to pretending you have information that you don’t actually have. This issue raises the question of whether there is something better that we could do.
I suggest that the correct functional to maximise in the case of approximating one distribution by another is actually the relative entropy, but with the two distributions reversed:
Why? Well, for one, it just works better in extreme examples I’ve concocted to magnify (a la Ed Jaynes) the differences between using and . See the figure below:
If the blue distribution represented your actual state of knowledge, but out of necessity you could only use the red or the green distribution, which would you prefer? I find it very hard to imagine an argument that would make me choose the red distribution over the green. Another argument supporting the use of this ‘reversed’ entropy is that it is equivalent to generating a large number of samples from q, and then doing a maximum likelihood fit of p to these samples. I know maximum likelihood isn’t the best, most principled thing in the world, but in the limit of a large number of samples it’s pretty hard to argue with.
A further example supporting the ‘reversed’ entropy is what happens if is zero at some points. According to the regular entropy, any distribution that is nonzero where is zero, is infinitely bad. I don’t think that’s true, in the case of approximations – some leakage of probability to values we know are impossible is no catastrophe. This is manifestly different to the case where we have legitimate information – if is zero somewhere then of course we want to have zero there as well. If we’re updating probabilities, we’re trying to narrow down the possibilities, and resurrecting some is certaintly unwarranted – but the goal in doing an approximation is different.
Maximising the reversed entropy also has some pretty neat properties. If the approximating distribution is a Gaussian, then the first and second moments should be chosen to match the moments of . If the original distribution is over many variables, but you want to approximate it by a distribution where the variables are all independent, just take all of the marginal distributions and product them together, and there’s your optimal approximation.
If isn’t the best thing to use for approximations, that means that something in the derivation of applies to legitimate information but does not apply to approximations. Most of the axioms (coordinate independence, consistency for independent systems, etc) make sense, and both entropies discussed in this post satisfy those. It is only at the very end of the derivation that the reversed entropy is ruled out, and by some pretty esoteric arguments that I admit I don’t fully understand. I think the examples I’ve presented in this post are suggestive enough that there is room here for a proof that the reversed entropy is the thing to use for approximations. This means that maximum relative entropy is a little less than universal, but that’s okay – the optimal solutions to different problems are allowed to be different!
Yesterday I read a few of the recent papers of Francesco Sylos Labini, who has pursued a distinction between the common or garden type statistical homogeneity in the Universe that one reads about in textbooks, and a stronger form (‘super-homogeneity’) in which the mass fluctuations follow a behaviour that is sub-Poisson as a function of scale. This implies a sort of anti-correlation—a lattice of points is, for instance, sub-Poisson, as the points are deliberately avoiding one another—and has consequences for the form of the two-point correlation function:
that look remarkably similar to those imposed by the integral constraint, but which are, in fact, quite different—the super-homogeneity condition affects the actual correlation function, while the correction usually referred to as the integral constraint affects estimators of the correlation function. I started writing a summary document on this topic for the reference of myself and others.
After DARK’s infamous session, I hit a sweet spot in coding productivity and wrote a bunch of scripts to extract spatial features from galaxy images, along lines suggested to me a week or so ago by Andrew Zirm. These features are extracted from a matrix that encodes the frequency of adjacency between threshold intensity levels in the image. It’s the sort of thing best shown with pictures, which perhaps I can post once Andrew has decided which direction to pursue next.