It’s always useful to know a statistics junkie or two. Brendon is our resident Bayesian. Another colleague of mine from Zurich, Ewan Cameron, has recently started Another Astrostatistics Blog. It’s well worth a look.
I’m not a statistics expert, but I’ve had this rant in mind for a while. I’m currently at the “Feeding, Feedback, and Fireworks” conference on Hamilton Island (thanks Astropixie!). There has been some discussion of the problem of reification. In particular, Ray Norris warned that, once a phenomenon is named, we have put it in a box and it is difficult to think outside that box. For example, what was discovered in 1998 was the acceleration of the expansion of the universe. We often call it the discovery of dark energy, but this is perhaps a premature leap from observation to explanation – the acceleration could be being caused by something other than some exotic new form of matter.
There is a broader message here, which I’ll motivate with this very interesting passage from Alfred North Whitehead’s book “Science and the Modern World” (1925):
In a sense, Plato and Pythagoras stand nearer to modern physical science than does Aristotle. The former two were mathematicians, whereas Aristotle was the son of a doctor, though of course he was not thereby ignorant of mathematics. The practical counsel to be derived from Pythagoras is to measure, and thus to express quality in terms of numerically determined quantity. But the biological sciences, then and till our own time, has been overwhelmingly classificatory. Accordingly, Aristotle by his Logic throws the emphasis on classification. The popularity of Aristotelian Logic retarded the advance of physical science throughout the Middle Ages. If only the schoolmen had measured instead of classifying, how much they might have learnt!
… Classification is necessary. But unless you can progress from classification to mathematics, your reasoning will not take you very far.
A similar idea is championed by the biologist and palaeontologist Stephen Jay Gould in the essay “Why We Should Not Name Human Races – A Biological View”, which can be found in his book “Ever Since Darwin” (highly recommended). Gould first makes the point that “species” is a good classification in the animal kingdom. It represents a clear division in nature: same species = able to breed fertile offspring. However, the temptation to further divide into subspecies – or races, when the species is humans – should be resisted, since it involves classification where we should be measuring. Species have a (mostly) continuous geographic variability, and so Gould asks:
Shall we artificially partition such a dynamic and continuous pattern into distinct units with formal names? Would it not be better to map this variation objectively without imposing upon it the subjective criteria for formal subdivision that any taxonomist must use in naming subspecies?
Gould gives the example of the English sparrow, introduced to North America in the 1850s. The plot below shows the distribution of the size of male sparrows – dark regions show larger sparrows. Gould notes:
The strong relationship between large size and cold winter climates is obvious. But would we have seen it so clearly if variation had been expressed instead by a set of formal Latin names artificially dividing the continuum?
Note, however, that measurement doesn’t tell the full story. Ultimately, we do want to classify nature. There are not just things in nature but kinds of things, and we would like to know what kinds there are. We don’t just want to measure the properties of a whole bunch of individual particles. We want to be able to conclude that there is a kind of particle called an electron which are all identical and distinct from other kinds of particles. Electrons, quarks, galaxies, stars, silicon, DNA, and magnetism are kinds of things.
So we don’t just want classification -> mathematics. We also want mathematics -> classification. We want the mathematics to reveal the correct (simplest? most natural?) classification of things in the world. We want to carve nature at the joints, as Plato said. Otherwise, science is just be a pile of numbers.
In particular, classification must be justified mathematically. We must look carefully before we leap from measurement to classification, before we conclude that we have separate types of things rather than just one population with variation. For example, if one plots the distribution of human heights, one sees one mountain, one bump1. On the basis of this plot alone, there is no justification for supposing that there are two types of humans, short ones and tall ones. There is simply a range of human heights. If you furrow your brow and try to decide which height separates tall people from short people, you are asking the wrong question. Don’t make a binary distinction where none is given to you. What nature hath joined together let none tear asunder.
Contrast this with the distribution of testosterone, below. If this were the only information you had a about a population, you would quite naturally place them into two populations. The data suggest that there are two distinct types of people, and what determines testosterone levels in one is different to the other. There is justification here for a classification.
The fundamental distinction underlying the testosterone bimodality is, of course, biological gender. This is a very bimodal, discrete distinction: XX vs XY. Even though there is a correlation between height and gender, the height distribution alone does not warrant such a classification. This point flows on to correlations between measurements. If one puts humans into a “short” bin and a “tall” bin, and then observes that the average 100m sprint times for the tall bin is significantly quicker than for the short bin, one should conclude that taller people are on average quicker than shorter people. One should not conclude that there are two types of people – tall, fast ones and short, slow ones.
This is important for theorists. We want to know the causes of the variation in height and testosterone levels. Bimodality suggests that we should go looking for two different processes (or causal histories, or populations), rather than just one stochastic process.
I see this point in astronomy. There is a temptation for observers, faced as they are with either too much data or too little, to analyse data by classification. This is sometimes not a bad idea, but one must be careful about the interpretation of the results. The reason why we think that there are two types of galaxies – red ones and blue ones – is not simply that there are red ones and there are blue ones and that we can correlate redness and blueness with other properties like morphology and star-formation rate. The reason is that there are few green ones. It’s the dip in the middle that suggests that there are two types of galaxies.
The plot above shows the distribution of galaxies in terms of the total mass of stars, and the colour of the galaxy (red at the top, blue at the bottom). The darker the colour, the more galaxies that have that particular combination of stellar mass and colour there are. If it were a topographic map, one would see two peaks in the mountain range.
Even in this case, one shouldn’t be too hasty in drawing a dividing line through the middle of the dip. It’s a ridge, not a trench. Don’t classify first, analyse second. There are better ways – mixing models, for example – to model the two populations. (Carry on, Ned.) This is especially true if your conclusions are sensitive to the overlap of the populations.
1. There is a caveat, here. If one is convinced that a population should have a certain type of distribution, eg. Gaussian, then one can conclude that a single population doesn’t fit the data and that two or more populations are required. One can then conclude that there are two populations, even in the absence of a strong bimodality in the raw data. The question, of course, is how strongly one believes that the population has the particular type of distribution.