[Ed.: I started writing this and it turned into gargantupost, so I’ve broken it into chunks. This is Chunk the First. The others will appear whenever I’m short of ideas.]
So – some languages are more difficult to learn than others. Not intrinsically, mind you – the difficulty is the result of not being familiar with the language family (Teutonic for, say, German or Romance for, say, French). The U.S. State Department has a system for categorising the difficulty of a new language for a native English speaker. It goes:
- French, German, Indonesian, Italian, Portuguese, Romanian, Spanish, Swahili
- Bulgarian, Burmese, Greek, Hindi, Persian, Urdu
- Amharic, Cambodian, Czech, Finnish, Hebrew, Hungarian, Lao, Polish, Russian, Serbo-Croatian, Thai, Turkish, Vietnamese
- Arabic, Chinese, Japanese, Korean
The direction of the ordering is evident as soon as one picks up a book in French and a book in Japanese, and decides which is easier; I have no idea how the classification was arrived at. In primary schools in Australia during the late 1980s, there was a rush to study Japanese to prepare for the nation’s impending engagement with Asia, but after the Keating government was foreclosed the wedding was called off; I think everyone learns French and Indonesian now. The positive: we learnt about a culture well outside the continental European orthodoxy of languages taught throughout the 20th century. The negative: after studying Japanese throughout primary school, high school and a semester of University, I don’t speak a foreign language.
(Prof. Friedman says: “Try talking French with someone who studied it in public school”)
This is a bit unacceptable – being monolingual in Europe is like not being toilet-trained. Seriously. Britons are pretty bad offenders here too, so any Australians seeking asylum in the U.K. are sheltered from having to face their inadequacy fully, but this has the downside of allowing ignorance to fester. I worry that whenever I travel to the mainland the Dutch air stewards are making abominably witty multilingual puns at my expense.
So just how distinct are all these languages? I’ve never tried to learn French seriously, though I have read half of a Teach Yourself Esperanto book and would be prepared to say that, as a Romance language creole, it was less taxing than Japanese. The crude divisions of language above have the order property, but no scale – so let’s try to make up a metric. To do this, let us break languages down into components of vocabulary, syntax and pronunciation. This isn’t perfect, but the three do give a fairly full description of most language in the world. We can now tackle them separately.
Vocabulary
We’ll consider only the written language here. Say we compile two lists, A and B, of the majority of commonly used words in our two languages – n entries per list. This could be achieved by recording the voices of everyone on the planet for a day or so – I leave it to Google to develop this technology. Now compile two co-lists of words, call them A’ and B’, being the translation of A into the language of B and vice versa. When comparing, say, English and Italian, which both use the Latin alphabet, this is easy; moderately challenging would be, say, English and Vietnamese, which can nonetheless be handled using the extended Latin alphabet; English and Chinese, where the latter uses a grapheme-based writing system, requires some sleight-of-hand – one possibility is to use a system of romanisation, and make sure our metric gives no weight to this. I will also introduce two other ‘languages’, to function as extreme cases: i) a written language made by the simple permutation a>b, b>c, … , z>a (or ž>a in the extended Latin alphabet); and ii) a language where words are composed of random letters from the original alphabet, obeying no cypher whatsoever.
This already gives us excellent discriminating power, though we have do a bit of work to see how. If we place our faux languages in the classification above, the permutation would be a ‘0’, and the random switch a (near) ‘infinity’, i.e. I can explain how to read the permutation language using one rule (there can be no simpler case for non-identical languages), but for the random script I need one rule per word (which is pathological). Real languages lie somewhere between these. Now take your lists A and A’ and, for each pair, calculate the numerical difference between the words, proceeding on a letter by letter basis using the minimum possible distance through the alphabet in modulo-26 (or -112 in extended Latin) arithmetic. For example, the words dog and bat have a difference of 2 + 12 + 12 = 26. If the words are of differing length, compare only the number of letters in the shorter word, sliding the longer word along to give the minimum possible difference. Of course, we want the difference not to depend on word length, so divide the result for each pair by the number of letters compared in each.
So our two big lists of words have been turned into a single list of n numbers. Now what? There are two important quantities – the mean of the n numbers, and their variance. I claim that the sum, call it X, of these two quantities is a good measure of the difference in vocabulary between the languages. For evidence, I’m afraid I can only demonstrate that it works really well (i.e. my metric is ad hoc, but you knew that already, right?):
- Consider English and English. The numerical differences are always zero, so their mean and variance is zero too. X is 0, just as it should be for any metric.
- Consider English and the simple permutation. The numerical differences will be a constant throughout the list, equal, in fact, to 1 – this is all we have shifted the letters by. So the mean is 1, but the variance is 0 – so X is now 1.
- Consider English and French. There are many words with a common root, so the numerical difference for these words will be small, in many cases less than 1 (e.g. ‘chamber’ and ‘chambre’), but many that are quite different. I won’t try and estimate the values of the numbers, but the gist of it would be that the mean is smallish, though perhaps more than 1, while the variance is greater than zero.
- Consider English and romanised Japanese. Now the mean is substantial, but the words still retain some structural relationship (e.g. ‘yunyuu suru’, to import and ‘yushoku suru’, to export), so the variance is kept in check.
- Consider English and the random script. Again the mean is substantial, though not necessarily much larger than for a real language, and certainly not as large as it can be (which is the length of the alphabet divided by two), but the variance is maximised. I don’t think it’s possible to make a more different language – if we start to try to maximise the mean we bring the variance down.
One final example is worth considering – French and the permuted English script. The mean is now one larger than for the difference between French and English, and the variance is unchanged. A moment’s thought (not more) shows that we’ve satisfied the triangle inequality, and, because the choice of which language goes in which list is symmetric, we’ve satisfied all the requirements for a metric. Victory!
And now, some qualifications. While the choice of calculating the numerical difference along the alphabet works well on average, it’s completely absurd. The alphabet isn’t ordered so that ‘similar’ letters are grouped together – in other words, the difference between ‘l’ and ‘d’ shouldn’t really be larger than that between ‘l’ and ‘n’. One could use a binary system, that checked only whether the letters were exactly the same or not, or a compromise where letters acknowledged to be similar (such as soft mutation pairs) count as the same, and others do not. If the list of words is sufficiently large, though, this shouldn’t be decisive.
Also, the symmetry property of the metric relies on their being a unique translation for each word in a language. Practice shows that this isn’t the case. Nonetheless, there’s no reason not to simply agree on a set of unique pairs and study these pared down languages, and for words for which no simple translation exists (like ‘zeitgeist’), treat it as a loan word.
Phew! That was quite a jaunt… but there’s still syntax and pronunciation to come. Stay tuned!
An intriguing idea … I have almost no experience with learning other languages, but I was reminded of the following sentence:
“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.”
This suggests that the metric should reward the preservation of the first and last letters. However, this article has some useful comments about the above sentence.
In the end, you’ve set yourself a challenging task and made good progress toward a solution. I was taught Indonesian in year 7. I remember that the Indonesian word for “book” is buku – same starting letter, similar sound etc. However, I just as easily remember that Indonesian for “highlighter” is “stabilo” (i.e. after the brand name). It is easy to remember, even though the words have effectively no similarities, because of the association of an object and the name printed on the side. Examples like this, where words are connected by ideas rather than character similarities, would be almost impossible to quantify. It is likely, however, that they are rare enough in most languages to be ignored.