Tuesday 28 May 2013

ORIGINAL PAPER: A high-quality replication of Galton’s study one century later: Wilkinson & Allison (1989)

Michael A. Woodley, Jan te Nijenhuis & Raegan Murphy

In Woodley, te Nijenhuis, and Murphy (2013, in press) we argue that intelligence has declined substantially since Victorian times, based on a meta-analysis of simple reaction time. An exchange of ideas started at several blogs. We hereby reply to the blogposts of Scott Alexander and HBD Chick, reacting to an earlier post made by us.

A paper has come to our attention that provides strong evidence against the supposed representativeness problem across cohorts (e.g. Alexander, 2013). The study in question is that of Wilkinson and Allison (1989) using a sample of 5,324 visitors to the London Science Museum, which is situated at the exact site of Galton’s 19th century Anthropometric Laboratory in South Kensington.  All visitors undertook psychophysical testing on a simple reaction time-measuring apparatus, just as the people in Galton’s study did. Of these mixed-sex participants 1,189 were aged between 20 and 29, and are thus highly similar to the age range employed in our own study. Their simple RT mean was substantially slower than the weighted 1889 RT mean (245 ms vs. 194.06 ms), and furthermore the mean of this sample falls very close to the meta-regression-estimated mean across studies for the late 1980s (approximately 250 ms, see: Figure 1 in Woodley, te Nijenhuis & Murphy, 2013). The remarkable features of this study are the ways in which it replicates virtually every significant demographic aspect of Galton’s study.

There is the issue of a participation fee. Galton is known to have requested a participation fee of 3 pennies (approximately £5 in modern UK currency). The London Science Museum required the payment of an admissions fee right up until December 2001. Furthermore it still requires the payment of fees of £6 to £10 for access to some special exhibitions (London Science Museum, 2013a). The Wilkinson and Allison (1989) study was in fact conducted as part of a special exhibition entitled Medicines for Man, which was hosted by the Museum from the early 1980s (Medicines for Man Organizing Committee, 1980). Therefore participation fees were employed in the case of both studies.

There is strong evidence for the demographic convergence between the two studies. Johnson et al. (1985) indicate that whilst Galton’s sample included persons from all occupational and socioeconomic groups in Victorian London, it was nonetheless skewed towards students and professionals, and both groups could fairly be described as solidly White and middle class. In the last decades of the 20th century, museum attendance in the UK exhibited precisely the same skew in terms of sociodemography. Eckstein and Feist (1992) for example noted that most UK museum visitors are drawn from White and upper-middle-class populations. Furthermore Hooper-Greenhill (1994) observed that the largest minority ethnic groups in the UK (i.e. Asians and Afro-Caribbeans) are underrepresented amongst museum visitors. In acknowledging this issue, a House of Commons report in 2002 stated that free admission to museums would unlikely ‘… be effective in attracting significant numbers of new visitors from the widest range of socio-economic and ethic groups’ (House of Commons report, 2002, p. 23).

The presence of this self-selection amongst visitors strongly harmonizes the studies of Galton and Wilkinson and Allison. Add to this the fact that participation fees were employed in both cases, the fact that the geographical locations were exactly the same and finally the fact that the age demographic of interest (i.e. twenty-somethings) were intensively sampled in both cases (i.e. 3,410 in the case of Silverman’s subset of Galton’s sample and 1,189 in the case of Wilkinson and Allison). The net of this is that the studies become even more strongly convergent in terms of comparing like with like. Thus the argument of more heterogeneous samples visiting museums in the 1980s compared to more restricted samples visiting museums in the 1880s is critically weakened. The principal objections that can be leveled against this are as follows.

Firstly there is the issue of tourism. Most tourists to the UK are from the US and Europe (Tourism 3B), meaning that they are likely to be both ethnically and socioeconomically matched to the majority of the participants in this study (i.e. UK citizens). In fact, international arrivals in the United Kingdom in 1990 show that of the 439 million inbound tourists, 60% were European in origin and 21% emanated from the Americas. Hence, 81% of the tourist population came from groups which are highly ethnically similar to the British. Only 12% came from Asia and the Pacific with a meager 3% coming from the Middle East and 2% from Africa (Tourism 3B). In sum, it is unlikely that tourists being tested in the 1989 study were substantially ethnically different from the typical UK museum visitor. Based on current statistics from the Science Museum, the preponderance of visitors hail from the UK (69%) and the preponderance of those are from Greater London (44%; London Science Museum, 2013b). Historically, especially prior to the 1990s this figure would have been much higher, owing to far lower levels of tourism to the UK (in 1990 international tourism levels were less than half the current levels,  >940 million per year, BBC, 2013). This means that in all likelihood well over 70% of the participants in Wilkinson and Allison’s study would have been British, and the overwhelming majority of these would have been White, upper middle-class and from London. The overwhelming majority of the international visitors would have been ethnically and broadly socioeconomically matched to the British visitors.

Secondly is the issue of instrumentation. Galton utilized a pendulum chronoscope with a temporal resolution of around a centi-second (i.e. 1/100th of a second, or 0.01 seconds). The electronic apparatus employed by Wilkinson and Allison in all likelihood had a higher resolution (post-1908 chronoscopy at least had the potential to be accurate to a single milli-second; Haupt, 2001), however a centi-second level only resolution in Galton’s apparatus cannot account for the substantial discrepancies between these two studies.
Thirdly, Galton’s sample was single person-single trial, whereas Wilkinson and Allison’s study employed two practice trials followed by 10 trials per person for the purposes of averaging. This protocol would almost certainly have enhanced the reliability of Wilkinson and Allison’s data relative to Galton’s (Jensen, 1980); however in both cases we are dealing with aggregates. Strong biases (i.e. jumping the gun vs. slow to start) have the potential to cancel each other out when employing these sorts of very large datasets, as these sources of error are distributed in a Gaussian fashion. This means that aggregate-level mean-wise comparisons are appropriate for comparisons between data exhibiting different coefficients of reliability coupled with very large Ns.

On this basis Wilkinson and Allison’s (1989) study must be considered an excellent replication of Galton’s study. Its mean reaction time for the relevant age cohort is almost precisely where our meta-regression predicts it should be. This is clearly strong supporting evidence for the robustness of the increase in simple RT latency produced to date and so puts even more nails in the coffin of those who argue that the trend can be accounted for by lack of representativeness across cohorts.

Alexander, S. S. (2013). The wisdom of the ancients. Slate Star Codex. URL: http://slatestarcodex.com/2013/05/22/the-wisdom-of-the-ancients/ [retrieved on 24/05/13]
BBC. (2013). GCSE Bitesize. Geography tourism trends. http://www.bbc.co.uk/schools/gcsebitesize/geography/tourism/tourism_trends_rev1.shtml
Eskstein, J. & Feist, A. (1992). Cultural Trends, 1991. London, Policy Studies Institute.
Haupt, E. J. (2001). Laboratories for experimental psychology: Gottingen’s ascendancy over Leipzig in the 1890s. In: Rieber, R. W., & Robinson, D. K. (Eds.), Wilhelm Wundt in History. The Making of a Scientific Psychology. (pp. 205-250). New York: Kluwer Academic.
Hooper-Greenhill, E. (1994). Museums and their Visitors. London, Routledge.  
House of Commons, Culture, Media and Sport Committee (2002). National Museums and Galleries: Funding and free admission. House of Commons, United Kingdom.
Jensen, A. R. (1980). Bias in Mental Testing. New York: Free Press.
Johnson, R. C., McClearn, G., Yuen, S., Nagosha, C. T., Abern, F. M., & Cole, R. E. (1985). Galton's data a century later. American Psychologist, 40, 875–892.
Medicines for Man Organizing Committee. (1980). Medicines for Man: A Booklet Based on an Exhibition at the Science Museum about Medicines - how They are Discovered and how They Work, how They are Made and Tested, how They are Prescribed and Dispensed, and how Laws Control Their Use. London, Science Museum.
No author (no date). Tourism 3 SB. Oxford University Press
London Science Museum. (2013a). http://www.sciencemuseum.org.uk/visitmuseum/prices.aspx [retrieved on 27/05/2013]
London Science Museum. (2013b). http://www.sciencemuseum.org.uk/about_us/history/facts_and_figures.aspx [retrieved on 27/05/2013]
Wilkinson, R. T., & Allison, S. (1989). Age and simple reaction time: Decade differences for 5,324 subjects. Journal of Gerontology, 44, 29–35.
Woodley, M. A., te Nijenhuis, J., & Murphy, R. (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time. Intelligence. doi:10.1016/j.intell.2013.04.006


  1. Nice refutation of the self bias explanation, I think this proove that this is not the explanation. On the other hand, I have big troubles believing in a genetic explanation for such a large effect, there is so few generation to explain such huge increase in RT purely by relaxation off selective pressure on QI and/or RT.

    So imho there should be another effect at play, probably bigger than the possible disgenic effect.

    What about:
    -the experimental device. I saw the comment on accuracy, and find it convincing. But how is the zero RT reference set for the pendulum? For electronic timer, there is no transmission delay, so what is reported as RT is visual signal appearance (non negligeable for filament lamps for example) + RT + button depression time.
    For mechanical pendulum device, One can suspect more inertia, which would invite a carefull experimenter to calibrate the device to remove the delay. Calibration step could be done by physically linking the visual clue trigger with the RT trigger, and measure the pebdulum fall if any. This time would be the inherent delay, that could be removed from the mesurement to compute actual RT. If this was done, this could lead to systematic bias when no such procedure is done for electronic timer. I would be extremely carefull about this, especially as vistorian RT seems very low. A simple android apps allowed me to test my personal RT, and after this experience I have trouble believing average people could consistently reach below 0.2s RT...without the calibration effect...

    Non-genetic environment effect:
    -general size/height: people have grown a lot since vistorian era, maybe 1 SD, which is similar to RT change...coincidence?
    - germs/parasites, in particular toxoplasmosis that was shown to increase RT very significatively. Toxoplasmosis may be much more prevaent in modern england than vistorian england, after all it change a lot depending on the country so...
    -other environment effect

  2. The commenter is on the right path. The instrument invented by Galton and built for him seems to have serious calibration problems, including the elastic thread (it varies with temperature, etc. and we dont know what it was made of, and so).

  3. Silverman (2010) is correct with "first suggestion is that an ambient, population-wide increase in neurotoxic load..." Numerous studies show that radiation affects the autonomic system. The present generation, with air conditioners in their windows, iPhones on their ears, and ultrasound on their fetuses, should continue to drive reaction times to an all time low.

  4. Addenda to my prior message:

    The paraphrase of Silverman (2010) is from