Tuesday, 22 October 2013

On best understanding Nisbett and co.


Occasionally, curiosity leads one into byways, and thus disinters shards of memory lost in the grass of our neglect, letting ghosts speak. Here is the chain: a few days ago Greg Cochran posted a link to a 2007 opinion piece by Eric Turkheimer in which the latter said that “The important questions about the role of genetics in the explanation of racial differences in ability are not empirical, but theoretical and philosophical”. This suggested that neither a positive nor a negative empirical result would have an important effect. On the contrary, if no genetic investigation can explain intellectual differences between racial groups then that destroys the genetic hypothesis. It will be wrong, end of story. I also noted that his essay was written in 2007 so it was possible he had changed his mind.

Later that evening I realised I ought to check up on this possibility, and looked at Turkheimer’s personal website, and in which he lists a recent publication in the September 2012 American Psychologist DOI: 10.1037/a0029772 Nisbett, Aronson, Blair, Dickens, Flynn, Halpern, Turkheimer “Group Differences in IQ Are Best Understood as Environmental in Origin”.

It is a minor gripe, but “best understood” strikes me as a curious phrase. What is wrong with “Group differences are environmental in origin”? That might be true, or partly true, but at least determining it depends on fact. Even “environmental explanation is best fit with all the data” would have been clearer.

Anyway, there is much to comment on in this paper, which is written by distinguished authors in the field, but my attention was drawn to one line of text: Gains in sub-Saharan African countries of 0.50 to 0.70 SD in response to a few months of Western-style education have been reported for heavily g-loaded fluid intelligence tests (McFie, 1961). This dramatic claim is supported by far the oldest reference in the paper. Apart from another reference from 1993 all the rest are from the present century. Nothing wrong with a reference being old, but I decided to look it up. Here is the abstract:

SUMMARY Twenty-six African boys entering technical school were given a series of intellectual tests involving verbal, numerical, pictorial and constructional material, and also an ' abstraction ' test (Weigl) and ' memory for designs ' (Terman-Merrill) . None showed qualitative differences from English subjects in their performance on verbal, numerical and abstraction tests. On the non-language tests their performance was slower than would be expected of English subjects, and they showed differences in their innaccurate orientation of drawn and constructed designs. At the conclusion of two years' technical training, the subjects were retested on the same material. Significant changes were increases in scores on the non-language tests. These were associated with increased speed and accuracy of orientation, and also apparently with a more ' synthetic ' approach toward visual material. It is suggested that an ability which may be poorly developed under these cultural conditions, and which may be increased by appropriate educational methods, is that of perceiving visual material as a whole (or ' Gestalt ' perception).

From this abstract we see: 1) the sample size is small 2) the study was carried out in one country 3) the sample is specific, in that it is students entering a technical school who may well be above the local norm, and 4) the extent of technical training was two years and four months, not “a few months”.

When one reads the paper, it is apparent that the boys are between 16 and 19 and are entering what in Uganda would be tertiary training prior to taking up technical occupations like carpentry, motor maintenance and machine fitting. Retesting took place after 2 years and about 4 months. The 7 tests were in fact McFie’s adaption of standard tests, including putting in some new material and utilising some new scoring methods, and significantly increasing the time taken to complete items. This makes interpretation difficult. One can talk about change after education, but not locate the result properly in conventional intelligence testing. The paper is an indication of an effect, no more, carried out by a thoughtful researcher exploring some possible effects. Despite the very small sample, he includes some factor analytic results. These will not be stable when n=26 and tests=7, but they are a welcome addition as a statement of intent for later work.

Here are his descriptions of the tests:

The main group of tests corresponded to subtests of Wechsler's (1944) scale, with questions omitted which were obviously related to European cultural experience, and some added from other scales (e.g., the Terman-Merrill 1937) :

A.--Comprehension : Six questions : maximum two points each.

B.-Similarities : Seven pairs ; maximum two points each.

C-Arithmetic : Six questions, one point each ; two questions, two points each.

D.-Picture Description : Four photographs of scenes from African life ; 1 point for enumeration or description, 2 for synthetic interpretation.

E.-Picture Arrangement : One example and four test stories, taken from a popular African cartoon strip : maximum three points each.

F.-Block Designs : Wechsler's designs, 1-6, with time limits extended to 2' for designs 1-3, and 3' for designs 4-6.

G.-Memory for Designs : As Terman-Merrill IX, 3, but scoring two, four or six points according to quality of reproduction of each design.

H.-Weigl's Sorting Test : As described by Weigl in 1927 : pass or fail recorded according to ability to sort both ways (cf. McFie and Piercy, 1952).

Criticisms are often made about the testing of intelligence in Africa. McFie was clearly giving his subjects every chance to score well by making the material culturally appropriate, which is good. He severely cut down the number of test items, which reduces reliability and range significantly; he made his own decisions on cultural bias which may or may not have been correct; and he extended the time limits, which damages the interpretation of results. The better procedure would have been to have recorded times up to some generous limit, and then shown how the results were affected by using the generous as opposed to the official time limits. McFie also changed the marking system. This is regrettable. Better to have kept the original and then compared it with his more generous system. The tests have lost their integrity, and can only serve as being broadly indicative.

By the way, in the heated atmosphere of debate about African intelligence, if such a paper was being put forward as a proof of low African intelligence it would rightly be rejected as flawed. The tests are not complete, include other material, have been altered by the examiner, and have had their scoring system and timings altered. This still allows before and after comparisons, but the results only weakly relate to the g loaded originals.

Here are the main results in a screen grab:


It is hard to make a judgment about what the results signify in terms of overall intelligence. Even if one looks only at the tests which most resemble full Wechsler subtests: Comprehension, Similarities, Arithmetic and Block Designs; the original test results are probably equivalent to Full Scale IQ 85 and they rise after over two years of technical education to FSIQ 89. This change is within the usual 4 point retest difference, but it is certainly suggestive of a training effect. If we knew more about the selection of the sample we could judge how this compared to the local norm. For example, if they are the brighter students, chosen for tertiary training, one might expect them to be one standard deviation above the national mean. That would correspond to being drawn from a population mean of IQ 70. If they are run of the mill students then the local average is IQ 85 which is very much better than average results for sub-saharan Africans. As always, the representativeness of samples is crucial when trying to estimate the mean of a bell curve.

Once these IQ results have been spelt out, which McFie did not do, his comment: “None showed qualitative differences from English subjects in their performance on verbal, numerical and abstraction tests” becomes particularly interesting and informative. Right from the start, the students knew what was expected of them, and understood the concept of the tests. They did not have an operating system incompatibility, although there was a power difference. (I make this point because some ill-advised commentators try to suggest that there is an African way of thinking which is fundamentally different from the European way of thinking, as profound as the difference between Microsoft and Apple operating systems). These students may have had a problem with the unfamiliar blocks, but they did not have a problem in realising that they had to make a copy of the block design.

Although the tests have been much altered, in terms of the original standard deviations the statistically significant gains are Picture Description 0.67 Block Designs 0.65 Memory for Designs 0.63 and the total of all tests is 0.73. If you look at the narrow standard deviation for Picture Description, this made-up test lacks discriminative ability as does Arithmetic and Comprehension. Either there weren’t enough items and not enough hard items, or the group were already highly selected and homogenous in ability on these particular skills. Fuller testing might have shown different results. My own reading is that this small study suggests that 2 years of technical education probably improves Block Design and Memory for Designs, but without control groups we cannot be sure.

In my view, the Nisbett et al. account is not a good representation of the paper. Here it is again:

Gains in sub-Saharan African countries of 0.50 to 0.70 SD in response to a few months of Western-style education have been reported for heavily g-loaded fluid intelligence tests (McFie, 1961).

It makes the results sound extensive, broadly effective across countries, and quickly and easily attained. How should they have reported the results? There are various options, but here is one suggestion:

McFie (1961) tested 26 Ugandan adolescent boys on 7 adapted Wechsler type tests after over two years of technical education, finding gains of 0.6 sd on 3 non-verbal tests.

It is 29 words to their 30 (and 143 versus 163 characters) and I think it captures the main results more accurately. What do you think?


So, where is the ghost to whom I referred earlier? John McFie was my first boss, a kind man who left a promising career at The National Hospital, Queen Square, London to work as a rural doctor in Africa. When he eventually returned, to England with his African wife, and to brain research at Guy’s Hospital, he hired me in 1968 and mentored me as we studied the cognitive effects of cortical injuries sustained in childhood. When Arthur Jensen’s paper came out in 1969 we both decided we would attack what we thought was Jensen’s suggestion of a genetic cause of African deficits on Block Designs by extending the work John had done in his 1961 paper. He was the senior author of the first paper I published, broadly arguing that Jensen’s findings were an artefact of cultural restrictions regarding constructional toys. Did we prove our case? Was Jensen convinced when I presented the results to him at a conference? More of that later.


  1. "I think it captures the main results more accurately. What do you think?" Indeed it does. For a start the misleading description of the number of months is corrected.

    I grinned at "The 7 tests were in fact McFie’s adaption of standard tests, including putting in some new material and utilising some new scoring methods, and significantly increasing the time taken to complete items." That's the sort of stunt that "climate scientists" get up to, fiddling about with instrumentation until it makes the point they wish for. I hope your old chum isn't “best understood” as having been an early adopter of such a technique.

  2. Nice work. John Loehlin and colleagues wrote in their 1975 book "Race Differences in Intelligence" that many of the studies referenced in the race and IQ debate are curiously old and have never been replicated. Unfortunately, almost forty years later we are still seeing references to such studies, often the very same ones whose antiquity Loehlin et al. bemoaned in 1975. For example, why has the famous Eyferth study of white and mixed-race children of American GIs in post-WW2 Germany never been replicated?

    In his 2010 book on intelligence, Nisbett devotes much space to research conducted in the first half of the 20th century, often putting a similarly misleading gloss on it as in that 2012 paper. For example, he discusses a 1936 study where high-IQ African American students were found to have no more (parentally reported) white ancestry than what was found in a representative sample of African Americans.

    What he does not mention is that the "representative" comparison sample was in fact highly elite, largely comprising of college students and professionals. See here for more.

  3. I think that all parties become enamored of particular studies. The Eyferth study, for example is important, but never seems to have generated followups or comparable replications. Sometimes one just has to accept that a particular researcher finds hard to repeat an investigation because the opportunity falls into his lap once. As a rule of thumb, however, if a study cannot be replicated then it has less to contribute. As to racial admixture, that needs to be done again with a full genome study. Razib Khan has some good suggestions on this. Visible racial characteristics are not always related to underlying genetics, so one has a potentially intrinsic genetic measure of racial prejudice effects: does intelligence follow the "hidden" genetic code or the part of it which is visible, and might have generated negative racial attitudes? Worth testing. Ideally, we would do a full meta-analysis of all the data. Jason Malloy is doing that on parts of the story. I certainly get tired of partisan reviews and special pleading. We ought to be able to agree what constitutes a good quality study, and rank the evidence by quality. I don't mean getting rid off everything because it is not up to the best standards. Simply showing the variable quality, and trying to derived the best possible summary of the findings would be a good way forwards.

  4. The fourth doorman of the apocalypse15 December 2013 at 18:44

    I notice one of the tests sets included are "abstraction tests."

    I am interested in this subject but have not been able to find much with cursory studies. Ie, what constitutes abstraction, at what age does it manifest, is it correlated with IQ, etc.

    Where should I look for papers etc on this?

  5. http://onlinelibrary.wiley.com/doi/10.1002/1097-4679(198007)36:3%3C778::AID-JCLP2270360333%3E3.0.CO;2-6/abstract
    These are tests in which subject have to deduce the general rule behind a set of exemplars presented to them, to deduce the abstract principle which determines whether the example is "right" or "wrong". It may be an abstract concept like "always chose the bigger item if it is on the left" or something of that sort. There are quite a few variants of this, including the Wisconsin card sorting task, in which the rules are changed without warning, and the test is to see how quickly subjects look for the new abstract rule.