Sunday 15 December 2013

Correction on test bias

In my rush to post up the Expert Opinion findings presented yesterday by Rindermann et al, I realise I gave the old 1984 figures for test bias. The newer figures are somewhat lower, a score of 1.86 overall being found, where 1 is an insignificant amount, and 4 is a large amount. Sorry about that. It seems that Jensen (1980) “Bias in mental testing “ has had an educational effect after all. To those of you who queried it, fear not. The experts are responding to rational arguments.




Similarly, in the relevant comparisons the contemporary experts see less bias against groups, about 2 and a bit out of 9, though they certainly feel that all immigrants might be at a disadvantage. Jensen had a rule of thumb that immigrants should have 5 years to learn the local ways before it was valid to test them.

So, in summary, there is acknowledgment that the issue of test bias has been worked on, and has been reduced. Test producers now have to meet legal standards for item analysis, and there is double-sampling of minorities so as to achieve better confidence levels.


  1. Sorry if this long comment shuts down this thread (!) but I must chime in on (lack of) test bias - & “expert” subjective “ratings.” I bet statistically whatever tests they looked at, the data would actually say there’s NO bias or if there is, it’s against white males or Asians… (the experts are probably over-predicting bias! Yes, the bias experts are probably biased)

    A little bias history that ties in eventually:
    We always threw out items biased against females (sometimes an odd vocab word would be biased against females, occasionally an action oriented word) - & would leave in items biased against males (!) but the most fun was sending items out to “experts” (aka someone from out of town with a briefcase) & then saying in the manual “hey, we sent ‘em out to a bias review panel of experts” - & listing them (& paying them), but never mentioning some experts were so crazy they’d say “any mental test item is inappropriate for any native American” – that didn’t matter, b/c no matter what the experts said, we just reported that experts had looked at it & offered their opinions (which we might’ve considered if they weren’t so CRAZY!:)
    -- so this current research really just shows us that “experts” are getting less crazy!!

    Even by the late 1980s wealthier test companies did “minority oversampling” b/c the gold standard for (lack of) bias is equal predictive validity/fairness in prediction (if one has a low score - does it predict the same low outcome no matter what group one is in, etc.), & same approximate rank order of item difficulties for group. & same internal “factor structure” which is silly – e.g., for 2nd graders you may get 2 factors in an achievement test – math & reading – but for native American 2nd graders, they don’t score high enough yet to be high on one but low on the other, so they would have just 1 general factor - which didn’t mean the test was measuring something “different” for them – it just meant they hadn’t developed enough to differentiate into 2 (related) factors. But you’d try to show the same “internal factor structure” knowing full well one group might be one factor shy of a full load, simply b/c they score lower in general!

    items were more apt to be biased regionally (e.g., a picture of a cactus for 5 or 6 year olds will be harder for people from the NE US than for those from the SW US where cacti exist in large #’s. Or a lighthouse beacon might be more familiar (& thus easier) for those from the NE US than those from the SW US.

    Any visual item now comes preloaded with people of color & girls for every math item – which makes “experts” happy! But, how many of these experts could interpret equal predictive validity through multiple regression. If they can’t – they’re not experts:)

    In fact, nowadays we should be looking at reducing the amount of low “g” subtests that go into an ability test (even tho minority groups score closer to equal on those), b/c the low “g” tasks are what dyslexics tend to do poorly on – so ability tests with a lot of low “g” tasks underestimate the ability of those with dyslexia. Sorry, I’ve enjoyed all Dr. Thompson’s posts, but this is the only one I know enough to comment on lately!

  2. This is a great contribution! I always imagined the test constructors must be consulting people with far greater ability and experience than me. As a clinician, and my wife who gives more test is the better judge of this, I get the impression that the Wechsler is now less good than before. The range of items is poorer, and some of the newer tests simply don't contribute much. What you have drawn more expert attention to is that a differential response to a test item does not have to be an indication of bias. It may indicate a truly different level of ability. Perhaps you would like to do a longer critique of the current Wechsler tests?

  3. aw, thank you sir. i try to stay away from the wechslers, b/c i use mainly the Differential Ability Scales (DAS-II - a revision of the venerable old British Ability Scales). The DAS-II kindly lumps all the high-"g" subtests into a composite & kicks out the lower-"g" (aka "relatively independent measures of ability") tasks - to be interpreted separately -- which it did as a lucky accidental marketing move to separate it from the wechslers – which had the same publisher. I shall try to damn the wechslers with faint praise… ideally, I come to bury them not to praise them:)

    the old wisc-r & wisc-III were terrible b/c they allowed so much verbal/cultural info into their nonverbal "performance" (putting pictures together in the right order to tell a story – uh not exactly “spatial” ability!) & for making the verbal so auditory-memory-dependent (& sad subtests like “information” – an easily scored quick proxy for “g” – but unfair to rural types who thought the 4 seasons were “deer, duck, elk & antelope” - “information” begged to be slapped for being “culturally biased”) + the older wechslers were awful for having so many low "g" (aka "processing":) tasks glommed into the overall composites. & that’s bad because…

    first I must jump over to the WJ-III to make a point - the wonderful (& venerable) gentleman, Dr. Richard Woodcock argued (faultily in my view, but right from other points of view) that it’s good to put the "processing" tasks in the mix with the high-"g" tasks – b/c people who process poorly usually score poorly on achievement, SO including processing in the “IQ” (aka GIA) gives you a stronger correlation with academic achievement – a correlation one can point to in the manual as a shining beacon of validity. So for “group” reasons it might be good,

    BUT, for testing INDIVIDUALS - especially bright (high "g") individuals who have low processing – it’s NO help to lump high-”g” & low-“g” all together & call the person “average” – it’s FAR better to separate the high-“g” from the low “g” – so you can tell that person to their face “hey, you’re smart – especially at things school doesn’t use much (e.g., spatial ability) – but see here how your processing deficits drag down certain parts of reading & writing to their same level…

  4. Yet another reason to be mad at the wechsler – they don’t use processing that would actually be relevant, viz., rapid naming/rapid retrieval tasks – aka language processing & retrieval speed – relevant to reading fluency – it’s the same task – rapidly retrieving exact rote verbal sound info from the filing cabinets.

    The Wechsler is so dumb, they use nonverbal symbol processing speed (quick to give & score! Yet utterly irrelevant – but hey, it came out of neuropsychology - & that’s fancy! & impressive to anyone who doesn’t actually have to use tests to try to help people & get actual RELEVANT info from them:) it’d be nice if the wechsler used language processing-retrieval speed instead of the fancily irrelevant non-language processing speed.

    The newer Wechsler’s are slightly kinder & gentler than their predecessors (“descent with modification” ☺ they don’t change ‘em much at any one time, b/c then people won’t buy them – nobody wants to relearn to give the darn things:) at least now the wechslers give you the option of making a high “g” composite – the GAI – which kicks out the 4 lower-“g” processing tasks (which form Processing Speed & (auditory) Working Memory)

    Still the Wechsler is a moron (I’m being nice - that’s higher than idiot & imbecile) for having a single nonverbal composite (Perceptual Reasoning Index - even more pretentious when it’s called “perceptuo-reasoning”) the problem is it’s underfactored – there are not enough 3D hands-on spatial tasks in the mix (if any – block design seems like one, but to get all the 2D line & angle aspects right is more 2D than 3D) if they had actual spatial tasks, block design would probably run off to join them in a SPATIAL factor – when I was at these test publishing places – they’re like American academia – NO FORMER MILITARY & especially no former enlisteds – like myself☺ – SO they know nothing of the ASVAB – which would be wonderful to give along with these tests – b/c you could easily see what wants to correlate with mechanical comprehension – the mainly spatial component of the ASVAB (required to be good in all 3 of the ASVAB formulas to audition to be a SEAL! The commonality to all 3 methods of proving one’s bright enough)

    many bright people have a Spatial > 2D Nonverbal > Verbal – that pattern is more frequent than its reverse (!) humbling to me, since I have the reverse - & the spatial > verbal pattern is more frequent b/c evolution selected more for spatial b/c that’s building shelters & keeping animals penned in & fixing things, etc. it usually helps people survive more than talking! the Wechsler won’t allow you to find that evolutionarily frequent & valuable pattern, but the DAS-II will.

    So that’s why I like the DAS-II:
    b/c evolution/descent with modification – rocks!

  5. Thanks for all this interesting information. I agree that researchers and test givers tend to work in silos. When I started clinical work Wechsler was seen as far better that the old Binet (too verbal) but all the selection type tests passed me by. I think there is a practical reason, to which you have already referred. Giving a test requires lots of practice. Once you really know the material, then you can concentrate on how the testee is handling things. Before that you are just making sure that you have presented everything correctly, and are ready for the next subtest.

  6. thank you. the irony is i went from test publishing to academia to private practice where i actually have to give these dang things all the time - & see their little flaws & idiosyncrasies & unequal item gradient jumps (& where they're normed too hard or too easy but only for certain ability levels at certain ages on certain subtests, etc.) & now could make them better than before! but one learns to work around their flaws & knows which parts to take with a grain of salt, etc. but many people give them unquestioningly as a whole - & in doing so, do a disservice to the examinee. thank you for letting me rant about this picky little part about measurement of intelligence:)

  7. On the contrary, this is important, and is often news to theoreticians in the field, many of whom confess "I have never given an IQ test". I think that testers who have given 500+ face to face intelligence tests are in a minority and have to explain "unequal gradient jumps" and also what needs to be taken with a grain of salt because from interviews you can understand a misunderstand which was ignored by the test constructors. Further rant: testers who use a new "special abilities" test with many subtests, and give these greater emphasis than established intelligence tests, simply because the new tests generate far more apparent deficits which can be used as heads of claim in medico-legal reports.