Sunday 23 February 2014

Multiple strengths, multiple weaknesses


Test construction used to be a sober business. Every ten or fifteen years a new version of an established test would come out, and test-giving psychologists bought the new version. They often complained about it, on the grounds of cost, and on the grounds of having to learn new material when they had grown familiar with the old version, and knew all its characteristics intimately. They also noticed the occasional improvement.

The benefits of sticking to what you know were explained to me by an anaesthetist years ago. He told me that out of the many anaesthetics on the market he restricted himself to the best three, and mostly used just one. In this way he learnt everything he could about how it worked and how his patients reacted to it. By paying great attention to the patient’s medical history, and also the family history of allergies, he learned how to anaesthetise his patients. In special cases he would sometimes use the other two anaesthetics, singly or in combination. In that way he got good results, and fewer nasty surprises.  For that reason, I would rather have a skilled surgeon with a somewhat blunt knife, than expose myself to a very sharp knife in the hands of a dull surgeon.

Even the old Stanford Binet which bundled together verbal and non-verbal items, and which the Wechsler tests rendered obsolete could, in the hands of a skilled practitioner, give you the child’s intellectual level very quickly, and pretty accurately. Testers knew which items to skip. They were the pioneers of dynamic testing, now the domain of computer administered tests.

Of course, new tests kept coming out, and they often caused some excitement. There were tests which tested things which had never been tested before. Tests which tested old things in new ways. Tests which tested new things in a new way and displayed the results in new ways. (The latter were particularly popular). Also, many, many tests which were not tests. They were not tests (tests of) learning “styles”, creativity, and sundry other abilities, but they were not intelligence tests. Or so the description of the test asserted.

Test publishers quickly realised that they made profits every time they published a new test (particularly if you had to go on a training course to give the test). Once that new test became even slightly popular, new test enthusiasts could silence any critic by saying “Have you done the training?”. In that way anyone with critical faculties was either rendered mute, or had to pay for both test and training, after which some conceded that one of the subtest items was passable. Of course, you may rightly wonder what sort of psychologist needs to be trained to give a test, when every single examination of a person’s mental capabilities has to follow the same inevitable steps: item construction, piloting, item selection, test explanation, scoring systems, standardisation, item analysis, construction of scaled scores, construction of group ability scores, construction of overall ability scores, and so on. By all means give psychologists a very good grounding in psychometrics, even if it takes six months to a year, but if you have to have a training course for each test, then both the test and the psychologist are in difficulties.

Naturally, most of these tests fell by the wayside. There is a positive manifold in human skills, so whatever the label on the test, the test-taker’s brain has to solve the items, and general intelligence comes into play. Once you have a measure of general intelligence and, say, two group factors, you get diminishing returns when you push further into special skills, and you also tend to pick up more error variance.

Now a few words about the grand-daddy of individually administered “clinical” intelligence tests, the Wechsler tests. They are the gold standard for intelligence testing. First of all, they are pretty good, which is why they got their dominant position. The material was sensibly selected and well organised. It provided a full scale IQ based on 10 subtests and two well known and well understood sub-scales, Verbal and Performance, each composed of 5 each of the subtests. Sales were brisk, and most clinicians could call on roughly 40 years of data on a common basis, with only minor up-dating each re-standardisation. For once psychology was moving towards replicability of results, and you could even begin to do comparisons across generations, all of whom had done roughly the same test, and whose full scale IQs were directly comparable.

So, Wechsler decided to mess everything up. They broke the 10 subtests in 4 subscales, composed of 3 or 2 subtests each. You do not have to know much about sampling theory to realise that the error term for each will be wider than for a 5 subscale. Two subtests do not a factor make. The search for apparent detail in factors came at a cost in terms of accuracy. When you make allowance for the reality that many psychologists do not give the full test, you have a prescription for psychologists writing long reports about many subscale results, with fragile support for their interpretations. There are now 4 subscales about which one can speculate, where formerly there were 2. Good for business. Your chance of finding a special strength (my client’s genius has been under-estimated) or a special weakness (my client’s genius has been damaged by your negligence) has been doubled at a stroke. Good for business.

Then, add in a few other tests which are g loaded, but not as g loaded as intelligence tests. For example, tests of memory. Even here, Wechsler has designed memory tests with known correlations with intelligence, so you can calculate how big a discrepancy has to be before one can argue that a client’s memory is poorer than their intelligence would predict. There is wiggle room, but not much. However, there are other tests of memory, so those can be used in addition with more chance of finding apparent discrepancies.

Then add in several other tests with variable g loadings. Tests of executive function are the most popular. With each additional test your chance of finding a significant discrepancy rises. By giving the percentile rank for each test result you can convince almost every reader of the report that the person’s abilities are highly variable. You can then use this variability (based on improper sampling) to argue that calculating a full scale IQ would be “meaningless”. Like adolescent poets, such clinicians adore meaninglessness. Of course, no attainment result is meaningless. Each contributes information about the person’s abilities, and also contains an error term. The trick is to maximise the former and decrease the latter. Pooling the result of several well-sampled tests helps achieve that.

Now, the Wechsler team are not responsible for the plethora of special tests, but their foray into “factors” based on 3 or 2 subtests was not a good precedent. It has led to a confusion among some clinical psychologists about the factorial structure of intelligence. Wechsler must have gambled that producing a large number of factor scores on the basis of a small number of subtests was what the market wanted, and they relied on the professionalism of testers to give the full 10 tests, and then give precedence to the best founded score, which is Full Scale intelligence.

The current situation is like inviting every Olympic athlete to compete in the decathlon, but then allowing them to drop some of the events and to ask for prizes on the basis of a quasi-random selection of their best 2 or 3 events. The decathlon is what the Wechsler test required: 10 core tests for a full result. We should return to that simple standard if, like trying to find the best all-round athletes, we want to find the best all-round minds.


  1. Dr. Thompson, do you want to take an actual general intelligence test?

    Something that is quite probably a better, bias free test of intelligence (pattern recognition, hypothesis generation and testing, logical and mathematical reasoning etc...) for at least adult able bodied humans than anything you have ever seen or used before? I seriously promise you'd be interested, quite likely you haven't heard of this before or aren't usually knee-deep in computer science research.

    No other strings attached, completely free, though it would take a couple of hours of your time. I'd get right back to you with the details and setup as long as you have the time to set aside, with like a computer, desk and paper. To be clear this isn't something I have devised but true expert work that is simply entirely overlooked by the wrong academic subdisciplines.

    I appreciate and sympathize with a lot of what you write about here on this blog, frustration with the media and bad science, discussing how to actually look at all sorts of things, subfactor analysis and g. I'd honestly say I would rather care about what correlates to performance on this battery than a lot of what is in psychometric literature out there.

    I suggest this not just because I do think you are a far more than typically open-minded and passionate researcher in the field with the right background experience and understanding. It's because you also seem like the type of person who would be boundlessly fascinated by the prospect, have a true intellectual desire to see what is going on and have the ability to move forward and make something of that. This isn't just about arguing over the social sciences' lack of rigor, inability to reproduce results or clashing with colleagues and so on. I can't think of a better way to communicate or demonstrate a ton of points than to simply ask you try this out.

    I wasn't exactly extending this to other random blog followers, nor do I have an idea what norms are appropriate, though that all could be played out if you get back on it. The whole thing while simple would be on an honor system, but thanks for taking consideration.

  2. i'm not a wechsler fan - but it got out there & saturated the area & with the reluctance of highly trained test givers to learn new tests, it remains the (fool's) gold standard by default :( side note - after the company makes $ selling test kits for a year or 2, then all the $ is in record forms from then on (aka "test protocols":)
    some of the "factors" of these tests are false factors - as we discovered back when the wechsler had the silly freedom from distrability factor - which pulled apart after more subtests were introduced & coding found it could run off with symbol search to form its own factor, free of digit span & arithmetic. don't get me started:) vocab tests (altho high in g, & a good proxy for intelligence) are not fair to older smart dyslexics (they don't grow in vocab at the same rate as others). tests with math(s) in them are not so fair to people who are bad at it (one's measuring their difficulty rather than their intelligence then), etc. yet, even the wechsler, as far as group statistics & g go - works well for large data sets -- yet for an individual it can be rough to ferret out their relevant strengths & weaknesses, etc. good point on the knowing experienced examiner getting what they need out of it, while not being fooled by parts of it that don't work as well, or blaming those parts on the test-ee instead of the test:)

  3. I don't know this "executive function" test, being an economist, but I can guess at why it might sell well. Suppose you are thinking about how to train one of your employees. You know that IQ is useful information, a magic one-hour way to find out a lot about an employee. You can't change the IQ, but it does tell you a strength or weakness of the person and helps you know how to train them. You'd like other magical tests, especially if each only takes ten minutes and you can take six and get IQ for free, because you 'd like to have more of that quick and scientific data to base your training on. Of course, if all six turned out to be highly correlated, they'd be redundant, and you wouldn't want to buy them. You are delighted, though, if they vary a lot for a person with a given IQ. You'd know to work on one person's organizing ability and another person's memory. And notice how the company can get lots of variance in subscores. All the company has to do is have *lots* of subscores. If each subscore only gets ten minutes of testing, you get high variance in subscores, and as a bonus you save on employee time!