Test construction used to be a sober business. Every ten or fifteen years a new version of an established test would come out, and test-giving psychologists bought the new version. They often complained about it, on the grounds of cost, and on the grounds of having to learn new material when they had grown familiar with the old version, and knew all its characteristics intimately. They also noticed the occasional improvement.
The benefits of sticking to what you know were explained to me by an anaesthetist years ago. He told me that out of the many anaesthetics on the market he restricted himself to the best three, and mostly used just one. In this way he learnt everything he could about how it worked and how his patients reacted to it. By paying great attention to the patient’s medical history, and also the family history of allergies, he learned how to anaesthetise his patients. In special cases he would sometimes use the other two anaesthetics, singly or in combination. In that way he got good results, and fewer nasty surprises. For that reason, I would rather have a skilled surgeon with a somewhat blunt knife, than expose myself to a very sharp knife in the hands of a dull surgeon.
Even the old Stanford Binet which bundled together verbal and non-verbal items, and which the Wechsler tests rendered obsolete could, in the hands of a skilled practitioner, give you the child’s intellectual level very quickly, and pretty accurately. Testers knew which items to skip. They were the pioneers of dynamic testing, now the domain of computer administered tests.
Of course, new tests kept coming out, and they often caused some excitement. There were tests which tested things which had never been tested before. Tests which tested old things in new ways. Tests which tested new things in a new way and displayed the results in new ways. (The latter were particularly popular). Also, many, many tests which were not tests. They were not tests (tests of) learning “styles”, creativity, and sundry other abilities, but they were not intelligence tests. Or so the description of the test asserted.
Test publishers quickly realised that they made profits every time they published a new test (particularly if you had to go on a training course to give the test). Once that new test became even slightly popular, new test enthusiasts could silence any critic by saying “Have you done the training?”. In that way anyone with critical faculties was either rendered mute, or had to pay for both test and training, after which some conceded that one of the subtest items was passable. Of course, you may rightly wonder what sort of psychologist needs to be trained to give a test, when every single examination of a person’s mental capabilities has to follow the same inevitable steps: item construction, piloting, item selection, test explanation, scoring systems, standardisation, item analysis, construction of scaled scores, construction of group ability scores, construction of overall ability scores, and so on. By all means give psychologists a very good grounding in psychometrics, even if it takes six months to a year, but if you have to have a training course for each test, then both the test and the psychologist are in difficulties.
Naturally, most of these tests fell by the wayside. There is a positive manifold in human skills, so whatever the label on the test, the test-taker’s brain has to solve the items, and general intelligence comes into play. Once you have a measure of general intelligence and, say, two group factors, you get diminishing returns when you push further into special skills, and you also tend to pick up more error variance.
Now a few words about the grand-daddy of individually administered “clinical” intelligence tests, the Wechsler tests. They are the gold standard for intelligence testing. First of all, they are pretty good, which is why they got their dominant position. The material was sensibly selected and well organised. It provided a full scale IQ based on 10 subtests and two well known and well understood sub-scales, Verbal and Performance, each composed of 5 each of the subtests. Sales were brisk, and most clinicians could call on roughly 40 years of data on a common basis, with only minor up-dating each re-standardisation. For once psychology was moving towards replicability of results, and you could even begin to do comparisons across generations, all of whom had done roughly the same test, and whose full scale IQs were directly comparable.
So, Wechsler decided to mess everything up. They broke the 10 subtests in 4 subscales, composed of 3 or 2 subtests each. You do not have to know much about sampling theory to realise that the error term for each will be wider than for a 5 subscale. Two subtests do not a factor make. The search for apparent detail in factors came at a cost in terms of accuracy. When you make allowance for the reality that many psychologists do not give the full test, you have a prescription for psychologists writing long reports about many subscale results, with fragile support for their interpretations. There are now 4 subscales about which one can speculate, where formerly there were 2. Good for business. Your chance of finding a special strength (my client’s genius has been under-estimated) or a special weakness (my client’s genius has been damaged by your negligence) has been doubled at a stroke. Good for business.
Then, add in a few other tests which are g loaded, but not as g loaded as intelligence tests. For example, tests of memory. Even here, Wechsler has designed memory tests with known correlations with intelligence, so you can calculate how big a discrepancy has to be before one can argue that a client’s memory is poorer than their intelligence would predict. There is wiggle room, but not much. However, there are other tests of memory, so those can be used in addition with more chance of finding apparent discrepancies.
Then add in several other tests with variable g loadings. Tests of executive function are the most popular. With each additional test your chance of finding a significant discrepancy rises. By giving the percentile rank for each test result you can convince almost every reader of the report that the person’s abilities are highly variable. You can then use this variability (based on improper sampling) to argue that calculating a full scale IQ would be “meaningless”. Like adolescent poets, such clinicians adore meaninglessness. Of course, no attainment result is meaningless. Each contributes information about the person’s abilities, and also contains an error term. The trick is to maximise the former and decrease the latter. Pooling the result of several well-sampled tests helps achieve that.
Now, the Wechsler team are not responsible for the plethora of special tests, but their foray into “factors” based on 3 or 2 subtests was not a good precedent. It has led to a confusion among some clinical psychologists about the factorial structure of intelligence. Wechsler must have gambled that producing a large number of factor scores on the basis of a small number of subtests was what the market wanted, and they relied on the professionalism of testers to give the full 10 tests, and then give precedence to the best founded score, which is Full Scale intelligence.
The current situation is like inviting every Olympic athlete to compete in the decathlon, but then allowing them to drop some of the events and to ask for prizes on the basis of a quasi-random selection of their best 2 or 3 events. The decathlon is what the Wechsler test required: 10 core tests for a full result. We should return to that simple standard if, like trying to find the best all-round athletes, we want to find the best all-round minds.