Tuesday 8 September 2015

The Wechsler factor factory


Nobody gets round sampling theory, not even the Spanish Inquisition. If you take 10 different samples of problem-solving you can be fairly sure that this intellectual decathlon will reveal the best mental athletes. If you separate those 10 subtests into 5 Verbal versus 5 Non-verbal performance skills, you get 3 summary scores: Full scale IQ based on 10, and Verbal IQ and Performance IQ based on 5 each. This is the structure which reigned supreme from 1955 till about 2008, when a four factor structure was launched: Verbal Comprehension Index (Similarities, Vocabulary, Information); Perceptual Reasoning Index (Block Design, Matrix Reasoning, Visual Puzzles); Working Memory Index (Digit Span, Arithmetic); and Processing Speed Index (Symbol Search, Coding).

Did you see what they did there? The same 10 subtests now produce 6 summary scores for psychologists to play with. These were the 4 factor scores described above, a Full Scale score, and a General Ability Index comprised of the verbal comprehension and perceptual reasoning factors added together.

The official explanation for trashing 53 years of comparative data was that many publications suggested that the 4 factor solution was better than the old verbal-performance distinction.

The unofficial explanation given to me by those more closely involved with the process is that a 3 or 4 factor solution depended on which of the supplementary subtests was included (there are 5 of them, but I have ignored them in the hope you will keep reading) but that a new structure was a marketing ploy which had the commercial benefit of forcing practising psychologists to buy the new test, minimum cost £1,306,  £125 extra for 25 test score forms.

Seen from a sampling theory point of view, whereas formerly the verbal and performance factors were based on 5 subtests, the new factors were based on 2 or 3 subtests. There is no way round the fact that the sampling of ability per factor is inadequate. Psychologists have had a field day, generating factor scores which then show discrepancies with other factor scores, thus revealing syndromes of learning disorders, and a menagerie of specific deficits which require remediation through extensive training. Writing a psychometric report is now easier (in fact, there is software which will write it for you) because there appears to be so much more to talk about, still based on 10 tests. Since psychologists, like everyone else, tend to laziness, they often give only 7 or 8 of the required 10 subtests, pro-rate the results, and then have even more to talk about, at even lower levels of reliability.

There is nothing much that can be done about this, other than patiently explaining that measures of human ability depend on administering a broad range of well-validated tests, and paying due regard to error terms in measurement, and pointing out the pitfalls of multiple comparisons between not particularly reliable factor scores.

The Spanish standardisation of the Wechsler in 2012 has allowed a team of Madrid researchers to take a detailed look at the factor structure of the test. I should warn you that work of this nature tends to be an intelligence test in its own right, so I will simply give you the highlights as I understand them. They also look at how the test relates to educational attainment. For example,  Ritchie et al. (2015) found that years of education directly predicted performance on Block Design, Matrix Reasoning, Digit Symbol Coding, Symbol Search, but not on Letter-Number Sequencing, and Digit Span Backwards.

Francisco J. Abad, Miguel A. Sorrel, Francisco J. Román, and Roberto Colom “The Relationships , Between WAIS-IV Factor Index Scores and Educational
Level: A Bifactor Model Approach” Psychological Assessment © 2015 American Psychological Association 2015, Vol. 27, No. 3, 000 1040-3590/15/$12.00 http://dx.doi.org/10.1037/pas0000228


These were the main findings: (a) the bi-factor model provides the best fit; (b)
there is partial invariance, and therefore it is concluded that the battery is a proper measure of the constructs of interest for the educational levels analyzed (nevertheless, the relevance of g decreases at high educational levels); (c) at the latent level, g and, to a lesser extent, Verbal Comprehension and Processing Speed, are positively related to educational level/attainment; (d) despite the previous finding, we find that Verbal Comprehension and Processing Speed factor index scores have reduced incremental validity beyond FSIQ; and (e) FSIQ is a slightly biased measure of g.

A few words of explanation:

Bi-factor model. Assumes that general intelligence affects all subtest scores, and in addition those scores are each affected by different group factors. The bi-factor model—one general factor and four specific factors - showed the best fit, which is consistent with previous investigations of the factor structure of WAIS-IV (Canivez & Watkins,2010a, 2010b; Gignac & Watkins, 2013; Kranzler et al., 2015).

Measurement invariance. You know all this, but the clumsy phrase means “are the measures well founded, reliable and reproduce-able in other large representative samples?” (Colom later suggested to me that measurement invariance should be described as ‘comparable across groups’). If not, they are of limited value. We do not want measures that vary all over the place. Establishing measurement invariance is a demanding procedure, usually involving confirmatory factor analyses on large data sets.

Regarding the bi-factor model, we have seen that the g explains a large part of the variance. Most of the subtests tapping Perceptual Reasoning and Working Memory had greater loadings on the general factor than on their specific factors, and this is also the case for Similarities and Information subtests.

We have verified that the factor structure of WAIS-IV is almost invariant across educational levels (equal loadings, intercepts, and residual variances). Figure Weights and Arithmetic are the best indicators of g (with large common variance related to g and small common variance related to specific factors), and this is true across all educational levels. Only three tests showed lower loadings as
educational level increased. Schooling might reduce the effect of some
abilities required in the Matrix Reasoning Task. In the higher educational levels, Matrix Reasoning would be a purer indicator of the general factor. On the other hand, for Coding and Letter-Number Sequencing, the loading on g was
reduced by half. One possible explanation is that schooling improves
the efficiency of sequential processing of letter and/or numbers, reducing the complexity of the tasks.

With regard to the relationship between intelligence and educational attainment, both observed and latent regression analysis have shown that FSIQ/g explained most of the variance in the prediction model.

When educational attainment was modelled as predictor, we found an average increase of 2.1 IQ points per year in the g latent factor. This latter finding is largely consistent with Ceci (1991) and Gustafsson (2001). Regarding the relationship between FSI and educational attainment, we conclude that it is mainly explained by the latent general factor (64%).

The authors kindly reach out to their hard pressed (or lazy) clinical and educational psychologists:

There might be time constraints requiring the selection of a reduced
number of tests for measuring g. In this latter case, those interested in choosing the purest intelligence measure (less influenced by educational attainment) might consider skipping the administration of the Information, Similarities, Comprehension, and Coding subtests. Matrix Reasoning and Figure Weights,
are the measures less influenced by formal education.

In sum, this paper helps us to understand the underpinnings of the test which is regarded as the gold standard for individual psychometric assessment. All tests have to evolve, but none can escape the need to sample widely and carefully, and to restrict the scores to those which are reliable and valid. In my view the Wechsler test producers should use more subtests and offer fewer factor scores.


  1. For more on the shenanigans of test publishers, see this thorough review of the WISC-V by Canivez and Watkins.

    1. thank you. An illuminating attempt at replication

  2. there's a continuum - on one end we can measure many different-ish things in a cursory manner, on the other end we can measure very few things way too many times - no sense in measuring the same thing 10 times:)

    a composite based on only 2 or 3 subtests may have high reliability - or not - it depends on those pesky subtests (reliability/intercorrelations/& the sample)

    factors definitely depend upon what tests are in the mix. factor stability of traditional ability tasks doesn't very much unless given to a group without much variability - say to a very low ability (or very young) group - they will have fewer factors. but g is always there:)

    the WJ-IV replaced many of their WJ-III GIA subtests (relegating old ones to the supplemental bin). some folks question why we were giving them all those years if they didn't work well enough to be included in the new GIA. some call it progress.

    all IQ tests are works in progress (as is science!) yet marketing departments never want to call them "works in progress" - they want to change them as little as necessary to appear up to date, but not so much they are unrecognizable & hence too much work to (re)learn!

    marketing would be happy with "now more scores from fewer tests! less work - more scores! buy now!" :)

    1. Thanks for your comments, which I agree with, and always welcome.

  3. Regarding the Spanish Inquisition: https://m.youtube.com/watch?v=d3N0uEGVv48

    1. It was the early part of that film which I had in mind. Thanks!

  4. A small point: "Ritchie et al. (2015) found that years of education directly predicted performance on ... Matrix Reasoning .... " doesn't seem entirely consistent with "Matrix Reasoning and Figure Weights are the measures less influenced by formal education".

    A second point, and meant seriously not sarcastically: what is the point nowadays of research into measuring IQ? I have the impression that IQ is such an anti-PC concept that many organisations are reluctant to use it - particularly many organisations in the US. Or am I simply wrong in that impression?

    1. I summarised a more lengthy explanation thus: Only three tests showed lower loadings as educational level increased. Schooling might reduce the effect of some abilities required in the Matrix Reasoning Task. In the higher educational levels, Matrix Reasoning would be a purer indicator of the general factor.
      So, it is about the diminishing effects at higher levels of education.
      On your second point, organisations are still interested in testing intelligence, though the tests usually have other names. Government bodies are usually more evasive (the US military can test intelligence so long as they don't say so too loudly) but private companies test "ability" "readiness" "thinking skills" and so on, avidly.

  5. My reply seems to have vanished into the ether so I repeat my thanks for this answer.