Wednesday 26 February 2014

Intelligence tests test intelligence testers

 

Testing intelligence used to be a simple business. The patients used their wits, and the psychologists used their instruction manuals. Some instruction and practice was required, because psychologists had to learn the instructions to be given for each test at every stage, including the prompts; learn how to record the answers and also time them with an old mechanical stopwatch; do all this when the material in front of you on the patient’s side was upside down and left right inverted; record any response which was out of the ordinary, and keep the patient cheerful and engaged throughout. To help you, the presentation booklets had discreet little numbers for you to read, if only to check that you were presenting the right problem. There were also recording forms to jot down the results, and prompts about how many failures patients were allowed before you moved briskly to the next subtest.

Block design, object assembly, picture completion and picture arrangement all required some kit, which had a tendency to get battered or lost. Coding required a form on the back of the test record booklet, and a cardboard overlay to mark the results quickly.

A mechanical stopwatch, I should explain, was a large, heavy, metallic chronometer which never ran out of batteries, and was easy to use. Multiple lap time analysis was not an option, nor were nano-seconds, so error rates in recording times were low. More sophisticated testers were provided with a chronometer wrist watch, so that timing could be done discretely, without the person noticing it and getting too anxious. I was taken on special journey by my boss to a specialist watch shop in the City of London in order to get the numbered chronometer placed on my wrist. It was a Moeris Antimagnetic, Swiss made watch, and it still works well.

A psychologist of modest intellect could be trained to use all these materials in a matter of weeks, and then they were tested on a patient or two by a senior psychologist, after which they were considered competent to begin their testing careers.

In the old days of testing, psychologists tested lots of people, so they started taking short cuts. They boiled the instructions down to the sensible minimum, having found out that the basic short form generally make more sense than the elaborate longer one. Then they started cutting out tests, on the “bang for your buck” basis. Bluntly, how long does it take to get a result out of each subtest? Some are easy to give and score. Others require setting out lots of material, gathering it back again, and require you to work through complicated scoring systems. Those tests tended to be left in the bottom drawer of the desk. Psychologists may be dull but they are not stupid.

Eventually researchers worked out statistically derived short forms in which 4 key subtests gave roughly the same result as the full 10. Roughly. Any psychologist who was in a hurry plumped for those. Of course, the error term was larger, but pragmatism ruled. As a consequence, a very large number of intelligence test results are not done properly in that they are not based on the full test. It is hardly surprising later that scores on re-testing may differ, particularly when psychologists pick and choose which tests to include out of the 10, according to their own interests and theories. Short form testing also increase the apparent variability of the results, leading some gullible psychologists into thinking that it was wrong to calculate an overall result, when in fact that overall result had higher validity. Nobody gets round sampling errors, not even the Spanish Inquisition.

When new tests came on the market they usually  provided extremely interesting approaches with extremely complicated and bulky equipment. Take Digit Span, for example, which tests short term memory. This now comes in a more complicated form, but might be useful. Then, in the Wechsler memory tests, someone decided to have a sequence tapping test of “spatial memory”. You were required to set out the provided array of blue blocks welded onto a plastic tray, and then tap out a sequence of moves with your finger, which the patient had to copy. No problem when one or two moves were required. However, when the tapping sequence was 7 different positions long, it was difficult to be sure one had tapped out the sequence correctly, and then baffling when the patient tapped out the sequence back again so quickly that you could not be sure you had recorded it correctly. That test has been quietly dropped. One cannot have the punters showing up the psychologists.

However, the search for the quick test that gives a valid result continues. The task is not a trivial one. Here are the g loadings of the Wechsler Adult Intelligence Scale subtests, simply as a guideline on the competitive psychometric landscape which confronts any developer of a new intelligence subtest. These are taken from Table 2.12 of the WAIS-IV manual.

Vocabulary .78  Similarities .76   Comprehension .75   Arithmetic .75  Information .73  Figure Weights .73   Digit Span .71 are all good measures of g.

Block Design .70 Matrix Reasoning .70 Visual Puzzles .67 Letter-number sequencing .67 Coding .62 Picture Completion .59 Symbol Search .58 are all fair measures of g.

Cancellation .44 is a poor measure of g but has remains one of the optional subtests.

Each of the subtests, particularly the top 7 are good measures of g, and none of them take more than 10 minutes each, and most of them less. They provide plenty of psychometric bang for your testing-time buck. With a bit of practice in memorising the scoring criteria, you can almost mark up the vocabulary score as you go along.

So, here is the ultimate intelligence test item for intelligence testers. Can you think of a task which is quicker and easier than the best Wechsler subtests, but has higher predictive utility?

While thinking about that, would you like to take a non-Wechsler vocabulary test, just for private amusement and to provide a quick general intelligence measure that you can keep to yourself?

http://drjamesthompson.blogspot.co.uk/2013/06/vocabulary-humanitys-greatest.html

http://drjamesthompson.blogspot.co.uk/2013/06/shibboleth-test-your-vocabulary-and.html

http://drjamesthompson.blogspot.co.uk/2013/06/test-your-vocabulary-part-2.html

16 comments:

  1. If anyone else wondered what cancellation is:

    "Working within a specified time limit, the examinee scans a structured arrangement of shapes and marks target shapes. This subtest measures processing speed, visual selective attention, vigilance, perceptual speed, and visual-motor ability. The examinee completes this subtest using a response booklet, and not on his or her digital device."

    See: http://www.helloq.com/overview/the-q-interactive-library/wais-iv.html

    ReplyDelete
  2. nice g-loadings! the wechsler is overkill - vocab & sims together would themselves yield a "mini-verbal composite" with a reliability of about .96 or .97 -- but testers who don't really understand math would say "how can we get a good verbal measurement with only 2 subtests?" Block design & matrix reasoning end up on the same nonverbal factor simply b/c there isn't a more spatial measure in the mix for block design to run off with & form a spatial-ish factor. but those 4 subtests together would yield an IQ that would have a reliability of .98 or so, so the other high-g tasks are superfluous:) that's a pretty high g for digit span, but then i remembered the wechsler does digits forward AND digits backward, with the latter of course using more 'g.' one gets better diagnostic info by separating the high g from the low g stuff...

    some achievement tests predict achievement/performance as well or better than the wechsler :) so maybe a reading comprehension test? a quick little achievement test? or a well-normed Matrices type test with good (fairly equal) item gradients? might depend on what we wanted to predict:) the ASVAB predicts pretty darn well (tho it underestimates bright people who have deficits in processing speed [verbal or symbol/nonverbal] b/c it's a timed paper-pencil [or computer] test.)

    PS - the publishers want there to be some "pieces & parts" to buy - so they can justify selling a kit of stuff, but not so many that it gets unmanageable. the WJ-III was brave to just do it as an easel (but many decried the lack of hands-on stuff)

    ReplyDelete
    Replies
    1. Thanks. Problem with short form is that some clinicians still put together subscales on the basis of two subtests or a selection of optional subtests, and then argue that they have found high abilities in one area and suspiciously low abilities (deficits caused by the presumed index event) in other areas. Neurological therapy will of course be required for many months to cure the problem thus identified.

      Delete
  3. The General Social Survey includes a very short intelligence test, something like 8 questions long. Have you heard of that? How good is it compared with a "real" intelligence test?

    ReplyDelete
  4. Wordsum. Useful results, though one of the items, in my view, is a bit questionable. Improved versions should be used in all psychological research.

    ReplyDelete
  5. A very short vocabulary test, which correlates 0.71 with IQ, is the ten word test in the General Social Survey (US). Can something so crude yield interesting results? Yes. Razib Khan has a very informative post on this. In my view, no survey should be conducted without including a test like this, which provides a very good estimate of intelligence.

    http://blogs.discovermagazine.com/gnxp/2012/04/verbal-intelligence-by-demographic/#.Ubcj6fnR2So

    ReplyDelete
    Replies
    1. Good link. A nice exercise would be to see how well just the first five of the Wordsum items correlate with IQ. But it seems the data is from a 1980 paper and probably not available.

      Delete
  6. What are your thoughts on the 12 minute Wonderlic?

    ReplyDelete
    Replies
    1. Pretty good. Only came across it recently, so don't have personal experience of using it, but it certainly goes in the right direction, and is the sort of test which should be used routinely with subjects in psychology experiments.

      Delete
  7. Do those g loadings come from a first-principal factor? The g loadings look higher...Gc type tests will have higher loadings when it is a first-principal factor. Matrix Reasoning or Arithmetic typically have the highest g loadings (Vocab is close typically though).
    Analysis-Synthesis from the WJ III has a high g loading. I don't know about the predictive validity.

    ReplyDelete
    Replies
    1. Haven't got the manual with me this weekend, but the text says "the factor loading on the first unrotated factor provide information about g, or general intelligence".

      Delete
    2. And they go for a 4 factor model (which may or may not be right). I think I prefer principal components analysis. Fewer arguments.

      Delete
  8. Again, have you ever looked at any publications about the CRT? It meets the criteria you are looking for in being extremely fast (like 5 minutes) and correlates highly with various other psychometric tests and specific constructed manifolds labeled as g.

    As for a quick measure of actual general intelligence, well it's not like Wechsler and everything else used in the field don't have problems there too.

    ReplyDelete
    Replies
    1. Don't like it at all. Doubt it has acceptable psychometric properties across the intelligence range. Too tricky. For real results every item should have a 50% pass rate throughout the intelligence range. For other reasons see http://drjamesthompson.blogspot.co.uk/2014/02/the-many-headed-hydra-of-alternate.html

      Delete
    2. Yes, but you can note that correlations to g in peer reviewed papers are as high as you see in anything else out there (e.g. WORDSUM)

      I'd say that both are terrible proxies for general intelligence. So the point is more that you can't pick and choose blindly.

      Delete
  9. Hello Dr. Thompson,

    Are you familiar the tests produced by Xavier Jouve and the Cerebrals organization? They can be found at http://www.cerebrals.org/wp/?page_id=27. I've taken some of the tests there like the Cerebrals Cognitive Ability Tests (CCAT) and the Jouve-Cerebrals Crystallized-Educational Scale (JCCES), and the test creator has documentation showing the g-loadings, reliability, and validity of each test. They seem accurate to me based on my scores on other standardized tests, but I was wondering if you thought they were any good as far as Internet tests go.

    ReplyDelete