Thursday 27 February 2014

A few points of feedback

Can I just check a few issues about the blog?

Does the “Follow by email” function work? I assumed it would be helpful, but as far as I can see from the summary statistics, only 22 people are registered. If you have tried to register and encountered a problem, I will try to fix it.

Does the search function work? It had a phase a month or two back when there seemed to be a problem, but it seems have fixed itself.

Last, can you find the topics you are looking for? I don’t particularly want to move to another platform, but if topic search is difficult, I might have to consider it.

And, of course, any other feedback you might like to give, particularly about topics which need comment.

Wednesday 26 February 2014

Intelligence tests test intelligence testers


Testing intelligence used to be a simple business. The patients used their wits, and the psychologists used their instruction manuals. Some instruction and practice was required, because psychologists had to learn the instructions to be given for each test at every stage, including the prompts; learn how to record the answers and also time them with an old mechanical stopwatch; do all this when the material in front of you on the patient’s side was upside down and left right inverted; record any response which was out of the ordinary, and keep the patient cheerful and engaged throughout. To help you, the presentation booklets had discreet little numbers for you to read, if only to check that you were presenting the right problem. There were also recording forms to jot down the results, and prompts about how many failures patients were allowed before you moved briskly to the next subtest.

Block design, object assembly, picture completion and picture arrangement all required some kit, which had a tendency to get battered or lost. Coding required a form on the back of the test record booklet, and a cardboard overlay to mark the results quickly.

A mechanical stopwatch, I should explain, was a large, heavy, metallic chronometer which never ran out of batteries, and was easy to use. Multiple lap time analysis was not an option, nor were nano-seconds, so error rates in recording times were low. More sophisticated testers were provided with a chronometer wrist watch, so that timing could be done discretely, without the person noticing it and getting too anxious. I was taken on special journey by my boss to a specialist watch shop in the City of London in order to get the numbered chronometer placed on my wrist. It was a Moeris Antimagnetic, Swiss made watch, and it still works well.

A psychologist of modest intellect could be trained to use all these materials in a matter of weeks, and then they were tested on a patient or two by a senior psychologist, after which they were considered competent to begin their testing careers.

In the old days of testing, psychologists tested lots of people, so they started taking short cuts. They boiled the instructions down to the sensible minimum, having found out that the basic short form generally make more sense than the elaborate longer one. Then they started cutting out tests, on the “bang for your buck” basis. Bluntly, how long does it take to get a result out of each subtest? Some are easy to give and score. Others require setting out lots of material, gathering it back again, and require you to work through complicated scoring systems. Those tests tended to be left in the bottom drawer of the desk. Psychologists may be dull but they are not stupid.

Eventually researchers worked out statistically derived short forms in which 4 key subtests gave roughly the same result as the full 10. Roughly. Any psychologist who was in a hurry plumped for those. Of course, the error term was larger, but pragmatism ruled. As a consequence, a very large number of intelligence test results are not done properly in that they are not based on the full test. It is hardly surprising later that scores on re-testing may differ, particularly when psychologists pick and choose which tests to include out of the 10, according to their own interests and theories. Short form testing also increase the apparent variability of the results, leading some gullible psychologists into thinking that it was wrong to calculate an overall result, when in fact that overall result had higher validity. Nobody gets round sampling errors, not even the Spanish Inquisition.

When new tests came on the market they usually  provided extremely interesting approaches with extremely complicated and bulky equipment. Take Digit Span, for example, which tests short term memory. This now comes in a more complicated form, but might be useful. Then, in the Wechsler memory tests, someone decided to have a sequence tapping test of “spatial memory”. You were required to set out the provided array of blue blocks welded onto a plastic tray, and then tap out a sequence of moves with your finger, which the patient had to copy. No problem when one or two moves were required. However, when the tapping sequence was 7 different positions long, it was difficult to be sure one had tapped out the sequence correctly, and then baffling when the patient tapped out the sequence back again so quickly that you could not be sure you had recorded it correctly. That test has been quietly dropped. One cannot have the punters showing up the psychologists.

However, the search for the quick test that gives a valid result continues. The task is not a trivial one. Here are the g loadings of the Wechsler Adult Intelligence Scale subtests, simply as a guideline on the competitive psychometric landscape which confronts any developer of a new intelligence subtest. These are taken from Table 2.12 of the WAIS-IV manual.

Vocabulary .78  Similarities .76   Comprehension .75   Arithmetic .75  Information .73  Figure Weights .73   Digit Span .71 are all good measures of g.

Block Design .70 Matrix Reasoning .70 Visual Puzzles .67 Letter-number sequencing .67 Coding .62 Picture Completion .59 Symbol Search .58 are all fair measures of g.

Cancellation .44 is a poor measure of g but has remains one of the optional subtests.

Each of the subtests, particularly the top 7 are good measures of g, and none of them take more than 10 minutes each, and most of them less. They provide plenty of psychometric bang for your testing-time buck. With a bit of practice in memorising the scoring criteria, you can almost mark up the vocabulary score as you go along.

So, here is the ultimate intelligence test item for intelligence testers. Can you think of a task which is quicker and easier than the best Wechsler subtests, but has higher predictive utility?

While thinking about that, would you like to take a non-Wechsler vocabulary test, just for private amusement and to provide a quick general intelligence measure that you can keep to yourself?

Sunday 23 February 2014

Multiple strengths, multiple weaknesses


Test construction used to be a sober business. Every ten or fifteen years a new version of an established test would come out, and test-giving psychologists bought the new version. They often complained about it, on the grounds of cost, and on the grounds of having to learn new material when they had grown familiar with the old version, and knew all its characteristics intimately. They also noticed the occasional improvement.

The benefits of sticking to what you know were explained to me by an anaesthetist years ago. He told me that out of the many anaesthetics on the market he restricted himself to the best three, and mostly used just one. In this way he learnt everything he could about how it worked and how his patients reacted to it. By paying great attention to the patient’s medical history, and also the family history of allergies, he learned how to anaesthetise his patients. In special cases he would sometimes use the other two anaesthetics, singly or in combination. In that way he got good results, and fewer nasty surprises.  For that reason, I would rather have a skilled surgeon with a somewhat blunt knife, than expose myself to a very sharp knife in the hands of a dull surgeon.

Even the old Stanford Binet which bundled together verbal and non-verbal items, and which the Wechsler tests rendered obsolete could, in the hands of a skilled practitioner, give you the child’s intellectual level very quickly, and pretty accurately. Testers knew which items to skip. They were the pioneers of dynamic testing, now the domain of computer administered tests.

Of course, new tests kept coming out, and they often caused some excitement. There were tests which tested things which had never been tested before. Tests which tested old things in new ways. Tests which tested new things in a new way and displayed the results in new ways. (The latter were particularly popular). Also, many, many tests which were not tests. They were not tests (tests of) learning “styles”, creativity, and sundry other abilities, but they were not intelligence tests. Or so the description of the test asserted.

Test publishers quickly realised that they made profits every time they published a new test (particularly if you had to go on a training course to give the test). Once that new test became even slightly popular, new test enthusiasts could silence any critic by saying “Have you done the training?”. In that way anyone with critical faculties was either rendered mute, or had to pay for both test and training, after which some conceded that one of the subtest items was passable. Of course, you may rightly wonder what sort of psychologist needs to be trained to give a test, when every single examination of a person’s mental capabilities has to follow the same inevitable steps: item construction, piloting, item selection, test explanation, scoring systems, standardisation, item analysis, construction of scaled scores, construction of group ability scores, construction of overall ability scores, and so on. By all means give psychologists a very good grounding in psychometrics, even if it takes six months to a year, but if you have to have a training course for each test, then both the test and the psychologist are in difficulties.

Naturally, most of these tests fell by the wayside. There is a positive manifold in human skills, so whatever the label on the test, the test-taker’s brain has to solve the items, and general intelligence comes into play. Once you have a measure of general intelligence and, say, two group factors, you get diminishing returns when you push further into special skills, and you also tend to pick up more error variance.

Now a few words about the grand-daddy of individually administered “clinical” intelligence tests, the Wechsler tests. They are the gold standard for intelligence testing. First of all, they are pretty good, which is why they got their dominant position. The material was sensibly selected and well organised. It provided a full scale IQ based on 10 subtests and two well known and well understood sub-scales, Verbal and Performance, each composed of 5 each of the subtests. Sales were brisk, and most clinicians could call on roughly 40 years of data on a common basis, with only minor up-dating each re-standardisation. For once psychology was moving towards replicability of results, and you could even begin to do comparisons across generations, all of whom had done roughly the same test, and whose full scale IQs were directly comparable.

So, Wechsler decided to mess everything up. They broke the 10 subtests in 4 subscales, composed of 3 or 2 subtests each. You do not have to know much about sampling theory to realise that the error term for each will be wider than for a 5 subscale. Two subtests do not a factor make. The search for apparent detail in factors came at a cost in terms of accuracy. When you make allowance for the reality that many psychologists do not give the full test, you have a prescription for psychologists writing long reports about many subscale results, with fragile support for their interpretations. There are now 4 subscales about which one can speculate, where formerly there were 2. Good for business. Your chance of finding a special strength (my client’s genius has been under-estimated) or a special weakness (my client’s genius has been damaged by your negligence) has been doubled at a stroke. Good for business.

Then, add in a few other tests which are g loaded, but not as g loaded as intelligence tests. For example, tests of memory. Even here, Wechsler has designed memory tests with known correlations with intelligence, so you can calculate how big a discrepancy has to be before one can argue that a client’s memory is poorer than their intelligence would predict. There is wiggle room, but not much. However, there are other tests of memory, so those can be used in addition with more chance of finding apparent discrepancies.

Then add in several other tests with variable g loadings. Tests of executive function are the most popular. With each additional test your chance of finding a significant discrepancy rises. By giving the percentile rank for each test result you can convince almost every reader of the report that the person’s abilities are highly variable. You can then use this variability (based on improper sampling) to argue that calculating a full scale IQ would be “meaningless”. Like adolescent poets, such clinicians adore meaninglessness. Of course, no attainment result is meaningless. Each contributes information about the person’s abilities, and also contains an error term. The trick is to maximise the former and decrease the latter. Pooling the result of several well-sampled tests helps achieve that.

Now, the Wechsler team are not responsible for the plethora of special tests, but their foray into “factors” based on 3 or 2 subtests was not a good precedent. It has led to a confusion among some clinical psychologists about the factorial structure of intelligence. Wechsler must have gambled that producing a large number of factor scores on the basis of a small number of subtests was what the market wanted, and they relied on the professionalism of testers to give the full 10 tests, and then give precedence to the best founded score, which is Full Scale intelligence.

The current situation is like inviting every Olympic athlete to compete in the decathlon, but then allowing them to drop some of the events and to ask for prizes on the basis of a quasi-random selection of their best 2 or 3 events. The decathlon is what the Wechsler test required: 10 core tests for a full result. We should return to that simple standard if, like trying to find the best all-round athletes, we want to find the best all-round minds.

Saturday 22 February 2014

Sorry I can’t talk at the moment


At intelligence conference there are two sorts of participants: pure researchers in intelligence (the vast majority) and researchers who continue to actually test people’s intelligence in face to face assessments (a very small number of individuals). Aristocrats of intellect and peasants of psychometry. I am in the latter category, but I cannot tell you anything about it at the moment, because I am busy giving some tests.

Except that…… it never ceases to amaze me how an individual responds to test problems, and exactly how, and at what stage, and with what sorts of materials, they run into difficulties. You watch a person sail through the easy items (separate post needed on what makes an item easy) and then suddenly you find them pausing, struggling, and with any luck overcoming the difficulties they encounter with more difficult items (separate but related post needed on what makes an item difficult). Watched closely, you can see when people run out of old solutions and then, perhaps a little later, you can see when they run out of the capacity to generate new solutions (intelligence). Encountering more problem than intelligence is a humbling moment for all of us.

Kohs blocks are the best (in its modern incarnation of Block Design), because you can sometimes see subjects battling with a schemata which is wrong in scale, or orientation, or in internal logic, and then having to take it down and try again.

With every item you see fascinating issues about the level of complexity, the types of answer required in that minority of subtests which require scoring guidelines, the types of errors generated (subtle or not so subtle misunderstandings or ambiguities) and when person was raised in another culture, speaking another language, there is also a host of interesting questions about the comparability of translation and standardisation. Ceiling effects on skills like mechanical reading, and much higher ceiling effects on reading comprehension raise the tantalising question as to where harder and harder comprehension tests morph into tests of verbal intelligence.

Of course, giving a face to face intelligence test also tests the intelligence of the tester. Sorry I can’t talk at the moment.

Tuesday 18 February 2014

A nice bunch of flowers


Grasp a bunch of flowers in your hand, making sure you hold them towards the bottom of the bunch so that it splays out in a pleasing fashion, and you are well on your way to winning a lady’s heart, and to understanding Spearman’s law of diminishing returns.

The general factor of intelligence is strongest at lower levels of intelligence. It may be a case of “All neurones to the pump”. When abilities are low, most problems are difficult. In such cases, all resources have to be thrown at the problem. When abilities are higher there is more spare capacity for differentiation of abilities. Brighter persons have a lower proportion of their abilities accounted for by a common factor, even though the have higher absolute abilities.

So, if we stick to the flowers analogy in this post-Valentine’s day phase, the flowers of intellect of less able persons are tightly held together. The vector of “flowerness in common” runs from the bottom of the bunch of flowers to about two thirds up the bunch. In bright persons “flowerness in common” runs from the bottom to about one third up the bunch.

So, if you confine your studies of human beings to university students, not only will you misrepresent average mental abilities, but you will also diminish your measures of g, and be ever more likely to find apparently new, disparate mental abilities.

So, psychologists should study people who are not at university. I suppose that, more self-servingly, they might argue that everyone must go to university, simply to provide them with more representative study samples. However, the main effect of being at university is inebriation and delusions of adequacy, so it would probably be better to avoid university students altogether.

Flowers generally work, though.

Can young Woodley make a comeback?


You will recall that his opponents had left young Woodley (him of the Edwardian frock coat and the gold pince nez glasses) in a dazed condition after four rounds of gruelling counter argument.

I now hear from his training camp that the young challenger is about to enter the ring again, claiming he will be able to give “a big response”. Is this the bravado of youth, or has he found the new punches with which to vanquish his fearsome opponents? I will be wagering a whole five guineas on Woodley. By the way, the young challenger has a new slogan he is likely to chant at the weighing in ceremony: “The Victorians were still  cleverer than us”.

More news exclusively from this blog, as soon as the reply paper gets the final finishing touches.

Thursday 13 February 2014

Types of Psychologist


I refer here to empirical psychologists who lecture and publish, which is how we learn about their work and opinions. Historically, there have been 6 types of psychologists:

1 Inventors. They find a new process.

2 Masters. They combine a number of such processes, and use them as well as or better than the inventors.

3 The Diluters. They came after the first two kinds, and couldn’t do the job quite as well.

4 Researchers without salient qualities, who operate at a time when the research process generally is in good working order, and so add to the body of knowledge, without outstanding achievements.

5 Fancy researchers. They don’t really invent anything, but specialise in research which is on the light side, but is done with a flourish, in a fine, well-written, somewhat fancy but limited way.

6 Starters of crazes. They re-label other research in a temporarily appealing way.

I would like to hear what you think of this characterization. I would be fascinated if you could provide some names of psychologists to place in each of the categories. Candidly, where would you place yourself? Finally, I would commend those of my readers who can tell me from which author and book I cribbed the list, which I have altered and re-labelled somewhat.

Tuesday 11 February 2014

The many-headed Hydra of alternate intelligences


Some stories never die. They serve a purpose: to distract, explain away, assuage a fear or, in this particular case, to make us feel better about ourselves. It is a variant of the seductive story that the examiners did not mark your papers correctly, and that other examiners would have rated you more highly. This is always true, to some extent, because we can all shop around for an assessment which gives us more flattering results, like choosing the best photo and discarding the unfavourable depictions.

Last March I posted an item about “tests of rationality” being championed in a science magazine, which tried to generate interest by talking about “popular stupidity”.

Now in “The Psychologist”, a magazine published by The British Psychological Society, Keith Stanovich and Richard West have written an article “What intelligence tests miss” suggesting that intelligence tests neglect to measure “rationality”. They are trying to create a test of rationality using Kahneman and Tversky’s problems, together with others collected by the late lamented Robyn Dawes and subsequently brilliantly dissected by Gerd Gigerenzer. This latest escapade strikes me as the recycling of Gardner’s Multiple Intelligences, in the form of: Alternative Intelligences (Seriously and Rationally).

The hidden implication is that if you are smarting at a disappointing result on an intelligence test you might be better off taking a rationality test, which could give you a more accurate, or at the very least broader, assessment of your wide ranging mental skills, not to say your fundamental wisdom.

IQ has gained a bad reputation. In marketing terms it is a toxic brand: it immediately turns off half the population, who are brutally told that they are below average. That is a bad policy if you trying to win friends and influence people. There are several attacks on intelligence testing, but the frontal attack is that the tests are no good and best ignored, while the flanking attack is that the tests are too narrow, and leave out too much of the full panoply of human abilities.

The latter attack is always true, to some extent, because a one hour test cannot be expected to generate the complete picture which could be obtained over a week of testing on the full range of mental tasks.  However, the surprising finding is that, hour for hour, intelligence testing is extraordinarily effective at predicting human futures, more so than any other assessment available so far. This is not entirely surprising when one realises that psychologists tried out at least 23 different mental tasks in the 1920s (including many we would find quaint today) and came to the conclusion that each additional test produced rapidly diminishing returns, such that 10 sub-tests were a reasonable cut-off point for an accurate measure of ability, and a key 4 sub-tests suffice for a reasonable estimate.

So, when a purveyor of an alternative intelligence test makes claims for their new assessment, they have something of a mountain to climb. After a century of development, intelligence testers have an armoury of approaches, methods and material they can bring to bear on the evaluation of abilities. New tests have to show that they can offer something over and above TAU (Testing As Usual). Years ago, this looked like being easy. There is still so much unexplained variance in ability that there was great confidence in the 60s that personality testing would add considerable explanatory power. Not so. Then tests of creativity were touted as the obvious route to a better understanding of ability. Not so. Then multiple intelligences, which psychology text books enthusiastically continue touting despite the paucity of supportive evidence. Not so. Then learning styles. Not so. More recently, emotional intelligence, produced partial results, but far less than anticipated. Same story for Sternberg’s practical intelligence. The list will continue, like types of diets. The Hydra of alternative, more sympathetic, more attuned to your special abilities, sparkling new tests keeps raising its many heads.

What all these innovators have to face is that about 50% of all mental skills can be accounted for by a common latent factor. This shows up again and again. For once psychology has found something which replicates!

The other hurdle is that nowadays there are very demanding legal requirements placed upon any test of intelligence. You have to have a proper representative sample of the nation, or nations, in which you wish to give the test. Nationally drawn up samples of 2000 to 2500 are required. Not only that, but you generally have to double sample minorities. You also have to show that the items are not biased against any group. This is difficult, because any large difference between the sexes or races is considered prima facie evidence of bias. Indeed, if there are pronounced, very specific differences between the mental abilities of the sexes or of racial groups, such findings have been discarded for the last 50 years, at least as far as intelligence testing is concerned.

The conceit of the new proposal is that rationality is a different mental attribute to problem solving in the broad sense. The argument is that IQ and results are poorly correlated (.20 to .35) in university students. To my eye, given the restriction of range (even at American universities which take in a broad range of intellects in the first year) this is not a bad finding. I say this because the authors do not yet have a rationality test. They seem to be correlating scores on a many-item IQ test with the scores on a few pass-fail rationality problems. This lumpiness in the rationality measure needs to be sorted out before we can say that the two concepts are independent.

In fact, when you read their 2009 paper it turns out that they did not give their subjects intelligence tests. They simply recorded what the students told them were their Scholastic Ability Test totals. I don’t wish to be too hard, since of course scholastic ability tests are largely determined by intelligence, but since the authors go on to talk about “what intelligence tests miss” I think they ought to say “what self-reported scholastic achievement tests score miss”. In fact, even that is wrong, because the word “miss” implies a fault in the original aim. So, what they should have called their later book is “some tasks don’t correlate very strongly with what university students self-report about their scholastic achievement tests scores”. As you will note, I am in favour of catchy titles.

That aside, the authors note that if the “rationality” task allows you to guide your choices by doing a calculation (deciding which of two trays of marbles has the highest probability of producing a black marble which gets a reward) then the correct choice is made by brighter students (SAT scores of 1174 versus 1137). This test provides only a pass/fail result, like so many of these “rationality” puzzles, so does not easily fit into psychometric analysis.

By now, dear readers, you will have worked out the main difference between intelligence test items and rationality puzzles. The former are worked upon again and again so that they are as straightforward and unambiguous as possible. If a putative intelligence item is misleading in any way it gets dropped. Misleading items introduce error variance and obscure the underlying results. Also, if particular groups are more likely to be mislead, then their lawyers can argue that the item is unfair to them. All those contested items do not make it to the final published test.

Rationality puzzles, on the other hand, can be as tricky as possible. They are not “upon oath”. If a particular symbol or word misleads, so much the better. If the construction draws the reader down the wrong path, or sets up an incorrect focus of attention, that is all part of the fun. Gigerenzer did some of the best work on this. He looked at the base rate problem beloved of previous investigators, and at all the difficulties caused by percentages with decimal points and all the rest of it, and then proposed a solution (this is unusual for psychologists). He tested his proposed solution (which was to show the problem in terms of natural frequencies, usually on a base of 1000 persons) and found that it got rid of virtually all of the “irrationality” problem. Much of the “irrationality” effect is due to the problem form not being unpackaged properly. This is not a trivial matter, but it is not an insuperable one. For example, consider the question which Stanovich and West give as an example of irrationality.

A bat and a ball cost $1.10 in total. The bat costs $1 more than the ball. How much does the ball cost?

Most people say 1o cents. This makes sense, because this is the usual way you calculate, in that if you spend $1.10 on a bat and a ball, and the bat costs $1.00 then the ball costs .10. It is unusual and somewhat bizarre to put in the concept “a certain amount more than another amount. The usual answer of 0.10 would be right in most circumstances. This is a special circumstance, and very unusual, in that the concept of “$1 more than” is being used in what appears to be a simple calculation. Respondents use the usual format, without noticing the subtle format change. This change means that you have to work out a sum for the bat and ball, so that when you take the cost of the ball from the bat you are left with exactly $1. It cannot be 10 cents, because if you take 10c from $1 you are left with 90c. So, in this case the ball must cost 5c so that when you take 5c from $1.05 you are left with exactly $1. It may strike you as a bit odd, and somewhat tricky and pedantic, and you would not be wrong in making this judgment.

In this particular case the question might be recast as follows.

A bat and a ball cost $1.10 in total. The bat costs $1 more than the ball, meaning that when you take the cost of the ball away from the cost of the bat you are left with exactly $1. How much does the ball cost?

Even the extra explanation might not do the trick, because the usual subtraction sum is uppermost in people’s mind, but they are not being irrational when they make the mistake. They fall for a trick, but they can learn the trick if they have to, or if it seems likely to be useful in the future. In my view the real world implications of this finding are almost zero, other than to highly how some subtleties and ambiguities lead us astray (and are best avoided in standard examinations). As a sideline, if an aircraft cockpit contains similar ambiguities, they can be lethal, and must be removed for safety reasons.

Similarly, as already discussed Dawes base rate problem disappears when you use natural frequencies. Gigerenzer likened it to being confused about the colour of a car seen under sodium floodlights at night in a car park . In the day time the usual colours were visible again. Strange problem formats (mathematical notation, symbolic logic notation, percentages which include decimal points, decimal points with many zeros, relative versus absolute risks, complicated visual displays in aeroplane cockpits, poorly set out controls in cars) impose an additional load on understanding. Most respondents take a short cut. As a rule of thumb, if you need lots of special training to operate a system, it is badly designed for humans.

The Stanovich and West test of rationality has yet to be constructed, let alone tested on the general population. To show that the test was worth giving it would be necessary to measure what additional benefits it provides over and above Testing As Usual. If the resultant Rationality Quotient proves to be very powerful in predicting human futures, then it can take over the lead position from intelligence testing. What is interesting to me is how much mileage they are getting out of attacking intelligence testing for “what it misses”. All they have done is compared SAT scores with replications of some rationality tests. Described more modestly, I would be on their side, and interested in the results of their replications. They distinguish between the results on different tests, which provides a version of an item analysis. However, they do not show that some tests are better predictors of real life achievements than the SAT scores reported by their university students. And, once again, university students are not the only people in the world, nor are they representative of the mental abilities of the general population. Stanovich and West’s rationality test seems to be a case of premature self-congratulation.

What can one say about a test which has yet to be created, tested,  published and compared with established measures of mental ability? Frankly, it would be premature to say anything except: Good luck.