It was usual, in times of old, when Knights came courting a Princess, for the King to set suitors a challenge, usually involving them attacking troublesome neighbours or retrieving stolen property. This was an “all or nothing” measure of the suitor’s capabilities. In psychometric circles no proper clinician would be so crude. We know that tests vary in quality, and that test takers vary in ability, so it is more subtle and more indicative to see at precisely what point a string of successes breaks down into occasional errors and then eventual defeat. Humbling, this experience, to see the chatty and confident applicant, who has sailed through so many items, suddenly confronted with precisely the level of difficulty which fractionally exceeds his ability. Times lengthen, expressions change, and the tester must now encourage and reassure lest the applicant falter. Tests, and life, reveal this highly personal matrix, in which persons are plotted against problems, and the series of ticks gives way to a terminal cross.
Far from each item being just like the others, each has their own peculiarities, a mixture of inherent difficulty, and bits of fluff that stick to the pure item, contributing nothing but confusion. Experienced testers quickly spot items which they consider problematical. Most ambiguous items are rejected at an early stage of standardisation, but some marginal ones remain. They are not bad in themselves, but cannot be relied upon so strongly as other, better, items. For example, an item may trigger a misunderstanding which inadvertently causes errors. Also, in a short test the jumps in difficulty level may be too big, so that it would have been better to have more intermediate items. The items are not “wrong” but the overall progression of items is too lumpy (and often results in above average performers getting less fine grain results).
Item response theory delves deeply into these matters, looking at how each person responds to each item, thus teaching us much about the person and the individual task. At heart the concept is simple (though the maths, as usual, can get complicated): a general trait is tested at a number of points, and the probability of getting an item right (the pass rate) is a mathematical function of the person characteristics (in this case, the latent factor of problem-solving ability) and the item characteristics (difficulty level, discriminative power). In this theoretical approach the focus of attention is on the item while in classical test theory the focus is on the test as a whole, the total of all the items.
I like studying individual items in case where there are few of them and much reliance is to be placed on them. Like legal evidence, if much depends on a single witness or a single forensic result, then critical enquiry must be severe. When we have many test items, across a broad range of behaviour, it is still worth studying items just as a check, but they become a matter of indifference. The “indifference of the indicator” in Spearman’s phrase means that the precise content of intelligence tests is unimportant for the purposes of identifying general intelligence, because it enters into the performance of all kinds of tests. If tests (and individual items) do the job of discriminating between persons in a way which has predictive utility then, whatever they are, they are useful.
By the way, far from being an obscure aspect of intelligence testing, IRT applies to all test items, and is mostly used in educational settings to design and evaluate scholastic tests. It helps compare exams in different years, and to build up a bank of test items of known difficulty. It is a far more detailed and subtle approach than just looking at the total test scores, though it requires much more computational work. Any putative new test, of “emotional” intelligence or “rational” intelligence or “multiple” intelligence has to jump through the same hoops. New tests aren’t breakthroughs just because they haven’t been standardised yet.
In crude terms (which is the way I think most of the time) items are most efficient when they have a 50% pass rate. This is the most powerful way to sort out the sheep from the goats. Items with higher pass rates are useful as starter items, encouraging participation in the test. Harder items will have lower pass rates, though these risk discouraging most test takers. So, a test should begin with easy items (pass rates of 100% if possible) then gradually go on to lots of middle ranking items (pass rates of 40-60%) which build up the reliability of the test, and then the final harder items (pass rates of 5-20%) to pick out rarer intellects. Each test taker leaves a pattern of scores. If students who do very well overall have had difficulty with an item in the middle range of difficulty, then that item needs a closer look. Conversely, if a moderately difficult item is a particularly good predictor of later success on harder items, it is a gem worth keeping, and understanding better.
My preference is to have the items plotted out for me so that I can see pass-rates per item. If you work out the item response function you can calculate the probability that a person with a given level of ability will answer correctly.
Here is a slightly more complicated illustration, showing responses to a dichotomous item, usually a multiple choice test item:
The X axis shows item difficulty, and the midpoint b=0.0 corresponds to the average obtained by a representative sample of test takers. Each item can be located on that axis of difficulty, which corresponds to the pass rate of a representative sample. The average corresponds to a pass rate of 50% and it is also the point of maximum discrimination.
The Y axis is item discriminability, and the depicted curvy line is the shape of the actual response function. The discrimination of this function is around 0.6. Simply guessing in this case would be 0.25
a=1.0 gives you the slope of the function (its capacity to discriminate between test takers).
Of course, this is an idealised function, and all this depends on assumptions being met, that the test-takers don’t confer and that the items are truly independent.
Raven’s matrixes test has response function curves drawn out for every item, and they are broadly similar, though one difficult item is placed too low in the sequence (where it is found to be hard by people of all cultures and genetic groups).
In summary, every problem we ever face is an item on the Matrix, and we get a score on every item, as do all other test-takers. Items vary in their power to discriminate intellects. The task for us is to go as high up the matrix as we can, and stay there for as long as possible, practising the tasks which are possible at that difficulty level. The task for society is to put the best test-takers onto the hardest problems, so that they obtain solutions we can implement.
Jensen argued that success equals ability times effort times opportunity. If any of those three is zero, then zero is the final result. Open societies offer the best opportunities in the world (which is why so many people want to move to them). Given that, effort must be applied to obtain the best out of ability. Effort, or practice, may account for only one third of the variance among those who make the effort, but without effort the final result is always zero.
Find your place on the matrix, and get busy.