Psychological comments: Emil visualises

Friday, 12 June 2015

Emil visualises

Statistician AE Maxwell used to say, as I put my head cautiously past his open door and then sat in front of his desk “Have you plotted the data?” His doctoral thesis consisted of one factor analysis, done by hand, which reportedly took him almost three years. By that time, he had got to know his data.

Brian Everitt, in the room next to Maxwell in the Biometrics Department at the Institute of Psychiatry, used to add: “It is a big limitation of statistics that when you ask a question, you are given a number in reply. You should be given an answer to your question.”

With these paragons in mind it is a delight to be guided to Emil Kirkegaard’s site, where he plots the data and answers questions. Yes, there are some numbers, but they are closely linked to the plotted data, which aids understanding.

http://emilkirkegaard.dk/understanding_statistics/

I know that my esteemed readers might regard all this as old hat, but I think it has great utility.

Restriction of Range

Psychology samples tend to be drawn from college students, and although it may be hard to believe sometimes, they are of above average intelligence. Even if one excludes only those of below average intelligence (try it with the slider set at a Z value of zero) that restriction reduces the variance by 63%. In standard present day university samples where IQ 115 is the minimum required, variance will be reduced by 80%. In proper, old style universities where IQ 130 is the entry requirement, the reduction in variance is 88%. I think this is very important, particularly when some researchers make claims about multiple intelligences based on Ivy League and Oxbridge students showing that some particular skill, say gastro-intestinal intelligence, is unrelated to g because the correlation is only 0.18, which in fact means that the general population correlation is very probably a much larger 0.50

Tail effects

“Small differences in means are great at the extremes”

Having repeated the quip, I should have added to it: “and small differences in standard deviations cause large perturbations”. Here it is again, ready for a tweet:

“Small differences in means are great at the extremes and small differences in standard deviations cause large perturbations.”

In this example Emil introduces us to the Blues and the Reds. These two tribes differ by one standard deviation on a score which is very similar to intelligence. That means that at a threshold of IQ 130 (old style good university) the proportions of Blue to Red students will be about 17 to 1. That is to say, if entry to such a university is based only on ability, that will be the ratio. If in addition the standard deviation of Red intelligence is a bit narrower (say only 14, and not the usual Blue sd of 15) then the ratio of Blue to Red will be 35 to 1 on intelligence alone. Please stick to Blue and Red, because that makes the concept easier for many people to understand.

Regression towards the mean

This has been explained many times, but plotting the data helps. “Regression” implies a process which takes time: some magical shrinking or reversion to a primitive ancestral state. Partly this is due to psychoanalytic notions about childhood, partly due to an analogy with the loss of function which is part of ageing. Engaging ideas, but not what is being discussed here. I think I am in favour of the more general title of “errors in repeated measurements”. The simplest verbal explanation is to say that the more often you test someone the less their overall results will be affected by flukes, and if you select people on the basis of extreme scores at first testing, those individuals are unlikely to be so extreme at second testing, just because of testing un-reliabilities. Flukes get lost, because they are flukes.

I put in a test-retest reliability figure of 0.8 which corresponds to that observed for Wechsler intelligence subtests. Even in those subtests there will be an apparent regression caused by measurement error. As Emil notes, this may falsely create the impression that a group with low scores has been raised to a higher standard by some educational intervention carried out before they are re-tested. Ideally, one would re-test half the group who had obtained low scores first time round without giving them any educational intervention, in order to find out how much of the “improvement” was mere measurement error.

Even when you set test-retest reliability at 0.93 (true of Wechsler Full Scale IQ with 6 months between test sessions) then there is still a small regression slope of –0.07 and there will be quite a few outliers with large apparent changes in ability levels.

In conclusion, having these interactive visualising tools handy could help you make critical comments when reading 98% of psychology papers.

17 comments:

dearieme12 June 2015 at 15:05
In the version of the 11+ that I took, we had our IQs measured on two occasions, a year apart. If a pupil's two scored didn't agree well enough, he would be interviewed by a psychologist. That seems to me to be quite painstaking. Our "attainment" test - i.e. a simple exam in English and arithmetic - was (if I remember rightly) done only once.

And, in the case of many of us, all this effort was put in to assign us to different streams in the same Secondary School. Because in the lower-population parts of the county it would have been daft to have two different Secondary Schools. Just as it would have been daft not to stream us.
ReplyDelete
Replies
Emil OW Kirkegaard12 June 2015 at 18:03
I think being able to simulate statistical phenomena with different parameters has greatly increased by own understanding, so I am trying to make this accessible to others who are not able to simulate it themselves.

The scientific publication of the near future will only involve interactive figures. Static figures is a side-effect of sticking to the paper-like PDF format. Publications will be moving to some kind of enhanced format in the future such as HTML with css and js.

Those who are curious can download the source code for my figures and try playing around with the settings themselves on their own computer (Shiny apps can run locally too).

If anyone has comments/suggestions, they can email me at emil@emilkirkegaard.dk.
ReplyDelete
Replies
EH13 June 2015 at 00:59
If regression to the mean is a measurement error effect, given that it is easier to underperform on test questions than to over-perform (more possible wrong answers than right answers) and many extraneous factors such as sleep deprivation and low blood sugar can reduce test performance, while no intervention -- even extensive coaching and stimulants -- will raise scores much, then shouldn't we expect initially high scores to regress to the mean less often than initially low scores "progress" to the mean? In other words, because things that depress scores are more common than things that increase scores, isn't an individual score closer to a minimum estimate than a best-estimate?
ReplyDelete
Replies
Unknown13 June 2015 at 23:41
Some of the high scores result from guessing, particularly on multiple choice exams. Some of the low scores arise from trivial misunderstandings. The match between a particular test and the syllabus may be poor for one test and good for another. Unreliable means just that. Against a True Score actual results can over-estimate as much as under-estimate. Thorndike good on this topic.See R.L. Thorndike. The concepts of over- and underachievement. Columbia University, 1963.
ReplyDelete
Replies
Unknown14 June 2015 at 23:20
Thanks for your comments. I think that clinicians typically see all scores as minimal estimates, and the best scores as true measures of the underlying potential. Who can argue with potential? IRT is a great help in understanding how a test behaves, but the questions you raise go beyond that. Guessing is a bit more powerful than you calculate, because it is usually penalty free, yet some testees won't guess, for personality or cultural reasons. I will find a link to the Flynn Effect paper on guessing.
ReplyDelete
Replies
Unknown14 June 2015 at 23:22
Olev Must and Aasa Must continue the Estonian story by looking at guessing behaviour, and find that in some subtests of the Estonian National Intelligence test over the same period 1934 to 2006, adjustments for false-positive answers reduced the rise in test scores. Rapid guessing has risen over time and influenced test scores more strongly over the years. The FE is partly explained by changes in test-taking behaviour over time.
ReplyDelete
Replies
Unknown15 June 2015 at 15:10
Brief note on Tail Effects: for universal applicability the figures Emil starts with are given for equal population sizes. To calculate the actual differences between groups in any country the proportions must be adjusted. For example, if the Blue group accounts for about 70% of a population and the Red group 15% of that population, then an approximation will be that the Blue group is 5 times as numerous as the Red group. In that instance the Blue group will still have 2.275013% of their population above the IQ 130 threshold and the Red group 0.1349898 % above the threshold, but because of the very much larger Blue population the ratio will be 84 to 1
ReplyDelete
Replies

Add comment