Statistician AE Maxwell used to say, as I put my head cautiously past his open door and then sat in front of his desk “Have you plotted the data?” His doctoral thesis consisted of one factor analysis, done by hand, which reportedly took him almost three years. By that time, he had got to know his data.
Brian Everitt, in the room next to Maxwell in the Biometrics Department at the Institute of Psychiatry, used to add: “It is a big limitation of statistics that when you ask a question, you are given a number in reply. You should be given an answer to your question.”
With these paragons in mind it is a delight to be guided to Emil Kirkegaard’s site, where he plots the data and answers questions. Yes, there are some numbers, but they are closely linked to the plotted data, which aids understanding.
http://emilkirkegaard.dk/understanding_statistics/
I know that my esteemed readers might regard all this as old hat, but I think it has great utility.
Restriction of Range
Psychology samples tend to be drawn from college students, and although it may be hard to believe sometimes, they are of above average intelligence. Even if one excludes only those of below average intelligence (try it with the slider set at a Z value of zero) that restriction reduces the variance by 63%. In standard present day university samples where IQ 115 is the minimum required, variance will be reduced by 80%. In proper, old style universities where IQ 130 is the entry requirement, the reduction in variance is 88%. I think this is very important, particularly when some researchers make claims about multiple intelligences based on Ivy League and Oxbridge students showing that some particular skill, say gastro-intestinal intelligence, is unrelated to g because the correlation is only 0.18, which in fact means that the general population correlation is very probably a much larger 0.50
Tail effects
“Small differences in means are great at the extremes”
Having repeated the quip, I should have added to it: “and small differences in standard deviations cause large perturbations”. Here it is again, ready for a tweet:
“Small differences in means are great at the extremes and small differences in standard deviations cause large perturbations.”
In this example Emil introduces us to the Blues and the Reds. These two tribes differ by one standard deviation on a score which is very similar to intelligence. That means that at a threshold of IQ 130 (old style good university) the proportions of Blue to Red students will be about 17 to 1. That is to say, if entry to such a university is based only on ability, that will be the ratio. If in addition the standard deviation of Red intelligence is a bit narrower (say only 14, and not the usual Blue sd of 15) then the ratio of Blue to Red will be 35 to 1 on intelligence alone. Please stick to Blue and Red, because that makes the concept easier for many people to understand.
Regression towards the mean
This has been explained many times, but plotting the data helps. “Regression” implies a process which takes time: some magical shrinking or reversion to a primitive ancestral state. Partly this is due to psychoanalytic notions about childhood, partly due to an analogy with the loss of function which is part of ageing. Engaging ideas, but not what is being discussed here. I think I am in favour of the more general title of “errors in repeated measurements”. The simplest verbal explanation is to say that the more often you test someone the less their overall results will be affected by flukes, and if you select people on the basis of extreme scores at first testing, those individuals are unlikely to be so extreme at second testing, just because of testing un-reliabilities. Flukes get lost, because they are flukes.
I put in a test-retest reliability figure of 0.8 which corresponds to that observed for Wechsler intelligence subtests. Even in those subtests there will be an apparent regression caused by measurement error. As Emil notes, this may falsely create the impression that a group with low scores has been raised to a higher standard by some educational intervention carried out before they are re-tested. Ideally, one would re-test half the group who had obtained low scores first time round without giving them any educational intervention, in order to find out how much of the “improvement” was mere measurement error.
Even when you set test-retest reliability at 0.93 (true of Wechsler Full Scale IQ with 6 months between test sessions) then there is still a small regression slope of –0.07 and there will be quite a few outliers with large apparent changes in ability levels.
In conclusion, having these interactive visualising tools handy could help you make critical comments when reading 98% of psychology papers.
In the version of the 11+ that I took, we had our IQs measured on two occasions, a year apart. If a pupil's two scored didn't agree well enough, he would be interviewed by a psychologist. That seems to me to be quite painstaking. Our "attainment" test - i.e. a simple exam in English and arithmetic - was (if I remember rightly) done only once.
ReplyDeleteAnd, in the case of many of us, all this effort was put in to assign us to different streams in the same Secondary School. Because in the lower-population parts of the county it would have been daft to have two different Secondary Schools. Just as it would have been daft not to stream us.
I think being able to simulate statistical phenomena with different parameters has greatly increased by own understanding, so I am trying to make this accessible to others who are not able to simulate it themselves.
ReplyDeleteThe scientific publication of the near future will only involve interactive figures. Static figures is a side-effect of sticking to the paper-like PDF format. Publications will be moving to some kind of enhanced format in the future such as HTML with css and js.
Those who are curious can download the source code for my figures and try playing around with the settings themselves on their own computer (Shiny apps can run locally too).
If anyone has comments/suggestions, they can email me at emil@emilkirkegaard.dk.
Thanks for this kind offer. Hope the post directs traffic to your website.
DeleteIf regression to the mean is a measurement error effect, given that it is easier to underperform on test questions than to over-perform (more possible wrong answers than right answers) and many extraneous factors such as sleep deprivation and low blood sugar can reduce test performance, while no intervention -- even extensive coaching and stimulants -- will raise scores much, then shouldn't we expect initially high scores to regress to the mean less often than initially low scores "progress" to the mean? In other words, because things that depress scores are more common than things that increase scores, isn't an individual score closer to a minimum estimate than a best-estimate?
ReplyDeleteWell, the idea is testable. See if you can dig up a fairly large dataset with a good measurement of cognitive ability that has data form two testings. Then we can see if there is more regression towards the mean on either side. I will do the analysis if you can find the data for me.
DeleteThanks. Maybe SAT would be a good set? But the problem is that people who get low scores are much more likely to try again and to get coaching and practice for the second try than are people who get high scores. It's hard to think of a highly g-loaded test that large numbers of people take repeatedly. Perhaps very similar tests such as the PSAT, SAT, and GRE would work. The ETS (maker of those tests) isn't known for releasing raw data, though. I'll try to see what I can find.
DeleteEmil and EH:
DeleteNLSY97 has a fair number of observations for the same individuals taking the ASVAB/AFQT, PIAT math, SAT, ACT, PSAT, and some other more content laden tests (e.g., SAT II, AP, etc). I know this because I did some brief (casual) analysis of this very recently, albeit not to assess that particular question per se. One thing I did pick up in something I started working on earlier (and some other analysis) though is that even controlling for earlier tests high SES groups tend to outperform low SES groups (depending on test design and age of administration). I interpret that phenomenon as being less the result of measurement error per se than the fact that their underlying group IQs are different and heritability of intelligence increases with age....
- FDM
Thanks for pointing it out, FDM - the NLSY is the mother lode of data.
DeleteThe transcript SAT data didn't seem to have any repeat ID codes, but I'll keep looking.
Some of the high scores result from guessing, particularly on multiple choice exams. Some of the low scores arise from trivial misunderstandings. The match between a particular test and the syllabus may be poor for one test and good for another. Unreliable means just that. Against a True Score actual results can over-estimate as much as under-estimate. Thorndike good on this topic.See R.L. Thorndike. The concepts of over- and underachievement. Columbia University, 1963.
ReplyDeleteWell, yes, some scores can be a bit over due to guessing, but the likelihood of getting several multiple choice questions right by pure guessing is quite low, ~ 0.2^n. The likelihood of getting a string of questions wrong when the test-taker doesn't care and answers randomly would be much higher, 0.8^n.
DeleteItem response theory (IRT) gets into this in more detail, ranking questions' difficulty on the same measure as the test-takers' ability scores, but also rating questions by how sharply they distinguish between levels of difficulty, as well as how easy it is to guess their answers with no knowledge. Verifying that questions measure what they purport to measure is a difficult thing, and separate from the other qualities of questions, but IRT gives some tools that help with that, too.
Another subject related to regression to the mean is high-range testing.
If we make a scatter plot of two administrations of tests measuring the same thing with a horizontal and a vertical cutoff line at the same high z-score for each test, and the distributions for both tests are normal, it is certain that fewer points will be above both lines than are above one line or the other. This has been used to argue that multiple near-ceiling scores on different tests can be combined to effectively make a new test that has a higher ceiling than either test alone. Yet it seems to me that neither test has questions of the difficulty needed to yield the higher score, and all that can be inferred from multiple near-ceiling scores is that the subject's true score is at least the indicated scores on the individual tests.
I found Concepts of Over and Underachievement online. It's over 100 pages, so I haven't gotten all the way through it.
DeleteOne early quote with which I disagree:
p.7: "Since the error of measurement is conceived of as a random, chance variate associated only with a particular testing, it is also necessarily unrelated to any other measure, such as a measure of achievement."
What it is "conceived of as" isn't what it is. Ceiling effects are errors of measurement that depend on scores, being negligable at the mean and high near the hard ceiling of any subtest. Errors are not always random. A test repeatedly administered in a sweltering classroom or first thing in the morning to a non-morning person will have similar errors every time.
It is a matter of definition. What is normally called measurement error (e) in the models is defined as random error. Random means that it is not correlated with anything.
DeleteWhat you are talking about is systematic error which is error that is correlated with something, e.g. language bias on a test.
Thanks for your comments. I think that clinicians typically see all scores as minimal estimates, and the best scores as true measures of the underlying potential. Who can argue with potential? IRT is a great help in understanding how a test behaves, but the questions you raise go beyond that. Guessing is a bit more powerful than you calculate, because it is usually penalty free, yet some testees won't guess, for personality or cultural reasons. I will find a link to the Flynn Effect paper on guessing.
ReplyDeleteI think you're right about clinicians, whereas those who only work with data and statistics tend to have a more idealized and abstract point of view.
DeleteIRT has an anti-guessing technique based on (roughly) not counting correct answers on harder questions after a number of easier questions have been missed. You would have to tolerate a longer run of missed questions if their ability to discriminate is poor (the probability of a correct answer vs. ability is a soft logistic transition instead of a crisp step function). If I remember correctly, the chance of correct guessing shows up as an offset in the question's transition graph, so the floor is not zero, but the probability of randomly guessing correctly.
Olev Must and Aasa Must continue the Estonian story by looking at guessing behaviour, and find that in some subtests of the Estonian National Intelligence test over the same period 1934 to 2006, adjustments for false-positive answers reduced the rise in test scores. Rapid guessing has risen over time and influenced test scores more strongly over the years. The FE is partly explained by changes in test-taking behaviour over time.
ReplyDeleteBrief note on Tail Effects: for universal applicability the figures Emil starts with are given for equal population sizes. To calculate the actual differences between groups in any country the proportions must be adjusted. For example, if the Blue group accounts for about 70% of a population and the Red group 15% of that population, then an approximation will be that the Blue group is 5 times as numerous as the Red group. In that instance the Blue group will still have 2.275013% of their population above the IQ 130 threshold and the Red group 0.1349898 % above the threshold, but because of the very much larger Blue population the ratio will be 84 to 1
ReplyDeleteActually, the app has been updated to allow for unequal population sizes. :)
Delete