Can you remember back in ancient history when school exam questions said: “Compare and contrast”? I found this philosophically interesting, in that I was tempted to compare and contrast the epistemological foundations of comparing and contrasting. More to point, can you remember back to your undergraduate days when you learnt that each contrast and comparison used up some of your luck? I have put this in a dramatic and personal form to capture the dismay I felt when I understood that at least one of the positive t test results I had so painfully calculated was probably a fluke. I decided it was always the twentieth one which had led me astray, the early ones having first mover advantage in capturing the explanatory narrative, and becoming cherished for ever after, the first-born causes.
The problems of multiple contrasts arise in any even mildly complicated data set. Consider a test with 100 items in which you choose to compare each item with each other item in a t test. Doing multiple comparisons will throw up many spurious results, and you won’t know which is false positive and which is true.
Now consider a test with 1000 items. Multiple comparison will create a large number of errors of identification. There are ways of correcting for these multiple comparisons and contrasts, but they are always something of a patch and fix. The better strategy is to increase sample size.
The genome has a very large number of “scores” of interest, some more obvious to identify and measure than others. Deciding what is score and what is junk is not a trivial matter. Finding false positives is easy, finding true positives which replicate much harder. James Lee from the University of Minnesota told me in 2009 that his preliminary estimate of the sample sizes suggested that 100,000 was a likely starting point for dependable results, but that it could be higher. A few years is a long time in genomic analysis but now Steve Hsu has been thinking about this, and has published his conclusions, naming James Lee as one of the researchers whose work has influenced him.
I describe some unpublished results concerning the genetic architecture of height and cognitive ability, which suggest that roughly 10k moderately rare causal variants of mostly negative effect are responsible for normal population variation. Using results from Compressed Sensing (L1-penalized regression), I estimate the statistical power required to characterize both linear and nonlinear models for quantitative traits. The main unknown parameter s (sparsity) is the number of loci which account for the bulk of the genetic variation. The required sample size is of order 100s, or roughly a million in the case of cognitive ability.
The paper is attractive for covering the background to the genetics of intelligence in a clear and succinct format. Steve Hsu talks about the reduced cost of sequencing the genome, which is speeding up research; the heritability of intelligence; the Flynn effect; exceptional intelligence; and additive genetic models.
One might say that to first approximation, Biology = linear combinations of nonlinear gadgets, and most of the variation between individuals is in the (linear) way gadgets are combined, rather than in the realization of different gadgets in different individuals.
I like the word gadgets. That is the sort of genetics I understand. Alleles be damned.
Pairs of individuals who were both below average in stature or cognitive ability tended to have more SNP changes between them than pairs who were both above average. This result supports the assumption that the minor allele (–) tends to reduce
the phenotype value. In a toy model with, e.g., p = 0:1;N = 10k, an individual with average phenotype would have 9k (+) variants and 1k (–) variants. A below average (-3 SD) person might instead have 1100 (–) variants, and an above average individual (+3 SD) 900 (–) variants. The typical SNP distance between genotypes with 1100 (–) variants is larger than that for genotypes with 900 (–) variants, as there are many places to place the (–) alleles in a list of 10k total causal variants. Two randomly chosen individuals will generally not overlap much in the positions of their (–) variants, so each additional (–) variant tends to increase the distance between them.
The content of the basic calculation as to how much any species can be improved underlies the work of animal and plant breeders. As leading population geneticist James Crow of Wisconsin wrote :
The most extensive selection experiment, at least the one that has continued for the longest time, is the selection for oil and protein content in maize (Dudley 2007). These experiments began near the end of the nineteenth century and still continue; there are now more than 100 generations of selection. Remarkably, selection for high oil content and similarly, but less strikingly, selection for high protein, continue to make progress. There seems to be no diminishing of selectable variance in the population. The effect of selection is enormous: the difference in oil content between the high and low selected strains is some 32 times the original standard deviation.
Hsu’s point is to show that as regards intelligence, humans have not reached their upper limit.
His section on compressed sensing is interesting, but I cannot judge it, so leave that to you, dear reader. However, Hsu is clear that a sample size of a million persons will be required. On the upside, that should lead to genetic predictions of IQ accurate to about 8 IQ points. It would also lead to parents being able to choose the brightest of their fertilized eggs. Interesting times.
From the purely scientific perspective, the elucidation of the genetic architecture of intelligence is a first step towards unlocking the secrets of the brain and, indeed, of what makes humans unique among all life on earth.