Hi all,
I was studying genome-wide association studies, and I was having trouble with a problem the professor discussed. He said that if you were to genotype 1 million SNPs (to determine if any are associated with a disease phenotype), if you chose a significance level of 0.05 then you would expect by random chance to get 50,000 false-positives.
I've been told that the significance level equals the false-positive rate (ie # false positives/ (# false positives + # true negatives)), so therefore it would make sense that if all 1 million SNPs were not linked to the disease, then the false-positive rate of 5% means that 5% of those 1 million SNPs, or 50,000, will be identified as positive anyway.
The issue is that I'm not sure why the significance level equals the false-positive rate. My interpretation was: a SNP either is or is not associated with a disease, but all SNPs have the same probability of associating with the disease (assuming the null hypothesis is true--no linkage) by chance. If a SNP is associated with the disease AND the chance of this association occurring by chance is less than 5%, the SNP is identified as positive. However, there is still a less than 5% chance that this positive SNP is actually a false-positive; therefore, 5% of all positively identified SNPs are false positives. But my interpretation doesn't work because it would mean that 5% of positive SNPs are false, not 5% of all SNPs (positive or negatively identified by this test) are false positives.
I'm really confused by this, so any help is much appreciated. Thank you for taking the time to read through this.