- When it comes to genetic association studies such as GWAS, correlation really is causation. If a genetic variant is associated with a trait, it or a variant close to it causes the trait, assuming it's not a spurious correlation. Because DNA is read-only, it's not possible that the trait causes the variant. This makes genetics different from e.g. gene expression analysis, where the arrow of causation can point both ways. A differentially expressed gene can cause a disease, but the disease can also cause genes to be up- or downregulated. Typically, you have to do follow-up experiments to determine what's going on. Not so in genetics.
- Genotyping isn't sequencing. Frequently, people will say something like, "we have sequenced those samples" when they were really genotyped on a chip. The difference is that genotyping chips are cheap ($200 or less per sample) and typically produce data on several hundreds of thousands of known genetic variants. Sequencing is more expensive (more than $1,000) and produces data on almost all the variants in the genome, including those that haven't been observed before. Unlike genotyping chips, sequencing also delivers data on structural variants such as insertions, deletions and copy number variation.
- Knowing the causal variant isn't the same than knowing the causal gene. Most of the human genome isn't coding for genes, and it's not clear what it actually does, or if it does anything important at all. The majority of variants that have been associated with traits and diseases are not located in the coding parts of genes either. For those variants it's difficult to tell how they exert their effect. Some that are known to change gene expression are called expression quantitative trait loci or eQTLs. For those variants that aren't eQTLs, people often assume that one of the genes that are encoded in their vicinity is the causal one.
- Knowing the gene isn't knowing the effect direction. Even if you know through which gene a variant exerts its effect, you still don't know in which direction the effect goes. Take the example of a genetic variant that has two alleles, G and T. Assume the G allele is the risk allele for a disease, and it's located in the intron of a gene. This does not immediately tell you if decreased gene function is associated with higher or lower disease risk. Again, eQTLs come to the rescue, as they will tell you if the risk allele is associated with higher or lower gene expression, which are reasonable proxies for increased and decreased gene function, respectively.
- Genotypes are discrete, phenotypes often aren't. A genetic variant typically has several genotypes. The example variant from the previous paragraph with the two alleles G and T will, in a diploid organism like humans, have three genotypes: G/G, G/T and T/T. It may therefore be tempting to assume that genetic variants are great biomarkers, as they will unambiguously show if the trait associated with the variant is present or not, maybe with heterozygotes being something in between. Unfortunately, this is often not the case, especially for complex diseases that have many variants associated with them. Each of these variants contributes to disease risk only a little bit, and as a result, individual variants aren't very informative.
There's more, but this post is already too long, so I'll save it for another time.