Genotype Less, Impute More, Save Money – Part 1

Share on linkedin
Share on facebook
Share on twitter
Share on email

Modern molecular breeding methods such as Marker Assisted Selection (MAS), Trait Introgression (TI), and Genomic Selection (GS), as well as discovery research methods such as quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS), require routine genotyping of many samples. GS and GWAS may benefit from high-density genotyping but at the risk of high genotyping costs. Nevertheless, over the past 20 years GS has proven to be a very effective means for accelerating plant and animal improvement. While the costs of genotyping to enable GS may seem prohibitive, a breeding program without GS may not achieve the desired progress and thus lag behind programs that have adopted GS. Hence, the costs of genotyping must be confronted and tamed.

In general, a breeder has three options to reduce the costs of genotyping their breeding populations: (1) reduce the number of lines genotyped, (2) reduce the number of loci genotyped, and (3) seek a lower-cost genotyping technology platform or service provider. Option 1 is self-defeating for a successful breeding program. In fact, GS allows for a tremendous increase in the overall number of lines evaluated in each cycle. The second option may appear risky and raise fears of undermining the entire approach, yet it is here where real improvements are possible. As for the third, the on-going rapid pace of technology evolution and intense competition among service providers is lowering costs. More important is the suitability of the platform and provider for the desired marker density, and to meet requirements around data quality, total throughput, and turnaround time.

Consider option 2, the problem of reducing the number of loci genotyped. I am old enough to have worked with DNA markers from RFLPs through SNPs and spent many days in the lab performing Southern blots and reading autoradiograms for RFLPs. It was a difficult and time-consuming process to generate a picture of a genome produced by meiotic recombination, and the final picture was incredibly incomplete, often only a few markers per chromosome. And what about the extent of genotyping error! As marker systems evolved and improved, we sought more marker observations on each genome. When multiplexed marker technologies became widespread, eventually permitting up to 104 – 105 accurate SNP observations per genome, the picture became crystal-clear. Haplotype blocks and recombination breakpoints were visible as they would be in a textbook. But we were over-genotyping and over-spending to reach our objectives. Whereas we had always asked, “How many marker observations can I get?”, now we now began to ask “How few marker observations do I need?”

The challenge is to optimize the SNP loci genotyped for a given objective. Specifically, to minimize the number of marker loci genotyped for each sample and yet, from these limited genotypic data, obtain an accurate description of each sample genome. The informativeness of an SNP locus is determined by several parameters. It is desirable to target SNP loci that are uniformly distributed over the genome, thus the genomic location of SNPs is important. The allele frequency and distribution across the breeding germplasm is also important. For one breeding population, it only matters that an SNP be polymorphic between the parents. However, assembling a minimal set of SNP loci for several or many breeding populations is a difficult optimization problem. Ideally, one desires a set of SNP loci such that in every population a minimum number is polymorphic and uniformly distributed across the genome, yet minimizes the number of redundant loci genotyped within any one population.

Another consideration is the limited amount of recombination found within breeding population progeny. Consider the case of a maize bi-parental breeding population of doubled haploids. The maize genome comprises 10 chromosomes and the genetic map length is approximately 2,000 centiMorgans. This implies that a single progeny genome will contain about 20 crossovers distributed over 10 chromosomes, an average of 2 per chromosome. These crossovers will define 30 blocks of alternating parental haplotypes distributed over the same 10 chromosomes, an average of 3 per chromosome. Some 100 well-distributed and polymorphic markers will serve to find the parental haplotype blocks and locate the crossovers. Some parental haplotype blocks may be missed and some crossovers displaced, yet the parental origin of most of the genome will be inferred correctly. Admittedly, maize is a simple example. Crops selfed to homozygosity or parents not inbred will pose complications, but the basic argument remains unchanged: few crossovers and large blocks of parental haplotypes, for which a set of well-distributed and polymorphic SNP loci numbering in the low 100’s will be sufficient to infer the progeny genome.

Studies examining the number of genotyped loci necessary for accurate GS in crops are reaching general agreement. Fewer than 5,000 loci are usually sufficient, with 1,000 – 3,000 most common. Although these numbers are much larger than asserted above, the gap may be easily bridged by imputation. Imputation is a method that uses observed alleles from a small number of SNP loci to infer the allele states of a much larger set of unobserved loci. While this may seem like creating data out of thin air, imputation of genotypic data goes back at least 20 years. It is often used in human genetic research and animal improvement programs, but less so in crop breeding. This should not be the case. The factor that allows unobserved loci to be imputed from observed loci is the linkage disequilibrium between loci within a relevant population, which in the case of crops is typically the breeding germplasm.

To implement imputation in a breeding program is usually a three-step process. First, the breeding germplasm is genotyped at a large number of SNP loci, or sequenced to obtain SNP locus data, and the linkage disequilibrium between pairs of loci is determined. Second, the targets for SNP genotyping are selected by solving the optimization problem described above. Breeding populations are genotyped at the target SNP loci by a genotyping service provider. Third, data at the unobserved (ungenotyped) loci are imputed using the data from the observed (genotyped) loci and the previously determined linkage disequilibrium between loci. Although the optimization problem described does not mention it, optimization should take into account the desired number of imputed loci and the desired accuracy of the imputations.

This concludes Part 1 of this blog, which highlighted the high cost of genotyping to enable GS, and outlined a potential solution using SNP optimization and data imputation. In Part 2, to follow, we explore the quality of imputed data and its impact on the accuracy of predictions produced by GS.

Let’s get started!