Genotype Less, Impute More, Save Money – Part 2

Share on linkedin
Share on facebook
Share on twitter
Share on email

In Part 1 of this two-part blog we discussed a challenge that can limit the implementation of genomic selection (GS) in plant and animal improvement programs, which is the high cost of genotyping. We also outlined a potential solution, SNP marker minimization and optimization, together with imputation of ungenotyped SNP loci. Here we tackle the quality of imputed data and address the factors that limit the accuracy of predictions from genotypic selection. Imputed data quality is only one factor that affects prediction accuracy and it need not be significant as compared to others.

Imputed data will never achieve 100% accuracy, but neither is SNP genotyping 100% accurate. All genotyping technologies produce a low but non-zero error frequency. There are several methods available to estimate the accuracy of imputed data. The inaccuracy of imputation may or may not be significant against the background of genotyping error. Regardless of whether the data is observed or imputed, one cannot assess the significance of an error frequency without taking into account the objective for which the data was obtained. For example, in a Trait Integration (TI) program (integration is the transfer of a novel trait into a new genetic background by backcrossing), there might be a very low tolerance for erroneous data near the trait locus. In GS, on the other hand, the tolerance for errors may be much higher. Why?

The answer is that in crop breeding, GS is used primarily as a means of enrichment. In a conventional program, a breeder might make 40 parent combinations and generate 50 progeny per combination, and then field-test all 2,000 progeny before making selections. In contrast, with GS a breeder might generate 500 progeny per combination, genotype and rank the progeny based on predicted performance, and then field-test the top 10%, or 2,000. The prediction accuracy is not perfect but the 2,000 progeny after GS are likely to be better overall than a random sample of 2,000. The field test is thus an enriched 2,000 progeny from 20,000, a number that could never be field-tested under typical economic constraints of a commercial breeding program.

Furthermore, inaccurate genotypic data, whether observed or imputed, is only one factor that limits the accuracy of predictions. As mentioned earlier, most of the progeny genome is inferred correctly. Far more important for prediction accuracy is the quality of the phenotypic data associated with the training set, the extent to which the training environments of the collected phenotypic data replicate the field-testing environments, and the extent to which the training set is genetically related to the breeding populations under prediction. These three factors assert enormous influence on the accuracy of predictions. Doubling or tripling the observed SNP density will not improve the final predictions by a proportionate amount nor produce a significant change in rankings or membership in the top 10%.

Let’s step back at this point and reflect. Genotyping for GS is expensive, yet it must be done. Most implementations of GS have not fully optimized SNPs targeted for genotyping. Optimization alone can reduce genotyping costs by minimizing the SNP loci actually genotyped. More cost reductions can be achieved by using data imputation, which generates data for a much larger number of SNP loci than actually genotyped. It is important to estimate the accuracy of the imputed data (also the observed data) as well as the accuracy of the inferred progeny genomes. In addition, in GS the objective is to rank progeny and advance the top ranked. Several factors may have a large impact on the accuracy of predictions and rankings. After a judicious choice of genotyping and imputation strategies, an imperfect description of the progeny genomes should not be a major factor.

Finally, I would like to comment on the value of data and allocation of limited resources. When a breeding population is genotyped for GS, the majority of the genotypic data is used once and discarded. Few progeny are advanced after selections and most are discarded, rendering the genotypic data on these lines disposable. Spending money on genotyping is a cost that should be minimized because the unproven progeny lines individually represent no value to the program until they have advanced through (usually) several rounds of selection.

On the other hand, spending money to genotype at high-density or fully sequence the breeding germplasm is an investment that will deliver returns year after year. The breeding germplasm is generally a set of lines of proven value, some of which may be commercial products, and some of which may be used as parents of future breeding populations. The breeding germplasm has permanence and the data associated with it has long-lasting value. These data may be used to guide the choice of parents to cross and are invaluable if marker optimization and data imputation are to be realized. While it may be obvious that over-spending on unselected breeding population progeny should be avoided, it may be less obvious that some of the money saved would be better invested on the breeding germplasm up front.

In sum, the key to controlling genotyping costs in GS is to obtain less data by genotyping and to obtain more data by imputation. The success of GS lies in recognizing how few genotypic data are required. Shift some resources to obtain more genotypic data on the most valuable lines and not on the discarded progeny, just as is done for phenotypic data. Then sit back, reap the benefits…..and smile.

Let’s get started!