Detecting Relationships among Genotypes in a Rapidly Growing Collection
To correct pedigree errors and discover genotype misassignments, the Council on Dairy Cattle Breeding in the United States compares each new genotype with existing genotypes. With over 6 million genotypes as of May 2022, this is a computationally demanding task. The process was recently revised to maintain a table of genotype pairs that are similar enough to qualify as having a parent-progeny relationship or to be identical. Those genotype pairs are identified by a unique genotype identification and thus are unaffected by changes in genotype assignment to animal. Having those pairs substantially reduces processing time when propagating the effects of pedigree or assignment changes on the usability of genotypes. A set of 3,552 SNPs selected based on call rate and Mendelian consistency is used for the comparisons. Determination of percentage of conflicts stops after 96 and 1,000 SNPs if members of a genotype pair are unlikely to be related. The memory required to store the set of genotypes that is being searched is minimized by using just 2 bits per SNP. The time to access those genotypes is minimized by using memory mapping, which effectively makes the disk where the genotypes are stored an extension of memory. New or updated genotypes are compared with a restricted set of genotypes (one per animal) to reduce processing time. All animals with genotyped progeny are checked. Remaining genotypes are compared in birth date order so that no genotypes from animals born more than 12 years earlier are checked. This limit is reduced to 5 years if both parents of the animal are confirmed. Non-AI bulls with no progeny born in the last 5 years are skipped. Initial determination of unlikely grandsires is done using SNP-at-a-time comparisons and the genotype of the other parent (if available) based on the same 3,552 SNPs. During weekly and monthly evaluations, grandsires are validated using imputed haplotype comparisons. The reliance on the new procedure for discovery of close relatives eliminates the need to access full genotypes of all animals as was previously done. Previously, to minimize database access, all genotypes were loaded in memory from a file. Now, only those full genotypes needed to confirm pedigree relationships are retrieved from the database. The genotypes in the database are compressed, which reduces storage by 75%. These modifications allow comprehensive genotype checking while keeping processing time within acceptable limits
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).