Genotype imputation based on discriminant and cluster analysis

Authors

  • Medhat Mahmoud Norwegian University of Life Sciences. Ås, Norway.
  • Theo Meuwissen Norwegian University of Life Sciences. Ås, Norway.
  • Thore Egeland Norwegian University of Life Sciences. Ås, Norway.

Keywords:

SNP Imputation, Clustering, Linear discrimination

Abstract

The recent development of high-throughput systems for genotyping SNP in Eukaryote has led to an extraordinary amount of research activity, particularly in  areas  such  as  whole-genome  selection  of  livestock  and  genome-wide association studies for detection of  quantitative  trait. Recent technological advances allow us to rapidly genotype more than 10 million SNPs in an individual, accounting for 10% of the estimated number of common SNPs (more than 1% minor allele frequency) across the population. As a result of missing SNPs, true associations might be missed if the causal SNP is not genotyped or if the causal variant is an unknown variant. SNP imputation is important in reducing the cost of re-sequencing and  when  genotyping  all  considered  animals  may be  too  costly  and  sometimes  not feasible because DNA may not be available  for all animals. Computational algorithms and statistical methods have been developed to account for some of the unobserved variants. The main idea behind these methods is based on the observation that SNPs in close proximity to one another in the genome tend to be correlated, or in non-random association (linkage disequilibrium). “Several articles have described comparisons of imputation methods with respect to computational efficiency and the accuracy of results”. Consequently, we perceived a substantial need to proposing a new technique for SNP Imputation with applying linear Discrimination and Clustering Analysis Algorithms. To evaluate the factors potentially affecting imputation accuracy rates (ARs), we used simulated data sets to investigate the effects of Linkage disequilibrium (LD), Minor allele frequency (MAF) of un-typed SNPs, marker density (MD), reference sample size (n) and the different numbers of SNPs in every haplotype block, in imputation accuracy rate (AR) and the performance of linear discriminant analysis and clustering Analysis as a SNP imputation method. In optimal state of genotype data (in High LD, low MAF, and high density haplotype blokes) both methods (Clustering and discrimination) were working efficiently, and the accuracy can reached 89 %.

Downloads

Published

2014-06-22