Incorporating Gene Annotation into Genomic Prediction of Complex Phenotypes.
ABSTRACT: Today, genomic prediction (GP) is an established technology in plant and animal breeding programs. Current standard methods are purely based on statistical considerations but do not make use of the abundant biological knowledge, which is easily available from public databases. Major questions that have to be answered before biological prior information can be used routinely in GP approaches are which types of information can be used, and at which points they can be incorporated into prediction methods. In this study, we propose a novel strategy to incorporate gene annotation into GP of complex phenotypes by defining haploblocks according to gene positions. Haplotype effects are then modeled as categorical or as numerical allele dosage variables. The underlying concept of this approach is to build the statistical model on variables representing the biologically functional units. We evaluate the new methods with data from a heterogeneous stock mouse population, the Drosophila Genetic Reference Panel (DGRP), and a rice breeding population from the Rice Diversity Panel. Our results show that using gene annotation to define haploblocks often leads to a comparable, but for some traits to a higher, predictive ability compared to SNP-based models or to haplotype models that do not use gene annotation information. Modeling gene interaction effects can further improve predictive ability. We also illustrate that the additional use of markers that have not been mapped to any gene in a second separate relatedness matrix does in many cases not lead to a relevant additional increase in predictive ability when the first matrix is based on haploblocks defined with gene annotation data, suggesting that intergenic markers only provide redundant information on the considered data sets. Therefore, gene annotation information seems to be appropriate to perceive the importance of DNA segments. Finally, we discuss the effects of gene annotation quality, marker density, and linkage disequilibrium on the performance of the new methods. To our knowledge, this is the first work that incorporates epistatic interaction or gene annotation into haplotype-based prediction approaches.
Project description:In the last years, a series of methods for genomic prediction (GP) have been established, and the advantages of GP over pedigree best linear unbiased prediction (BLUP) have been reported. However, the majority of previously proposed GP models are purely based on mathematical considerations while seldom take the abundant biological knowledge into account. Prediction ability of those models largely depends on the consistency between the statistical assumptions and the underlying genetic architectures of traits of interest. In this study, gene annotation information was incorporated into GP models by constructing haplotypes with SNPs mapped to genic regions. Haplotype allele similarity between pairs of individuals was measured through different approaches at single gene level and then converted into whole genome level, which was then treated as a special kernel and used in kernel based GP models. Results shown that the gene annotation guided methods gave higher or at least comparable predictive ability in some traits, especially in the Arabidopsis dataset and the rice breeding population. Compared to SNP models and haplotype models without gene annotation, the gene annotation based models improved the predictive ability by 0.56~26.67% in the Arabidopsis and 1.62~16.53% in the rice breeding population, respectively. However, incorporating gene annotation slightly improved the predictive ability for several traits but did not show any extra gain for the rest traits in a chicken population. In conclusion, integrating gene annotation into GP models could be beneficial for some traits, species, and populations compared to SNP models and haplotype models without gene annotation. However, more studies are yet to be conducted to implicitly investigate the characteristics of these gene annotation guided models.
Project description:Various methods have been proposed for genomic prediction (GP) in livestock. These methods have mainly focused on statistical considerations and did not include genome annotation information. In this study, to improve the predictive performance of carcass traits in Chinese Simmental beef cattle, we incorporated the genome annotation information into GP. Single nucleotide polymorphisms (SNPs) were annotated to five genomic classes: intergenic, gene, exon, protein coding sequences, and 3'/5' untranslated region. Haploblocks were constructed for all markers and these five genomic classes by defining a biologically functional unit, and haplotype effects were modeled in both numerical dosage and categorical coding strategies. The first-order epistatic effects among SNPs and haplotypes were modeled using a categorical epistasis model. For all makers, the extension from the SNP-based model to a haplotype-based model improved the accuracy by 5.4-9.8% for carcass weight (CW), live weight (LW), and striploin (SI). For the five genomic classes using the haplotype-based prediction model, the incorporation of gene class information into the model improved the accuracies by an average of 1.4, 2.1, and 1.3% for CW, LW, and SI, respectively, compared with their corresponding results for all markers. Including the first-order epistatic effects into the prediction models improved the accuracies in some traits and genomic classes. Therefore, for traits with moderate-to-high heritability, incorporating genome annotation information of gene class into haplotype-based prediction models could be considered as a promising tool for GP in Chinese Simmental beef cattle, and modeling epistasis in prediction can further increase the accuracy to some degree.
Project description:A haplotype is defined as a combination of alleles at adjacent loci belonging to the same chromosome that can be transmitted as a unit. In this study, we used both the Illumina BovineHD chip (HD chip) and imputed whole-genome sequence (WGS) data to explore haploblocks and assess haplotype effects, and the haploblocks were defined based on the different LD thresholds. The accuracies of genomic prediction (GP) for dressing percentage (DP), meat percentage (MP), and rib eye roll weight (RERW) based on haplotype were investigated and compared for both data sets in Chinese Simmental beef cattle. The accuracies of GP using the entire imputed WGS data were lower than those using the HD chip data in all cases. For DP and MP, the accuracy of GP using haploblock approaches outperformed the individual single nucleotide polymorphism (SNP) approach (GBLUP_In_Block) at specific LD levels. Hotelling's test confirmed that GP using LD-based haplotypes from WGS data can significantly increase the accuracies of GP for RERW, compared with the individual SNP approach (∼1.4 and 1.9% for G<sub>H</sub>BLUP and G<sub>H</sub>BLUP+GBLUP, respectively). We found that the accuracies using haploblock approach varied with different LD thresholds. The LD thresholds (<i>r</i> <sup>2</sup> ≥ 0.5) were optimal for most scenarios. Our results suggested that LD-based haploblock approach can improve accuracy of genomic prediction for carcass traits using both HD chip and imputed WGS data under the optimal LD thresholds in Chinese Simmental beef cattle.
Project description:Multi-trait (MT) genomic prediction models enable breeders to save phenotyping resources and increase the prediction accuracy of unobserved target traits by exploiting available information from non-target or auxiliary traits. Our study evaluated different MT models using 250 rice accessions from Asian countries genotyped and phenotyped for grain content of zinc (Zn), iron (Fe), copper (Cu), manganese (Mn), and cadmium (Cd). The predictive performance of MT models compared to a traditional single trait (ST) model was assessed by 1) applying different cross-validation strategies (CV1, CV2, and CV3) inferring varied phenotyping patterns and budgets; 2) accounting for local epistatic effects along with the main additive effect in MT models; and 3) using a selective marker panel composed of trait-associated SNPs in MT models. MT models were not statistically significantly (<i>p <</i> 0.05) superior to ST model under CV1, where no phenotypic information was available for the accessions in the test set. After including phenotypes from auxiliary traits in both training and test sets (MT-CV2) or simply in the test set (MT-CV3), MT models significantly (<i>p</i> < 0.05) outperformed ST model for all the traits. The highest increases in the predictive ability of MT models relative to ST models were 11.1% (Mn), 11.5 (Cd), 33.3% (Fe), 95.2% (Cu) and 126% (Zn). Accounting for the local epistatic effects using a haplotype-based model further improved the predictive ability of MT models by 4.6% (Cu), 3.8% (Zn), and 3.5% (Cd) relative to MT models with only additive effects. The predictive ability of the haplotype-based model was not improved after optimizing the marker panel by only considering the markers associated with the traits. This study first assessed the local epistatic effects and marker optimization strategies in the MT genomic prediction framework and then illustrated the power of the MT model in predicting trace element traits in rice for the effective use of genetic resources to improve the nutritional quality of rice grain.
Project description:Hybrid crops have contributed greatly to improvements in global food and fodder production over the past several decades. Nevertheless, the growing population and changing climate have produced food crises and energy shortages. Breeding new elite hybrid varieties is currently an urgent task, but present breeding procedures are time-consuming and labour-intensive. In this study, parental metabolic information was utilized to predict three polygenic traits in hybrid rice. A complete diallel cross population consisting of eighteen rice inbred lines was constructed, and the hybrids' plant height, heading date and grain yield per plant were predicted using 525 metabolites. Metabolic prediction models were built using the partial least square regression method, with predictive abilities ranging from 0.858 to 0.977 for the hybrid phenotypes, relative heterosis, and specific combining ability. Only slight changes in predictive ability were observed between hybrid populations, and nearly no changes were detected between reciprocal hybrids. The outcomes of prediction of the three highly polygenic traits demonstrated that metabolic prediction was an accurate (high predictive abilities) and efficient (unaffected by population genetic structures) strategy for screening promising superior hybrid rice. Exploitation of this pre-hybridization strategy may contribute to rice production improvement and accelerate breeding programs.
Project description:The objective of this study was to explore the potential of genomic prediction (GP) for soybean resistance against Sclerotinia sclerotiorum (Lib.) de Bary, the causal agent of white mold (WM). A diverse panel of 465 soybean plant introduction accessions was phenotyped for WM resistance in replicated field and greenhouse tests. All plant accessions were previously genotyped using the SoySNP50K BeadChip. The predictive ability of six GP models were compared, and the impact of marker density and training population size on the predictive ability was investigated. Cross-prediction among environments was tested to determine the effectiveness of the prediction models. GP models had similar prediction accuracies for all experiments. Predictive ability did not improve significantly by using more than 5k SNPs, or by increasing the training population size (from 50% to 90% of the total of individuals). The GP model effectively predicted WM resistance across field and greenhouse experiments when each was used as either the training or validation population. The GP model was able to identify WM-resistant accessions in the USDA soybean germplasm collection that had previously been reported and were not included in the study panel. This study demonstrated the applicability of GP to identify useful genetic sources of WM resistance for soybean breeding. Further research will confirm the applicability of the proposed approach to other complex disease resistance traits and in other crops.
Project description:Glume hairiness or pubescence is an important morphological trait with high heritability to distinguish/characterize wheat and is related to the resistance to biotic and abiotic stresses. Hg1 (formerly named Hg) on chromosome arm 1AS controlled glume hairiness in wheat. Its genetic analysis and mapping have been widely studied, yet more useful and accurate information for fine mapping of Hg1 and identification of its candidate gene is lacking. The cloning of this gene has not yet been reported for the large complex wheat genome. Here, we performed a GWAS between SNP markers and glume pubescence (Gp) in a wheat population with 352 lines and further demonstrated the gene expression and haplotype analysis approach for isolating the Hg1 gene. One gene, TraesCSU02G143200 (TaELD1-1A), encoding glycosyltransferase-like ELD1/KOBITO 1, was identified as the most promising candidate gene of Hg1. The gene annotation, expression pattern, function SNP variation, haplotype analysis, and co-expression analysis in floral organ (spike) development indicated that it is likely to be involved in the regulation of glume pubescence. Our study demonstrates the importance of high-quality reference genomes and annotation information, as well as bioinformatics analysis, for gene cloning in wheat.
Project description:The lowering genotyping cost is ushering in a wider interest and adoption of genomic prediction and selection in plant breeding programs worldwide. However, improper conflation of historical and recent linkage disequilibrium between markers and genes restricts high accuracy of genomic prediction (GP). Multiple ancestors may share a common haplotype surrounding a gene, without sharing the same allele of that gene. This prevents parsing out genetic effects associated with the underlying allele of that gene among the set of ancestral haplotypes. We present “Parental Allele Tracing, Recombination Identification, and Optimal predicTion” (i.e., PATRIOT) approach that utilizes marker data to allow for a rapid identification of lines carrying specific alleles, increases the accuracy of genomic relatedness and diversity estimates, and improves genomic prediction. Leveraging identity-by-descent relationships, PATRIOT showed an improvement in GP accuracy by 16.6% relative to the traditional rrBLUP method. This approach will help to increase the rate of genetic gain and allow available information to be more effectively utilized within breeding programs.
Project description:Genomic prediction (GP) aims to construct a statistical model for predicting phenotypes using genome-wide markers and is a promising strategy for accelerating molecular plant breeding. However, current progress of phenotype prediction using genomic data alone has reached a bottleneck, and previous studies on transcriptomic and metabolomic predictions ignored genomic information. Here, we designed a novel strategy of GP called multilayered least absolute shrinkage and selection operator (MLLASSO) by integrating multiple omic data into a single model that iteratively learns three layers of genetic features (GFs) supervised by observed transcriptome and metabolome. Significantly, MLLASSO learns higher order information of gene interactions, which enables us to achieve a significant improvement of predictability of yield in rice from 0.1588 (GP alone) to 0.2451 (MLLASSO). In the prediction of the first two layers, some genes were found to be genetically predictable genes (GPGs) as their expressions were accurately predicted with genetic markers. Interestingly, we made three dramatic discoveries for the GPGs: (i) GPGs are good predictors for highly complex traits like yield; (ii) GPGs are mostly eQTL genes (cis or trans); and (iii) trait-related transcriptional factor families are enriched in GPGs. These findings support the notion that learned GFs not only are good predictors for traits but also have specific biological implications regarding regulation of gene expressions. To differentiate the new method from conventional GP models, we called MLLASSO a directed learning strategy supervised by intermediate omic data. This new prediction model appears to be more reliable and more robust than conventional GP models.
Project description:<p><strong>BACKGROUND:</strong> Genomic prediction (GP) based on single nucleotide polymorphisms (SNP) has become a broadly used tool to increase the gain of selection in plant breeding. However, using predictors that are biologically closer to the phenotypes such as transcriptome and metabolome may increase the prediction ability in GP. The objectives of this study were to (i) assess the prediction ability for three phenotypic traits using different omic datasets including sequence variants (SV), deleterious SV (dSV), tolerant SV (tSV), expression presence/absence variation (ePAV), gene expression (GE), transcript expression (TE), and metabolites (M) as single predictors in comparison to those using a SNP array; (ii) investigate the improvement in prediction ability when combining multiple omic datasets information to predict phenotypic variation in barley breeding programs; (iii) explore the relationship between genes and metabolites to unravel the metabolic pathway of the three above mentioned phenotypic traits.</p><p><strong>RESULTS:</strong> The prediction ability from genomic best linear unbiased prediction (GBLUP) for the three traits using dSV information was higher than when using tSV, all SV information, or the SNP array. Any predictors from the transcriptome (GE, TE, as well as ePAV) and metabolome provided higher prediction abilities compared to the SNP array and SV on average across the three traits. In addition, some (di)-similarity existed between different omic datasets, and therefore provided complementary biological perspectives to phenotypic variation. Optimal combining the information of dSV, TE, ePAV, as well as metabolites into GP models could improve the prediction ability over that of the single predictors alone.</p><p><strong>CONCLUSIONS:</strong> The use of integrated omic datasets in GP model is highly recommended. Furthermore, we evaluated a cost-effective approach generating 3’end mRNA sequencing with transcriptome data extracted from seedling without losing prediction ability in comparison to the full-length mRNA sequencing, paving the path for the use of such prediction methods in commercial breeding programs.</p>