Project description:Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal's own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000-1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50-250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.
Project description:Current cardiovascular risk assessment tools use a small number of predictors. Here, we study how machine learning might: (1) enable principled selection from a large multimodal set of candidate variables and (2) improve prediction of incident coronary artery disease (CAD) events. An elastic net-based Cox model (ML4HEN-COX) trained and evaluated in 173,274 UK Biobank participants selected 51 predictors from 13,782 candidates. Beyond most traditional risk factors, ML4HEN-COX selected a polygenic score, waist circumference, socioeconomic deprivation, and several hematologic indices. A more than 30-fold gradient in 10-year risk estimates was noted across ML4HEN-COX quintiles, ranging from 0.25% to 7.8%. ML4HEN-COX improved discrimination of incident CAD (C-statistic = 0.796) compared with the Framingham risk score, pooled cohort equations, and QRISK3 (range 0.754-0.761). This approach to variable selection and model assessment is readily generalizable to a broad range of complex datasets and disease endpoints.
Project description:Genetic correlations between quantitative traits measured in many breeding programs are pervasive. These correlations indicate that measurements of one trait carry information on other traits. Current single-trait (univariate) genomic selection does not take advantage of this information. Multivariate genomic selection on multiple traits could accomplish this but has been little explored and tested in practical breeding programs. In this study, three multivariate linear models (i.e., GBLUP, BayesA, and BayesC?) were presented and compared to univariate models using simulated and real quantitative traits controlled by different genetic architectures. We also extended BayesA with fixed hyperparameters to a full hierarchical model that estimated hyperparameters and BayesC? to impute missing phenotypes. We found that optimal marker-effect variance priors depended on the genetic architecture of the trait so that estimating them was beneficial. We showed that the prediction accuracy for a low-heritability trait could be significantly increased by multivariate genomic selection when a correlated high-heritability trait was available. Further, multiple-trait genomic selection had higher prediction accuracy than single-trait genomic selection when phenotypes are not available on all individuals and traits. Additional factors affecting the performance of multiple-trait genomic selection were explored.
Project description:Grapevine (Vitis vinifera) breeding reaches a critical point. New cultivars are released every year with resistance to powdery and downy mildews. However, the traditional process remains time-consuming, taking 20-25 years, and demands the evaluation of new traits to enhance grapevine adaptation to climate change. Until now, the selection process has relied on phenotypic data and a limited number of molecular markers for simple genetic traits such as resistance to pathogens, without a clearly defined ideotype, and was carried out on a large scale. To accelerate the breeding process and address these challenges, we investigated the use of genomic prediction, a methodology using molecular markers to predict genotypic values. In our study, we focused on 2 existing grapevine breeding programs: Rosé wine and Cognac production. In these programs, several families were created through crosses of emblematic and interspecific resistant varieties to powdery and downy mildews. Thirty traits were evaluated for each program, using 2 genomic prediction methods: Genomic Best Linear Unbiased Predictor and Least Absolute Shrinkage Selection Operator. The results revealed substantial variability in predictive abilities across traits, ranging from 0 to 0.9. These discrepancies could be attributed to factors such as trait heritability and trait characteristics. Moreover, we explored the potential of across-population genomic prediction by leveraging other grapevine populations as training sets. Integrating genomic prediction allowed us to identify superior individuals for each program, using multivariate selection index method. The ideotype for each breeding program was defined collaboratively with representatives from the wine-growing sector.
Project description:In single-step analyses, missing genotypes are explicitly or implicitly imputed, and this requires centering the observed genotypes using the means of the unselected founders. If genotypes are only available for selected individuals, centering on the unselected founder mean is not straightforward. Here, computer simulation is used to study an alternative analysis that does not require centering genotypes but fits the mean [Formula: see text] of unselected individuals as a fixed effect. Starting with observed diplotypes from 721 cattle, a five-generation population was simulated with sire selection to produce 40,000 individuals with phenotypes, of which the 1000 sires had genotypes. The next generation of 8000 genotyped individuals was used for validation. Evaluations were undertaken with (J) or without (N) [Formula: see text] when marker covariates were not centered; and with (JC) or without (C) [Formula: see text] when all observed and imputed marker covariates were centered. Centering did not influence accuracy of genomic prediction, but fitting [Formula: see text] did. Accuracies were improved when the panel comprised only quantitative trait loci (QTL); models JC and J had accuracies of 99.4%, whereas models C and N had accuracies of 90.2%. When only markers were in the panel, the 4 models had accuracies of 80.4%. In panels that included QTL, fitting [Formula: see text] in the model improved accuracy, but had little impact when the panel contained only markers. In populations undergoing selection, fitting [Formula: see text] in the model is recommended to avoid bias and reduction in prediction accuracy due to selection.
Project description:The Holstein breed is the mainstay of dairy production in Korea. In this study, we evaluated the genomic prediction accuracy for body conformation traits in Korean Holstein cattle, using a range of π levels (0.75, 0.90, 0.99, and 0.995) in Bayesian methods (BayesB and BayesC). Focusing on 24 traits, we analyzed the impact of different π levels on prediction accuracy. We observed a general increase in accuracy at higher levels for specific traits, with variations depending on the Bayesian method applied. Notably, the highest accuracy was achieved for rear teat angle when using deregressed estimated breeding values including parent average as a response variable. We further demonstrated that incorporating parent average into deregressed estimated breeding values enhances genomic prediction accuracy, showcasing the effectiveness of the model in integrating both offspring and parental genetic information. Additionally, we identified 18 significant window regions through genome-wide association studies, which are crucial for future fine mapping and discovery of causal mutations. These findings provide valuable insights into the efficiency of genomic selection for body conformation traits in Korean Holstein cattle and highlight the potential for advancements in the prediction accuracy using larger datasets and more sophisticated genomic models.
Project description:Genomic selection is revolutionizing plant breeding. However, its practical implementation is still very challenging, since predicted values do not necessarily have high correspondence to the observed phenotypic values. When the goal is to predict within-family, it is not always possible to obtain reasonable accuracies, which is of paramount importance to improve the selection process. For this reason, in this research, we propose the Adversaria-Boruta (AB) method, which combines the virtues of the adversarial validation (AV) method and the Boruta feature selection method. The AB method operates primarily by minimizing the disparity between training and testing distributions. This is accomplished by reducing the weight assigned to markers that display the most significant differences between the training and testing sets. Therefore, the AB method built a weighted genomic relationship matrix that is implemented with the genomic best linear unbiased predictor (GBLUP) model. The proposed AB method is compared using 12 real data sets with the GBLUP model that uses a nonweighted genomic relationship matrix. Our results show that the proposed AB method outperforms the GBLUP by 8.6, 19.7, and 9.8% in terms of Pearson's correlation, mean square error, and normalized root mean square error, respectively. Our results support that the proposed AB method is a useful tool to improve the prediction accuracy of a complete family, however, we encourage other investigators to evaluate the AB method to increase the empirical evidence of its potential.
Project description:In this work, we integrated finite element (FE) method and machine learning (ML) method to predict the stent expansion in a calcified coronary artery. The stenting procedure was captured in a patient-specific artery model, reconstructed based on optical coherence tomography images. Following FE simulation, eight geometrical features in each of 120 cross sections in the pre-stenting artery model, as well as the corresponding post-stenting lumen area, were extracted for training and testing the ML models. A linear regression model and a support vector regression (SVR) model with three different kernels (linear, polynomial, and radial basis function kernels) were adopted in this work. Two subgroups of the eight features, i.e., stretch features and calcification features, were further assessed for the prediction capacity. The influence of the neighboring cross sections on the prediction accuracy was also investigated by averaging each feature over eight neighboring cross sections. Results showed that the SVR models provided better predictions than the linear regression model in terms of bias. In addition, the inclusion of stretch features based on mechanistic understanding could provide a better prediction, compared with the calcification features only. However, there were no statistically significant differences between neighboring cross sections and individual ones in terms of the prediction bias and range of error. The simulation-driven machine learning framework in this work could enhance the mechanistic understanding of stenting in calcified coronary artery lesions, and also pave the way toward precise prediction of stent expansion.
Project description:BackgroundThe ever-increasing availability of high-density genomic markers in the form of single nucleotide polymorphisms (SNPs) enables genomic prediction, i.e. the inference of phenotypes based solely on genomic data, in the field of animal and plant breeding, where it has become an important tool. However, given the limited number of individuals, the abundance of variables (SNPs) can reduce the accuracy of prediction models due to overfitting or irrelevant SNPs. Feature selection can help to reduce the number of irrelevant SNPs and increase the model performance. In this study, we investigated an incremental feature selection approach based on ranking the SNPs according to the results of a genome-wide association study that we combined with random forest as a prediction model, and we applied it on several animal and plant datasets.ResultsApplying our approach to different datasets yielded a wide range of outcomes, i.e. from a substantial increase in prediction accuracy in a few cases to minor improvements when only a fraction of the available SNPs were used. Compared with models using all available SNPs, our approach was able to achieve comparable performances with a considerably reduced number of SNPs in several cases. Our approach showcased state-of-the-art efficiency and performance while having a faster computation time.ConclusionsThe results of our study suggest that our incremental feature selection approach has the potential to improve prediction accuracy substantially. However, this gain seems to depend on the genomic data used. Even for datasets where the number of markers is smaller than the number of individuals, feature selection may still increase the performance of the genomic prediction. Our approach is implemented in R and is available at https://github.com/FelixHeinrich/GP_with_IFS/ .
Project description:Multivariate analysis using mixed models allows for the exploration of genetic correlations between traits. Additionally, the transition to a genomic based approach is simplified by substituting classic pedigrees with a marker-based relationship matrix. It also enables the investigation of correlated responses to selection, trait integration and modularity in different kinds of populations. This study investigated a strategy for the construction of a marker-based relationship matrix that prioritized markers using Partial Least Squares. The efficiency of this strategy was found to depend on the correlation structure between investigated traits. In terms of accuracy, we found no benefit of this strategy compared with the all-marker-based multivariate model for the primary trait of diameter at breast height (DBH) in a radiata pine (Pinus radiata) population, possibly due to the presence of strong and well-estimated correlation with other highly heritable traits. Conversely, we did see benefit in a shining gum (Eucalyptus nitens) population, where the primary trait had low or only moderate genetic correlation with other low/moderately heritable traits. Marker selection in multivariate analysis can therefore be an efficient strategy to improve prediction accuracy for low heritability traits due to improved precision in poorly estimated low/moderate genetic correlations. Additionally, our study identified the genetic diversity as a factor contributing to the efficiency of marker selection in multivariate approaches due to higher precision of genetic correlation estimates.