Project description:BackgroundExisting feature selection methods typically do not consider prior knowledge in the form of structural relationships among features. In this study, the features are structured based on prior knowledge into groups. The problem addressed in this article is how to select one representative feature from each group such that the selected features are jointly discriminating the classes. The problem is formulated as a binary constrained optimization and the combinatorial optimization is relaxed as a convex-concave problem, which is then transformed into a sequence of convex optimization problems so that the problem can be solved by any standard optimization algorithm. Moreover, a block coordinate gradient descent optimization algorithm is proposed for high dimensional feature selection, which in our experiments was four times faster than using a standard optimization algorithm.ResultsIn order to test the effectiveness of the proposed formulation, we used microarray analysis as a case study, where genes with similar expressions or similar molecular functions were grouped together. In particular, the proposed block coordinate gradient descent feature selection method is evaluated on five benchmark microarray gene expression datasets and evidence is provided that the proposed method gives more accurate results than the state-of-the-art gene selection methods. Out of 25 experiments, the proposed method achieved the highest average AUC in 13 experiments while the other methods achieved higher average AUC in no more than 6 experiments.ConclusionA method is developed to select a feature from each group. When the features are grouped based on similarity in gene expression, we showed that the proposed algorithm is more accurate than state-of-the-art gene selection methods that are particularly developed to select highly discriminative and less redundant genes. In addition, the proposed method can exploit any grouping structure among features, while alternative methods are restricted to using similarity based grouping.
Project description:We introduce a novel data reduction technique whereby we select a subset of tiles to "cover" maximally events of interest in large-scale biological datasets (e.g., genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations (> 1M).
Project description:Introduction: Imaging of tumors is a standard step in diagnosing cancer and making subsequent treatment decisions. The field of radiomics aims to develop imaging based biomarkers using methods rooted in artificial intelligence applied to medical imaging. However, a challenging aspect of developing predictive models for clinical use is that many quantitative features derived from image data exhibit instability or lack of reproducibility across different imaging systems or image-processing pipelines. Methods: To address this challenge, we propose a Bayesian sparse modeling approach for image classification based on radiomic features, where the inclusion of more reliable features is favored via a probit prior formulation. Results: We verify through simulation studies that this approach can improve feature selection and prediction given correct prior information. Finally, we illustrate the method with an application to the classification of head and neck cancer patients by human papillomavirus status, using as our prior information a reliability metric quantifying feature stability across different imaging systems.
Project description:MotivationNetwork marker selection on genome-scale networks plays an important role in the understanding of biological mechanisms and disease pathologies. Recently, a Bayesian nonparametric mixture model has been developed and successfully applied for selecting genes and gene sub-networks. Hence, extending this method to a unified approach for network-based feature selection on general large-scale networks and creating an easy-to-use software package is on demand.ResultsWe extended the method and developed an R package, the Bayesian network feature finder (BANFF), providing a package of posterior inference, model comparison and graphical illustration of model fitting. The model was extended to a more general form, and a parallel computing algorithm for the Markov chain Monte Carlo -based posterior inference and an expectation maximization-based algorithm for posterior approximation were added. Based on simulation studies, we demonstrate the use of BANFF on analyzing gene expression on a protein-protein interaction network.Availabilityhttps://cran.r-project.org/web/packages/BANFF/index.html CONTACT: jiankang@umich.edu, tianwei.yu@emory.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Project description:BackgroundThe versatility of DNA copy number amplifications for profiling and categorization of various tissue samples has been widely acknowledged in the biomedical literature. For instance, this type of measurement techniques provides possibilities for exploring sets of cancerous tissues to identify novel subtypes. The previously utilized statistical approaches to various kinds of analyses include traditional algorithmic techniques for clustering and dimension reduction, such as independent and principal component analyses, hierarchical clustering, as well as model-based clustering using maximum likelihood estimation for latent class models.ResultsWhile purely algorithmic methods are usually easily applicable, their suboptimal performance and limitations in making formal inference have been thoroughly discussed in the statistical literature. Here we introduce a Bayesian model-based approach to simultaneous identification of underlying tissue groups and the informative amplifications. The model-based approach provides the possibility of using formal inference to determine the number of groups from the data, in contrast to the ad hoc methods often exploited for similar purposes. The model also automatically recognizes the chromosomal areas that are relevant for the clustering.ConclusionValidatory analyses of simulated data and a large database of DNA copy number amplifications in human neoplasms are used to illustrate the potential of our approach. Our software implementation BASTA for performing Bayesian statistical tissue profiling is freely available for academic purposes at (http://web.abo.fi/fak/mnf/mate/jc/software/basta.html).
Project description:Text classification tasks, particularly those involving a large number of features, pose significant challenges in effective feature selection. This research introduces a novel methodology, MBO-NB, which integrates Migrating Birds Optimization (MBO) approach with naïve Bayes as an internal classifier to address these challenges. The motivation behind this study stems from the recognized limitations of existing techniques in efficiently handling extensive feature sets. Traditional approaches often fail to adequately streamline the feature selection process, resulting in suboptimal classification accuracy and increased computational overhead. In response to this need, our primary objective is to propose a scalable and effective solution that enhances both computational efficiency and classification accuracy in text classification systems. To achieve this objective, we preprocess raw data using the Information Gain algorithm, strategically reducing the feature count from an average of 62,221 to 2,089. Through extensive experiments, we demonstrate the superior effectiveness of MBO-NB in feature reduction compared to other existing techniques, resulting in significantly improved classification accuracy. Furthermore, the successful integration of naïve Bayes within MBO offers a comprehensive and well-rounded solution to the feature selection problem. In individual comparisons with Particle Swarm Optimization (PSO), MBO-NB consistently outperforms by an average of 6.9% across four setups. This research provides valuable insights into enhancing feature selection methods, thereby contributing to the advancement of text classification techniques. By offering a scalable and effective solution, MBO-NB addresses the pressing need for improved feature selection methods in text classification, thereby facilitating the development of more robust and efficient classification systems.
Project description:The microbiome plays a critical role in human health and disease, and there is a strong scientific interest in linking specific features of the microbiome to clinical outcomes. There are key aspects of microbiome data, however, that limit the applicability of standard variable selection methods. In particular, the observed data are compositional, as the counts within each sample have a fixed-sum constraint. In addition, microbiome features, typically quantified as operational taxonomic units, often reflect microorganisms that are similar in function, and may therefore have a similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we propose a variable selection technique with the following novel features: a generalized transformation and z-prior to handle the compositional constraint, and an Ising prior that encourages the joint selection of microbiome features that are closely related in terms of their genetic sequence similarity. We demonstrate that our proposed method outperforms existing penalized approaches for microbiome variable selection in both simulation and the analysis of real data exploring the relationship of the gut microbiome to body mass index.
Project description:An efficient search for optimal solutions in Bayesian optimization (BO) entails providing appropriate initial samples when building a Gaussian process regression model. For general experimental designs without compounds or molecular descriptors in explanatory variable x, selecting initial samples with a larger D-optimality allows little correlation between x in the selected samples, which leads to effective regression model building. However, in the case of experimental designs with compounds, a high correlation always exists between molecular descriptors calculated from chemical structures, and compounds with similar structures form clusters in the chemical space. Therefore, selecting the initial samples uniformly from each cluster is desirable for obtaining initial samples with maximum information on experimental conditions. As D-optimality does not work well with highly correlated molecular descriptors and does not consider information on clusters in sample selection, we propose an initial sample selection method based on clustering and apply it to the optimization of coupling reaction conditions with BO. We confirm that the proposed method reaches the optimal solution with up to 5% fewer experiments than random sampling or sampling based on D-optimality. This study makes a contribution to the initial sample selection method for BO, and we are convinced that the proposed method improves the search performance of BO in various fields of science and technology if initial samples can be determined using cluster information appropriately formed by utilizing domain knowledge.
Project description:The increased popularity of the web has caused the inclusion of huge amount of information to the web, and as a result of this explosive information growth, automated web page classification systems are needed to improve search engines' performance. Web pages have a large number of features such as HTML/XML tags, URLs, hyperlinks, and text contents that should be considered during an automated classification process. The aim of this study is to reduce the number of features to be used to improve runtime and accuracy of the classification of web pages. In this study, we used an ant colony optimization (ACO) algorithm to select the best features, and then we applied the well-known C4.5, naive Bayes, and k nearest neighbor classifiers to assign class labels to web pages. We used the WebKB and Conference datasets in our experiments, and we showed that using the ACO for feature selection improves both accuracy and runtime performance of classification. We also showed that the proposed ACO based algorithm can select better features with respect to the well-known information gain and chi square feature selection methods.
Project description:Electron ptychography provides new opportunities to resolve atomic structures with deep sub-angstrom spatial resolution and to study electron-beam sensitive materials with high dose efficiency. In practice, obtaining accurate ptychography images requires simultaneously optimizing multiple parameters that are often selected based on trial-and-error, resulting in low-throughput experiments and preventing wider adoption. Here, we develop an automatic parameter selection framework to circumvent this problem using Bayesian optimization with Gaussian processes. With minimal prior knowledge, the workflow efficiently produces ptychographic reconstructions that are superior to those processed by experienced experts. The method also facilitates better experimental designs by exploring optimized experimental parameters from simulated data.