ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction.
ABSTRACT: BACKGROUND:Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS:In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS:The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.
Project description:Accurate predictions of remaining useful life (RUL) of important components play a crucial role in system reliability, which is the basis of prognostics and health management (PHM). This paper proposed an integrated deep learning approach for RUL prediction of a turbofan engine by integrating an autoencoder (AE) with a deep convolutional generative adversarial network (DCGAN). In the pretraining stage, the reconstructed data of the AE not only participate in its error reconstruction but also take part in the DCGAN parameter training as the generated data of the DCGAN. Through double-error reconstructions, the capability of feature extraction is enhanced, and high-level abstract information is obtained. In the fine-tuning stage, a long short-term memory (LSTM) network is used to extract the sequential information from the features to predict the RUL. The effectiveness of the proposed scheme is verified on the NASA commercial modular aero-propulsion system simulation (C-MAPSS) dataset. The superiority of the proposed method is demonstrated via excellent prediction performance and comparisons with other existing state-of-the-art prognostics. The results of this study suggest that the proposed data-driven prognostic method offers a new and promising prediction approach and an efficient feature extraction scheme.
Project description:It is important to cluster heterogeneous information networks. A fast clustering algorithm based on an approximate commute time embedding for heterogeneous information networks with a star network schema is proposed in this paper by utilizing the sparsity of heterogeneous information networks. First, a heterogeneous information network is transformed into multiple compatible bipartite graphs from the compatible point of view. Second, the approximate commute time embedding of each bipartite graph is computed using random mapping and a linear time solver. All of the indicator subsets in each embedding simultaneously determine the target dataset. Finally, a general model is formulated by these indicator subsets, and a fast algorithm is derived by simultaneously clustering all of the indicator subsets using the sum of the weighted distances for all indicators for an identical target object. The proposed fast algorithm, FctClus, is shown to be efficient and generalizable and exhibits high clustering accuracy and fast computation speed based on a theoretic analysis and experimental verification.
Project description:With great potential for assisting radiological image interpretation and decision making, content-based image retrieval in the medical domain has become a hot topic in recent years. Many methods to enhance the performance of content-based medical image retrieval have been proposed, among which the relevance feedback (RF) scheme is one of the most promising. Given user feedback information, RF algorithms interactively learn a user's preferences to bridge the "semantic gap" between low-level computerized visual features and high-level human semantic perception and thus improve retrieval performance. However, most existing RF algorithms perform in the original high-dimensional feature space and ignore the manifold structure of the low-level visual features of images. In this paper, we propose a new method, termed dual-force ISOMAP (DFISOMAP), for content-based medical image retrieval. Under the assumption that medical images lie on a low-dimensional manifold embedded in a high-dimensional ambient space, DFISOMAP operates in the following three stages. First, the geometric structure of positive examples in the learned low-dimensional embedding is preserved according to the isometric feature mapping (ISOMAP) criterion. To precisely model the geometric structure, a reconstruction error constraint is also added. Second, the average distance between positive and negative examples is maximized to separate them; this margin maximization acts as a force that pushes negative examples far away from positive examples. Finally, the similarity propagation technique is utilized to provide negative examples with another force that will pull them back into the negative sample set. We evaluate the proposed method on a subset of the IRMA medical image dataset with a RF-based medical image retrieval framework. Experimental results show that DFISOMAP outperforms popular approaches for content-based medical image retrieval in terms of accuracy and stability.
Project description:We propose a novel end-to-end approach, namely, the semantic-containing double-level embedding Bi-LSTM model (SCDE-Bi-LSTM), to solve the three key problems of Q&A matching in the Chinese medical field. In the similarity calculation of the Q&A core module, we propose a text similarity calculation method that contains semantic information, to solve the problem that previous Q&A methods do not incorporate the deep information of a sentence into the similarity calculations. For the sentence vector representation module, we present a double-level embedding sentence representation method to reduce the error caused by Chinese medical word segmentation. In addition, due to the problem of the attention mechanism tending to cause backward deviation of the features, we propose an improved algorithm based on Bi-LSTM in the feature extraction stage. The Q&A framework proposed in this paper not only retains important timing features but also loses low-frequency features and noise. Additionally, it is applicable to different domains. To verify the framework, extensive Chinese medical Q&A corpora are created. We run several state-of-the-art Q&A methods as contrastive experiments on the medical corpora and the current popular insuranceQA dataset under different performance measures. The experimental results on the medical corpora show that our framework significantly outperforms several strong baselines and achieves an improvement of top-1 accuracy of up to 14%, reaching 79.15%.
Project description:Hyperspectral image (HSI) consists of hundreds of narrow spectral band components with rich spectral and spatial information. Extreme Learning Machine (ELM) has been widely used for HSI analysis. However, the classical ELM is difficult to use for sparse feature leaning due to its randomly generated hidden layer. In this paper, we propose a novel unsupervised sparse feature learning approach, called Evolutionary Multiobjective-based ELM (EMO-ELM), and apply it to HSI feature extraction. Specifically, we represent the task of constructing the ELM Autoencoder (ELM-AE) as a multiobjective optimization problem that takes the sparsity of hidden layer outputs and the reconstruction error as two conflicting objectives. Then, we adopt an Evolutionary Multiobjective Optimization (EMO) method to solve the two objectives, simultaneously. To find the best solution from the Pareto solution set and construct the best trade-off feature extractor, a curvature-based method is proposed to focus on the knee area of the Pareto solutions. Benefited from the EMO, the proposed EMO-ELM is less prone to fall into a local minimum and has fewer trainable parameters than gradient-based AEs. Experiments on two real HSIs demonstrate that the features learned by EMO-ELM not only preserve better sparsity but also achieve superior separability than many existing feature learning methods.
Project description:BACKGROUND: Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. RESULTS: The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. CONCLUSIONS: The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
Project description:Detecting drug-drug interaction (DDI) has become a vital part of public health safety. Therefore, using text mining techniques to extract DDIs from biomedical literature has received great attentions. However, this research is still at an early stage and its performance has much room to improve.In this article, we present a syntax convolutional neural network (SCNN) based DDI extraction method. In this method, a novel word embedding, syntax word embedding, is proposed to employ the syntactic information of a sentence. Then the position and part of speech features are introduced to extend the embedding of each word. Later, auto-encoder is introduced to encode the traditional bag-of-words feature (sparse 0-1 vector) as the dense real value vector. Finally, a combination of embedding-based convolutional features and traditional features are fed to the softmax classifier to extract DDIs from biomedical literature. Experimental results on the DDIExtraction 2013 corpus show that SCNN obtains a better performance (an F-score of 0.686) than other state-of-the-art methods.The source code is available for academic use at http://126.96.36.199:8080/DDI/SCNN-DDI.zip CONTACT: email@example.comSupplementary information: Supplementary data are available at Bioinformatics online.
Project description:<h4>Background</h4>Elucidating gene regulatory networks is crucial for understanding normal cell physiology and complex pathologic phenotypes. Existing computational methods for the genome-wide "reverse engineering" of such networks have been successful only for lower eukaryotes with simple genomes. Here we present ARACNE, a novel algorithm, using microarray expression profiles, specifically designed to scale up to the complexity of regulatory networks in mammalian cells, yet general enough to address a wider range of network deconvolution problems. This method uses an information theoretic approach to eliminate the majority of indirect interactions inferred by co-expression methods.<h4>Results</h4>We prove that ARACNE reconstructs the network exactly (asymptotically) if the effect of loops in the network topology is negligible, and we show that the algorithm works well in practice, even in the presence of numerous loops and complex topologies. We assess ARACNE's ability to reconstruct transcriptional regulatory networks using both a realistic synthetic dataset and a microarray dataset from human B cells. On synthetic datasets ARACNE achieves very low error rates and outperforms established methods, such as Relevance Networks and Bayesian Networks. Application to the deconvolution of genetic networks in human B cells demonstrates ARACNE's ability to infer validated transcriptional targets of the cMYC proto-oncogene. We also study the effects of misestimation of mutual information on network reconstruction, and show that algorithms based on mutual information ranking are more resilient to estimation errors.<h4>Conclusion</h4>ARACNE shows promise in identifying direct transcriptional interactions in mammalian cellular networks, a problem that has challenged existing reverse engineering algorithms. This approach should enhance our ability to use microarray data to elucidate functional mechanisms that underlie cellular processes and to identify molecular targets of pharmacological compounds in mammalian cellular networks.
Project description:In this paper, we propose a new framework for tackling face recognition problem. The face recognition problem is formulated as groupwise deformable image registration and feature matching problem. The main contributions of the proposed method lie in the following aspects: (1) Each pixel in a facial image is represented by an anatomical signature obtained from its corresponding most salient scale local region determined by the survival exponential entropy (SEE) information theoretic measure. (2) Based on the anatomical signature calculated from each pixel, a novel Markov random field based groupwise registration framework is proposed to formulate the face recognition problem as a feature guided deformable image registration problem. The similarity between different facial images are measured on the nonlinear Riemannian manifold based on the deformable transformations. (3) The proposed method does not suffer from the generalizability problem which exists commonly in learning based algorithms. The proposed method has been extensively evaluated on four publicly available databases: FERET, CAS-PEAL-R1, FRGC ver 2.0, and the LFW. It is also compared with several state-of-the-art face recognition approaches, and experimental results demonstrate that the proposed method consistently achieves the highest recognition rates among all the methods under comparison.
Project description:Three-dimensional (3D) reconstruction using RGB-D camera with simultaneous color image and depth information is attractive as it can significantly reduce the cost of equipment and time for data collection. Point feature is commonly used for aligning two RGB-D frames. Due to lacking reliable point features, RGB-D simultaneous localization and mapping (SLAM) is easy to fail in low textured scenes. To overcome the problem, this paper proposes a robust RGB-D SLAM system fusing both points and lines, because lines can provide robust geometry constraints when points are insufficient. To comprehensively fuse line constraints, we combine 2D and 3D line reprojection error with point reprojection error in a novel cost function. To solve the cost function and filter out wrong feature matches, we build a robust pose solver using the Gauss-Newton method and Chi-Square test. To correct the drift of camera poses, we maintain a sliding-window framework to update the keyframe poses and related features. We evaluate the proposed system on both public datasets and real-world experiments. It is demonstrated that it is comparable to or better than state-of-the-art methods in consideration with both accuracy and robustness.