Protein sequences classification by means of feature extraction with substitution matrices.
ABSTRACT: BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. RESULTS: In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. CONCLUSIONS: The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.
Project description:This paper proposes a new quantum-like method for the binary classification applied to classical datasets. Inspired by the quantum Helstrom measurement, this innovative approach has enabled us to define a new classifier, called Helstrom Quantum Centroid (HQC). This binary classifier (inspired by the concept of distinguishability between quantum states) acts on density matrices-called density patterns-that are the quantum encoding of classical patterns of a dataset. In this paper we compare the performance of HQC with respect to twelve standard (linear and non-linear) classifiers over fourteen different datasets. The experimental results show that HQC outperforms the other classifiers when compared to the Balanced Accuracy and other statistical measures. Finally, we show that the performance of our classifier is positively correlated to the increase in the number of "quantum copies" of a pattern and the resulting tensor product thereof.
Project description:This paper describes a general kernel regression approach to predict experimental conditions from activity patterns acquired with functional magnetic resonance image (fMRI). The standard approach is to use classifiers that predict conditions from activity patterns. Our approach involves training different regression machines for each experimental condition, so that a predicted temporal profile is computed for each condition. A decision function is then used to classify the responses from the testing volumes into the corresponding category, by comparing the predicted temporal profile elicited by each event, against a canonical hemodynamic response function. This approach utilizes the temporal information in the fMRI signal and maintains more training samples in order to improve the classification accuracy over an existing strategy. This paper also introduces efficient techniques of temporal compaction, which operate directly on kernel matrices for kernel classification algorithms such as the support vector machine (SVM). Temporal compacting can convert the kernel computed from each fMRI volume directly into the kernel computed from beta-maps, average of volumes or spatial-temporal kernel. The proposed method was applied to three different datasets. The first one is a block-design experiment with three conditions of image stimuli. The method outperformed the SVM classifiers of three different types of temporal compaction in single-subject leave-one-block-out cross-validation. Our method achieved 100% classification accuracy for six of the subjects and an average of 94% accuracy across all 16 subjects, exceeding the best SVM classification result, which was 83% accuracy (p=0.008). The second dataset is also a block-design experiment with two conditions of visual attention (left or right). Our method yielded 96% accuracy and SVM yielded 92% (p=0.005). The third dataset is from a fast event-related experiment with two categories of visual objects. Our method achieved 77% accuracy, compared with 72% using SVM (p=0.0006).
Project description:Urban areas feature complex and heterogeneous land covers which create challenging issues for tree species classification. The increased availability of high spatial resolution multispectral satellite imagery and LiDAR datasets combined with the recent evolution of deep learning within remote sensing for object detection and scene classification, provide promising opportunities to map individual tree species with greater accuracy and resolution. However, there are knowledge gaps that are related to the contribution of Worldview-3 SWIR bands, very high resolution PAN band and LiDAR data in detailed tree species mapping. Additionally, contemporary deep learning methods are hampered by lack of training samples and difficulties of preparing training data. The objective of this study was to examine the potential of a novel deep learning method, Dense Convolutional Network (DenseNet), to identify dominant individual tree species in a complex urban environment within a fused image of WorldView-2 VNIR, Worldview-3 SWIR and LiDAR datasets. DenseNet results were compared against two popular machine classifiers in remote sensing image analysis, Random Forest (RF) and Support Vector Machine (SVM). Our results demonstrated that: (1) utilizing a data fusion approach beginning with VNIR and adding SWIR, LiDAR, and panchromatic (PAN) bands increased the overall accuracy of the DenseNet classifier from 75.9% to 76.8%, 81.1% and 82.6%, respectively. (2) DenseNet significantly outperformed RF and SVM for the classification of eight dominant tree species with an overall accuracy of 82.6%, compared to 51.8% and 52% for SVM and RF classifiers, respectively. (3) DenseNet maintained superior performance over RF and SVM classifiers under restricted training sample quantities which is a major limiting factor for deep learning techniques. Overall, the study reveals that DenseNet is more effective for urban tree species classification as it outperforms the popular RF and SVM techniques when working with highly complex image scenes regardless of training sample size.
Project description:This research proposes an intelligent decision support system for acute lymphoblastic leukaemia diagnosis from microscopic blood images. A novel clustering algorithm with stimulating discriminant measures (SDM) of both within- and between-cluster scatter variances is proposed to produce robust segmentation of nucleus and cytoplasm of lymphocytes/lymphoblasts. Specifically, the proposed between-cluster evaluation is formulated based on the trade-off of several between-cluster measures of well-known feature extraction methods. The SDM measures are used in conjuction with Genetic Algorithm for clustering nucleus, cytoplasm, and background regions. Subsequently, a total of eighty features consisting of shape, texture, and colour information of the nucleus and cytoplasm sub-images are extracted. A number of classifiers (multi-layer perceptron, Support Vector Machine (SVM) and Dempster-Shafer ensemble) are employed for lymphocyte/lymphoblast classification. Evaluated with the ALL-IDB2 database, the proposed SDM-based clustering overcomes the shortcomings of Fuzzy C-means which focuses purely on within-cluster scatter variance. It also outperforms Linear Discriminant Analysis and Fuzzy Compactness and Separation for nucleus-cytoplasm separation. The overall system achieves superior recognition rates of 96.72% and 96.67% accuracies using bootstrapping and 10-fold cross validation with Dempster-Shafer and SVM, respectively. The results also compare favourably with those reported in the literature, indicating the usefulness of the proposed SDM-based clustering method.
Project description:BACKGROUND: PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. RESULTS: We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means). CONCLUSIONS: The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.
Project description:BACKGROUND: Prediction of bacterial virulent protein sequences has implications for identification and characterization of novel virulence-associated factors, finding novel drug/vaccine targets against proteins indispensable to pathogenicity, and understanding the complex virulence mechanism in pathogens. RESULTS: In the present study we propose a bacterial virulent protein prediction method based on bi-layer cascade Support Vector Machine (SVM). The first layer SVM classifiers were trained and optimized with different individual protein sequence features like amino acid composition, dipeptide composition (occurrences of the possible pairs of ith and i+1th amino acid residues), higher order dipeptide composition (pairs of ith and i+2nd residues) and Position Specific Iterated BLAST (PSI-BLAST) generated Position Specific Scoring Matrices (PSSM). In addition, a similarity-search based module was also developed using a dataset of virulent and non-virulent proteins as BLAST database. A five-fold cross-validation technique was used for the evaluation of various prediction strategies in this study. The results from the first layer (SVM scores and PSI-BLAST result) were cascaded to the second layer SVM classifier to train and generate the final classifier. The cascade SVM classifier was able to accomplish an accuracy of 81.8%, covering 86% area in the Receiver Operator Characteristic (ROC) plot, better than that of either of the layer one SVM classifiers based on single or multiple sequence features. CONCLUSION: VirulentPred is a SVM based method to predict bacterial virulent proteins sequences, which can be used to screen virulent proteins in proteomes. Together with experimentally verified virulent proteins, several putative, non annotated and hypothetical protein sequences have been predicted to be high scoring virulent proteins by the prediction method. VirulentPred is available as a freely accessible World Wide Web server - VirulentPred, at http://bioinfo.icgeb.res.in/virulent/.
Project description:MOTIVATION: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts. RESULTS: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena. AVAILABILITY: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Project description:Due to the existence of Lingzhi adulteration, there is a growing demand for species classification of medicinal mushrooms by various techniques. The objective of this study was to explore a rapid and reliable way to distinguish between different Lingzhi species and compare the influence of data pretreatment methods on the recognition results. To this end, 120 fresh fruiting bodies of Lingzhi were collected, and all of them were analyzed by attenuated total reflection-Fourier transform infrared spectroscopy (ATR-FTIR). Random forest (RF), support vector machine (SVM) and partial least squares discriminant analysis (PLS-DA) classification models were established for raw and pretreated second derivative (SD) spectral matrices to authenticate different Lingzhi species. The results of multivariate statistical analysis indicated that the SD preprocessing method displayed a higher classification ability, which may be attributed to the analysis of powder samples that requires removal of overlapping peaks and baseline shifts. Compared with RF, the results of the SVM and PLS-DA methods were more satisfying, and their accuracies for the test set were both 100%. Among SVM and PLS-DA, the training set and test set accuracy of PLS-DA were both 100%. In conclusion, ATR-FTIR spectroscopy data pretreated by SD combined with PLS-DA is a simple, rapid, non-destructive and relatively inexpensive method to discriminate between mushroom species and provide a good reference to quality assessment.
Project description:BACKGROUND:With millisecond-level resolution, electroencephalographic (EEG) recording provides a sensitive tool to assay neural dynamics of human cognition. However, selection of EEG features used to answer experimental questions is typically determined a priori. The utility of machine learning was investigated as a computational framework for extracting the most relevant features from EEG data empirically. METHODS:Schizophrenia (SZ; n = 40) and healthy community (HC; n = 12) subjects completed a Sternberg Working Memory Task (SWMT) during EEG recording. EEG was analyzed to extract 5 frequency components (theta1, theta2, alpha, beta, gamma) at 4 processing stages (baseline, encoding, retention, retrieval) and 3 scalp sites (frontal-Fz, central-Cz, occipital-Oz) separately for correctly and incorrectly answered trials. The 1-norm support vector machine (SVM) method was used to build EEG classifiers of SWMT trial accuracy (correct vs. incorrect; Model 1) and diagnosis (HC vs. SZ; Model 2). External validity of SVM models was examined in relation to neuropsychological test performance and diagnostic classification using conventional regression-based analyses. RESULTS:SWMT performance was significantly reduced in SZ (p < .001). Model 1 correctly classified trial accuracy at 84 % in HC, and at 74 % when cross-validated in SZ data. Frontal gamma at encoding and central theta at retention provided highest weightings, accounting for 76 % of variance in SWMT scores and 42 % variance in neuropsychological test performance across samples. Model 2 identified frontal theta at baseline and frontal alpha during retrieval as primary classifiers of diagnosis, providing 87 % classification accuracy as a discriminant function. CONCLUSIONS:EEG features derived by SVM are consistent with literature reports of gamma's role in memory encoding, engagement of theta during memory retention, and elevated resting low-frequency activity in schizophrenia. Tests of model performance and cross-validation support the stability and generalizability of results, and utility of SVM as an analytic approach for EEG feature selection.
Project description:BACKGROUND: Support vector machine (SVM) has been widely used as accurate and reliable method to decipher brain patterns from functional MRI (fMRI) data. Previous studies have not found a clear benefit for non-linear (polynomial kernel) SVM versus linear one. Here, a more effective non-linear SVM using radial basis function (RBF) kernel is compared with linear SVM. Different from traditional studies which focused either merely on the evaluation of different types of SVM or the voxel selection methods, we aimed to investigate the overall performance of linear and RBF SVM for fMRI classification together with voxel selection schemes on classification accuracy and time-consuming. METHODOLOGY/PRINCIPAL FINDINGS: Six different voxel selection methods were employed to decide which voxels of fMRI data would be included in SVM classifiers with linear and RBF kernels in classifying 4-category objects. Then the overall performances of voxel selection and classification methods were compared. Results showed that: (1) Voxel selection had an important impact on the classification accuracy of the classifiers: in a relative low dimensional feature space, RBF SVM outperformed linear SVM significantly; in a relative high dimensional space, linear SVM performed better than its counterpart; (2) Considering the classification accuracy and time-consuming holistically, linear SVM with relative more voxels as features and RBF SVM with small set of voxels (after PCA) could achieve the better accuracy and cost shorter time. CONCLUSIONS/SIGNIFICANCE: The present work provides the first empirical result of linear and RBF SVM in classification of fMRI data, combined with voxel selection methods. Based on the findings, if only classification accuracy was concerned, RBF SVM with appropriate small voxels and linear SVM with relative more voxels were two suggested solutions; if users concerned more about the computational time, RBF SVM with relative small set of voxels when part of the principal components were kept as features was a better choice.