Analyzing kernel matrices for the identification of differentially expressed genes.
ABSTRACT: One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing [Formula: see text]-like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate.
Project description:This paper describes a general kernel regression approach to predict experimental conditions from activity patterns acquired with functional magnetic resonance image (fMRI). The standard approach is to use classifiers that predict conditions from activity patterns. Our approach involves training different regression machines for each experimental condition, so that a predicted temporal profile is computed for each condition. A decision function is then used to classify the responses from the testing volumes into the corresponding category, by comparing the predicted temporal profile elicited by each event, against a canonical hemodynamic response function. This approach utilizes the temporal information in the fMRI signal and maintains more training samples in order to improve the classification accuracy over an existing strategy. This paper also introduces efficient techniques of temporal compaction, which operate directly on kernel matrices for kernel classification algorithms such as the support vector machine (SVM). Temporal compacting can convert the kernel computed from each fMRI volume directly into the kernel computed from beta-maps, average of volumes or spatial-temporal kernel. The proposed method was applied to three different datasets. The first one is a block-design experiment with three conditions of image stimuli. The method outperformed the SVM classifiers of three different types of temporal compaction in single-subject leave-one-block-out cross-validation. Our method achieved 100% classification accuracy for six of the subjects and an average of 94% accuracy across all 16 subjects, exceeding the best SVM classification result, which was 83% accuracy (p=0.008). The second dataset is also a block-design experiment with two conditions of visual attention (left or right). Our method yielded 96% accuracy and SVM yielded 92% (p=0.005). The third dataset is from a fast event-related experiment with two categories of visual objects. Our method achieved 77% accuracy, compared with 72% using SVM (p=0.0006).
Project description:Genomic prediction benefits hybrid rice breeding by increasing selection intensity and accelerating breeding cycles. With the rapid advancement of technology, other omic data, such as metabolomic data and transcriptomic data, are readily available for predicting breeding values for agronomically important traits. In this study, the best prediction strategies were determined for yield, 1000 grain weight, number of grains per panicle, and number of tillers per plant of hybrid rice (derived from recombinant inbred lines) by comprehensively evaluating all possible combinations of omic datasets with different prediction methods. It was demonstrated that, in rice, the predictions using a combination of genomic and metabolomic data generally produce better results than single-omics predictions or predictions based on other combined omic data. Best linear unbiased prediction (BLUP) appears to be the most efficient prediction method compared to the other commonly used approaches, including least absolute shrinkage and selection operator (LASSO), stochastic search variable selection (SSVS), support vector machines with radial basis function and epsilon regression (SVM-R(EPS)), support vector machines with radial basis function and nu regression (SVM-R(NU)), support vector machines with polynomial kernel and epsilon regression (SVM-P(EPS)), support vector machines with polynomial kernel and nu regression (SVM-P(NU)) and partial least squares regression (PLS). This study has provided guidelines for selection of hybrid rice in terms of which types of omic datasets and which method should be used to achieve higher trait predictability. The answer to these questions will benefit academic research and will also greatly reduce the operative cost for the industry which specializes in breeding and selection.
Project description:Despite the intrinsic elemental analysis capability and lack of sample preparation requirements, laser-induced breakdown spectroscopy (LIBS) has not been extensively used for real-world applications, e.g., quality assurance and process monitoring. Specifically, variability in sample, system, and experimental parameters in LIBS studies present a substantive hurdle for robust classification, even when standard multivariate chemometric techniques are used for analysis. Considering pharmaceutical sample investigation as an example, we propose the use of support vector machines (SVM) as a nonlinear classification method over conventional linear techniques such as soft independent modeling of class analogy (SIMCA) and partial least-squares discriminant analysis (PLS-DA) for discrimination based on LIBS measurements. Using over-the-counter pharmaceutical samples, we demonstrate that the application of SVM enables statistically significant improvements in prospective classification accuracy (sensitivity), because of its ability to address variability in LIBS sample ablation and plasma self-absorption behavior. Furthermore, our results reveal that SVM provides nearly 10% improvement in correct allocation rate and a concomitant reduction in misclassification rates of 75% (cf. PLS-DA) and 80% (cf. SIMCA)-when measurements from samples not included in the training set are incorporated in the test data-highlighting its robustness. While further studies on a wider matrix of sample types performed using different LIBS systems is needed to fully characterize the capability of SVM to provide superior predictions, we anticipate that the improved sensitivity and robustness observed here will facilitate application of the proposed LIBS-SVM toolbox for screening drugs and detecting counterfeit samples, as well as in related areas of forensic and biological sample analysis.
Project description:Alzheimer's disease (AD) is the kind of dementia that affects the most people around the world. Therefore, an early identification supporting effective treatments is required to increase the life quality of a wide number of patients. Recently, computer-aided diagnosis tools for dementia using Magnetic Resonance Imaging scans have been successfully proposed to discriminate between patients with AD, mild cognitive impairment, and healthy controls. Most of the attention has been given to the clinical data, provided by initiatives as the ADNI, supporting reliable researches on intervention, prevention, and treatments of AD. Therefore, there is a need for improving the performance of classification machines. In this paper, we propose a kernel framework for learning metrics that enhances conventional machines and supports the diagnosis of dementia. Our framework aims at building discriminative spaces through the maximization of center kernel alignment function, aiming at improving the discrimination of the three considered neurological classes. The proposed metric learning performance is evaluated on the widely-known ADNI database using three supervised classification machines (k-nn, SVM and NNs) for multi-class and bi-class scenarios from structural MRIs. Specifically, from ADNI collection 286 AD patients, 379 MCI patients and 231 healthy controls are used for development and validation of our proposed metric learning framework. For the experimental validation, we split the data into two subsets: 30% of subjects used like a blindfolded assessment and 70% employed for parameter tuning. Then, in the preprocessing stage, each structural MRI scan a total of 310 morphological measurements are automatically extracted from by FreeSurfer software package and concatenated to build an input feature matrix. Obtained test performance results, show that including a supervised metric learning improves the compared baseline classifiers in both scenarios. In the multi-class scenario, we achieve the best performance (accuracy 60.1%) for pretrained 1-layered NN, and we obtain measures over 90% in the average for HC vs. AD task. From the machine learning point of view, our proposal enhances the classifier performance by building spaces with a better class separability. From the clinical application, our enhancement results in a more balanced performance in each class than the compared approaches from the CADDementia challenge by increasing the sensitivity of pathological groups and the specificity of healthy controls.
Project description:Protein post-translational modification (PTM) is an important mechanism that is involved in the regulation of protein function. Considering the high-cost and labor-intensive of experimental identification, many computational prediction methods are currently available for the prediction of PTM sites by using protein local sequence information in the context of conserved motif. Here we proposed a novel computational method by using the combination of multiple kernel support vector machines (SVM) for predicting PTM sites including phosphorylation, O-linked glycosylation, acetylation, sulfation and nitration. To largely make use of local sequence information and site-modification relationships, we developed a local sequence kernel and Gaussian interaction profile kernel, respectively. Multiple kernels were further combined to train SVM for efficiently leveraging kernel information to boost predictive performance. We compared the proposed method with existing PTM prediction methods. The experimental results revealed that the proposed method performed comparable or better performance than the existing prediction methods, suggesting the feasibility of the developed kernels and the usefulness of the proposed method in PTM sites prediction.
Project description:This paper describes a new method based on a voltammetric electronic tongue (ET) for the recognition of distinctive features in coffee samples. An ET was directly applied to different samples from the main Mexican coffee regions without any pretreatment before the analysis. The resulting electrochemical information was modeled with two different mathematical tools, namely Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). Growing conditions (i.e., organic or non-organic practices and altitude of crops) were considered for a first classification. LDA results showed an average discrimination rate of 88% ± 6.53% while SVM successfully accomplished an overall accuracy of 96.4% ± 3.50% for the same task. A second classification based on geographical origin of samples was carried out. Results showed an overall accuracy of 87.5% ± 7.79% for LDA and a superior performance of 97.5% ± 3.22% for SVM. Given the complexity of coffee samples, the high accuracy percentages achieved by ET coupled with SVM in both classification problems suggested a potential applicability of ET in the assessment of selected coffee features with a simpler and faster methodology along with a null sample pretreatment. In addition, the proposed method can be applied to authentication assessment while improving cost, time and accuracy of the general procedure.
Project description:Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights.SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.
Project description:With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.
Project description:Filtered selection coupled with support vector machines generate functionally relevant prediction model for colorectal cancer. In this study, we built a model that uses Support Vector Machine (SVM) to classify cancer and normal samples using Affymetrix exon microarray data obtained from 90 samples of 48 patients diagnosed with CRC. From the 22,011 genes, we selected the 20, 30, 50, 100, 200, 300 and 500 genes most relevant to CRC using the Minimum-Redundancy–Maximum-Relevance (mRMR) technique. With these gene sets, an SVM model was designed using four different kernel types (linear, polynomial, radial basis function and sigmoid). Overall design: We conducted a pair-wise comparison of Tumor vs Normal samples obtained from cancer patients. Array data was processed using Expression Console Patients detail for sample 052311 and 082812 are missing.
Project description:BACKGROUND:Breast cancer is one of the leading causes of deaths for women. It is of great necessity to develop effective methods for breast cancer detection and diagnosis. Recent studies have focused on gene-based signatures for outcome predictions. Kernel SVM for its discriminative power in dealing with small sample pattern recognition problems has attracted a lot attention. But how to select or construct an appropriate kernel for a specified problem still needs further investigation. RESULTS:Here we propose a novel kernel (Hadamard Kernel) in conjunction with Support Vector Machines (SVMs) to address the problem of breast cancer outcome prediction using gene expression data. Hadamard Kernel outperform the classical kernels and correlation kernel in terms of Area under the ROC Curve (AUC) values where a number of real-world data sets are adopted to test the performance of different methods. CONCLUSIONS:Hadamard Kernel SVM is effective for breast cancer predictions, either in terms of prognosis or diagnosis. It may benefit patients by guiding therapeutic options. Apart from that, it would be a valuable addition to the current SVM kernel families. We hope it will contribute to the wider biology and related communities.