Project description:MotivationMass spectrometry imaging (MSI) provides rich biochemical information in a label-free manner and therefore holds promise to substantially impact current practice in disease diagnosis. However, the complex nature of MSI data poses computational challenges in its analysis. The complexity of the data arises from its large size, high-dimensionality and spectral nonlinearity. Preprocessing, including peak picking, has been used to reduce raw data complexity; however, peak picking is sensitive to parameter selection that, perhaps prematurely, shapes the downstream analysis for tissue classification and ensuing biological interpretation.ResultsWe propose a deep learning model, massNet, that provides the desired qualities of scalability, nonlinearity and speed in MSI data analysis. This deep learning model was used, without prior preprocessing and peak picking, to classify MSI data from a mouse brain harboring a patient-derived tumor. The massNet architecture established automatically learning of predictive features, and automated methods were incorporated to identify peaks with potential for tumor delineation. The model's performance was assessed using cross-validation, and the results demonstrate higher accuracy and a substantial gain in speed compared to the established classical machine learning method, support vector machine.Availability and implementationhttps://github.com/wabdelmoula/massNet. The data underlying this article are available in the NIH Common Fund's National Metabolomics Data Repository (NMDR) Metabolomics Workbench under project id (PR001292) with http://dx.doi.org/10.21228/M8Q70T.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:SummaryIn recent years, SWATH-MS has become the proteomic method of choice for data-independent-acquisition, as it enables high proteome coverage, accuracy and reproducibility. However, data analysis is convoluted and requires prior information and expert curation. Furthermore, as quantification is limited to a small set of peptides, potentially important biological information may be discarded. Here we demonstrate that deep learning can be used to learn discriminative features directly from raw MS data, eliminating hence the need of elaborate data processing pipelines. Using transfer learning to overcome sample sparsity, we exploit a collection of publicly available deep learning models already trained for the task of natural image classification. These models are used to produce feature vectors from each mass spectrometry (MS) raw image, which are later used as input for a classifier trained to distinguish tumor from normal prostate biopsies. Although the deep learning models were originally trained for a completely different classification task and no additional fine-tuning is performed on them, we achieve a highly remarkable classification performance of 0.876 AUC. We investigate different types of image preprocessing and encoding. We also investigate whether the inclusion of the secondary MS2 spectra improves the classification performance. Throughout all tested models, we use standard protein expression vectors as gold standards. Even with our naïve implementation, our results suggest that the application of deep learning and transfer learning techniques might pave the way to the broader usage of raw mass spectrometry data in real-time diagnosis.Availability and implementationThe open source code used to generate the results from MS images is available on GitHub: https://ibm.biz/mstransc. The raw MS data underlying this article cannot be shared publicly for the privacy of individuals that participated in the study. Processed data including the MS images, their encodings, classification labels and results can be accessed at the following link: https://ibm.box.com/v/mstc-supplementary.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Computational analysis is crucial to capitalize on the wealth of spatio-molecular information generated by mass spectrometry imaging (MSI) experiments. Currently, the spatial information available in MSI data is often under-utilized, due to the challenges of in-depth spatial pattern extraction. The advent of deep learning has greatly facilitated such complex spatial analysis. In this work, we use a pre-trained neural network to extract high-level features from ion images in MSI data, and test whether this improves downstream data analysis. The resulting neural network interpretation of ion images, coined neural ion images, is used to cluster ion images based on spatial expressions. We evaluate the impact of neural ion images on two ion image clustering pipelines, namely DBSCAN clustering, combined with UMAP-based dimensionality reduction, and k-means clustering. In both pipelines, we compare regular and neural ion images from two different MSI datasets. All tested pipelines could extract underlying spatial patterns, but the neural network-based pipelines provided better assignment of ion images, with more fine-grained clusters, and greater consistency in the spatial structures assigned to individual clusters. Additionally, we introduce the relative isotope ratio metric to quantitatively evaluate clustering quality. The resulting scores show that isotopical m/z values are more often clustered together in the neural network-based pipeline, indicating improved clustering outcomes. The usefulness of neural ion images extends beyond clustering towards a generic framework to incorporate spatial information into any MSI-focused machine learning pipeline, both supervised and unsupervised.
Project description:Imputation techniques provide means to replace missing measurements with a value and are used in almost all downstream analysis of mass spectrometry (MS) based proteomics data using label-free quantification (LFQ). Here we demonstrate how collaborative filtering, denoising autoencoders, and variational autoencoders can impute missing values in the context of LFQ at different levels. We applied our method, proteomics imputation modeling mass spectrometry (PIMMS), to an alcohol-related liver disease (ALD) cohort with blood plasma proteomics data available for 358 individuals. Removing 20 percent of the intensities we were able to recover 15 out of 17 significant abundant protein groups using PIMMS-VAE imputations. When analyzing the full dataset we identified 30 additional proteins (+13.2%) that were significantly differentially abundant across disease stages compared to no imputation and found that some of these were predictive of ALD progression in machine learning models. We, therefore, suggest the use of deep learning approaches for imputing missing values in MS-based proteomics on larger datasets and provide workflows for these.
Project description:Methodologies that facilitate high-throughput proteomic analysis are a key step toward moving proteome investigations into clinical translation. Data independent acquisition (DIA) has potential as a high-throughput analytical method due to the reduced time needed for sample analysis, as well as its highly quantitative accuracy. However, a limiting feature of DIA methods is the sensitivity of detection of low abundant proteins and depth of coverage, which other mass spectrometry approaches address by two-dimensional fractionation (2D) to reduce sample complexity during data acquisition. In this study, we developed a 2D-DIA method intended for rapid- and deeper-proteome analysis compared to conventional 1D-DIA analysis. First, we characterized 96 individual fractions obtained from the protein standard, NCI-7, using a data-dependent approach (DDA), identifying a total of 151,366 unique peptides from 11,273 protein groups. We observed that the majority of the proteins can be identified from just a few selected fractions. By performing an optimization analysis, we identified six fractions with high peptide number and uniqueness that can account for 80% of the proteins identified in the entire experiment. These selected fractions were combined into a single sample which was then subjected to DIA (referred to as 2D-DIA) quantitative analysis. Furthermore, improved DIA quantification was achieved using a hybrid spectral library, obtained by combining peptides identified from DDA data with peptides identified directly from the DIA runs with the help of DIA-Umpire. The optimized 2D-DIA method allowed for improved identification and quantification of low abundant proteins compared to conventional unfractionated DIA analysis (1D-DIA). We then applied the 2D-DIA method to profile the proteomes of two breast cancer patient-derived xenograft (PDX) models, quantifying 6,217 and 6,167 unique proteins in basal- and luminal- tumors, respectively. Overall, this study demonstrates the potential of high-throughput quantitative proteomics using a novel 2D-DIA method.
Project description:Mass spectrometry imaging (MSI) is widely used for the label-free molecular mapping of biological samples. The identification of co-localized molecules in MSI data is crucial to the understanding of biochemical pathways. One of key challenges in molecular colocalization is that complex MSI data are too large for manual annotation but too small for training deep neural networks. Herein, we introduce a self-supervised clustering approach based on contrastive learning, which shows an excellent performance in clustering of MSI data. We train a deep convolutional neural network (CNN) using MSI data from a single experiment without manual annotations to effectively learn high-level spatial features from ion images and classify them based on molecular colocalizations. We demonstrate that contrastive learning generates ion image representations that form well-resolved clusters. Subsequent self-labeling is used to fine-tune both the CNN encoder and linear classifier based on confidently classified ion images. This new approach enables autonomous and high-throughput identification of co-localized species in MSI data, which will dramatically expand the application of spatial lipidomics, metabolomics, and proteomics in biological research.
Project description:Mass spectrometry imaging (MSI) is an emerging technology that holds potential for improving, biomarker discovery, metabolomics research, pharmaceutical applications and clinical diagnosis. Despite many solutions being developed, the large data size and high dimensional nature of MSI, especially 3D datasets, still pose computational and memory complexities that hinder accurate identification of biologically relevant molecular patterns. Moreover, the subjectivity in the selection of parameters for conventional pre-processing approaches can lead to bias. Therefore, we assess if a probabilistic generative model based on a fully connected variational autoencoder can be used for unsupervised analysis and peak learning of MSI data to uncover hidden structures. The resulting msiPL method learns and visualizes the underlying non-linear spectral manifold, revealing biologically relevant clusters of tissue anatomy in a mouse kidney and tumor heterogeneity in human prostatectomy tissue, colorectal carcinoma, and glioblastoma mouse model, with identification of underlying m/z peaks. The method is applied for the analysis of MSI datasets ranging from 3.3 to 78.9 GB, without prior pre-processing and peak picking, and acquired using different mass spectrometers at different centers.
Project description:BackgroundMass spectrometry (MS) has become a promising analytical technique to acquire proteomics information for the characterization of biological samples. Nevertheless, most studies focus on the final proteins identified through a suite of algorithms by using partial MS spectra to compare with the sequence database, while the pattern recognition and classification of raw mass-spectrometric data remain unresolved.ResultsWe developed an open-source and comprehensive platform, named MSpectraAI, for analyzing large-scale MS data through deep neural networks (DNNs); this system involves spectral-feature swath extraction, classification, and visualization. Moreover, this platform allows users to create their own DNN model by using Keras. To evaluate this tool, we collected the publicly available proteomics datasets of six tumor types (a total of 7,997,805 mass spectra) from the ProteomeXchange consortium and classified the samples based on the spectra profiling. The results suggest that MSpectraAI can distinguish different types of samples based on the fingerprint spectrum and achieve better prediction accuracy in MS1 level (average 0.967).ConclusionThis study deciphers proteome profiling of raw mass spectrometry data and broadens the promising application of the classification and prediction of proteomics data from multi-tumor samples using deep learning methods. MSpectraAI also shows a better performance compared to the other classical machine learning approaches.
Project description:A lipidome comprises thousands of lipid species, many of which are isomers and isobars. Liquid chromatography-tandem mass spectrometry (LC-MS/MS), although widely used for lipidomic profiling, faces challenges in differentiating lipid isomers. Herein, we address this issue by leveraging the orthogonal separation capabilities of hydrophilic interaction liquid chromatography (HILIC) and trapped ion mobility spectrometry (TIMS). We further integrate isomer-resolved MS/MS methods onto HILIC-TIMS, which enable pinpointing double bond locations in phospholipids and sn-positions in phosphatidylcholine. This system profiles phospholipids at multiple structural levels with short analysis time (<10 min per LC run), high sensitivity (nM detection limit), and wide coverage, while data analysis is streamlined using a home-developed software, LipidNovelist. Notably, compared to our previous report, the system doubles the coverage of phospholipids in bovine liver and reveals uncanonical desaturation pathways in RAW 264.7 macrophages. Relative quantitation of the double bond location isomers of phospholipids and the sn-position isomers of phosphatidylcholine enables the phenotyping of human bladder cancer tissue relative to normal control, which would be otherwise indistinguishable by traditional profiling methods. Our research offers a comprehensive solution for lipidomic profiling and highlights the critical role of isomer analysis in studying lipid metabolism in both healthy and diseased states.
Project description:Characterizing the human leukocyte antigen (HLA) bound ligandome by mass spectrometry (MS) holds great promise for developing vaccines and drugs for immune-oncology. Still, the identification of non-tryptic peptides presents substantial computational challenges. To address these, we synthesized and analyzed >300,000 peptides by multi-modal LC-MS/MS within the ProteomeTools project representing HLA class I & II ligands and products of the proteases AspN and LysN. The resulting data enabled training of a single model using the deep learning framework Prosit, allowing the accurate prediction of fragment ion spectra for tryptic and non-tryptic peptides. Applying Prosit demonstrates that the identification of HLA peptides can be improved up to 7-fold, that 87% of the proposed proteasomally spliced HLA peptides may be incorrect and that dozens of additional immunogenic neo-epitopes can be identified from patient tumors in published data. Together, the provided peptides, spectra and computational tools substantially expand the analytical depth of immunopeptidomics workflows.