Project description:MotivationMultiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets.ResultsWe present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise.Availability and implementationhttps://github.com/gillichu/sepp.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.
Project description:MotivationIn bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly.ResultsFMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By employing a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame.AvailabilitySource code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.Contactpingluzhang@outlook.com.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:SummaryAliView is an alignment viewer and editor designed to meet the requirements of next-generation sequencing era phylogenetic datasets. AliView handles alignments of unlimited size in the formats most commonly used, i.e. FASTA, Phylip, Nexus, Clustal and MSF. The intuitive graphical interface makes it easy to inspect, sort, delete, merge and realign sequences as part of the manual filtering process of large datasets. AliView also works as an easy-to-use alignment editor for small as well as large datasets.Availability and implementationAliView is released as open-source software under the GNU General Public License, version 3.0 (GPLv3), and is available at GitHub (www.github.com/AliView). The program is cross-platform and extensively tested on Linux, Mac OS X and Windows systems. Downloads and help are available at http://ormbunkar.se/aliviewContactanders.larsson@ebc.uu.seSupplementary informationSupplementary data are available at Bioinformatics online.
Project description:We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment--previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches--yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.
Project description:Histone post-translational modifications contribute to chromatin function through their chemical properties which influence chromatin structure and their ability to recruit chromatin interacting proteins. Nanoflow liquid chromatography coupled with high resolution tandem mass spectrometry (nanoLC-MS/MS) has emerged as the most suitable technology for global histone modification analysis because of the high sensitivity and the high mass accuracy of this approach that provides confident identification. However, analysis of histones with this method is even more challenging because of the large number and variety of isobaric histone peptides and the high dynamic range of histone peptide abundances. Here, we introduce EpiProfile, a software tool that discriminates isobaric histone peptides using the distinguishing fragment ions in their tandem mass spectra and extracts the chromatographic area under the curve using previous knowledge about peptide retention time. The accuracy of EpiProfile was evaluated by analysis of mixtures containing different ratios of synthetic histone peptides. In addition to label-free quantification of histone peptides, EpiProfile is flexible and can quantify different types of isotopically labeled histone peptides. EpiProfile is unique in generating layouts (i.e. relative retention time) of histone peptides when compared with manual quantification of the data and other programs (such as Skyline), filling the need of an automatic and freely available tool to quantify labeled and non-labeled modified histone peptides. In summary, EpiProfile is a valuable nanoflow liquid chromatography coupled with high resolution tandem mass spectrometry-based quantification tool for histone peptides, which can also be adapted to analyze nonhistone protein samples.
Project description:Neoantigen-based immunotherapy has yielded promising results in clinical trials. However, it is limited to tumor-specific mutations, and is often tailored to individual patients. Identifying suitable tumor-specific antigens is still a major challenge. Previous proteogenomics studies have identified peptides encoded by predicted non-coding sequences in human genome. To investigate whether tumors express specific peptides encoded by non-coding genes, we analyzed published proteomics data from five cancer types including 933 tumor samples and 275 matched normal samples and compared these to data from 31 different healthy human tissues. Our results reveal that many predicted non-coding genes such as DGCR9 and RHOXF1P3 encode peptides that are overexpressed in tumors compared to normal controls. Furthermore, from the non-coding genes-encoded peptides specifically detected in cancers, we predict a large number of "dark antigens" (neoantigens from non-coding genomic regions), which may provide an alternative source of neoantigens beyond standard tumor specific mutations.
Project description:Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner.
Project description:The emergence of efficient fragmentation methods such as electron capture dissociation (ECD) and electron transfer dissociation (ETD) provides the opportunity for detailed structural characterization of heavily covalently modified large peptides and small proteins such as intact histones. Even with effective gas phase ion isolation so that a single molecular precursor ion is selected, the MSMS spectrum of a heavily modified peptide may reveal the presence of a mixture of peptides with the same amino acid sequence and the same total number of posttranslational modification (PTM) moieties (same PTM composition) but with different PTM configurations or site-specific occupancy isoforms. Currently available data analysis methods depend on a deisotoping procedure, which becomes less effective when spectra (fragmentation patterns) contain many overlapping isotopic distributions. Peptide database search engines can only identify the most abundant PTM configuration (PTM arrangement on different residues) in such mixtures. To identify all the PTM configurations present in these mixtures and to estimate their relative abundances, we extended our fragment assignment by visual assistance program to search for ions representing all possible configurations, subjected to the total PTM composition constraint. This resulted in the identification of PTM configurations supported by unique fragment ions, and their relative abundances were estimated by use of a non-negative least squares procedure.
Project description:Mass spectrometry is a valued method to evaluate the metabolomics content of a biological sample. The recent advent of rapid ionization technologies such as Laser Diode Thermal Desorption (LDTD) and Direct Analysis in Real Time (DART) has rendered high-throughput mass spectrometry possible. It is used for large-scale comparative analysis of populations of samples. In practice, many factors resulting from the environment, the protocol, and even the instrument itself, can lead to minor discrepancies between spectra, rendering automated comparative analysis difficult. In this work, a sequence/pipeline of algorithms to correct variations between spectra is proposed. The algorithms correct multiple spectra by identifying peaks that are common to all and, from those, computes a spectrum-specific correction. We show that these algorithms increase comparability within large datasets of spectra, facilitating comparative analysis, such as machine learning.