ANDES: Statistical tools for the ANalyses of DEep Sequencing.
ABSTRACT: BACKGROUND:The advancements in DNA sequencing technologies have allowed researchers to progress from the analyses of a single organism towards the deep sequencing of a sample of organisms. With sufficient sequencing depth, it is now possible to detect subtle variations between members of the same species, or between mixed species with shared biomarkers, such as the 16S rRNA gene. However, traditional sequencing analyses of samples from largely homogeneous populations are often still based on multiple sequence alignments (MSA), where each sequence is placed along a separate row and similarities between aligned bases can be followed down each column. While this visual format is intuitive for a small set of aligned sequences, the representation quickly becomes cumbersome as sequencing depths cover loci hundreds or thousands of reads deep. FINDINGS:We have developed ANDES, a software library and a suite of applications, written in Perl and R, for the statistical ANalyses of DEep Sequencing. The fundamental data structure underlying ANDES is the position profile, which contains the nucleotide distributions for each genomic position resultant from a multiple sequence alignment (MSA). Tools include the root mean square deviation (RMSD) plot, which allows for the visual comparison of multiple samples on a position-by-position basis, and the computation of base conversion frequencies (transition/transversion rates), variation (Shannon entropy), inter-sample clustering and visualization (dendrogram and multidimensional scaling (MDS) plot), threshold-driven consensus sequence generation and polymorphism detection, and the estimation of empirically determined sequencing quality values. CONCLUSIONS:As new sequencing technologies evolve, deep sequencing will become increasingly cost-efficient and the inter and intra-sample comparisons of largely homogeneous sequences will become more common. We have provided a software package and demonstrated its application on various empirically-derived datasets. Investigators may download the software from Sourceforge at https://sourceforge.net/projects/andestools.
Project description:Profile-based analysis of multiple sequence alignments (MSA) allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1) MSA position and a set of predicted residue frequencies, and (2) between two MSA positions. These problems are important for (i) evaluation and optimization of methods predicting residue occurrence at protein positions; (ii) detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii) detection of sites that determine functional or structural specificity in two related families.For problems (1) and (2), we propose analytical estimates of P-value and apply them to the detection of significant positional dissimilarities in various experimental situations. (a) We compare structure-based predictions of residue propensities at a protein position to the actual residue frequencies in the MSA of homologs. (b) We evaluate our method by the ability to detect erroneous position matches produced by an automatic sequence aligner. (c) We compare MSA positions that correspond to residues aligned by automatic structure aligners. (d) We compare MSA positions that are aligned by high-quality manual superposition of structures. Detected dissimilarities reveal shortcomings of the automatic methods for residue frequency prediction and alignment construction. For the high-quality structural alignments, the dissimilarities suggest sites of potential functional or structural importance.The proposed computational method is of significant potential value for the analysis of protein families.
Project description:Both multiple sequence alignment and phylogenetic analysis are problematic in the "twilight zone" of sequence similarity (? 25% amino acid identity). Herein we explore the accuracy of phylogenetic inference at extreme sequence divergence using a variety of simulated data sets. We evaluate four leading multiple sequence alignment (MSA) methods (MAFFT, T-COFFEE, CLUSTAL, and MUSCLE) and six commonly used programs of tree estimation (Distance-based: Neighbor-Joining; Character-based: PhyML, RAxML, GARLI, Maximum Parsimony, and Bayesian) against a novel MSA-independent method (PHYRN) described here. Strikingly, at "midnight zone" genetic distances (~7% pairwise identity and 4.0 gaps per position), PHYRN returns high-resolution phylogenies that outperform traditional approaches. We reason this is due to PHRYN's capability to amplify informative positions, even at the most extreme levels of sequence divergence. We also assess the applicability of the PHYRN algorithm for inferring deep evolutionary relationships in the divergent DANGER protein superfamily, for which PHYRN infers a more robust tree compared to MSA-based approaches. Taken together, these results demonstrate that PHYRN represents a powerful mechanism for mapping uncharted frontiers in highly divergent protein sequence data sets.
Project description:BACKGROUND:Multiple Sequence Alignments (MSAs) are the starting point of molecular evolutionary analyses. Errors in MSAs generate a non-historical signal that can lead to incorrect inferences. Therefore, numerous efforts have been made to reduce the impact of alignment errors, by improving alignment algorithms and by developing methods to filter out poorly aligned regions. However, MSAs do not only contain alignment errors, but also primary sequence errors. Such errors may originate from sequencing errors, from assembly errors, or from erroneous structural annotations (such as incorrect intron/exon boundaries). Even though their existence is acknowledged, the impact of primary sequence errors on evolutionary inference is poorly characterized. RESULTS:In a first step to fill this gap, we have developed a program called HmmCleaner, which detects and eliminates these errors from MSAs. It uses profile hidden Markov models (pHMM) to identify sequence segments that poorly fit their MSA and selectively removes them. We assessed its performances using >?700 amino-acid MSAs from prokaryotes and eukaryotes, in which we introduced several types of simulated primary sequence errors. The sensitivity of HmmCleaner towards simulated primary sequence errors was >?95%. In a second step, we compared the impact of segment filtering software (HmmCleaner and PREQUAL) relative to commonly used block-filtering software (BMGE and TrimAI) on evolutionary analyses. Using real data from vertebrates, we observed that segment-filtering methods improve the quality of evolutionary inference more than the currently used block-filtering methods. The formers were especially effective at improving branch length inferences, and at reducing false positive rate during detection of positive selection. CONCLUSIONS:Segment filtering methods such as HmmCleaner accurately detect simulated primary sequence errors. Our results suggest that these errors are more detrimental than alignment errors. However, they also show that stochastic (sampling) error is predominant in single-gene evolutionary inferences. Therefore, we argue that MSA filtering should focus on segment instead of block removal and that more studies are required to find the optimal balance between accuracy improvement and stochastic error increase brought by data removal.
Project description:Using deep sequencing technologies such as Illumina's platform, it is possible to obtain reads from the viral RNA population revealing the viral genome diversity within a single host. A range of software tools and pipelines can transform raw deep sequencing reads into Sequence Alignment Mapping (SAM) files. We propose that interpretation tools should process these SAM files, directly translating individual reads to amino acids in order to extract statistics of interest such as the proportion of different amino acid residues at specific sites. This preserves per-read linkage between nucleotide variants at different positions within a codon location. The samReporter is a subsystem of the GLUE software toolkit which follows this direct read translation approach in its processing of SAM files. We test samReporter on a deep sequencing dataset obtained from a cohort of 241 UK HCV patients for whom prior treatment with direct-acting antivirals has failed; deep sequencing and resistance testing have been suggested to be of clinical use in this context. We compared the polymorphism interpretation results of the samReporter against an approach that does not preserve per-read linkage. We found that the samReporter was able to properly interpret the sequence data at resistance-associated locations in nine patients where the alternative approach was equivocal. In three cases, the samReporter confirmed that resistance or an atypical substitution was present at NS5A position 30. In three further cases, it confirmed that the sofosbuvir-resistant NS5B substitution S282T was absent. This suggests the direct read translation approach implemented is of value for interpreting viral deep sequencing data.
Project description:Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program's algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution.Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program.Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by more programs in a near future, which will certainly improve performance significantly.
Project description:BACKGROUND:Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. RESULTS:We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. CONCLUSIONS:We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.
Project description:The widespread and rapid adoption of high-throughput sequencing technologies has afforded researchers the opportunity to gain a deep understanding of genome level processes that underlie evolutionary change, and perhaps more importantly, the links between genotype and phenotype. In particular, researchers interested in functional biology and adaptation have used these technologies to sequence mRNA transcriptomes of specific tissues, which in turn are often compared to other tissues, or other individuals with different phenotypes. While these techniques are extremely powerful, careful attention to data quality is required. In particular, because high-throughput sequencing is more error-prone than traditional Sanger sequencing, quality trimming of sequence reads should be an important step in all data processing pipelines. While several software packages for quality trimming exist, no general guidelines for the specifics of trimming have been developed. Here, using empirically derived sequence data, I provide general recommendations regarding the optimal strength of trimming, specifically in mRNA-Seq studies. Although very aggressive quality trimming is common, this study suggests that a more gentle trimming, specifically of those nucleotides whose Phred score <2 or <5, is optimal for most studies across a wide variety of metrics.
Project description:We have developed a semiempirical algorithm called Family Values (FamVal), which identifies residues that encode functional specificity in a protein sequence. Given a multiple sequence alignment (MSA) grouped into functionally distinct subfamilies, FamVal calculates a specificity score for each subfamily at every amino acid position of an MSA. This algorithm was used to predict specificity-encoding positions within the tetramerization assembly (T1) domain of voltage-gated potassium (Kv) channel subfamilies Kv3 and Kv4. The importance of one such position (Arg to Ala at MSA position 93) was confirmed by in vitro pull-down assays. The structural basis of this assembly discrimination was elucidated by determining the crystal structure of the Kv4 T1 domain and comparing it to the Kv3 T1 domain.
Project description:We propose a multiple sequence alignment (MSA) algorithm and compare the alignment-quality and execution-time of the proposed algorithm with that of existing algorithms. The proposed progressive alignment algorithm uses a grammar-based distance metric to determine the order in which biological sequences are to be pairwise aligned. The progressive alignment occurs via pairwise aligning new sequences with an ensemble of the sequences previously aligned.The performance of the proposed algorithm is validated via comparison to popular progressive multiple alignment approaches, ClustalW and T-Coffee, and to the more recently developed algorithms MAFFT, MUSCLE, Kalign, and PSAlign using the BAliBASE 3.0 database of amino acid alignment files and a set of longer sequences generated by Rose software. The proposed algorithm has successfully built multiple alignments comparable to other programs with significant improvements in running time. The results are especially striking for large datasets.We introduce a computationally efficient progressive alignment algorithm using a grammar based sequence distance particularly useful in aligning large datasets.
Project description:In the design of whole-genome sequencing (WGS) studies, sequencing depth is a crucial parameter to define variant calling accuracy and study cost, with no standard recommendations having been established. We empirically evaluated the variant calling accuracy of the WGS pipeline using ultra-deep WGS data (approximately 410×). We randomly sampled sequence reads and constructed a series of simulation WGS datasets with a variety of gradual depths (n?=?54; from 0.05× to 410×). Next, we evaluated the genotype concordances of the WGS data with those in the SNP microarray data or the WGS data using all the sequence reads. In addition, we assessed the accuracy of HLA allele genotyping using the WGS data with multiple software tools (PHLAT, HLA-VBseq, HLA-HD, and SNP2HLA). The WGS data with higher depths showed higher concordance rates, and >13.7× depth achieved as high as >99% of concordance. Comparisons with the WGS data using all the sequence reads showed that SNVs achieved >95% of concordance at 17.6× depth, whereas indels showed only 60% concordance. For the accuracy of HLA allele genotyping using the WGS data, 13.7× depth showed sufficient accuracy while performance heterogeneity among the software tools was observed (the highest concordance of 96.9% was observed with HLA-HD). Improvement in HLA genotyping accuracy by further increasing the depths was limited. These results suggest a medium degree of the WGS depth setting (approximately 15×) to achieve both accurate SNV calling and cost-effectiveness, whereas relatively higher depths are required for accurate indel calling.