Efficient alignment of pyrosequencing reads for re-sequencing applications.
ABSTRACT: BACKGROUND: Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects. RESULTS: We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art indexing techniques combined with a flexible seed-based approach, leading to a fast and accurate algorithm which needs very little user parameterization. An evaluation performed using real and simulated data shows that our proposed method outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time. CONCLUSIONS: The proposed methodology was implemented in a software tool called TAPyR--Tool for the Alignment of Pyrosequencing Reads--which is publicly available from http://www.tapyr.net.
Project description:BACKGROUND: With a whole genome duplication event and wealth of biological data, salmonids are excellent model organisms for studying evolutionary processes, fates of duplicated genes and genetic and physiological processes associated with complex behavioral phenotypes. It is surprising therefore, that no salmonid genome has been sequenced. Atlantic salmon (Salmo salar) is a good representative salmonid for sequencing given its importance in aquaculture and the genomic resources available. However, the size and complexity of the genome combined with the lack of a sequenced reference genome from a closely related fish makes assembly challenging. Given the cost and time limitations of Sanger sequencing as well as recent improvements to next generation sequencing technologies, we examined the feasibility of using the Genome Sequencer (GS) FLX pyrosequencing system to obtain the sequence of a salmonid genome. Eight pooled BACs belonging to a minimum tiling path covering approximately 1 Mb of the Atlantic salmon genome were sequenced by GS FLX shotgun and Long Paired End sequencing and compared with a ninth BAC sequenced by Sanger sequencing of a shotgun library. RESULTS: An initial assembly using only GS FLX shotgun sequences (average read length 248.5 bp) with approximately 30x coverage allowed gene identification, but was incomplete even when 126 Sanger-generated BAC-end sequences (approximately 0.09x coverage) were incorporated. The addition of paired end sequencing reads (additional approximately 26x coverage) produced a final assembly comprising 175 contigs assembled into four scaffolds with 171 gaps. Sanger sequencing of the ninth BAC (approximately 10.5x coverage) produced nine contigs and two scaffolds. The number of scaffolds produced by the GS FLX assembly was comparable to Sanger-generated sequencing; however, the number of gaps was much higher in the GS FLX assembly. CONCLUSION: These results represent the first use of GS FLX paired end reads for de novo sequence assembly. Our data demonstrated that this improved the GS FLX assemblies; however, with respect to de novo sequencing of complex genomes, the GS FLX technology is limited to gene mining and establishing a set of ordered sequence contigs. Currently, for a salmonid reference sequence, it appears that a substantial portion of sequencing should be done using Sanger technology.
Project description:DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data.Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data.The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.
Project description:BACKGROUND: Sequencing-by-synthesis technologies significantly improve over the Sanger method in terms of speed and cost per base. However, they still usually fail to compete in terms of read length and quality. Current high-throughput implementations of the pyrosequencing technique yield reads whose length approach those of the capillary electrophoresis method. A less obvious question is whether their quality is affected by platform-specific sequencing errors. RESULTS: We present an empirical study aimed at assessing the quality and characterising sequencing errors for high throughput pyrosequencing data. We have developed a procedure for extracting sequencing error data from genome assemblies and study their characteristics, in particular the length distribution of indel gaps and their relation to the sequence contexts where they occur. We used this procedure to analyse data from three prokaryotic genomes sequenced with the GS FLX technology. We also compared two models previously employed with success for peptide sequence alignment. CONCLUSIONS: We observed an overall very low error rate in the analysed data, with indel errors being much more abundant than substitutions. We also observed a dependence between the length of the gaps and that of the homopolymer context where they occur. As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme. In whichever case, however, our procedure can be used to estimate more realistic error model parameters.
Project description:BACKGROUND: Dengue is the world's most common mosquito-borne viral disease. Poor proofreading by RNA polymerase during its replication results in the accumulation of mutations in its genome. This leads to a diversity of genotypes in the viral population termed quasispecies. Quasispecies play an important role in disease severity. The study of quasispecies in dengue has been hindered because of the requirement for large amounts of cloning and sequencing, which could be overcome by 454 pyrosequencing. In this study, we attempted to demonstrate the feasibility of using 454 pyrosequencing to study genome diversity of dengue virus quasispecies by sequencing a pool of known dengue viral strains. RESULTS: Two sets of dengue DNA templates were sequenced by 454/Roche GS FLX. The total number of reads for data 1 and data 2 were 54,440 and 134,441, with average lengths of 221 and 232 bp, respectively. Reads containing ambiguous base Ns were excluded (6.00% in data 1, 7.05% in data 2). More than 99% of reads could be aligned back to the correct serotypes by BLAST. The reads covered the whole genome without any gaps, and the minimum coverage depth was 50×. Frequencies of known strains detected from each data set were highly correlated with the input ratios. We also explored criteria for filtering error reads and artifacts from true variations. CONCLUSIONS: This study showed that 454 pyrosequencing, coupled with our analysis procedure, could sequence the whole genome of dengue virus with good coverage. The ratio of detected variants in the sequencing data reflected the starting ratio, proving that the proposed technique could be used to study the frequencies of variants in quasispecies.
Project description:The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate-based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug-resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an expectation-maximization (EM) algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.
Project description:BACKGROUND: Surveillance for HIV transmitted drug resistance (TDR) is performed using HIV genotype results from individual specimens. Pyrosequencing, through its massive parallel sequencing ability, can analyze large numbers of specimens simultaneously. Instead of using pyrosequencing conventionally, to sequence a population of viruses within an individual, we interrogated a single combined pool of surveillance specimens to demonstrate that it is possible to determine TDR rates in HIV protease from a population of individuals. METHODOLOGY/PRINCIPAL FINDINGS: The protease region from 96 treatment naïve, HIV+ serum specimens was genotyped using standard Sanger sequencing method. The 462 bp protease amplicons from these specimens were pooled in equimolar concentrations and re-sequenced using the GS FLX Titanium system. The nucleotide (NT) and amino acid (AA) differences from the reference sequence, along with TDR mutations, detected by each method were compared. In the protease sequence, there were 212 nucleotide and 81 AA differences found using conventional sequencing and 345 nucleotide and 168 AA differences using pyrosequencing. All nucleotide and amino acid polymorphisms found at frequencies >/=5% in pyrosequencing were detected using both methods with the rates of variation highly correlated. Using Sanger sequencing, two TDR mutations, M46L and I84V, were each detected as mixtures at a frequency of 1.04% (1/96). These same TDR mutations were detected by pyrosequencing with a prevalence of 0.29% and 0.34% respectively. Phylogenetic analysis established that the detected low frequency mutations arose from the same single specimens that were found to contain TDR mutations by Sanger sequencing. Multiple clinical protease DR mutations present at higher frequencies were concordantly identified using both methods. CONCLUSIONS/SIGNIFICANCE: We show that pyrosequencing pooled surveillance specimens can cost-competitively detect protease TDR mutations when compared with conventional methods. With few modifications, the method described here can be used to determine population rates of TDR in both protease and reverse transcriptase. Furthermore, this pooled pyrosequencing technique may be generalizable to other infectious agents where a survey of DR rates is required.
Project description:We have investigated the reliability and reproducibility of HCV viral quasispecies quantification by ultra-deep pyrosequencing (UDPS) methods. Our study has been divided in two parts. First of all, by UDPS sequencing of clone mixes samples we have established the global noise level of UDPS and fine tuned a data treatment workflow previously optimized for HBV sequence analysis. Secondly, we have studied the reproducibility of the methodology by comparing 5 amplicons from two patient samples on three massive sequencing platforms (FLX+, FLX and Junior) after applying the error filters developed from the clonal/control study. After noise filtering the UDPS results, the three replicates showed the same 12 polymorphic sites above 0.7%, with a mean CV of 4.86%. Two polymorphic sites below 0.6% were identified by two replicates and one replicate respectively. A total of 25, 23 and 26 haplotypes were detected by GS-Junior, GS-FLX and GS-FLX+. The observed CVs for the normalized Shannon entropy (Sn), the mutation frequency (Mf), and the nucleotidic diversity (Pi) were 1.46%, 3.96% and 3.78%. The mean absolute difference in the two patients (5 amplicons each), in the GS-FLX and GS-FLX+, were 1.46%, 3.96% and 3.78% for Sn, Mf and Pi. No false polymorphic site was observed above 0.5%. Our results indicate that UDPS is an optimal alternative to molecular cloning for quantitative study of HCV viral quasispecies populations, both in complexity and composition. We propose an UDPS data treatment workflow for amplicons from the RNA viral quasispecies which, at a sequencing depth of at least 10,000 reads per strand, enables to obtain sequences and frequencies of consensus haplotypes above 0.5% abundance with no erroneous mutations, with high confidence, resistant mutants as minor variants at the level of 1%, with high confidence that variants are not missed, and highly confident measures of quasispecies complexity.
Project description:BACKGROUND: As more and more reference genome sequences are assembled, it becomes practical to assemble individual genomes from large amount of raw read data based on a reference sequence. However, most available assembly tools are designed for de-novo genome assembly. There is one commercial tool box (Newbler) developed for re-sequencing projects based on the Roche 454 sequencing platform. However, the genome with large repeat regions cannot be well assembled in Newbler. FINDINGS: We developed a new sequence assembly tool (BIGrat, Beijing Institute of Genomics Re-Assembly Tool) for pyrosequencing-based re-sequencing projects, such as data generated from Roche 454 and IonTorrent platforms. BIGrat improves the output of Newbler when evaluated on genome assemblies including chloroplast, mitochondrial, bacterial, and plant nuclear genomes. CONCLUSION: We presented a novel sequence assembly tool BIGrat for pyrosequencing-based re-sequencing projects, which can easily be integrated into Newbler pipelines for next-generation sequencing assembly and analysis.
Project description:Next-generation sequencing technologies enable the rapid cost-effective production of sequence data. To evaluate the performance of these sequencing technologies, investigation of the quality of sequence reads obtained from these methods is important. In this study, we analyzed the quality of sequence reads and SNP detection performance using three commercially available next-generation sequencers, i.e., Roche Genome Sequencer FLX System (FLX), Illumina Genome Analyzer (GA), and Applied Biosystems SOLiD system (SOLiD). A common genomic DNA sample obtained from Escherichia coli strain DH1 was applied to these sequencers. The obtained sequence reads were aligned to the complete genome sequence of E. coli DH1, to evaluate the accuracy and sequence bias of these sequence methods. We found that the fraction of "junk" data, which could not be aligned to the reference genome, was largest in the data set of SOLiD, in which about half of reads could not be aligned. Among data sets after alignment to the reference, sequence accuracy was poorest in GA data sets, suggesting relatively low fidelity of the elongation reaction in the GA method. Furthermore, by aligning the sequence reads to the E. coli strain W3110, we screened sequence differences between two E. coli strains using data sets of three different next-generation platforms. The results revealed that the detected sequence differences were similar among these three methods, while the sequence coverage required for the detection was significantly small in the FLX data set. These results provided valuable information on the quality of short sequence reads and the performance of SNP detection in three next-generation sequencing platforms.
Project description:BACKGROUND: In molecular microbial ecology, massive sequencing is gradually replacing classical fingerprinting techniques such as terminal-restriction fragment length polymorphism (T-RFLP) combined with cloning-sequencing for the characterization of microbiomes. Here, a bioinformatics methodology for pyrosequencing-based T-RF identification (PyroTRF-ID) was developed to combine pyrosequencing and T-RFLP approaches for the description of microbial communities. The strength of this methodology relies on the identification of T-RFs by comparison of experimental and digital T-RFLP profiles obtained from the same samples. DNA extracts were subjected to amplification of the 16S rRNA gene pool, T-RFLP with the HaeIII restriction enzyme, 454 tag encoded FLX amplicon pyrosequencing, and PyroTRF-ID analysis. Digital T-RFLP profiles were generated from the denoised full pyrosequencing datasets, and the sequences contributing to each digital T-RF were classified to taxonomic bins using the Greengenes reference database. The method was tested both on bacterial communities found in chloroethene-contaminated groundwater samples and in aerobic granular sludge biofilms originating from wastewater treatment systems. RESULTS: PyroTRF-ID was efficient for high-throughput mapping and digital T-RFLP profiling of pyrosequencing datasets. After denoising, a dataset comprising ca. 10'000 reads of 300 to 500 bp was typically processed within ca. 20 minutes on a high-performance computing cluster, running on a Linux-related CentOS 5.5 operating system, enabling parallel processing of multiple samples. Both digital and experimental T-RFLP profiles were aligned with maximum cross-correlation coefficients of 0.71 and 0.92 for high- and low-complexity environments, respectively. On average, 63±18% of all experimental T-RFs (30 to 93 peaks per sample) were affiliated to phylotypes. CONCLUSIONS: PyroTRF-ID profits from complementary advantages of pyrosequencing and T-RFLP and is particularly adapted for optimizing laboratory and computational efforts to describe microbial communities and their dynamics in any biological system. The high resolution of the microbial community composition is provided by pyrosequencing, which can be performed on a restricted set of selected samples, whereas T-RFLP enables simultaneous fingerprinting of numerous samples at relatively low cost and is especially adapted for routine analysis and follow-up of microbial communities on the long run.