Effects of sample age on data quality from targeted sequencing of museum specimens: what are we capturing in time?
ABSTRACT: BACKGROUND:Next generation sequencing (NGS) can recover DNA data from valuable extant and extinct museum specimens. However, archived or preserved DNA is difficult to sequence because of its fragmented, damaged nature, such that the most successful NGS methods for preserved specimens remain sub-optimal. Improving wet-lab protocols and comprehensively determining the effects of sample age on NGS library quality are therefore of vital importance. Here, I examine the relationship between sample age and several indicators of library quality following targeted NGS sequencing of ~?1300 loci using 271 samples of pinned moth specimens (Helicoverpa armigera) ranging in age from 5 to 117?years. RESULTS:I find that older samples have lower DNA concentrations following extraction and thus require a higher number of indexing PCR cycles during library preparation. When sequenced reads are aligned to a reference genome or to only the targeted region, older samples have a lower number of sequenced and mapped reads, lower mean coverage, and lower estimated library sizes, while the percentage of adapters in sequenced reads increases significantly as samples become older. Older samples also show the poorest capture success, with lower enrichment and a higher improved coverage anticipated from further sequencing. CONCLUSIONS:Sample age has significant, measurable impacts on the quality of NGS data following targeted enrichment. However, incorporating a uracil-removing enzyme into the blunt end-repair step during library preparation could help to repair DNA damage, and using a method that prevents adapter-dimer formation may result in improved data yields.
Project description:Most ancient specimens contain very low levels of endogenous DNA, precluding the shotgun sequencing of many interesting samples because of cost. Ancient DNA (aDNA) libraries often contain <1% endogenous DNA, with the majority of sequencing capacity taken up by environmental DNA. Here we present a capture-based method for enriching the endogenous component of aDNA sequencing libraries. By using biotinylated RNA baits transcribed from genomic DNA libraries, we are able to capture DNA fragments from across the human genome. We demonstrate this method on libraries created from four Iron Age and Bronze Age human teeth from Bulgaria, as well as bone samples from seven Peruvian mummies and a Bronze Age hair sample from Denmark. Prior to capture, shotgun sequencing of these libraries yielded an average of 1.2% of reads mapping to the human genome (including duplicates). After capture, this fraction increased substantially, with up to 59% of reads mapped to human and enrichment ranging from 6- to 159-fold. Furthermore, we maintained coverage of the majority of regions sequenced in the precapture library. Intersection with the 1000 Genomes Project reference panel yielded an average of 50,723 SNPs (range 3,062-147,243) for the postcapture libraries sequenced with 1 million reads, compared with 13,280 SNPs (range 217-73,266) for the precapture libraries, increasing resolution in population genetic analyses. Our whole-genome capture approach makes it less costly to sequence aDNA from specimens containing very low levels of endogenous DNA, enabling the analysis of larger numbers of samples.
Project description:Cell-free DNA (cfDNA) extracted from diverse specimen types has emerged as a high quality substrate for molecular tumor profiling. Analytical and pre-analytical challenges in the utilization of cfDNA extracted from pleural effusion supernatant (PES) are herein characterized in patients with metastatic non-small cell lung carcinoma (NSCLC). Pleural effusion specimens containing metastatic NSCLC were collected prospectively. After ThinPrep® (TP) and cell block (CB) preparation, DNA was extracted from residual PES and analyzed by gel electrophoresis for quality and quantity. Libraries were prepared and sequenced with a targeted next-generation sequencing (NGS) platform and panel clinically validated for plasma specimens. Results were compared with DNA extracted from corresponding FFPE samples that were sequenced using institutional targeted NGS assays clinically validated for solid tumor FFPE samples. Tumor (TC) and overall cellularity (OC) were evaluated. Fourteen specimens were collected from 13 patients. Median specimen volume was 180 mL (range, 35-1,400 mL). Median TC and OC on TP slides and CB sections were comparable. Median extracted DNA concentration was 7.4 ng/μL (range, 0.1-58.0 ng/μL), with >5 ng/μL DNA extracted from 10/14 specimens (71%). Mutations were identified in 10/14 specimens, including 1/3 specimens with median molecular coverage <1,000 reads. The minimal detected allelic fraction was 0.6%. NGS was falsely negative for the presence of one driver mutation. No correlation was identified between sample volume or OC, quality or quantity of extracted DNA, or mutation detection. Despite analytical and pre-analytical challenges, PES represents a robust source of DNA for NGS.
Project description:Obtaining sequence data from historical museum specimens has been a growing research interest, invigorated by next-generation sequencing methods that allow inputs of highly degraded DNA. We applied a target enrichment and next-generation sequencing protocol to generate ultraconserved elements (UCEs) from 51 large carpenter bee specimens (genus Xylocopa), representing 25 species with specimen ages ranging from 2-121 years. We measured the correlation between specimen age and DNA yield (pre- and post-library preparation DNA concentration) and several UCE sequence capture statistics (raw read count, UCE reads on target, UCE mean contig length and UCE locus count) with linear regression models. We performed piecewise regression to test for specific breakpoints in the relationship of specimen age and DNA yield and sequence capture variables. Additionally, we compared UCE data from newer and older specimens of the same species and reconstructed their phylogeny in order to confirm the validity of our data. We recovered 6-972 UCE loci from samples with pre-library DNA concentrations ranging from 0.06-9.8 ng/?L. All investigated DNA yield and sequence capture variables were significantly but only moderately negatively correlated with specimen age. Specimens of age 20 years or less had significantly higher pre- and post-library concentrations, UCE contig lengths, and locus counts compared to specimens older than 20 years. We found breakpoints in our data indicating a decrease of the initial detrimental effect of specimen age on pre- and post-library DNA concentration and UCE contig length starting around 21-39 years after preservation. Our phylogenetic results confirmed the integrity of our data, giving preliminary insights into relationships within Xylocopa. We consider the effect of additional factors not measured in this study on our age-related sequence capture results, such as DNA fragmentation and preservation method, and discuss the promise of the UCE approach for large-scale projects in insect phylogenomics using museum specimens.
Project description:Avoiding biases in next generation sequencing (NGS) library preparation is crucial for obtaining reliable sequencing data. Recently, a new library preparation method has been introduced which has eliminated the need for the ligation step. This method, termed SMART (switching mechanism at the 5' end of the RNA transcript), is based on template switching reverse transcription. To date, there has been no systematic analysis of the additional biases introduced by this method. We analysed the genomic distribution of sequenced reads prepared from genomic DNA using the SMART methodology and found a strong bias toward long (?12bp) poly dA/dT containing genomic loci. This bias is unique to the SMART-based library preparation and does not appear when libraries are prepared with conventional ligation based methods. Although this bias is obvious only when performing paired end sequencing, it affects single end sequenced samples as well. Our analysis demonstrates that sequenced reads originating from SMART-DNA libraries are heavily skewed toward genomic poly dA/dT tracts. This bias needs to be considered when deciding to use SMART based technology for library preparation.
Project description:Next-generation sequencing (NGS) has been applied in the field of infectious diseases. Bronchoalveolar lavage fluid (BALF) is considered a sterile type of specimen that is suitable for detecting pathogens of respiratory infections. The aim of this study was to comprehensively identify causative pathogens using NGS in BALF samples from immunocompetent pediatric patients with respiratory failure. Ten patients hospitalized with respiratory failure were included. BALF samples obtained in the acute phase were used to prepare DNA- and RNA-sequencing libraries. The libraries were sequenced on MiSeq, and the sequence data were analyzed using metagenome analysis tools. A mean of 2,041,216 total reads were sequenced for each library. Significant bacterial or viral sequencing reads were detected in eight of the 10 patients. Furthermore, candidate pathogens were detected in three patients in whom etiologic agents were not identified by conventional methods. The complete genome of enterovirus D68 was identified in two patients, and phylogenetic analysis suggested that both strains belong to subclade B3, which is an epidemic strain that has spread worldwide in recent years. Our results suggest that NGS can be applied for comprehensive molecular diagnostics as well as surveillance of pathogens in BALF from patients with respiratory infection.
Project description:Metagenomic sequencing of clinical diagnostic specimens has a potential for unbiased detection of infectious agents, diagnosis of polymicrobial infections and discovery of emerging pathogens. Herein, next generation sequencing (NGS)-based metagenomic approach was used to investigate the cause of illness in a subset of horses recruited for a tick-borne disease surveillance study during 2017-2019. Blood samples collected from 10 horses with suspected tick-borne infection and five apparently healthy horses were subjected to metagenomic analysis. Total genomic DNA extracted from the blood samples were enriched for microbial DNA and subjected to shotgun next generation sequencing using Nextera DNA Flex library preparation kit and V2 chemistry sequencing kit on the Illumina MiSeq sequencing platform. Overall, 0.4-0.6 million reads per sample were analyzed using Kraken metagenomic sequence classification program. The taxonomic classification of the reads indicated that bacterial genomes were overrepresented (0.5 to 1%) among the total microbial reads. Most of the bacterial reads (~91%) belonged to phyla Firmicutes, Proteobacteria, Bacteroidetes, Actinobacteria, Cyanobacteria and Tenericutes in both groups. Importantly, 10-42.5% of Alphaproteobacterial reads in 5 of 10 animals with suspected tick-borne infection were identified as <i>Anaplasma phagocytophilum</i>. Of the 5 animals positive for <i>A. phagocytophilum</i> sequence reads, four animals tested <i>A. phagocytophilum</i> positive by PCR. Two animals with suspected tick-borne infection and <i>A. phagocytophilum</i> positive by PCR were found negative for any tick-borne microbial reads by metagenomic analysis. The present study demonstrates the usefulness of the NGS-based metagenomic analysis approach for the detection of blood-borne microbes.
Project description:Rapid and accurate identification of an influenza outbreak is essential for patient care and treatment. We describe a next-generation sequencing (NGS)-based, unbiased deep sequencing method in clinical specimens to investigate an influenza outbreak. Nasopharyngeal swabs from patients were collected for molecular epidemiological analysis. Total RNA was sequenced by using the NGS technology as paired-end 250 bp reads. Total of 7 to 12 million reads were obtained. After mapping to the human reference genome, we analyzed the 3-4% of reads that originated from a non-human source. A BLAST search of the contigs reconstructed de novo revealed high sequence similarity with that of the pandemic H1N1 virus. In the phylogenetic analysis, the HA gene of our samples clustered closely with that of A/Senegal/VR785/2010(H1N1), A/Wisconsin/11/2013(H1N1), and A/Korea/01/2009(H1N1), and the NA gene of our samples clustered closely with A/Wisconsin/11/2013(H1N1). This study suggests that NGS-based unbiased sequencing can be effectively applied to investigate molecular characteristics of nosocomial influenza outbreak by using clinical specimens such as nasopharyngeal swabs.
Project description:Despite widespread interest in next-generation sequencing (NGS), the adoption of personalized clinical genomics and mutation profiling of cancer specimens is lagging, in part because of technical limitations. Tumors are genetically heterogeneous and often contain normal/stromal cells, features that lead to low-abundance somatic mutations that generate ambiguous results or reside below NGS detection limits, thus hindering the clinical sensitivity/specificity standards of mutation calling. We applied COLD-PCR (coamplification at lower denaturation temperature PCR), a PCR methodology that selectively enriches variants, to improve the detection of unknown mutations before NGS-based amplicon resequencing.We used both COLD-PCR and conventional PCR (for comparison) to amplify serially diluted mutation-containing cell-line DNA diluted into wild-type DNA, as well as DNA from lung adenocarcinoma and colorectal cancer samples. After amplification of TP53 (tumor protein p53), KRAS (v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog), IDH1 [isocitrate dehydrogenase 1 (NADP(+)), soluble], and EGFR (epidermal growth factor receptor) gene regions, PCR products were pooled for library preparation, bar-coded, and sequenced on the Illumina HiSeq 2000.In agreement with recent findings, sequencing errors by conventional targeted-amplicon approaches dictated a mutation-detection limit of approximately 1%-2%. Conversely, COLD-PCR amplicons enriched mutations above the error-related noise, enabling reliable identification of mutation abundances of approximately 0.04%. Sequencing depth was not a large factor in the identification of COLD-PCR-enriched mutations. For the clinical samples, several missense mutations were not called with conventional amplicons, yet they were clearly detectable with COLD-PCR amplicons. Tumor heterogeneity for the TP53 gene was apparent.As cancer care shifts toward personalized intervention based on each patient's unique genetic abnormalities and tumor genome, we anticipate that COLD-PCR combined with NGS will elucidate the role of mutations in tumor progression, enabling NGS-based analysis of diverse clinical specimens within clinical practice.
Project description:Next-generation sequences (NGS) dataset of nanobody (Nb) clones in a phage display library (PDL) is of immense value as it serves in many different ways, such as: i). estimating the library size, ii). improving selection and identification of Nbs, iii). informing about frequency of V gene families, diversity and length of CDRs, iv). high resolution analysis of natural and synthetic libraries, etc. , , . We used a fraction of our previously constructed PDL of Nbs derived from an <i>E. coli</i> lipopolysaccharide-immunized Indian desert camel in order to obtain the dataset of NGS reads of Nbs. The cryo-preserved transformants library was revived to extract the Nb-encoding VHH (inserts)-pHEN4 (vector) DNA pool. The DNA sample was used for amplifying VHH pool by PCR . The VHH amplicons band was gel-purified and subjected to NGS using Illumina MiSeq<sup>TM</sup> platform. 'Nextra XT micro V2 Index' kit was used for the Nb library DNA sample sequencing, with the adaptors: 'i7' (N706: TAGGCATG) and 'i5' (S517: GCGTAAGA). The raw data comprised of a total read count of 182146 (matched= 179591; unmatched=2555), with average read length of 130.33 bases and a total of 23.74 Mb. Of 179591 matched reads, 142004 were paired reads and 37587 broken paired reads. The raw data of NGS reads was submitted to NCBI Sequence Reads Archive accessible at URL: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA516512 (dataset ref. ), and after analysis deposited in Mendeley Datasets repository, which is accessible at URL: [https://data.mendeley.com/datasets/4rsz3snvk5/3] (dataset ref. ). The sequence reads were analyzed by bioinformatics tools , , , . The assembled consensus contigs revealed Nb orthologs of diverse Ag-specificities, including those isolated by conventional panning and Sanger-sequenced functional Nbs. Contig 1 CDR1-3 matched to those of anti-<i>Trypanosoma evansi</i> RoTat1.2 variant surface glycoprotein (VSG), while Contig 2 CDR1-3 matched to those of anti-LPS Nb clones isolated from the library. Contig 3 was however incomplete and lacked CDR3. Despite lacking the depth, the NGS data is a useful guide for selection of antigen-specific Nbs from the library, as demonstrated by anti-<i>T. evansi</i> VSG Nbs, and provides templates for Nb-based diagnostic reagents and therapeutic agents.
Project description:Next-generation sequencing (NGS) has emerged as a powerful technique for the detection of genetic variants in the clinical laboratory. NGS can be performed using DNA from FFPE tissue, but it is unknown whether such specimens are truly equivalent to unfixed tissue for NGS applications. To address this question, we performed hybridization-capture enrichment and multiplexed Illumina NGS for 27 cancer-related genes using DNA from 16 paired fresh-frozen and routine FFPE lung adenocarcinoma specimens and conducted extensive comparisons between the sequence data from each sample type. This analysis revealed small but detectable differences between FFPE and frozen samples. Compared with frozen samples, NGS data from FFPE samples had smaller library insert sizes, greater coverage variability, and an increase in C to T transitions that was most pronounced at CpG dinucleotides, suggesting interplay between DNA methylation and formalin-induced changes; however, the error rate, library complexity, enrichment performance, and coverage statistics were not significantly different. Comparison of base calls between paired samples demonstrated concordances of >99.99%, with 96.8% agreement in the single-nucleotide variants detected and >98% accuracy of NGS data when compared with genotypes from an orthogonal single-nucleotide polymorphism array platform. This study demonstrates that routine processing of FFPE samples has a detectable but negligible effect on NGS data and that these samples can be a reliable substrate for clinical NGS testing.