ACDC, a global database of amphibian cytochrome-b sequences using reproducible curation for GenBank records.
ABSTRACT: Genetic data are a crucial and exponentially growing resource across all biological sciences, yet curated databases are scarce. The widespread occurrence of sequence and (meta)data errors in public repositories calls for comprehensive improvements of curation protocols leading to robust research and downstream analyses. We collated and curated all available GenBank cytochrome-b sequences for amphibians, a benchmark marker in this globally declining vertebrate clade. The Amphibia's Curated Database of Cytochrome-b (ACDC) consists of 36,514 sequences representing 2,309 species from 398 genera (median?=?2 with 50% interquartile ranges of 1-7 species/genus). We updated the taxonomic identity of >4,800 sequences (ca. 13%) and found 2,359 (6%) conflicting sequences with 84% of the errors originating from taxonomic misidentifications. The database (accessible at https://doi.org/10.6084/m9.figshare.9944759 ) also includes an R script to replicate our study for other loci and taxonomic groups. We provide recommendations to improve genetic-data quality in public repositories and flag species for which there is a need for taxonomic refinement in the face of increased rate of amphibian extinctions in the Anthropocene.
Project description:ACDC (arterial calcification due to deficiency of CD73) is an autosomal recessive disease resulting from loss-of-function mutations in NT5E, which encodes CD73, a 5'-ectonucleotidase that converts extracellular adenosine monophosphate to adenosine. ACDC patients display progressive calcification of lower extremity arteries, causing limb ischemia. Tissue-nonspecific alkaline phosphatase (TNAP), which converts pyrophosphate (PPi) to inorganic phosphate (Pi), and extracellular purine metabolism play important roles in other inherited forms of vascular calcification. Compared to cells from healthy subjects, induced pluripotent stem cell-derived mesenchymal stromal cells (iMSCs) from ACDC patients displayed accelerated calcification and increased TNAP activity when cultured under conditions that promote osteogenesis. TNAP activity generated adenosine in iMSCs derived from ACDC patients but not in iMSCs from control subjects, which have CD73. In response to osteogenic stimulation, ACDC patient-derived iMSCs had decreased amounts of the TNAP substrate PPi, an inhibitor of extracellular matrix calcification, and exhibited increased activation of AKT, mechanistic target of rapamycin (mTOR), and the 70-kDa ribosomal protein S6 kinase (p70S6K), a pathway that promotes calcification. In vivo, teratomas derived from ACDC patient cells showed extensive calcification and increased TNAP activity. Treating mice bearing these teratomas with an A2b adenosine receptor agonist, the mTOR inhibitor rapamycin, or the bisphosphonate etidronate reduced calcification. These results show that an increase of TNAP activity in ACDC contributes to ectopic calcification by disrupting the extracellular balance of PPi and Pi and identify potential therapeutic targets for ACDC.
Project description:<h4>Background</h4>Primary cilia frequency and length are key metrics in studies of ciliogenesis and ciliopathies. Typically, quantitative cilia analysis is done manually, which is very time-consuming. While some open-source and commercial image analysis software applications can segment input data, they still require the user to optimize many parameters, suffer from user bias, and often lack rigorous performance quality assessment (e.g., false positives and false negatives). Further, optimal parameter combinations vary in detection accuracy depending on cilia reporter, cell type, and imaging modality. A good automated solution would analyze images quickly, robustly, and adaptably-across different experimental data sets-without significantly compromising the accuracy of manual analysis.<h4>Methods</h4>To solve this problem, we developed a new software for automated cilia detection in cells (ACDC). The software operates through four main steps: image importation, pre-processing, detection auto-optimization, and analysis. From a data set, a representative image with manually selected cilia (i.e., Ground Truth) is used for detection auto-optimization based on four parameters: signal-to-noise ratio, length, directional score, and intensity standard deviation. Millions of parameter combinations are automatically evaluated and optimized according to an accuracy 'F1' score, based on the amount of false positives and false negatives. Afterwards, the optimized parameter combination is used for automated detection and analysis of the entire data set.<h4>Results</h4>The ACDC software accurately and adaptably detected nuclei and primary cilia across different cell types (NIH3T3, RPE1), cilia reporters (AcTub, Smo-GFP, Arl13b), and image magnifications (60×, 40×). We found that false-positive and false-negative rates for Arl13b-stained cilia were 1-6%, yielding high F1 scores of 0.96-0.97 (max.?=?1.00). The software detected significant differences in mean cilia length between control and cytochalasin D-treated cell populations and could monitor dynamic changes in cilia length from movie recordings. Automated analysis offered up to a 96-fold speed enhancement compared to manual analysis, requiring around 5 s/image, or nearly 18,000 cilia analyzed/hour.<h4>Conclusion</h4>The ACDC software is a solution for robust automated analysis of microscopic images of ciliated cells. The software is extremely adaptable, accurate, and offers immense time-savings compared to traditional manual analysis.
Project description:Taxonomic identification of biological materials can be achieved through DNA barcoding, where an unknown "barcode" sequence is compared to a reference database. In many disciplines, obtaining accurate taxonomic identifications can be imperative (e.g., evolutionary biology, food regulatory compliance, forensics). The Barcode of Life DataSystems (BOLD) and GenBank are the main public repositories of DNA barcode sequences. In this study, an assessment of the accuracy and reliability of sequences in these databases was performed. To achieve this, 1) curated reference materials for plants, macro-fungi and insects were obtained from national collections, 2) relevant barcode sequences (rbcL, matK, trnH-psbA, ITS and COI) from these reference samples were generated and used for searching against both databases, and 3) optimal search parameters were determined that ensure the best match to the known species in either database. While GenBank outperformed BOLD for species-level identification of insect taxa (53% and 35%, respectively), both databases performed comparably for plants and macro-fungi (~81% and ~57%, respectively). Results illustrated that using a multi-locus barcode approach increased identification success. This study outlines the utility of the BLAST search tool in GenBank and the BOLD identification engine for taxonomic identifications and identifies some precautions needed when using public sequence repositories in applied scientific disciplines.
Project description:Next-generation sequencing has provided powerful tools to conduct microbial ecology studies. Analysis of community composition relies on annotated databases of curated sequences to provide taxonomic assignments; however, these databases occasionally have errors with implications for downstream analyses. Systemic taxonomic errors were discovered in Greengenes database (v13_5 and 13_8) related to orders Vibrionales and Alteromonadales. These orders have family level annotations that were erroneous at least one taxonomic level, e.g., 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in Vibrionales (rather than Alteromonadales) and >20% of these sequences were indeed Vibrio spp. but were improperly assigned to the Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common; we identified 68 peer-reviewed papers since 2013 that likely included erroneous annotations specifically associated with Vibrionales and Pseudoalteromonadaceae, with 20 explicitly stating the incorrect taxonomy. Erroneous assignments using these specific versions of Greengenes can lead to incorrect conclusions, especially in marine systems where these taxa are commonly encountered as conditionally rare organisms and potential pathogens.
Project description:The reliable taxonomic identification of organisms through DNA sequence data requires a well parameterized library of curated reference sequences. However, it is estimated that just 15% of described animal species are represented in public sequence repositories. To begin to address this deficiency, we provide DNA barcodes for 1,500,003 animal specimens collected from 23 terrestrial and aquatic ecozones at sites across Canada, a nation that comprises 7% of the planet's land surface. In total, 14 phyla, 43 classes, 163 orders, 1123 families, 6186 genera, and 64,264 Barcode Index Numbers (BINs; a proxy for species) are represented. Species-level taxonomy was available for 38% of the specimens, but higher proportions were assigned to a genus (69.5%) and a family (99.9%). Voucher specimens and DNA extracts are archived at the Centre for Biodiversity Genomics where they are available for further research. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, and the Global Genome Biodiversity Network Data Portal.
Project description:The Gastropoda is one of the best studied classes of marine invertebrates. Yet, most species have been delimited based on morphology only. The application of DNA barcodes has shown to be greatly useful to help delimiting species. Therefore, sequences of the cytochrome c oxidase I gene from 108 specimens of 34 morpho-species were used to investigate the molecular diversity within the gastropods from the Portuguese coast. To the above dataset, we added available COI-5P sequences of taxonomically close species, in a total of 58 morpho-species examined. There was a good match between ours and sequences from independent studies, in public repositories. We found 32 concordant (91.4%) out of the 35 Barcode Index Numbers (BINs) generated from our sequences. The application of a ranking system to the barcodes yield over 70% with top taxonomic congruence, while 14.2% of the species barcodes had insufficient data. In the majority of the cases, there was a good concordance between morphological identification and DNA barcodes. Nonetheless, the discordance between morphological and molecular data is a reminder that even the comparatively well-known European marine gastropods can benefit from being probed using the DNA barcode approach. Discordant cases should be reviewed with more integrative studies.
Project description:In recent years, the number of sequences of diverse species submitted to GenBank has grown explosively and not infrequently the data contain errors. This problem is extensively recognized but not for invalid or incorrectly identified species, sample mixed-up, and contamination. DNA barcoding is a powerful tool for identifying and confirming species and one very important application involves forensics. In this study, we use DNA barcoding to detect erroneous sequences in GenBank by evaluating deep intraspecific and shallow interspecific divergences to discover possible taxonomic problems and other sources of error. We use the mitochondrial DNA gene encoding cytochrome b (Cytb) from turtles to test the utility of barcoding for pinpointing potential errors. This gene is widely used in phylogenetic studies of the speciose group. Intraspecific variation is usually less than 2.0% and in most cases it is less than 1.0%. In comparison, most species differ by more than 10.0% in our dataset. Overlapping intra- and interspecific percentages of variation mainly involve problematic identifications of species and outdated taxonomies. Further, we detect identical problems in Cytb from Insectivora and Chiroptera. Upon applying this strategy to 47,524 mammalian CoxI sequences, we resolve a suite of potentially problematic sequences. Our study reveals that erroneous sequences are not rare in GenBank and that the DNA barcoding can serve to confirm sequencing accuracy and discover problems such as misidentified species, inaccurate taxonomies, contamination, and potential errors in sequencing.
Project description:The interrogation of genetic markers in environmental meta-barcoding studies is currently seriously hindered by the lack of taxonomically curated reference data sets for the targeted genes. The Protist Ribosomal Reference database (PR(2), http://ssu-rrna.org/) provides a unique access to eukaryotic small sub-unit (SSU) ribosomal RNA and DNA sequences, with curated taxonomy. The database mainly consists of nuclear-encoded protistan sequences. However, metazoans, land plants, macrosporic fungi and eukaryotic organelles (mitochondrion, plastid and others) are also included because they are useful for the analysis of high-troughput sequencing data sets. Introns and putative chimeric sequences have been also carefully checked. Taxonomic assignation of sequences consists of eight unique taxonomic fields. In total, 136 866 sequences are nuclear encoded, 45 708 (36 501 mitochondrial and 9657 chloroplastic) are from organelles, the remaining being putative chimeric sequences. The website allows the users to download sequences from the entire and partial databases (including representative sequences after clustering at a given level of similarity). Different web tools also allow searches by sequence similarity. The presence of both rRNA and rDNA sequences, taking into account introns (crucial for eukaryotic sequences), a normalized eight terms ranked-taxonomy and updates of new GenBank releases were made possible by a long-term collaboration between experts in taxonomy and computer scientists.
Project description:Clubroot, caused by Plasmodiophora brassicae, is an important disease of Brassica crops worldwide. F<sub>1</sub> progeny from the Brassica rapa lines T19 (resistant)?×?ACDC (susceptible) were backcrossed with ACDC, then self-pollinated to produce BC<sub>1</sub>S<sub>1</sub> lines, From genotyping-by-sequencing (GBS) of the parental lines and BC<sub>1</sub> plants, about 1.32?M sequences from T19 were aligned into the reference genome of B. rapa with 0.4-fold coverage, and 1.77?M sequences with 0.5-fold coverage in ACDC. The number of aligned short reads per plant in the BC<sub>1</sub> ranged from 0.07 to 1.41?M sequences with 0.1-fold coverage. A total of 1584 high quality SNP loci were obtained, distributed on 10 chromosomes. A single co-localized QTL, designated as Rcr4 on chromosome A03, conferred resistance to pathotypes 2, 3, 5, 6 and 8. The peak was at SNP locus A03_23710236, where LOD values were 30.3 to 38.8, with phenotypic variation explained (PVE) of 85-95%. Two QTLs for resistance to a novel P. brassicae pathotype 5x, designated Rcr8 on chromosome A02 and Rcr9 on A08, were detected with 15.0 LOD and 15.8 LOD, and PVE of 36% and 39%, respectively. Bulked segregant analysis was performed to examine TIR-NBS-LRR proteins in the regions harboring the QTL.
Project description:The free available eutherian genomic sequence data sets advanced scientific field of genomics. Of note, future revisions of gene data sets were expected, due to incompleteness of public eutherian genomic sequence assemblies and potential genomic sequence errors. The eutherian comparative genomic analysis protocol was proposed as guidance in protection against potential genomic sequence errors in public eutherian genomic sequences. The protocol was applicable in updates of 7 major eutherian gene data sets, including 812 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets.