A method for automatically extracting infectious disease-related primers and probes from the literature.
ABSTRACT: BACKGROUND: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. RESULTS: We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. CONCLUSIONS: We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.
Project description:<h4>Background</h4>Extracting and visualizing of protein-protein interaction (PPI) from text literatures are a meaningful topic in protein science. It assists the identification of interactions among proteins. There is a lack of tools to extract PPI, visualize and classify the results.<h4>Results</h4>We developed a PPI search system, termed PPLook, which automatically extracts and visualizes protein-protein interaction (PPI) from text. Given a query protein name, PPLook can search a dataset for other proteins interacting with it by using a keywords dictionary pattern-matching algorithm, and display the topological parameters, such as the number of nodes, edges, and connected components. The visualization component of PPLook enables us to view the interaction relationship among the proteins in a three-dimensional space based on the OpenGL graphics interface technology. PPLook can also provide the functions of selecting protein semantic class, counting the number of semantic class proteins which interact with query protein, counting the literature number of articles appearing the interaction relationship about the query protein. Moreover, PPLook provides heterogeneous search and a user-friendly graphical interface.<h4>Conclusions</h4>PPLook is an effective tool for biologists and biosystem developers who need to access PPI information from the literature. PPLook is freely available for non-commercial users at http://meta.usc.edu/softs/PPLook.
Project description:Large quantities of information describing the mechanisms of biological pathways continue to be collected in publicly available databases. At the same time, experiments have increased in scale, and biologists increasingly use pathways defined in online databases to interpret the results of experiments and generate hypotheses. Emerging computational techniques that exploit the rich biological information captured in reaction systems require formal standardized descriptions of pathways to extract these reaction networks and avoid the alternative: time-consuming and largely manual literature-based network reconstruction. Here, we systematically evaluate the effects of commonly used knowledge representations on the seemingly simple task of extracting a reaction network describing signal transduction from a pathway database. We show that this process is in fact surprisingly difficult, and the pathway representations adopted by various knowledge bases have dramatic consequences for reaction network extraction, connectivity, capture of pathway crosstalk and in the modelling of cell-cell interactions. Researchers constructing computational models built from automatically extracted reaction networks must therefore consider the issues we outline in this review to maximize the value of existing pathway knowledge.
Project description:The RTPrimerDB (http://medgen.ugent.be/rtprimerdb) project provides a freely accessible data retrieval system and an in silico assay evaluation pipeline for real-time quantitative PCR assays. Over the last year the number of user submitted assays has grown to 3500. Data conveyance from Entrez Gene by establishing an assay-to-gene relationship enables the addition of new primer assays for one of the 1.5 million different genes from 2300 species stored in the system. Easy access to the primer and probe data is possible by using multiple search criteria. Assay reports contain gene information, assay details (such as oligonucleotide sequences, detection chemistry and reaction conditions), publication information, users' experimental evaluation feedback and submitter's contact details. Gene expression assays are extended with a scalable assay viewer that provides detailed information on the alignment of primer and probe sequences on the known transcript variants of a gene, along with Single Nucleotide Polymorphisms (SNP) positions and peptide domain information. Furthermore, an mfold module is implemented to predict the secondary structure of the amplicon sequence, as this has been reported to impact the efficiency of the PCR. RTPrimerDB is also extended with an in silico analysis pipeline to streamline the evaluation of custom designed primer and probe sequences prior to ordering and experimental evaluation. In a secured environment, the pipeline performs automated BLAST specificity searches, mfold secondary structure prediction, SNP or plain sequence error identification, and graphical visualization of the aligned primer and probe sequences on the target gene.
Project description:MOTIVATION:Mirtrons arise from short introns with atypical cleavage by using the splicing mechanism. In the current literature, there is no repository centralizing and organizing the data available to the public. To fill this gap, we developed mirtronDB, the first knowledge database dedicated to mirtron, and it is available at http://mirtrondb.cp.utfpr.edu.br/. MirtronDB currently contains a total of 1407 mirtron precursors and 2426 mirtron mature sequences in 18 species. RESULTS:Through a user-friendly interface, users can now browse and search mirtrons by organism, organism group, type and name. MirtronDB is a specialized resource that provides free and user-friendly access to knowledge on mirtron data. AVAILABILITY AND IMPLEMENTATION:MirtronDB is available at http://mirtrondb.cp.utfpr.edu.br/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:BACKGROUND: Superoxide reductases (SOR) catalyse the reduction of superoxide anions to hydrogen peroxide and are involved in the oxidative stress defences of anaerobic and facultative anaerobic organisms. Genes encoding SOR were discovered recently and suffer from annotation problems. These genes, named sor, are short and the transfer of annotations from previously characterized neelaredoxin, desulfoferrodoxin, superoxide reductase and rubredoxin oxidase has been heterogeneous. Consequently, many sor remain anonymous or mis-annotated. DESCRIPTION: SORGOdb is an exhaustive database of SOR that proposes a new classification based on domain architecture. SORGOdb supplies a simple user-friendly web-based database for retrieving and exploring relevant information about the proposed SOR families. The database can be queried using an organism name, a locus tag or phylogenetic criteria, and also offers sequence similarity searches using BlastP. Genes encoding SOR have been re-annotated in all available genome sequences (prokaryotic and eukaryotic (complete and in draft) genomes, updated in May 2010). CONCLUSIONS: SORGOdb contains 325 non-redundant and curated SOR, from 274 organisms. It proposes a new classification of SOR into seven different classes and allows biologists to explore and analyze sor in order to establish correlations between the class of SOR and organism phenotypes. SORGOdb is freely available at http://sorgo.genouest.org/index.php.
Project description:The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.
Project description:With genome sequencing efforts increasing exponentially, valuable information accumulates on genomic content of the various organisms sequenced. Projector 2 uses (un)finished genomic sequences of an organism as a template to infer linkage information for a genome sequence assembly of a related organism being sequenced. The remaining gaps between contigs for which no linkage information is present can subsequently be closed with direct PCR strategies. Compared with other implementations, Projector 2 has several distinctive features: a user-friendly web interface, automatic removal of repetitive elements (repeat-masking) and automated primer design for gap-closure purposes. Moreover, when using multiple fragments of a template genome, primers for multiplex PCR strategies can also be designed. Primer design takes into account that, in many cases, contig ends contain unreliable DNA sequences and repetitive sequences. Closing the remaining gaps in prokaryotic genome sequence assemblies is thereby made very efficient and virtually effortless. We demonstrate that the use of single or multiple fragments of a template genome (i.e. unfinished genome sequences) in combination with repeat-masking results in mapping success rates close to 100%. The web interface is freely accessible at http://molgen.biol.rug.nl/websoftware/projector2.
Project description:The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to "text mine" these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.
Project description:The presence of methanogenic bacteria was assessed in peat and soil cores taken from upland moors. The sampling area was largely covered by blanket bog peat together with small areas of red-brown limestone and peaty gley. A 30-cm-deep core of each soil type was taken, and DNA was extracted from 5-cm transverse sections. Purified DNA was subjected to PCR amplification with primers IAf and 1100Ar, which specifically amplify 1.1 kb of the archaeal 16S rRNA gene, and ME1 and ME2, which were designed to amplify a 0.75-kb region of the alpha-subunit gene for methyl coenzyme M reductase (MCR). Amplification with both primer pairs was obtained only with DNA extracted from the two deepest sections of the blanket bog peat core. This is consistent with the notion that anaerobiosis is required for activity and survival of the methanogen population. PCR products from both amplifications were cloned, and the resulting transformants were screened with specific oligonucleotide probes internal to the MCR or archaeal 16S rRNA PCR product. Plasmid DNA was extracted from probe-positive clones of both types and the insert was sequenced. The DNA sequences of 8 MCR clones were identical, as were those of 16 of the 17 16S rRNA clones. One clone showed marked variation from the remainder in specific regions of the sequence. From a comparison of these two different 16S rRNA sequences, an oligonucleotide was synthesized that was 100% homologous to a sequence region of the first 16 clones but had six mismatches with the variant. This probe was used to screen primary populations of PCR clones, and all of those that were probe negative were checked for the presence of inserts, which were then sequenced. By using this strategy, further novel methanogen 16S rRNA variants were identified and analyzed. The sequences recovered from the peat formed two clusters on the end of long branches within the methanogen radiation that are distinct from each other. These cannot be placed directly with sequences from any cultured taxa for which sequence information is available.
Project description:BACKGROUND: Today, there are more than 18 million articles related to biomedical research indexed in MEDLINE, and information derived from them could be used effectively to save the great amount of time and resources spent by government agencies in understanding the scientific landscape, including key opinion leaders and centers of excellence. Associating biomedical articles with organization names could significantly benefit the pharmaceutical marketing industry, health care funding agencies and public health officials and be useful for other scientists in normalizing author names, automatically creating citations, indexing articles and identifying potential resources or collaborators. Large amount of extracted information helps in disambiguating organization names using machine-learning algorithms. RESULTS: We propose NEMO, a system for extracting organization names in the affiliation and normalizing them to a canonical organization name. Our parsing process involves multi-layered rule matching with multiple dictionaries. The system achieves more than 98% f-score in extracting organization names. Our process of normalization that involves clustering based on local sequence alignment metrics and local learning based on finding connected components. A high precision was also observed in normalization. CONCLUSION: NEMO is the missing link in associating each biomedical paper and its authors to an organization name in its canonical form and the Geopolitical location of the organization. This research could potentially help in analyzing large social networks of organizations for landscaping a particular topic, improving performance of author disambiguation, adding weak links in the co-author network of authors, augmenting NLM's MARS system for correcting errors in OCR output of affiliation field, and automatically indexing the PubMed citations with the normalized organization name and country. Our system is available as a graphical user interface available for download along with this paper.