Fast and Accurate Approaches for Large-Scale, Automated Mapping of Food Diaries on Food Composition Tables.
ABSTRACT: Aim of Study: The use of weighed food diaries in nutritional studies provides a powerful method to quantify food and nutrient intakes. Yet, mapping these records onto food composition tables (FCTs) is a challenging, time-consuming and error-prone process. Experts make this effort manually and no automation has been previously proposed. Our study aimed to assess automated approaches to map food items onto FCTs. Methods: We used food diaries (~170,000 records pertaining to 4,200 unique food items) from the DiOGenes randomized clinical trial. We attempted to map these items onto six FCTs available from the EuroFIR resource. Two approaches were tested: the first was based solely on food name similarity (fuzzy matching). The second used a machine learning approach (C5.0 classifier) combining both fuzzy matching and food energy. We tested mapping food items using their original names and also an English-translation. Top matching pairs were reviewed manually to derive performance metrics: precision (the percentage of correctly mapped items) and recall (percentage of mapped items). Results: The simpler approach: fuzzy matching, provided very good performance. Under a relaxed threshold (score > 50%), this approach enabled to remap 99.49% of the items with a precision of 88.75%. With a slightly more stringent threshold (score > 63%), the precision could be significantly improved to 96.81% while keeping a recall rate > 95% (i.e., only 5% of the queried items would not be mapped). The machine learning approach did not lead to any improvements compared to the fuzzy matching. However, it could increase substantially the recall rate for food items without any clear equivalent in the FCTs (+7 and +20% when mapping items using their original or English-translated names). Our approaches have been implemented as R packages and are freely available from GitHub. Conclusion: This study is the first to provide automated approaches for large-scale food item mapping onto FCTs. We demonstrate that both high precision and recall can be achieved. Our solutions can be used with any FCT and do not require any programming background. These methodologies and findings are useful to any small or large nutritional study (observational as well as interventional).
Project description:BACKGROUND:A quantitative trait locus (QTL) is a genomic region that correlates with a phenotype. Most of the experimental information about QTL mapping studies is described in tables of scientific publications. Traditional text mining techniques aim to extract information from unstructured text rather than from tables. We present QTLTableMiner++ (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature. QTM is a command line tool written in the Java programming language. This tool takes scientific articles from the Europe PMC repository as input, extracts QTL tables using keyword matching and ontology-based concept identification. The tables are further normalized using rules derived from table properties such as captions, column headers and table footers. Furthermore, table columns are classified into three categories namely column descriptors, properties and values based on column headers and data types of cell entries. Abbreviations found in the tables are expanded using the Schwartz and Hearst algorithm. Finally, the content of QTL tables is semantically enriched with domain-specific ontologies (e.g. Crop Ontology, Plant Ontology and Trait Ontology) using the Apache Solr search platform and the results are stored in a relational database and a text file. RESULTS:The performance of the QTM tool was assessed by precision and recall based on the information retrieved from two manually annotated corpora of open access articles, i.e. QTL mapping studies in tomato (Solanum lycopersicum) and in potato (S. tuberosum). In summary, QTM detected QTL statements in tomato with 74.53% precision and 92.56% recall and in potato with 82.82% precision and 98.94% recall. CONCLUSION:QTM is a unique tool that aids in providing QTL information in machine-readable and semantically interoperable formats.
Project description:Unraveling functional and ancestral relationships between proteins as well as structure-prediction procedures require powerful protein-alignment methods. A structure-alignment method is presented where the problem is mapped onto a cost function containing both fuzzy (Potts) assignment variables and atomic coordinates. The cost function is minimized by using an iterative scheme, where at each step mean field theory methods at finite "temperatures" are used for determining fuzzy assignment variables followed by exact translation and rotation of atomic coordinates weighted by their corresponding fuzzy assignment variables. The approach performs very well when compared with other methods, requires modest central processing unit consumption, and is robust with respect to choice of iteration parameters for a wide range of proteins.
Project description:While sensory processes are tuned to particular features, such as an object's specific location, color or orientation, visual working memory (vWM) is assumed to store information using representations, which generalize over a feature dimension. Additionally, current vWM models presume that different features or objects are stored independently. On the other hand, configurational effects, when observed, are supposed to mainly reflect encoding strategies. We show that the location of the target, relative to the display center and boundaries, and overall memory load influenced recall precision, indicating that, like sensory processes, capacity limited vWM resources are spatially tuned. When recalling one of three memory items the target distance from the display center was overestimated, similar to the error when only one item was memorized, but its distance from the memory items' average position was underestimated, showing that not only individual memory items' position, but also the global configuration of the memory array may be stored. Finally, presenting the non-target items at recall, consequently providing landmarks and configurational information, improved precision and accuracy of target recall. Similarly, when the non-target items were translated at recall, relative to their position in the initial display, a parallel displacement of the recalled target was observed. These findings suggest that fine-grained spatial information in vWM is represented in local maps whose resolution varies with distance from landmarks, such as the display center, while coarse representations are used to store the memory array configuration. Both these representations are updated at the time of recall.
Project description:BACKGROUND:Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually. AIM:To develop an algorithm to identify relevant free texts automatically based on labelled examples. METHODS:We developed a novel machine learning algorithm, the 'Semi-supervised Set Covering Machine' (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our 'Freetext Matching Algorithm' natural language processor. RESULTS:Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%). CONCLUSIONS:Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.
Project description:Poor recall during investigations of foodborne outbreaks may lead to misclassifications in exposure ascertainment. We conducted a simulation study to assess the frequency and determinants of recall errors.Lunch visitors in a cafeteria using exclusively cashless payment reported their consumption of 13 food servings available daily in the three preceding weeks using a self-administered paper-questionnaire. We validated this information using electronic payment information. We calculated associated factors on misclassification of recall according to time, age, sex, education level, dietary habits and type of servings.We included 145/226 (64%) respondents who reported 27,095 consumed food items. Sensitivity of recall was 73%, specificity 96%. In multivariable analysis, for each additional day of recall period, the adjusted chance for false-negative recall increased by 8% (OR: 1.1;95%-CI: 1.06, 1.1), for false-positive recall by 3% (OR: 1.03;95%-CI: 1.02, 1.05), for indecisive recall by 12% (OR: 1.1;95%-CI: 1.08, 1.15). Sex and education-level had minor effects.Forgetting to report consumed foods is more frequent than reporting food-items actually not consumed. Bad recall is strongly enhanced by delay of interviews and may make hypothesis generation and testing very challenging. Side dishes are more easily missed than main courses. If available, electronic payment data can improve food-history information.
Project description:To develop an automated system to extract medications and related information from discharge summaries as part of the 2009 i2b2 natural language processing (NLP) challenge. This task required accurate recognition of medication name, dosage, mode, frequency, duration, and reason for drug administration.We developed an integrated system using several existing NLP components developed at Vanderbilt University Medical Center, which included MedEx (to extract medication information), SecTag (a section identification system for clinical notes), a sentence splitter, and a spell checker for drug names. Our goal was to achieve good performance with minimal to no specific training for this document corpus; thus, evaluating the portability of those NLP tools beyond their home institution. The integrated system was developed using 17 notes that were annotated by the organizers and evaluated using 251 notes that were annotated by participating teams.The i2b2 challenge used standard measures, including precision, recall, and F-measure, to evaluate the performance of participating systems. There were two ways to determine whether an extracted textual finding is correct or not: exact matching or inexact matching. The overall performance for all six types of medication-related findings across 251 annotated notes was considered as the primary metric in the challenge.Our system achieved an overall F-measure of 0.821 for exact matching (0.839 precision; 0.803 recall) and 0.822 for inexact matching (0.866 precision; 0.782 recall). The system ranked second out of 20 participating teams on overall performance at extracting medications and related information.The results show that the existing MedEx system, together with other NLP components, can extract medication information in clinical text from institutions other than the site of algorithm development with reasonable performance.
Project description:Nerve tissue contains a high density of chemical synapses, about 1 per µm3 in the mammalian cerebral cortex. Thus, even for small blocks of nerve tissue, dense connectomic mapping requires the identification of millions to billions of synapses. While the focus of connectomic data analysis has been on neurite reconstruction, synapse detection becomes limiting when datasets grow in size and dense mapping is required. Here, we report SynEM, a method for automated detection of synapses from conventionally en-bloc stained 3D electron microscopy image stacks. The approach is based on a segmentation of the image data and focuses on classifying borders between neuronal processes as synaptic or non-synaptic. SynEM yields 97% precision and recall in binary cortical connectomes with no user interaction. It scales to large volumes of cortical neuropil, plausibly even whole-brain datasets. SynEM removes the burden of manual synapse annotation for large densely mapped connectomes.
Project description:BACKGROUND: Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. RESULTS: In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. CONCLUSION: GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.
Project description:Large-scale siRNA screenings allow linking the function of poorly characterized genes to phenotypic readouts. According to this strategy, genes are associated with a function of interest if the alteration of their expression perturbs the phenotypic readouts. However, given the intricacy of the cell regulatory network, the mapping procedure is low resolution and the resulting models provide little mechanistic insights. We have developed a new strategy that combines multiparametric analysis of cell perturbation with logic modeling to achieve a more detailed functional mapping of human genes onto complex pathways. A literature-derived optimized model is used to infer the cell activation state following upregulation or downregulation of the model entities. By matching this signature with the experimental profile obtained in the high-throughput siRNA screening it is possible to infer the target of each protein, thus defining its 'entry point' in the network. By this novel approach, 41 phosphatases that affect key growth pathways were identified and mapped onto a human epithelial cell-specific growth model, thus providing insights into the mechanisms underlying their function.
Project description:MOTIVATION:The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations. RESULTS:In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes. AVAILABILITY AND IMPLEMENTATION:The BHap tool is available at http://www.cs.ucf.edu/?xiaoman/BHap/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.