SDRF2GRAPH: a visualization tool of a spreadsheet-based description of experimental processes.
ABSTRACT: BACKGROUND: As larger datasets are produced with the development of genome-scale experimental techniques, it has become essential to explicitly describe the meta-data (information describing the data) generated by an experiment. The experimental process is a part of the meta-data required to interpret the produced data, and SDRF (Sample and Data Relationship Format) supports its description in a spreadsheet or tab-delimited file. This format was primarily developed to describe microarray studies in MAGE-tab, and it is being applied in a broader context in ISA-tab. While the format provides an explicit framework to describe experiments, increase of experimental steps makes it less obvious to understand the content of the SDRF files. RESULTS: Here, we describe a new tool, SDRF2GRAPH, for displaying experimental steps described in an SDRF file as an investigation design graph, a directed acyclic graph representing experimental steps. A spreadsheet, in Microsoft Excel for example, which is used to edit and inspect the descriptions, can be directly input via a web-based interface without converting to tab-delimited text. This makes it much easier to organize large contents of SDRF described in multiple spreadsheets. CONCLUSION: SDRF2GRAPH is applicable for a wide range of SDRF files for not only microarray-based analysis but also other genome-scale technologies, such as next generation sequencers. Visualization of the Investigation Design Graph (IDG) structure leads to an easy understanding of the experimental process described in the SDRF files even if the experiment is complicated, and such visualization also encourages the creation of SDRF files by providing prompt visual feedback.
Project description:Sharing of microarray data within the research community has been greatly facilitated by the development of the disclosure and communication standards MIAME and MAGE-ML by the MGED Society. However, the complexity of the MAGE-ML format has made its use impractical for laboratories lacking dedicated bioinformatics support.We propose a simple tab-delimited, spreadsheet-based format, MAGE-TAB, which will become a part of the MAGE microarray data standard and can be used for annotating and communicating microarray data in a MIAME compliant fashion.MAGE-TAB will enable laboratories without bioinformatics experience or support to manage, exchange and submit well-annotated microarray data in a standard format using a spreadsheet. The MAGE-TAB format is self-contained, and does not require an understanding of MAGE-ML or XML.
Project description:Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
Project description:The Gene Ontology (GO) is the most widely used ontology for creating biomedical annotations. GO annotations are statements associating a biological entity with a GO term. These statements comprise a large dataset of biological knowledge that is used widely in biomedical research. GO Annotations are available as "gene association files" from the GO website in a tab-delimited file format (GO Annotation File Format) composed of rows of 15 tab-delimited fields. This simple format lacks the knowledge representation (KR) capabilities to represent unambiguously semantic relationships between each field. This paper demonstrates that this KR shortcoming leads users to interpret the files in ways that can be erroneous. We propose a complementary format to represent GO annotation files as knowledge bases using the W3C recommended Web Ontology Language (OWL).
Project description:The high-throughput genomics communities have been successfully using standardized spreadsheet-based formats to capture and share data within labs and among public repositories. The nanomedicine community has yet to adopt similar standards to share the diverse and multi-dimensional types of data (including metadata) pertaining to the description and characterization of nanomaterials. Owing to the lack of standardization in representing and sharing nanomaterial data, most of the data currently shared via publications and data resources are incomplete, poorly-integrated, and not suitable for meaningful interpretation and re-use of the data. Specifically, in its current state, data cannot be effectively utilized for the development of predictive models that will inform the rational design of nanomaterials.We have developed a specification called ISA-TAB-Nano, which comprises four spreadsheet-based file formats for representing and integrating various types of nanomaterial data. Three file formats (Investigation, Study, and Assay files) have been adapted from the established ISA-TAB specification; while the Material file format was developed de novo to more readily describe the complexity of nanomaterials and associated small molecules. In this paper, we have discussed the main features of each file format and how to use them for sharing nanomaterial descriptions and assay metadata.The ISA-TAB-Nano file formats provide a general and flexible framework to record and integrate nanomaterial descriptions, assay data (metadata and endpoint measurements) and protocol information. Like ISA-TAB, ISA-TAB-Nano supports the use of ontology terms to promote standardized descriptions and to facilitate search and integration of the data. The ISA-TAB-Nano specification has been submitted as an ASTM work item to obtain community feedback and to provide a nanotechnology data-sharing standard for public development and adoption.
Project description:Sequence annotation is essential for genomics-based research. Investigators of a specific genomic region who have developed abundant local discoveries such as genes and genetic markers, or have collected annotations from multiple resources, can be overwhelmed by the difficulty in creating local annotation and the complexity of integrating all the annotations. Presenting such integrated data in a form suitable for data mining and high-throughput experimental design is even more daunting. DNannotator, a web application, was designed to perform batch annotation on a sizeable genomic region. It takes annotation source data, such as SNPs, genes, primers, and so on, prepared by the end-user and/or a specified target of genomic DNA, and performs de novo annotation. DNannotator can also robustly migrate existing annotations in GenBank format from one sequence to another. Annotation results are provided in GenBank format and in tab-delimited text, which can be imported and managed in a database or spreadsheet and combined with existing annotation as desired. Graphic viewers, such as Genome Browser or Artemis, can display the annotation results. Reference data (reports on the process) facilitating the user's evaluation of annotation quality are optionally provided. DNannotator can be accessed at http://sky.bsd.uchicago.edu/DNannotator.htm.
Project description:This software repository provides a pipeline for converting raw ClinVar data files into analysis-friendly tab-delimited tables, and also provides these tables for the most recent ClinVar release. Separate tables are generated for genome builds GRCh37 and GRCh38 as well as for mono-allelic variants and complex multi-allelic variants. Additionally, the tables are augmented with allele frequencies from the ExAC and gnomAD datasets as these are often consulted when analyzing ClinVar variants. Overall, this work provides ClinVar data in a format that is easier to work with and can be directly loaded into a variety of popular analysis tools such as R, python pandas, and SQL databases.
Project description:This experiment contains a subset of data from the BLUEPRINT Epigenome project ( http://www.blueprint-epigenome.eu ), which aims at producing a reference haemopoetic epigenomes for the research community. 74 samples of primary cells or cultured primary cells of different haemopoeitc lineages from cord blood, venous blood, bone marrow and thymus are included in this experiment. This ArrayExpress record contains only meta-data. Raw data files have been archived at the European Genome-Phenome Archive (EGA, www.ebi.ac.uk/ega) by the consortium, with restricted access to protect sample donors' identity. There are 32 EGA data set accessions, which can be found under the Comment[EGA_DATA_SET] column in the 'Sample Data Relationship Format' (SDRF) file of this ArrayExpress record (http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-3827/E-MTAB-3827.sdrf.txt). Details on how to apply for data access via the BLUEPRINT data access committee are on the EGA data set pages. Likewise, mapping of samples to these EGA accessions can be found in the SDRF file. Please note that the raw data files for 11 sequencing runs have yet been deposited at EGA, so they are marked with \\ot available\\ under the Comment[SUBMITTED_FILE_NAME] field in the SDRF file, and were included for the sake of completeness. Further iInformation on individual samples and sequencing libraries can also be found on the BLUEPRINT data coordination centre (DCC) website: http://dcc.blueprint-epigenome.eu\
Project description:One major task in the post-genome era is to reconstruct proteomic and genomic interacting networks using high-throughput experiment data. To identify essential nodes/hubs in these interactomes is a way to decipher the critical keys inside biochemical pathways or complex networks. These essential nodes/hubs may serve as potential drug-targets for developing novel therapy of human diseases, such as cancer or infectious disease caused by emerging pathogens. Hub Objects Analyzer (Hubba) is a web-based service for exploring important nodes in an interactome network generated from specific small- or large-scale experimental methods based on graph theory. Two characteristic analysis algorithms, Maximum Neighborhood Component (MNC) and Density of Maximum Neighborhood Component (DMNC) are developed for exploring and identifying hubs/essential nodes from interactome networks. Users can submit their own interaction data in PSI format (Proteomics Standards Initiative, version 2.5 and 1.0), tab format and tab with weight values. User will get an email notification of the calculation complete in minutes or hours, depending on the size of submitted dataset. Hubba result includes a rank given by a composite index, a manifest graph of network to show the relationship amid these hubs, and links for retrieving output files. This proposed method (DMNC || MNC) can be applied to discover some unrecognized hubs from previous dataset. For example, most of the Hubba high-ranked hubs (80% in top 10 hub list, and >70% in top 40 hub list) from the yeast protein interactome data (Y2H experiment) are reported as essential proteins. Since the analysis methods of Hubba are based on topology, it can also be used on other kinds of networks to explore the essential nodes, like networks in yeast, rat, mouse and human. The website of Hubba is freely available at http://hub.iis.sinica.edu.tw/Hubba.
Project description:JCVI Metagenomics Reports (METAREP) is a Web 2.0 application designed to help scientists analyze and compare annotated metagenomics datasets. It utilizes Solr/Lucene, a high-performance scalable search engine, to quickly query large data collections. Furthermore, users can use its SQL-like query syntax to filter and refine datasets. METAREP provides graphical summaries for top taxonomic and functional classifications as well as a GO, NCBI Taxonomy and KEGG Pathway Browser. Users can compare absolute and relative counts of multiple datasets at various functional and taxonomic levels. Advanced comparative features comprise statistical tests as well as multidimensional scaling, heatmap and hierarchical clustering plots. Summaries can be exported as tab-delimited files, publication quality plots in PDF format. A data management layer allows collaborative data analysis and result sharing.Web site http://www.jcvi.org/metarep; source code http://github.com/jcvi/METAREP CONTACT: firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
Project description:BACKGROUND:Systems biologists study interaction data to understand the behaviour of whole cell systems, and their environment, at a molecular level. In order to effectively achieve this goal, it is critical that researchers have high quality interaction datasets available to them, in a standard data format, and also a suite of tools with which to analyse such data and form experimentally testable hypotheses from them. The PSI-MI XML standard interchange format was initially published in 2004, and expanded in 2007 to enable the download and interchange of molecular interaction data. PSI-XML2.5 was designed to describe experimental data and to date has fulfilled this basic requirement. However, new use cases have arisen that the format cannot properly accommodate. These include data abstracted from more than one publication such as allosteric/cooperative interactions and protein complexes, dynamic interactions and the need to link kinetic and affinity data to specific mutational changes. RESULTS:The Molecular Interaction workgroup of the HUPO-PSI has extended the existing, well-used XML interchange format for molecular interaction data to meet new use cases and enable the capture of new data types, following extensive community consultation. PSI-MI XML3.0 expands the capabilities of the format beyond simple experimental data, with a concomitant update of the tool suite which serves this format. The format has been implemented by key data producers such as the International Molecular Exchange (IMEx) Consortium of protein interaction databases and the Complex Portal. CONCLUSIONS:PSI-MI XML3.0 has been developed by the data producers, data users, tool developers and database providers who constitute the PSI-MI workgroup. This group now actively supports PSI-MI XML2.5 as the main interchange format for experimental data, PSI-MI XML3.0 which additionally handles more complex data types, and the simpler, tab-delimited MITAB2.5, 2.6 and 2.7 for rapid parsing and download.