Applications of the pipeline environment for visual informatics and genomics computations.
ABSTRACT: Contemporary informatics and genomics research require efficient, flexible and robust management of large heterogeneous data, advanced computational tools, powerful visualization, reliable hardware infrastructure, interoperability of computational resources, and detailed data and analysis-protocol provenance. The Pipeline is a client-server distributed computational environment that facilitates the visual graphical construction, execution, monitoring, validation and dissemination of advanced data analysis protocols.This paper reports on the applications of the LONI Pipeline environment to address two informatics challenges - graphical management of diverse genomics tools, and the interoperability of informatics software. Specifically, this manuscript presents the concrete details of deploying general informatics suites and individual software tools to new hardware infrastructures, the design, validation and execution of new visual analysis protocols via the Pipeline graphical interface, and integration of diverse informatics tools via the Pipeline eXtensible Markup Language syntax. We demonstrate each of these processes using several established informatics packages (e.g., miBLAST, EMBOSS, mrFAST, GWASS, MAQ, SAMtools, Bowtie) for basic local sequence alignment and search, molecular biology data analysis, and genome-wide association studies. These examples demonstrate the power of the Pipeline graphical workflow environment to enable integration of bioinformatics resources which provide a well-defined syntax for dynamic specification of the input/output parameters and the run-time execution controls.The LONI Pipeline environment http://pipeline.loni.ucla.edu provides a flexible graphical infrastructure for efficient biomedical computing and distributed informatics research. The interactive Pipeline resource manager enables the utilization and interoperability of diverse types of informatics resources. The Pipeline client-server model provides computational power to a broad spectrum of informatics investigators--experienced developers and novice users, user with or without access to advanced computational-resources (e.g., Grid, data), as well as basic and translational scientists. The open development, validation and dissemination of computational networks (pipeline workflows) facilitates the sharing of knowledge, tools, protocols and best practices, and enables the unbiased validation and replication of scientific findings by the entire community.
Project description:The volume, diversity and velocity of biomedical data are exponentially increasing providing petabytes of new neuroimaging and genetics data every year. At the same time, tens-of-thousands of computational algorithms are developed and reported in the literature along with thousands of software tools and services. Users demand intuitive, quick and platform-agnostic access to data, software tools, and infrastructure from millions of hardware devices. This explosion of information, scientific techniques, computational models, and technological advances leads to enormous challenges in data analysis, evidence-based biomedical inference and reproducibility of findings. The Pipeline workflow environment provides a crowd-based distributed solution for consistent management of these heterogeneous resources. The Pipeline allows multiple (local) clients and (remote) servers to connect, exchange protocols, control the execution, monitor the states of different tools or hardware, and share complete protocols as portable XML workflows. In this paper, we demonstrate several advanced computational neuroimaging and genetics case-studies, and end-to-end pipeline solutions. These are implemented as graphical workflow protocols in the context of analyzing imaging (sMRI, fMRI, DTI), phenotypic (demographic, clinical), and genetic (SNP) data.
Project description:<h4>Background</h4>Modern omics research involves the application of high-throughput technologies that generate vast volumes of data. These data need to be pre-processed, analyzed and integrated with existing knowledge through the use of diverse sets of software tools, models and databases. The analyses are often interdependent and chained together to form complex workflows or pipelines. Given the volume of the data used and the multitude of computational resources available, specialized pipeline software is required to make high-throughput analysis of large-scale omics datasets feasible.<h4>Results</h4>We have developed a generic pipeline system called Cyrille2. The system is modular in design and consists of three functionally distinct parts: 1) a web based, graphical user interface (GUI) that enables a pipeline operator to manage the system; 2) the Scheduler, which forms the functional core of the system and which tracks what data enters the system and determines what jobs must be scheduled for execution, and; 3) the Executor, which searches for scheduled jobs and executes these on a compute cluster.<h4>Conclusion</h4>The Cyrille2 system is an extensible, modular system, implementing the stated requirements. Cyrille2 enables easy creation and execution of high throughput, flexible bioinformatics pipelines.
Project description:Informatics has become an essential component of research in the past few decades, capitalizing on the efficiency and power of computation to improve the knowledge gained from increasing quantities and types of data. While other fields of research such as genomics are well represented in informatics resources, nutrition remains underrepresented. Nutrition is one of the most integral components of human life, and it impacts individuals far beyond just nutrient provisions. For example, nutrition plays a role in cultural practices, interpersonal relationships and body image. Despite this, integrated computational investigations have been limited due to challenges within nutrition informatics (nutri-informatics) and nutrition data. The purpose of this review is to describe the landscape of nutri-informatics resources available for use in computational nutrition research and clinical utilization. In particular, we will focus on the application of biomedical ontologies and their potential to improve the standardization and interoperability of nutrition terminologies and relationships between nutrition and other biomedical disciplines such as disease and phenomics. Additionally, we will highlight challenges currently faced by the nutri-informatics community including experimental design, data aggregation and the roles scientific journals and primary nutrition researchers play in facilitating data reuse and successful computational research. Finally, we will conclude with a call to action to create and follow community standards regarding standardization of language, documentation specifications and requirements for data reuse. With the continued movement toward community standards of this kind, the entire nutrition research community can transition toward greater usage of Findability, Accessibility, Interoperability and Reusability principles and in turn more transparent science.
Project description:BACKGROUND: High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat's serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently. RESULTS: We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during 'spliced alignment' and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time. CONCLUSIONS: PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system.
Project description:PURPOSE:We present a platform, GRAphical Pipeline Environment (GRAPE), to facilitate the development of patient-adaptive magnetic resonance imaging (MRI) protocols. METHODS:GRAPE is an open-source project implemented in the Qt C++ framework to enable graphical creation, execution, and debugging of real-time image analysis algorithms integrated with the MRI scanner. The platform provides the tools and infrastructure to design new algorithms, and build and execute an array of image analysis routines, and provides a mechanism to include existing analysis libraries, all within a graphical environment. The application of GRAPE is demonstrated in multiple MRI applications, and the software is described in detail for both the user and the developer. RESULTS:GRAPE was successfully used to implement and execute three applications in MRI of the brain, performed on a 3.0-T MRI scanner: (i) a multi-parametric pipeline for segmenting the brain tissue and detecting lesions in multiple sclerosis (MS), (ii) patient-specific optimization of the 3D fluid-attenuated inversion recovery MRI scan parameters to enhance the contrast of brain lesions in MS, and (iii) an algebraic image method for combining two MR images for improved lesion contrast. CONCLUSIONS:GRAPE allows graphical development and execution of image analysis algorithms for inline, real-time, and adaptive MRI applications.
Project description:Access to streamlined computational resources remains a significant bottleneck for new users of cryo-electron microscopy (cryo-EM). To address this, we have developed tools that will submit cryo-EM analysis routines and atomic model building jobs directly to Amazon Web Services (AWS) from a local computer or laptop. These new software tools ("cryoem-cloud-tools") have incorporated optimal data movement, security, and cost-saving strategies, giving novice users access to complex cryo-EM data processing pipelines. Integrating these tools into the RELION processing pipeline and graphical user interface we determined a 2.2?Å structure of ß-galactosidase in ?55?h on AWS. We implemented a similar strategy to submit Rosetta atomic model building and refinement to AWS. These software tools dramatically reduce the barrier for entry of new users to cloud computing for cryo-EM and are freely available at cryoem-tools.cloud.
Project description:HL7 Fast Healthcare Interoperability Resources (FHIR) is an emerging open standard for the exchange of electronic healthcare information. FHIR resources are defined in a specialized modeling language. FHIR instances can currently be represented in either XML or JSON. The FHIR and Semantic Web communities are developing a third FHIR instance representation format in Resource Description Framework (RDF). Shape Expressions (ShEx), a formal RDF data constraint language, is a candidate for describing and validating the FHIR RDF representation.Create a FHIR to ShEx model transformation and assess its ability to describe and validate FHIR RDF data.We created the methods and tools that generate the ShEx schemas modeling the FHIR to RDF specification being developed by HL7 ITS/W3C RDF Task Force, and evaluated the applicability of ShEx in the description and validation of FHIR to RDF transformations.The ShEx models contributed significantly to workgroup consensus. Algorithmic transformations from the FHIR model to ShEx schemas and FHIR example data to RDF transformations were incorporated into the FHIR build process. ShEx schemas representing 109 FHIR resources were used to validate 511 FHIR RDF data examples from the Standards for Trial Use (STU 3) Ballot version. We were able to uncover unresolved issues in the FHIR to RDF specification and detect 10 types of errors and root causes in the actual implementation. The FHIR ShEx representations have been included in the official FHIR web pages for the STU 3 Ballot version since September 2016.ShEx can be used to define and validate the syntax of a FHIR resource, which is complementary to the use of RDF Schema (RDFS) and Web Ontology Language (OWL) for semantic validation.ShEx proved useful for describing a standard model of FHIR RDF data. The combination of a formal model and a succinct format enabled comprehensive review and automated validation.
Project description:The Drug Disease Model Resources (DDMoRe) Interoperability Framework (IOF) enables pharmacometric model encoding and execution via Model Description Language (MDL) and R language, through the ddmore package. Through its components and converter plugins, the IOF can execute pharmacometric tasks using different target tools, starting from a single MDL-encoded model. In this article, we present the WinBUGS plugin and show how its integration in the IOF enables an easy implementation of complex Bayesian workflows. We selected a published diabetes-linked study as a real-world example, in which two inter-related models are used to estimate insulin secretion rate in response to a glucose stimulus from intravenous glucose tolerance test (IVGTT) data. This model was implemented following different approaches to propagate uncertainty, via diverse IOF target tools (NONMEM, WinBUGS, PsN, and Xpose). The developed software supports a plethora of pharmacokinetic/pharmacodynamic (PK/PD) modeling features. It provides solutions to reproducibility and interoperability issues in Bayesian modeling, and facilitates the difficult encoding of complex PK/PD models in WinBUGS.
Project description:<h4>Background</h4>Current high-throughput technologies-i.e. whole genome sequencing, RNA-Seq, ChIP-Seq, etc.-generate huge amounts of data and their usage gets more widespread with each passing year. Complex analysis pipelines involving several computationally-intensive steps have to be applied on an increasing number of samples. Workflow management systems allow parallelization and a more efficient usage of computational power. Nevertheless, this mostly happens by assigning the available cores to a single or few samples' pipeline at a time. We refer to this approach as naive parallel strategy (NPS). Here, we discuss an alternative approach, which we refer to as concurrent execution strategy (CES), which equally distributes the available processors across every sample's pipeline.<h4>Results</h4>Theoretically, we show that the CES results, under loose conditions, in a substantial speedup, with an ideal gain range spanning from 1 to the number of samples. Also, we observe that the CES yields even faster executions since parallelly computable tasks scale sub-linearly. Practically, we tested both strategies on a whole exome sequencing pipeline applied to three publicly available matched tumour-normal sample pairs of gastrointestinal stromal tumour. The CES achieved speedups in latency up to 2-2.4 compared to the NPS.<h4>Conclusions</h4>Our results hint that if resources distribution is further tailored to fit specific situations, an even greater gain in performance of multiple samples pipelines execution could be achieved. For this to be feasible, a benchmarking of the tools included in the pipeline would be necessary. It is our opinion these benchmarks should be consistently performed by the tools' developers. Finally, these results suggest that concurrent strategies might also lead to energy and cost savings by making feasible the usage of low power machine clusters.
Project description:<h4>Unlabelled</h4>Researchers are perpetually amassing biological sequence data. The computational approaches employed by ecologists for organizing this data (e.g. alignment, phylogeny, etc.) typically scale nonlinearly in execution time with the size of the dataset. This often serves as a bottleneck for processing experimental data since many molecular studies are characterized by massive datasets. To keep up with experimental data demands, ecologists are forced to choose between continually upgrading expensive in-house computer hardware or outsourcing the most demanding computations to the cloud. Outsourcing is attractive since it is the least expensive option, but does not necessarily allow direct user interaction with the data for exploratory analysis. Desktop analytical tools such as ARB are indispensable for this purpose, but they do not necessarily offer a convenient solution for the coordination and integration of datasets between local and outsourced destinations. Therefore, researchers are currently left with an undesirable tradeoff between computational throughput and analytical capability. To mitigate this tradeoff we introduce a software package to leverage the utility of the interactive exploratory tools offered by ARB with the computational throughput of cloud-based resources. Our pipeline serves as middleware between the desktop and the cloud allowing researchers to form local custom databases containing sequences and metadata from multiple resources and a method for linking data outsourced for computation back to the local database. A tutorial implementation of the toolkit is provided in the supporting information, S1 Tutorial.<h4>Availability</h4>http://www.ece.drexel.edu/gailr/EESI/tutorial.php.