HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences.
ABSTRACT: Nucleotide sequence data are being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis-intended to reduce redundancy, define gene families or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps.Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster.Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in Cþþ and uses the Message Passing Interface (MPI) standard for distributed computing.
Project description:SUMMARY:Here we introduce a Fiji plugin utilizing the HPC-as-a-Service concept, significantly mitigating the challenges life scientists face when delegating complex data-intensive processing workflows to HPC clusters. We demonstrate on a common Selective Plane Illumination Microscopy image processing task that execution of a Fiji workflow on a remote supercomputer leads to improved turnaround time despite the data transfer overhead. The plugin allows the end users to conveniently transfer image data to remote HPC resources, manage pipeline jobs and visualize processed results directly from the Fiji graphical user interface. AVAILABILITY AND IMPLEMENTATION:The code is distributed free and open source under the MIT license. Source code: https://github.com/fiji-hpc/hpc-workflow-manager/, documentation: https://imagej.net/SPIM_Workflow_Manager_For_HPC. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:The upcoming exascale era will push the changes in computing architecture from classical CPU-based systems towards hybrid GPU-heavy systems with much higher levels of complexity. While such clusters are expected to improve the performance of certain optimized HPC applications, it will also increase the difficulties for those users who have yet to adapt their codes or are starting from scratch with new programming paradigms. Since there are still no comprehensive automatic assistance mechanisms to enhance application performance on such systems, we propose a support framework for future HPC architectures, called EASEY (Enable exASclae for EverYone). Our solution builds on a layered software architecture, which offers different mechanisms on each layer for different tasks of tuning, including a workflow management system. This enables users to adjust the parameters on each of the layers, thereby enhancing specific characteristics of their codes. We introduce the framework with a Charliecloud-based solution, showcasing the LULESH benchmark on the upper layers of our framework. Our approach can automatically deploy optimized container computations with negligible overhead and at the same time reduce the time a scientist needs to spent on manual job submission configurations.
Project description:Biomedical data are quickly growing in volume and in variety, providing clinicians an opportunity for better clinical decision support. Here, we demonstrate a robust platform that uses software automation and high performance computing (HPC) resources to achieve real-time analytics of clinical data, specifically magnetic resonance imaging (MRI) data. We used the Agave application programming interface to facilitate communication, data transfer, and job control between an MRI scanner and an off-site HPC resource. In this use case, Agave executed the graphical pipeline tool GRAphical Pipeline Environment (GRAPE) to perform automated, real-time, quantitative analysis of MRI scans. Same-session image processing will open the door for adaptive scanning and real-time quality control, potentially accelerating the discovery of pathologies and minimizing patient callbacks. We envision this platform can be adapted to other medical instruments, HPC resources, and analytics tools.
Project description:BACKGROUND:The advent of Next Generation Sequencing (NGS) technologies and the concomitant reduction in sequencing costs allows unprecedented high throughput profiling of biological systems in a cost-efficient manner. Modern biological experiments are increasingly becoming both data and computationally intensive and the wealth of publicly available biological data is introducing bioinformatics into the "Big Data" era. For these reasons, the effective application of High Performance Computing (HPC) architectures is becoming progressively more recognized also by bioinformaticians. Here we describe HPC resources provisioning pilot programs dedicated to bioinformaticians, run by the Italian Node of ELIXIR (ELIXIR-IT) in collaboration with CINECA, the main Italian supercomputing center. RESULTS:Starting from April 2016, CINECA and ELIXIR-IT launched the pilot Call "ELIXIR-IT HPC@CINECA", offering streamlined access to HPC resources for bioinformatics. Resources are made available either through web front-ends to dedicated workflows developed at CINECA or by providing direct access to the High Performance Computing systems through a standard command-line interface tailored for bioinformatics data analysis. This allows to offer to the biomedical research community a production scale environment, continuously updated with the latest available versions of publicly available reference datasets and bioinformatic tools. Currently, 63 research projects have gained access to the HPC@CINECA program, for a total handout of ~?8 Millions of CPU/hours and, for data storage, ~?100?TB of permanent and?~?300?TB of temporary space. CONCLUSIONS:Three years after the beginning of the ELIXIR-IT HPC@CINECA program, we can appreciate its impact over the Italian bioinformatics community and draw some considerations. Several Italian researchers who applied to the program have gained access to one of the top-ranking public scientific supercomputing facilities in Europe. Those investigators had the opportunity to sensibly reduce computational turnaround times in their research projects and to process massive amounts of data, pursuing research approaches that would have been otherwise difficult or impossible to undertake. Moreover, by taking advantage of the wealth of documentation and training material provided by CINECA, participants had the opportunity to improve their skills in the usage of HPC systems and be better positioned to apply to similar EU programs of greater scale, such as PRACE. To illustrate the effective usage and impact of the resources awarded by the program - in different research applications - we report five successful use cases, which have already published their findings in peer-reviewed journals.
Project description:Transposable elements (TEs) are non-static genomic units capable of moving indistinctly from one chromosomal location to another. Their insertion polymorphisms may cause beneficial mutations, such as the creation of new gene function, or deleterious in eukaryotes, e.g., different types of cancer in humans. A particular type of TE called LTR-retrotransposons comprises almost 8% of the human genome. Among LTR retrotransposons, human endogenous retroviruses (HERVs) bear structural and functional similarities to retroviruses. Several tools allow the detection of transposon insertion polymorphisms (TIPs) but fail to efficiently analyze large genomes or large datasets. Here, we developed a computational tool, named TIP_finder, able to detect mobile element insertions in very large genomes, through high-performance computing (HPC) and parallel programming, using the inference of discordant read pair analysis. TIP_finder inputs are (i) short pair reads such as those obtained by Illumina, (ii) a chromosome-level reference genome sequence, and (iii) a database of consensus TE sequences. The HPC strategy we propose adds scalability and provides a useful tool to analyze huge genomic datasets in a decent running time. TIP_finder accelerates the detection of transposon insertion polymorphisms (TIPs) by up to 55 times in breast cancer datasets and 46 times in cancer-free datasets compared to the fastest available algorithms. TIP_finder applies a validated strategy to find TIPs, accelerates the process through HPC, and addresses the issues of runtime for large-scale analyses in the post-genomic era. TIP_finder version 1.0 is available at https://github.com/simonorozcoarias/TIP_finder.
Project description:BACKGROUND:RNA editing is a widespread co-/post-transcriptional mechanism that alters primary RNA sequences through the modification of specific nucleotides and it can increase both the transcriptome and proteome diversity. The automatic detection of RNA-editing from RNA-seq data is computational intensive and limited to small data sets, thus preventing a reliable genome-wide characterisation of such process. RESULTS:In this work we introduce HPC-REDItools, an upgraded tool for accurate RNA-editing events discovery from large dataset repositories. AVAILABILITY:https://github.com/BioinfoUNIBA/REDItools2 . CONCLUSIONS:HPC-REDItools is dramatically faster than the previous version, REDItools, enabling big-data analysis by means of a MPI-based implementation and scaling almost linearly with the number of available cores.
Project description:BACKGROUND:With the advent of high-throughput sequencing, microbiology is becoming increasingly data-intensive. Because of its low cost, robust databases, and established bioinformatic workflows, sequencing of 16S/18S/ITS ribosomal RNA (rRNA) gene amplicons, which provides a marker of choice for phylogenetic studies, has become ubiquitous. Many established end-to-end bioinformatic pipelines are available to perform short amplicon sequence data analysis. These pipelines suit a general audience, but few options exist for more specialized users who are experienced in code scripting, Linux-based systems, and high-performance computing (HPC) environments. For such an audience, existing pipelines can be limiting to fully leverage modern HPC capabilities and perform tweaking and optimization operations. Moreover, a wealth of stand-alone software packages that perform specific targeted bioinformatic tasks are increasingly accessible, and finding a way to easily integrate these applications in a pipeline is critical to the evolution of bioinformatic methodologies. RESULTS:Here we describe AmpliconTagger, a short rRNA marker gene amplicon pipeline coded in a Python framework that enables fine tuning and integration of virtually any potential rRNA gene amplicon bioinformatic procedure. It is designed to work within an HPC environment, supporting a complex network of job dependencies with a smart-restart mechanism in case of job failure or parameter modifications. As proof of concept, we present end results obtained with AmpliconTagger using 16S, 18S, ITS rRNA short gene amplicons and Pacific Biosciences long-read amplicon data types as input. CONCLUSIONS:Using a selection of published algorithms for generating operational taxonomic units and amplicon sequence variants and for computing downstream taxonomic summaries and diversity metrics, we demonstrate the performance and versatility of our pipeline for systematic analyses of amplicon sequence data.
Project description:Machine learning algorithms are becoming more and more useful in many fields of science, including many areas where computational methods are rarely used. High-performance Computing (HPC) is the most powerful solution to get the best results using these algorithms. HPC requires various skills to use. Acquiring this knowledge might be intimidating and take a long time for a researcher with small or no background in information and communications technologies (ICTs), even if the benefits of such knowledge is evident for the researcher. In this work, we aim to assess how a specific method of introducing HPC to such researchers enables them to start using HPC. We gave talks to two groups of non-ICT researchers that introduced basic concepts focusing on the necessary practical steps needed to use HPC on a specific cluster. We also offered hands-on trainings for one of the groups which aimed to guide participants through the first steps of using HPC. Participants filled out questionnaires partly based on Kirkpatrick’s training evaluation model before and after the talk, and after the hands-on training. We found that the talk increased participants’ self-reported likelihood of using HPC in their future research, but this was not significant for the group where participation was voluntary. On the contrary, very few researchers participated in the hands-on training, and for these participants neither the talk, nor the hands-on training changed their self-reported likelihood of using HPC in their future research. We argue that our findings show that academia and researchers would benefit from an environment that not only expects researchers to train themselves, but provides structural support for acquiring new skills. Electronic supplementary material The online version of this article (10.1007/s11227-020-03438-0) contains supplementary material, which is available to authorised users.
Project description:With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence.Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput.BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology.BarraCUDA is currently available from http://seqbarracuda.sf.net.
Project description:Identifying co-expressed gene clusters can provide evidence for genetic or physical interactions. Thus, co-expression clustering is a routine step in large-scale analyses of gene expression data. We show that commonly used clustering methods produce results that substantially disagree and that do not match the biological expectations of co-expressed gene clusters. We present clust, a method that solves these problems by extracting clusters matching the biological expectations of co-expressed genes and outperforms widely used methods. Additionally, clust can simultaneously cluster multiple datasets, enabling users to leverage the large quantity of public expression data for novel comparative analysis. Clust is available at https://github.com/BaselAbujamous/clust .