Project description:MotivationGenome-wide association study (GWAS) has been a great success in the past decade. However, significant challenges still remain in both identifying new risk loci and interpreting results. Bonferroni-corrected significance level is known to be conservative, leading to insufficient statistical power when the effect size is moderate at risk locus. Complex structure of linkage disequilibrium also makes it challenging to separate causal variants from nonfunctional ones in large haplotype blocks. Under such circumstances, a computational approach that may increase signal replication rate and identify potential functional sites among correlated markers is urgently needed.ResultsWe describe GenoWAP, a GWAS signal prioritization method that integrates genomic functional annotation and GWAS test statistics. The effectiveness of GenoWAP is demonstrated through its applications to Crohn's disease and schizophrenia using the largest studies available, where highly ranked loci show substantially stronger signals in the whole dataset after prioritization based on a subset of samples. At the single nucleotide polymorphism (SNP) level, top ranked SNPs after prioritization have both higher replication rates and consistently stronger enrichment of eQTLs. Within each risk locus, GenoWAP may be able to distinguish functional sites from groups of correlated SNPs.Availability and implementationGenoWAP is freely available on the web at http://genocanyon.med.yale.edu/GenoWAP.
Project description:SummaryAs the importance of microbiome research continues to become more prevalent and essential to understanding a wide variety of ecosystems (e.g. marine, built, host associated, etc.), there is a need for researchers to be able to perform highly reproducible and quality analysis of microbial genomes. MetaSanity incorporates analyses from 11 existing and widely used genome evaluation and annotation suites into a single, distributable workflow, thereby decreasing the workload of microbiologists by allowing for a flexible, expansive data analysis pipeline. MetaSanity has been designed to provide separate, reproducible workflows that (i) can determine the overall quality of a microbial genome, while providing a putative phylogenetic assignment, and (ii) can assign structural and functional gene annotations with varying degrees of specificity to suit the needs of the researcher. The software suite combines the results from several tools to provide broad insights into overall metabolic function. Importantly, this software provides built-in optimization for 'big data' analysis by storing all relevant outputs in an SQL database, allowing users to query all the results for the elements that will most impact their research.Availability and implementationMetaSanity is provided under the GNU General Public License v.3.0 and is available for download at https://github.com/cjneely10/MetaSanity. This application is distributed as a Docker image. MetaSanity is implemented in Python3/Cython and C++. Instructions for its installation and use are available within the GitHub wiki page at https://github.com/cjneely10/MetaSanity/wiki, and additional instructions are available at https://cjneely10.github.io/year-archive/. MetaSanity is optimized for users with limited programing experience.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:BackgroundThe biological interpretation of large-scale gene expression data is one of the paramount challenges in current bioinformatics. In particular, placing the results in the context of other available functional genomics data, such as existing bio-ontologies, has already provided substantial improvement for detecting and categorizing genes of interest. One common approach is to look for functional annotations that are significantly enriched within a group or cluster of genes, as compared to a reference group.ResultsIn this work, we suggest the information-theoretic concept of mutual information to investigate the relationship between groups of genes, as given by data-driven clustering, and their respective functional categories. Drawing upon related approaches (Gibbons and Roth, Genome Research 12:1574-1581, 2002), we seek to quantify to what extent individual attributes are sufficient to characterize a given group or cluster of genes.ConclusionWe show that the mutual information provides a systematic framework to assess the relationship between groups or clusters of genes and their functional annotations in a quantitative way. Within this framework, the mutual information allows us to address and incorporate several important issues, such as the interdependence of functional annotations and combinatorial combinations of attributes. It thus supplements and extends the conventional search for overrepresented attributes within a group or cluster of genes. In particular taking combinations of attributes into account, the mutual information opens the way to uncover specific functional descriptions of a group of genes or clustering result. All datasets and functional annotations used in this study are publicly available. All scripts used in the analysis are provided as additional files.
Project description:Here, we describe MetaErg, a standalone and fully automated metagenome and metaproteome annotation pipeline. Annotation of metagenomes is challenging. First, metagenomes contain sequence data of many organisms from all domains of life. Second, many of these are from understudied lineages, encoding genes with low similarity to experimentally validated reference genes. Third, assembly and binning are not perfect, sometimes resulting in artifactual hybrid contigs or genomes. To address these challenges, MetaErg provides graphical summaries of annotation outcomes, both for the complete metagenome and for individual metagenome-assembled genomes (MAGs). It performs a comprehensive annotation of each gene, including taxonomic classification, enabling functional inferences despite low similarity to reference genes, as well as detection of potential assembly or binning artifacts. When provided with metaproteome information, it visualizes gene and pathway activity using sequencing coverage and proteomic spectral counts, respectively. For visualization, MetaErg provides an HTML interface, bringing all annotation results together, and producing sortable and searchable tables, collapsible trees, and other graphic representations enabling intuitive navigation of complex data. MetaErg, implemented in Perl, HTML, and JavaScript, is a fully open source application, distributed under Academic Free License at https://github.com/xiaoli-dong/metaerg. MetaErg is also available as a docker image at https://hub.docker.com/r/xiaolidong/docker-metaerg.
Project description:SummaryTo address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene calling algorithm and assigns putative functions to gene calls using protein-, virus- and phage-centric databases. multiPhATE's modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system.Availability and implementationmultiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE distribution.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Germline Variants (GVs) are effective in predicting cancer risk and may be relevant in predicting patient outcomes. Here we provide a bioinformatic pipeline to identify GVs from the TCGA lower grade glioma cohort in Genomics Data Commons. We integrate paired whole exome sequences from normal and tumor samples and RNA sequences from tumor samples to determine a patient's GV status. We then identify the subset of GVs that are predictive of patient outcomes by Cox regression. For complete details on the use and execution of this protocol, please refer to Chatrath et al. (2019) and Chatrath et al. (2020).
Project description:Despite the recent surge of viral metagenomic studies, it remains a significant challenge to recover complete virus genomes from metagenomic data. The majority of viral contigs generated from de novo assembly programs are highly fragmented, presenting significant challenges to downstream analysis and inference. To address this issue, we have developed Virseqimprover, a computational pipeline that can extend assembled contigs to complete or nearly complete genomes while maintaining extension quality. Virseqimprover first examines whether there is any chimeric sequence based on read coverage, breaks the sequence into segments if there is, then extends the longest segment with uniform depth of coverage, and repeats these procedures until the sequence cannot be extended. Finally, Virseqimprover annotates the gene content of the resulting sequence. Results show that Virseqimprover has good performances on correcting and extending viral contigs to their full lengths, hence can be a useful tool to improve the completeness and minimize the assembly errors of viral contigs. Both a web server and a conda package for Virseqimprover are provided to the research community free of charge.
Project description:Mitochondrial (mt) genomics represents an understudied but important field of molecular biology. Increasingly, mt dysfunction is being linked to a range of human diseases, including neurodegenerative disorders, diabetes and impairment of childhood development. In addition, mt genomes provide important markers for systematic, evolutionary and population genetic studies. Some technological limitations have prevented the expanded generation and utilization of mt genomic data for some groups of organisms. These obstacles most acutely impede, but are not limited to, studies requiring the determination of complete mt genomic data from minute amounts of material (e.g. biopsy samples or microscopic organisms). Furthermore, post-sequencing bioinformatic annotation and analyses of mt genomes are time consuming and inefficient. Herein, we describe a high-throughput sequencing and bioinformatic pipeline for mt genomics, which will have implications for the annotation and analysis of other organellar (e.g. plastid or apicoplast genomes) and virus genomes as well as long, contiguous regions in nuclear genomes. We utilize this pipeline to sequence and annotate the complete mt genomes of 12 species of parasitic nematode (order Strongylida) simultaneously, each from an individual organism. These mt genomic data provide a rich source of markers for studies of the systematics and population genetics of a group of socioeconomically important pathogens of humans and other animals.
Project description:To identify sets of genes that exhibit similar expression characteristics, co-expression networks were constructed from transcriptome datasets that were obtained from plant samples at various stages of growth and development or treated with diverse biotic, abiotic, and other environmental stresses. In addition, co-expression network analysis can provide deeper insights into gene regulation when combined with transcriptomics. The coordination and integration of all these complex networks to deduce gene regulation are major challenges for plant biologists. Python and R have emerged as major tools for managing complex scientific data over the past decade. In this study, we describe a reproducible protocol POTFUL (pant co-expression transcription factor regulators), implemented in Python 3, for integrating co-expression and transcription factor target protein networks to infer gene regulation.
Project description:The advances of large-scale genomics studies have enabled compilation of cell type-specific, genome-wide DNA functional elements at high resolution. With the growing volume of functional annotation data and sequencing variants, existing variant annotation algorithms lack the efficiency and scalability to process big genomic data, particularly when annotating whole-genome sequencing variants against a huge database with billions of genomic features. Here, we develop VarNote to rapidly annotate genome-scale variants in large and complex functional annotation resources. Equipped with a novel index system and a parallel random-sweep searching algorithm, VarNote shows substantial performance improvements (two to three orders of magnitude) over existing algorithms at different scales. It supports both region-based and allele-specific annotations and introduces advanced functions for the flexible extraction of annotations. By integrating massive base-wise and context-dependent annotations in the VarNote framework, we introduce three efficient and accurate pipelines to prioritize the causal regulatory variants for common diseases, Mendelian disorders, and cancers.