SeqGene: a comprehensive software solution for mining exome- and transcriptome- sequencing data.
ABSTRACT: BACKGROUND: The popularity of massively parallel exome and transcriptome sequencing projects demands new data mining tools with a comprehensive set of features to support a wide range of analysis tasks. RESULTS: SeqGene, a new data mining tool, supports mutation detection and annotation, dbSNP and 1000 Genome data integration, RNA-Seq expression quantification, mutation and coverage visualization, allele specific expression (ASE), differentially expressed genes (DEGs) identification, copy number variation (CNV) analysis, and gene expression quantitative trait loci (eQTLs) detection. We also developed novel methods for testing the association between SNP and expression and identifying genotype-controlled DEGs. We showed that the results generated from SeqGene compares favourably to other existing methods in our case studies. CONCLUSION: SeqGene is designed as a general-purpose software package. It supports both paired-end reads and single reads generated on most sequencing platforms; it runs on all major types of computers; it supports arbitrary genome assemblies for arbitrary organisms; and it scales well to support both large and small scale sequencing projects. The software homepage is http://seqgene.sourceforge.net.
Project description:Quantitative and systems biology approaches benefit from the unprecedented depth of next-generation sequencing. A typical experiment yields millions of short reads, which oftentimes carry particular sequence tags. These tags may be: (a) specific to the sequencing platform and library construction method (e.g., adapter sequences); (b) have been introduced by experimental design (e.g., sample barcodes); or (c) constitute some biological signal (e.g., splice leader sequences in nematodes). Our software FLEXBAR enables accurate recognition, sorting and trimming of sequence tags with maximal flexibility, based on exact overlap sequence alignment. The software supports data formats from all current sequencing platforms, including color-space reads. FLEXBAR maintains read pairings and processes separate barcode reads on demand. Our software facilitates the fine-grained adjustment of sequence tag detection parameters and search regions. FLEXBAR is a multi-threaded software and combines speed with precision. Even complex read processing scenarios might be executed with a single command line call. We demonstrate the utility of the software in terms of read mapping applications, library demultiplexing and splice leader detection. FLEXBAR and additional information is available for academic use from the website: http://sourceforge.net/projects/flexbar/.
Project description:Despite their economic, ecological, and experimental importance, genomic resources remain scarce for crustaceans. In lieu of genomes, many researchers have taken advantage of technological advancements to instead sequence and assemble crustacean transcriptomes de novo However, there is little consensus on what standard operating procedures are, or should be, for the field. Here, we systematically reviewed 53 studies published during 2014-2015 that utilized transcriptomic resources from this taxonomic group in an effort to identify commonalities as well as potential weaknesses that have applicability beyond just crustaceans. In general, these studies utilized RNA-Seq data, both novel and publicly available, to characterize transcriptomes and/or identify differentially expressed genes (DEGs) between treatments. Although the software suite Trinity was popular in assembly pipelines and other programs were also commonly employed, many studies failed to report crucial details regarding bioinformatic methodologies, including read mappers and the utilized parameters in identifying and characterizing DEGs. Annotation percentages for assembled transcriptomic contigs were low, averaging 32% overall. While other metrics, such as numbers of contigs and DEGs reported, correlated with the number of sequence reads utilized per sample, these did reach apparent saturation with increasing sequencing depth. Most disturbingly, a number of studies (55%) reported DEGs based on non-replicated experimental designs and single biological replicates for each treatment. Given this, we suggest future RNA-Seq experiments targeting transcriptome characterization conduct deeper (i.e., 50-100 M reads) sequencing while those examining differential expression instead focus more on increased biological replicates at shallower (i.e., ?10-20 M reads/sample) sequencing depths. Moreover, the community must avoid submitting for review, or accepting for publication, non-replicated differential expression studies. Finally, mining the ever growing publicly available transcriptomic data from crustaceans will allow future studies to focus on hypothesis-driven research instead of continuing to simply characterize transcriptomes. As an example of this, we utilized neurotoxin sequences from the recently described remipede venom gland transcriptome in conjunction with publicly available crustacean transcriptomic data to derive preliminary results and hypotheses regarding the evolution of venom in crustaceans.
Project description:BACKGROUND:Microbial genetics has formed a foundation for understanding many aspects of biology. Systematic annotation that supports computational data mining should reveal further insights for microbes, microbiomes, and conserved functions beyond microbes. The Ontology of Microbial Phenotypes (OMP) was created to support such annotation. RESULTS:We define standards for an OMP-based annotation framework that supports the capture of a variety of phenotypes and provides flexibility for different levels of detail based on a combination of pre- and post-composition using OMP and other Open Biomedical Ontology (OBO) projects. A system for entering and viewing OMP annotations has been added to our online, public, web-based data portal. CONCLUSIONS:The annotation framework described here is ready to support projects to capture phenotypes from the experimental literature for a variety of microbes. Defining the OMP annotation standard should support the development of new software tools for data mining and analysis in comparative phenomics.
Project description:Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user's system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.
Project description:The present study aimed to explore the underlying molecular mechanisms of hepatocellular carcinoma (HCC). RNA?sequencing profiles GSM629264 and GSM629265, from the GSE25599 data set, were downloaded from the Gene Expression Omnibus database and processed by quality evaluation. GSM629264 and GSM629265 were from HCC and adjacent non?cancerous tissues, respectively. TopHat software was used for alignment analysis, followed by the detection of novel splicing sites. In addition, the Cufflinks software package was used to analyze gene expressions, and the Cuffdiff program was used to screen for differently expressed genes (DEGs) and differentially expressed splicing variants. Gene ontology functional enrichment and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses of DEGs were also performed. Transcription factors (TFs) and microRNAs (miRNAs) that regulate DEGs were identified, and a protein?protein interaction (PPI) network was constructed. The hub node in the PPI network was obtained, and the TFs and miRNAs that regulated the hub node were further predicted. The quality of the sequencing data met the standards for analysis, and the clean reads were ~65%. Most sequencing reads mapped into coding sequence exons (CDS_exons), whereas other reads mapped into exon 3' untranslated regions (UTR_Exons), 5'UTR_Exons and Introns. Upregulated and downregulated DEGs between HCC and adjacent non?cancerous tissues were screened. Genes of differentially expressed splicing variants were identified, including vesicle?associated membrane protein 4, phosphatidylinositol glycan anchor biosynthesis class C, protein disulfide isomerase family A member 4 and growth arrest specific 5. Screened DEGs were enriched in the complement pathway. In the PPI network, ubiquitin C (UBC) was the hub node. UBC was predicted to be regulated by several TFs, including specificity protein 1 (SP1), FBJ murine osteosarcoma viral oncogene homolog (FOS), proto?oncogene c?JUN (JUN), FOS?like antigen 2 (FOSL2) and SWI/SNF?related, matrix?associated, actin?dependent regulator of chromatin, subfamily A, member 4 (SMARCA4), and several miRNAs, including miR?30 and miR?181. Results from the present study demonstrated that UBC, SP1, FOS, JUN, FOSL2, SMARCA4, miR?30 and miR?181 may participate in the development of HCC.
Project description:BACKGROUND:In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult. RESULTS:We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files. CONCLUSIONS:TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.
Project description:BACKGROUND: The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets. METHODOLOGY/PRINCIPAL FINDINGS: To facilitate the development and improvement of metagenomic tools and the planning of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a number of different sequencing technologies. A population sampler optionally produces evolved sequences based on source genomes and a given evolutionary tree. CONCLUSIONS/SIGNIFICANCE: MetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.
Project description:The Gene Ontology (GO) initiative is a collaborative effort that uses controlled vocabularies for annotating genetic information. We here present AGENDA (Application for mining Gene Ontology Data), a novel web-based tool for accessing the GO database. AGENDA allows the user to simultaneously retrieve and compare gene lists linked to different GO terms in diverse species using batch queries, facilitating comparative approaches to genetic information. The web-based application offers diverse search options and allows the user to bookmark, visualize, and download the results. AGENDA is an open source web-based application that is freely available for non-commercial use at the project homepage. URL: http://sourceforge.net/projects/bioagenda.
Project description:BACKGROUND: Neutrophil antigens are involved in a variety of clinical conditions including transfusion-related acute lung injury (TRALI) and other transfusion-related diseases. Recently, there are five characterized groups of human neutrophil antigen (HNA) systems, the HNA1 to 5. Characterization of all neutrophil antigens from whole genome sequencing (WGS) data may be accomplished for revealing complete genotyping formats of neutrophil antigens collectively at genome level with molecular variations which may respectively be revealed with available genotyping techniques for neutrophil antigens conventionally. RESULTS: We developed a computing method for the genotyping of human neutrophil antigens. Six samples from two families, available from the 1000 Genomes projects, were used for a HNA typing test. There are 500 ~ 3000 reads per sample filtered from the adopted human WGS datasets in order for identifying single nucleotide polymorphisms (SNPs) of neutrophil antigens. The visualization of read alignment shows that the yield reads from WGS dataset are enough to cover all of the SNP loci for the antigen system: HNA1, HNA3, HNA4 and HNA5. Consequently, our implemented Bioinformatics tool successfully revealed HNA types on all of the six samples including sequence-based typing (SBT) as well as PCR sequence-specific oligonucleotide probes (SSOP), PCR sequence-specific primers (SSP) and PCR restriction fragment length polymorphism (RFLP) along with parentage possibility. CONCLUSIONS: The next-generation sequencing technology strives to deliver affordable and non-biased sequencing results, hence the complete genotyping formats of HNA may be reported collectively from mining the output data of WGS. The study shows the feasibility of HNA genotyping through new WGS technologies. Our proposed algorithmic methodology is implemented in a HNATyping software package with user's guide available to the public at http://sourceforge.net/projects/hnatyping/.
Project description:BACKGROUND: As resequencing projects become more prevalent across a larger number of species, accurate variant identification will further elucidate the nature of genetic diversity and become increasingly relevant in genomic studies. However, the identification of larger genomic variants via DNA sequencing is limited by both the incomplete information provided by sequencing reads and the nature of the genome itself. Long-read sequencing technologies provide high-resolution access to structural variants often inaccessible to shorter reads. RESULTS: We present PBHoney, software that considers both intra-read discordance and soft-clipped tails of long reads (>10,000 bp) to identify structural variants. As a proof of concept, we identify four structural variants and two genomic features in a strain of Escherichia coli with PBHoney and validate them via de novo assembly. PBHoney is available for download at http://sourceforge.net/projects/pb-jelly/. CONCLUSIONS: Implementing two variant-identification approaches that exploit the high mappability of long reads, PBHoney is demonstrated as being effective at detecting larger structural variants using whole-genome Pacific Biosciences RS II Continuous Long Reads. Furthermore, PBHoney is able to discover two genomic features: the existence of Rac-Phage in isolate; evidence of E. coli's circular genome.