Project description:Motivation:Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Results:Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0?Gb (VCF), 12.3?Gb (BCF, binary VCF), 3.5?Gb (BGT) and 2.6?Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data. Availability and Implementation:http://www.bioconductor.org/packages/SeqArray. Contact:email@example.com. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:Human whole-genome-sequencing reveals about 4 000 000 genomic variants per individual. These data are mostly stored as VCF-format files. Although many variant analysis methods accept VCF as input, many other tools require DNA or protein sequences, particularly for splicing prediction, sequence alignment, phylogenetic analysis, and structure prediction. However, there is no existing webserver capable of extracting DNA/protein sequences for genomic variants from VCF files in a user-friendly and efficient manner. We developed the SeqTailor webserver to bridge this gap, by enabling rapid extraction of (i) DNA sequences around genomic variants, with customizable window sizes and options to annotate the splice sites closest to the variants and to consider the neighboring variants within the window; and (ii) protein sequences encoded by the DNA sequences around genomic variants, with built-in SnpEff annotator and customizable window sizes. SeqTailor supports 11 species, including: human (GRCh37/GRCh38), chimpanzee, mouse, rat, cow, chicken, lizard, zebrafish, fruitfly, Arabidopsis and rice. Standalone programs are provided for command-line-based needs. SeqTailor streamlines the sequence extraction process, and accelerates the analysis of genomic variants with software requiring DNA/protein sequences. It will facilitate the study of genomic variation, by increasing the feasibility of sequence-based analysis and prediction. The SeqTailor webserver is freely available at http://shiva.rockefeller.edu/SeqTailor/.
Project description:This experiment contains a subset of data from the BLUEPRINT Epigenome project ( http://www.blueprint-epigenome.eu ), which aims at producing a reference haemopoetic epigenomes for the research community. 74 samples of primary cells or cultured primary cells of different haemopoeitc lineages from cord blood, venous blood, bone marrow and thymus are included in this experiment. This ArrayExpress record contains only meta-data. Raw data files have been archived at the European Genome-Phenome Archive (EGA, www.ebi.ac.uk/ega) by the consortium, with restricted access to protect sample donors' identity. There are 32 EGA data set accessions, which can be found under the Comment[EGA_DATA_SET] column in the 'Sample Data Relationship Format' (SDRF) file of this ArrayExpress record (http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-3827/E-MTAB-3827.sdrf.txt). Details on how to apply for data access via the BLUEPRINT data access committee are on the EGA data set pages. Likewise, mapping of samples to these EGA accessions can be found in the SDRF file. Please note that the raw data files for 11 sequencing runs have yet been deposited at EGA, so they are marked with \\ot available\\ under the Comment[SUBMITTED_FILE_NAME] field in the SDRF file, and were included for the sake of completeness. Further iInformation on individual samples and sequencing libraries can also be found on the BLUEPRINT data coordination centre (DCC) website: http://dcc.blueprint-epigenome.eu\
Project description:BACKGROUND:Next-generation sequencing (NGS) has been widely used in both clinics and research. It has become the most powerful tool for diagnosing genetic disorders and investigating disease etiology through the discovery of genetic variants. Variants identified by NGS are stored in variant call format (VCF) files. However, querying and filtering VCF files are extremely difficult for researchers without programming skills. Furthermore, as the mutation data are increasing exponentially, there is an urgent need to develop tools to manage these variant data in a centralized way. METHODS:The VCF-Server was developed as a web-based visualization tool to support the interactive analysis of genetic variant data. It allows researchers and medical geneticists to manage, annotate, filter, query, and export variants in a fast and effective way. RESULTS:In this study, we developed the VCF-Server, a powerful and easily accessible tool for researchers and medical geneticists to perform variant analysis. Users can query VCFs, annotate, and filter variants without knowing programming code. Once the VCF file is uploaded, VCF-Server allows users to annotate the VCF with commonly used databases or user-defined variant annotations (including variant blacklist and whitelist). Variant information in the VCF is shown visually via the interactive graphical interface. Users can filter the variants with flexible filtering rules, and the prioritized variants can be exported locally for further analysis. As VCF-Server adopts a web file system, files in the VCF-Server can be stored and managed in a centralized way. Moreover, VCF-Server allows direct web-based analysis (accessible through either desktop computers or mobile devices) as well as local deployment. CONCLUSIONS:With an easy-to-use graphical interface, VCF-Server allows researchers with little bioinformatics background to explore and mine mutation data, which may broaden the application of NGS technology in clinics and research. The tool is freely available for use at https://www.diseasegps.org/VCF-Server?lan = eng.
Project description:Evaluating, optimising and benchmarking of next generation sequencing (NGS) variant calling performance are essential requirements for clinical, commercial and academic NGS pipelines. Such assessments should be performed in a consistent, transparent and reproducible fashion, using independently, orthogonally generated data. Here we present ICR142 Benchmarker, a tool to generate outputs for assessing germline base substitution and indel calling performance using the ICR142 NGS validation series, a dataset of Illumina platform-based exome sequence data from 142 samples together with Sanger sequence data at 704 sites. ICR142 Benchmarker provides summary and detailed information on the sensitivity, specificity and false detection rates of variant callers. ICR142 Benchmarker also automatically generates a single page report highlighting key performance metrics and how performance compares to widely-used open-source tools. We used ICR142 Benchmarker with VCF files outputted by GATK, OpEx and DeepVariant to create a benchmark for variant calling performance. This evaluation revealed pipeline-specific differences and shared challenges in variant calling, for example in detecting indels in short repeating sequence motifs. We next used ICR142 Benchmarker to perform regression testing with DeepVariant versions 0.5.2 and 0.6.1. This showed that v0.6.1 improves variant calling performance, but there was evidence of minor changes in indel calling behaviour that may benefit from attention. The data also allowed us to evaluate filters to optimise DeepVariant calling, and we recommend using 30 as the QUAL threshold for base substitution calls when using DeepVariant v0.6.1. Finally, we used ICR142 Benchmarker with VCF files from two commercial variant calling providers to facilitate optimisation of their in-house pipelines and to provide transparent benchmarking of their performance. ICR142 Benchmarker consistently and transparently analyses variant calling performance based on the ICR142 NGS validation series, using the standard VCF input and outputting informative metrics to enable user understanding of pipeline performance. ICR142 Benchmarker is freely available at https://github.com/RahmanTeamDevelopment/ICR142_Benchmarker/releases.
Project description:Background:Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings:In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions:Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.
Project description:The expression profile and sequence variants of 476 early stage urothelial carcinoma were studied using whole transcriptome sequencing. RNA-Seq libraries were prepared by Ribo-Zero treatment of total-RNA (to reduce the rRNA content) followed by library preparation using ScriptSeq. RNA-Seq libraries were paired-end sequenced (2x 101 bp) on Illumina HiSeq 2000 and the resulting fastq files were processed using tools from the Genome Analysis Toolkit (GATK and from the Tuxedo suite. Access to the sequence data (bam and vcf files), containing person identifying information, needs signature on a controlled access form, and can be accessed at The European Genome-phenome Archive (EGA) using the study ID EGAS00001001236 following request. An expression matrix of FPKM values are available without restriction at ArrayExpress.
Project description:Cystic fibrosis (CF) is one of the most common genetic diseases worldwide with high carrier frequencies across different ethnicities. Next generation sequencing of the cystic fibrosis transmembrane conductance regulator (CFTR) gene has proven to be an effective screening tool to determine carrier status with high detection rates. Here, we evaluate the performance of the Swift Biosciences Accel-Amplicon CFTR Capture Panel using CFTR-positive DNA samples. This assay is a one-day protocol that allows for one-tube reaction of 87 amplicons that span all coding regions, 5' and 3'UTR, as well as four intronic regions. In this study, we provide the FASTQ, BAM, and VCF files on seven unique CFTR-positive samples and one normal control sample (14 samples processed including repeated samples). This method generated sequencing data with high coverage and near 100% on-target reads. We found that coverage depth was correlated with the GC content of each exon. This dataset is instrumental for clinical laboratories that are evaluating this technology as part of their carrier screening program.