Project description:Bioinformatic analysis of genomic sequencing data to identify somatic mutations in cancer samples is far from achieving the required robustness and standardisation. In this study we generated a whole exome sequencing benchmark dataset using the platinum genome sample NA12878 and developed an intersect-then-combine (ITC) approach to increase the accuracy in calling single nucleotide variants (SNVs) and indels in tumour-normal pairs. We evaluated the effect of alignment, base quality recalibration, mutation caller and filtering on sensitivity and false positive rate. The ITC approach increased the sensitivity up to 17.1%, without increasing the false positive rate per megabase (FPR/Mb) and its validity was confirmed in a set of clinical samples.
Project description:Next generation sequencing is extensively applied to catalogue somatic mutations in cancer, in research settings and increasingly in clinical settings for molecular diagnostics, guiding therapy decisions. Somatic variant callers perform paired comparisons of sequencing data from cancer tissue and matched normal tissue in order to detect somatic mutations. The advent of many new somatic variant callers creates a need for comparison and validation of the tools, as no de facto standard for detection of somatic mutations exists and only limited comparisons have been reported. We have performed a comprehensive evaluation using exome sequencing and targeted deep sequencing data of paired tumor-normal samples from five breast cancer patients to evaluate the performance of nine publicly available somatic variant callers: EBCall, Mutect, Seurat, Shimmer, Indelocator, Somatic Sniper, Strelka, VarScan 2 and Virmid for the detection of single nucleotide mutations and small deletions and insertions. We report a large variation in the number of calls from the nine somatic variant callers on the same sequencing data and highly variable agreement. Sequencing depth had markedly diverse impact on individual callers, as for some callers, increased sequencing depth highly improved sensitivity. For SNV calling, we report EBCall, Mutect, Virmid and Strelka to be the most reliable somatic variant callers for both exome sequencing and targeted deep sequencing. For indel calling, EBCall is superior due to high sensitivity and robustness to changes in sequencing depths.
Project description:MotivationThe development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms.ResultsWe apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm.Availability and implementationTo enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Diverticulitis is a chronic disease of the colon in which diverticuli, or outpouching through the colonic wall, become inflamed. Although recent observations suggest that genetic factors may play a significant role in diverticulitis, few genes have yet been implicated in disease pathogenesis and familial cases are uncommon. Here, we report results of whole exome sequencing performed on members from a single multi-generational family with early onset diverticulitis in order to identify a genetic component of the disease. We identified a rare single nucleotide variant in the laminin β 4 gene (LAMB4) that segregated with disease in a dominant pattern and causes a damaging missense substitution (D435N). Targeted sequencing of LAMB4 in 148 non-familial and unrelated sporadic diverticulitis patients identified two additional rare variants in the gene. Immunohistochemistry indicated that LAMB4 localizes to the myenteric plexus of colonic tissue and patients harboring LAMB4 variants exhibited reduced LAMB4 protein levels relative to controls. Laminins are constituents of the extracellular matrix and play a major role in regulating the development and function of the enteric nervous system. Reduced LAMB4 levels may therefore alter innervation and morphology of the enteric nervous system, which may contribute to colonic dysmotility associated with diverticulitis.
Project description:Accurate detection of somatic mutations in DNA sequencing data is a fundamental prerequisite for cancer research. Previous analytical challenge was overcome by consensus mutation calling from four to five popular callers. This, however, increases the already nontrivial computing time from individual callers. Here, we launch MuSE2.0, powered by multi-step parallelization and efficient memory allocation, to resolve the computing time bottleneck. MuSE2.0 speeds up 50 times than MuSE1.0 and 8-80 times than other popular callers. Our benchmark study suggests combining MuSE2.0 and the recently expedited Strelka2 can achieve high efficiency and accuracy in analyzing large cancer genomic datasets.
Project description:MotivationThe sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.ResultsIn this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.Availability and implementationBinaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.Contactdelarson@wustl.edu; lding@wustl.eduSupplementary informationSupplementary data are available at Bioinformatics online.
Project description:ObjectiveThe availability of complete and accurate crash injury data is critical to prevention and intervention efforts. Relying solely on hospital discharge data or police crash reports may result in a biased undercount of injuries. Linking hospital data with crash reports may allow for a more robust identification of injuries and an understanding of which populations may be missed in an analysis of one source. We used the New Jersey Safety and Health Outcomes (NJ-SHO) data warehouse to examine the share of the entire crash-injured population identified in each of the two data sources, overall and by age, race/ethnicity, sex, injury severity, and road user type.MethodsWe utilized 2016-2017 data from the NJ-SHO warehouse. We identified crash-involved individuals in hospital discharge data by applying the ICD-10-CM external cause of injury matrix. Among crash-involved individuals, we identified those with injury- or pain-related diagnosis codes as being injured. We also identified crash-involved individuals via crash report data and identified injuries using the KABCO scale. We jointly examined the two sources; injuries in the hospital discharge data were documented as being related to the same crash as injuries found in the crash report data if the date of the crash report preceded the date of hospital admission by no more than two days.ResultsIn total, there were 262,338 crash-involved individuals with a documented injury in the hospital discharge data or on the crash report during the study period; 168,874 had an injury according to hospital discharge data, and 164,158 had an injury in crash report data. Only 70,694 (26.9%) had an injury in both sources. We observed differences by age, race/ethnicity, injury severity, and road user type: hospital discharge data captured a larger share of those ages 65+, those who were Black or Hispanic, those with higher severity injuries, and those who were bicyclists or motorcyclists.ConclusionsEach data source in isolation captures approximately two-thirds of the entire crash-injured population; one source alone misses approximately one-third of injured individuals. Each source undercounts people in certain groups, so relying on one source alone may not allow for tailored prevention and intervention efforts.
Project description:BackgroundSmith-Magenis syndrome (SMS) is a developmental disability/multiple congenital anomaly disorder resulting from haploinsufficiency of RAI1. It is characterized by distinctive facial features, brachydactyly, sleep disturbances, and stereotypic behaviors.MethodsWe investigated a cohort of 15 individuals with a clinical suspicion of SMS who showed neither deletion in the SMS critical region nor damaging variants in RAI1 using whole exome sequencing. A combination of network analysis (co-expression and biomedical text mining), transcriptomics, and circularized chromatin conformation capture (4C-seq) was applied to verify whether modified genes are part of the same disease network as known SMS-causing genes.ResultsPotentially deleterious variants were identified in nine of these individuals using whole-exome sequencing. Eight of these changes affect KMT2D, ZEB2, MAP2K2, GLDC, CASK, MECP2, KDM5C, and POGZ, known to be associated with Kabuki syndrome 1, Mowat-Wilson syndrome, cardiofaciocutaneous syndrome, glycine encephalopathy, mental retardation and microcephaly with pontine and cerebellar hypoplasia, X-linked mental retardation 13, X-linked mental retardation Claes-Jensen type, and White-Sutton syndrome, respectively. The ninth individual carries a de novo variant in JAKMIP1, a regulator of neuronal translation that was recently found deleted in a patient with autism spectrum disorder. Analyses of co-expression and biomedical text mining suggest that these pathologies and SMS are part of the same disease network. Further support for this hypothesis was obtained from transcriptome profiling that showed that the expression levels of both Zeb2 and Map2k2 are perturbed in Rai1 -/- mice. As an orthogonal approach to potentially contributory disease gene variants, we used chromatin conformation capture to reveal chromatin contacts between RAI1 and the loci flanking ZEB2 and GLDC, as well as between RAI1 and human orthologs of the genes that show perturbed expression in our Rai1 -/- mouse model.ConclusionsThese holistic studies of RAI1 and its interactions allow insights into SMS and other disorders associated with intellectual disability and behavioral abnormalities. Our findings support a pan-genomic approach to the molecular diagnosis of a distinctive disorder.
Project description:Lipomas are benign fatty tumors with a high prevalence rate, mostly found in adults but have a good prognosis. Until now, reason for lipoma occurrence not been identified. We performed whole exome sequencing to define the mutational spectrum in ten lipoma patients along with their matching control samples. We presented genomic insight into the development of lipomas, the most common benign tumor of soft tissue. Our analysis identified 412 somatic variants including missense mutations, splice site variants, frameshift indels, and stop gain/lost. Copy number variation analysis highlighted minor aberrations in patients. Kinase genes and transcriptions factors were among the validated mutated genes critical for cell proliferation and survival. Pathway analysis revealed enrichment of calcium, Wnt and phospholipase D signaling in patients. In conclusion, whole exome sequencing in lipomas identified mutations in genes with a possible role in development and progression of lipomas.