Dataset Information

Improving somatic variant identification through integration of genome and exome data.

ABSTRACT:

SUBMITTER: Vijayan V

PROVIDER: S-EPMC5657037 | biostudies-literature | 2017 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improving somatic variant identification through integration of genome and exome data.

Vijayan Vinaya V Yiu Siu-Ming SM Zhang Liqing L

BMC genomics 20171016 Suppl 7

PMID: 29513195

Similar Datasets

Project description:MotivationThe development of cost-effective next-generation sequencing methods has spurred the development of high-throughput bioinformatics tools for detection of sequence variation. With many disparate variant-calling algorithms available, investigators must ask, 'Which method is best for my data?' Machine learning research has shown that so-called ensemble methods that combine the output of multiple models can dramatically improve classifier performance. Here we describe a novel variant-calling approach based on an ensemble of variant-calling algorithms, which we term the Consensus Genotyper for Exome Sequencing (CGES). CGES uses a two-stage voting scheme among four algorithm implementations. While our ensemble method can accept variants generated by any variant-calling algorithm, we used GATK2.8, SAMtools, FreeBayes and Atlas-SNP2 in building CGES because of their performance, widespread adoption and diverse but complementary algorithms.ResultsWe apply CGES to 132 samples sequenced at the Hudson Alpha Institute for Biotechnology (HAIB, Huntsville, AL) using the Nimblegen Exome Capture and Illumina sequencing technology. Our sample set consisted of 40 complete trios, two families of four, one parent-child duo and two unrelated individuals. CGES yielded the fewest total variant calls (N(CGES) = 139° 897), the highest Ts/Tv ratio (3.02), the lowest Mendelian error rate across all genotypes (0.028%), the highest rediscovery rate from the Exome Variant Server (EVS; 89.3%) and 1000 Genomes (1KG; 84.1%) and the highest positive predictive value (PPV; 96.1%) for a random sample of previously validated de novo variants. We describe these and other quality control (QC) metrics from consensus data and explain how the CGES pipeline can be used to generate call sets of varying quality stringency, including consensus calls present across all four algorithms, calls that are consistent across any three out of four algorithms, calls that are consistent across any two out of four algorithms or a more liberal set of all calls made by any algorithm.Availability and implementationTo enable accessible, efficient and reproducible analysis, we implement CGES both as a stand-alone command line tool available for download in GitHub and as a set of Galaxy tools and workflows configured to execute on parallel computers.Supplementary informationSupplementary data are available at Bioinformatics online.

Project description:ObjectiveThe availability of complete and accurate crash injury data is critical to prevention and intervention efforts. Relying solely on hospital discharge data or police crash reports may result in a biased undercount of injuries. Linking hospital data with crash reports may allow for a more robust identification of injuries and an understanding of which populations may be missed in an analysis of one source. We used the New Jersey Safety and Health Outcomes (NJ-SHO) data warehouse to examine the share of the entire crash-injured population identified in each of the two data sources, overall and by age, race/ethnicity, sex, injury severity, and road user type.MethodsWe utilized 2016-2017 data from the NJ-SHO warehouse. We identified crash-involved individuals in hospital discharge data by applying the ICD-10-CM external cause of injury matrix. Among crash-involved individuals, we identified those with injury- or pain-related diagnosis codes as being injured. We also identified crash-involved individuals via crash report data and identified injuries using the KABCO scale. We jointly examined the two sources; injuries in the hospital discharge data were documented as being related to the same crash as injuries found in the crash report data if the date of the crash report preceded the date of hospital admission by no more than two days.ResultsIn total, there were 262,338 crash-involved individuals with a documented injury in the hospital discharge data or on the crash report during the study period; 168,874 had an injury according to hospital discharge data, and 164,158 had an injury in crash report data. Only 70,694 (26.9%) had an injury in both sources. We observed differences by age, race/ethnicity, injury severity, and road user type: hospital discharge data captured a larger share of those ages 65+, those who were Black or Hispanic, those with higher severity injuries, and those who were bicyclists or motorcyclists.ConclusionsEach data source in isolation captures approximately two-thirds of the entire crash-injured population; one source alone misses approximately one-third of injured individuals. Each source undercounts people in certain groups, so relying on one source alone may not allow for tailored prevention and intervention efforts.

Project description:BackgroundSmith-Magenis syndrome (SMS) is a developmental disability/multiple congenital anomaly disorder resulting from haploinsufficiency of RAI1. It is characterized by distinctive facial features, brachydactyly, sleep disturbances, and stereotypic behaviors.MethodsWe investigated a cohort of 15 individuals with a clinical suspicion of SMS who showed neither deletion in the SMS critical region nor damaging variants in RAI1 using whole exome sequencing. A combination of network analysis (co-expression and biomedical text mining), transcriptomics, and circularized chromatin conformation capture (4C-seq) was applied to verify whether modified genes are part of the same disease network as known SMS-causing genes.ResultsPotentially deleterious variants were identified in nine of these individuals using whole-exome sequencing. Eight of these changes affect KMT2D, ZEB2, MAP2K2, GLDC, CASK, MECP2, KDM5C, and POGZ, known to be associated with Kabuki syndrome 1, Mowat-Wilson syndrome, cardiofaciocutaneous syndrome, glycine encephalopathy, mental retardation and microcephaly with pontine and cerebellar hypoplasia, X-linked mental retardation 13, X-linked mental retardation Claes-Jensen type, and White-Sutton syndrome, respectively. The ninth individual carries a de novo variant in JAKMIP1, a regulator of neuronal translation that was recently found deleted in a patient with autism spectrum disorder. Analyses of co-expression and biomedical text mining suggest that these pathologies and SMS are part of the same disease network. Further support for this hypothesis was obtained from transcriptome profiling that showed that the expression levels of both Zeb2 and Map2k2 are perturbed in Rai1 -/- mice. As an orthogonal approach to potentially contributory disease gene variants, we used chromatin conformation capture to reveal chromatin contacts between RAI1 and the loci flanking ZEB2 and GLDC, as well as between RAI1 and human orthologs of the genes that show perturbed expression in our Rai1 -/- mouse model.ConclusionsThese holistic studies of RAI1 and its interactions allow insights into SMS and other disorders associated with intellectual disability and behavioral abnormalities. Our findings support a pan-genomic approach to the molecular diagnosis of a distinctive disorder.

Dataset Information

Improving somatic variant identification through integration of genome and exome data.

Publications

Improving somatic variant identification through integration of genome and exome data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets