Dataset Information

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

ABSTRACT:

SUBMITTER: Liu Y

PROVIDER: S-EPMC10951360 | biostudies-literature | 2024 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BackgroundStructural variant (SV) calling from DNA sequencing data has been challenging due to several factors, including the ambiguity of short-read alignments, multiple complex SVs in the same genomic region, and the lack of "truth" datasets for benchmarking. Additionally, caller choice, parameter settings, and alignment method are known to affect SV calling. However, the impact of FASTQ read order on SV calling has not been explored for long-read data.ResultsHere, we used PacBio DNA sequencing data from 15 Caenorhabditis elegans strains and four Arabidopsis thaliana ecotypes to evaluate the sensitivity of different SV callers on FASTQ read order. Comparisons of variant call format files generated from the original and permutated FASTQ files demonstrated that the order of input data affected the SVs predicted by each caller. In particular, pbsv was highly sensitive to the order of the input data, especially at the highest depths where over 70% of the SV calls generated from pairs of differently ordered FASTQ files were in disagreement. These demonstrate that read order sensitivity is a complex, multifactorial process, as the differences observed both within and between species varied considerably according to the specific combination of aligner, SV caller, and sequencing depth. In addition to the SV callers being sensitive to the input data order, the SAMtools alignment sorting algorithm was identified as a source of variability following read order randomization.ConclusionThe results of this study highlight the sensitivity of SV calling on the order of reads encoded in FASTQ files, which has not been recognized in long-read approaches. These findings have implications for the replication of SV studies and the development of consistent SV calling protocols. Our study suggests that researchers should pay attention to the input order sensitivity of read alignment sorting methods when analyzing long-read sequencing data for SV calling, as mitigating a source of variability could facilitate future replication work. These results also raise important questions surrounding the relationship between SV caller read order sensitivity and tool performance. Therefore, tool developers should also consider input order sensitivity as a potential source of variability during the development and benchmarking of new and improved methods for SV calling.

Project description:BackgroundParkin RBR E3 ubiquitin-protein ligase (PRKN) mutations are the most common cause of young onset and autosomal recessive Parkinson's disease (PD). PRKN is located in FRA6E, which is one of the common fragile sites in the human genome, making this region prone to structural variants. However, complex structural variants such as inversions of PRKN are seldom reported, suggesting that there are potentially unrevealed complex pathogenic PRKN structural variants.ObjectivesTo identify complex structural variants in PRKN using long-read sequencing.MethodsWe investigated the genetic cause of monozygotic twins presenting with a young onset dystonia-parkinsonism using targeted sequencing, whole exome sequencing, multiple ligation probe amplification, and long-read sequencing. We assessed the presence and frequency of complex inversions overlapping PRKN using whole-genome sequencing data of Accelerating Medicines Partnership Parkinson's disease (AMP-PD) and United Kingdom (UK)-Biobank datasets.ResultsMultiple ligation probe amplification identified a heterozygous exon three deletion in PRKN and long-read sequencing identified a large novel inversion spanning over 7 Mb, including a large part of the coding DNA sequence of PRKN. We could diagnose the affected subjects as compound heterozygous carriers of PRKN. We analyzed whole genome sequencing data of 43,538 participants of the UK-Biobank and 4941 participants of the AMP-PD datasets. Nine inversions in the UK-Biobank and two in AMP PD were identified and were considered potentially damaging and likely to affect PRKN expression.ConclusionsThis is the first report describing a large 7 Mb inversion involving breakpoints outside of PRKN. This study highlights the importance of using long-read sequencing for structural variant analysis in unresolved young-onset PD cases. © 2023 The Authors. Movement Disorders published by Wiley Periodicals LLC on behalf of International Parkinson and Movement Disorder Society. This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.

Dataset Information

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets