Dataset Information

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly.

ABSTRACT:

Motivation

The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies.

Results

We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection.

Availability and implementation

Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY.

Contact

kmakova@bx.psu.edu or pashadag@cse.psu.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Rangavittal S

PROVIDER: S-EPMC6030959 | biostudies-literature | 2018 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly.

Rangavittal Samarth S Harris Robert S RS Cechova Monika M Tomaszkiewicz Marta M Chikhi Rayan R Makova Kateryna D KD Medvedev Paul P

Bioinformatics (Oxford, England) 20180401 7

<h4>Motivation</h4>The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to a ...[more]

PMID: 29194476

Dataset Information

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Ultraplexing: increasing the efficiency of long-read sequencing for hybrid assembly with k-mer-based multiplexing.
| S-EPMC7071681 | biostudies-literature

Compact representation of k-mer de Bruijn graphs for genome read assembly.
| S-EPMC4015147 | biostudies-literature

Near-chromosome level genome assembly of the fruit pest Drosophila suzukii using long-read sequencing.
| S-EPMC7343843 | biostudies-literature

Chromosome-scale assembly of the Sparassis latifolia genome obtained using long-read and Hi-C sequencing.
| S-EPMC8496284 | biostudies-literature

MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome.
| S-EPMC5793790 | biostudies-literature

Chromosome-level de novo assembly of the pig-tailed macaque genome using linked-read sequencing and HiC proximity scaffolding.
| S-EPMC7350979 | biostudies-literature

Classification of salivary bacteriome in asymptomatic COVID-19 cases based on long-read nanopore sequencing.
| S-EPMC9742750 | biostudies-literature

Canu: scalable and accurate long-read assembly via adaptive <i>k</i>-mer weighting and repeat separation.
| S-EPMC5411767 | biostudies-literature

MetaTransformer: deep metagenomic sequencing read classification using self-attention models.
| S-EPMC10495543 | biostudies-literature

FQSqueezer: k-mer-based compression of sequencing data.
| S-EPMC6969201 | biostudies-literature