Unknown

Dataset Information

0

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.


ABSTRACT: We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large 'window' sizes (≥50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.

SUBMITTER: Agrawal A 

PROVIDER: S-EPMC5861420 | biostudies-literature | 2018 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.

Agrawal Ankit A   Sambare Snehal V SV   Narlikar Leelavati L   Siddharthan Rahul R  

Nucleic acids research 20180301 5


We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our imple  ...[more]

Similar Datasets

| S-EPMC3843501 | biostudies-literature
| S-EPMC9881668 | biostudies-literature
| S-EPMC4117525 | biostudies-literature
| S-EPMC9141109 | biostudies-literature
| S-EPMC4138177 | biostudies-literature
| S-EPMC4481955 | biostudies-literature
| S-EPMC9621593 | biostudies-literature
| S-EPMC3052304 | biostudies-literature
| S-EPMC10809904 | biostudies-literature
| S-EPMC8956524 | biostudies-literature