A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information.
ABSTRACT: Identification of DNA motifs from ChIP-seq/ChIP-chip [chromatin immunoprecipitation (ChIP)] data is a powerful method for understanding the transcriptional regulatory network. However, most established methods are designed for small sample sizes and are inefficient for ChIP data. Here we propose a new k-mer occurrence model to reflect the fact that functional DNA k-mers often cluster around ChIP peak summits. With this model, we introduced a new measure to discover functional k-mers. Using simulation, we demonstrated that our method is more robust against noises in ChIP data than available methods. A novel word clustering method is also implemented to group similar k-mers into position weight matrices (PWMs). Our method was applied to a diverse set of ChIP experiments to demonstrate its high sensitivity and specificity. Importantly, our method is much faster than several other methods for large sample sizes. Thus, we have developed an efficient and effective motif discovery method for ChIP experiments.
Project description:ChIP-seq reveals genomic regions where proteins, e.g. transcription factors (TFs) interact with DNA. A substantial fraction of these regions, however, do not contain the cognate binding site for the TF of interest. This phenomenon might be explained by protein-protein interactions and co-precipitation of interacting gene regulatory elements. We uniformly processed 3727 human ChIP-seq data sets and determined the cistrome of 292 TFs, as well as the distances between the TF binding motif centers and the ChIP-seq peak summits. ChIPSummitDB enables the analysis of ChIP-seq data using multiple approaches. The 292 cistromes and corresponding ChIP-seq peak sets can be browsed in GenomeView. Overlapping SNPs can be inspected in dbSNPView. Most importantly, the MotifView and PairShiftView pages show the average distance between motif centers and overlapping ChIP-seq peak summits and distance distributions thereof, respectively. In addition to providing a comprehensive human TF binding site collection, the ChIPSummitDB database and web interface allows for the examination of the topological arrangement of TF complexes genome-wide. ChIPSummitDB is freely accessible at http://summit.med.unideb.hu/summitdb/. The database will be regularly updated and extended with the newly available human and mouse ChIP-seq data sets.
Project description:EGFR mutations companion diagnostics have been proved to be crucial for the efficacy of tyrosine kinase inhibitor targeted cancer therapies. To uncover multiple mutations occurred in minority of EGFR-mutated cells, which may be covered by the noises from majority of un-mutated cells, is currently becoming an urgent clinical requirement. Here we present the validation of a microfluidic-chip-based method for detecting EGFR multi-mutations at single-cell level. By trapping and immunofluorescently imaging single cells in specifically designed silicon microwells, the EGFR-expressed cells were easily identified. By in situ lysing single cells, the cell lysates of EGFR-expressed cells were retrieved without cross-contamination. Benefited from excluding the noise from cells without EGFR expression, the simple and cost-effective Sanger's sequencing, but not the expensive deep sequencing of the whole cell population, was used to discover multi-mutations. We verified the new method with precisely discovering three most important EGFR drug-related mutations from a sample in which EGFR-mutated cells only account for a small percentage of whole cell population. The microfluidic chip is capable of discovering not only the existence of specific EGFR multi-mutations, but also other valuable single-cell-level information: on which specific cells the mutations occurred, or whether different mutations coexist on the same cells. This microfluidic chip constitutes a promising method to promote simple and cost-effective Sanger's sequencing to be a routine test before performing targeted cancer therapy.
Project description:A single chromatin immunoprecipitation (ChIP) sample does not provide enough DNA for hybridization to a genomic tiling array. A commonly used technique for amplifying the DNA obtained from ChIP assays is ligation-mediated PCR (LM-PCR). However; using this amplification method, we could not identify Oct4 binding sites on genomic tiling arrays representing 1% of the human genome (ENCODE arrays). In contrast, hybridization of a pool of 10 ChIP samples to the arrays produced reproducible binding patterns and low background signals. However the pooling method would greatly increase the number of ChIP reactions needed to analyze the entire human genome. Therefore, we have adapted the GenomePlex whole genome amplification (WGA) method for use in ChIP-chip assays; detailed ChIP and amplification protocols used for these analyses are provided as supplementary material. When applied to ENCODE arrays, the products prepared using this new method resulted in an Oct4 binding pattern similar to that from the pooled Oct4 ChIP samples. Importantly, the signal-to-noise ratio using the GenomePlex WGA method is superior to the LM-PCR amplification method.
Project description:The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated noncoding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are overrepresented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM-derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq data sets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of noncoding genetic variations.
Project description:In this paper, we present a new multi-focus microscope (MFM) system based on a phase mask and HiLo algorithm, achieving high-speed (20 volumes per second), high-resolution, low-noise 3-D fluorescent imaging. During imaging, the emissions from the specimen at nine different depths are simultaneously modulated and focused to different regions on a single CCD chip, i.e., the CCD chip is subdivided into nine regions to record images from the different selected depths. Next, HiLo algorithm is applied to remove the background noises and to form clean 3-D images. To visualize larger volumes, the nine layers are scanned axially, realizing fast 3-D imaging. In the imaging experiments, a mouse kidney sample of ~ 60 × 60 × 16 ?m3 is visualized with only 10 raw images, demonstrating substantially enhanced resolution and contrast as well as suppressed background noises. The new method will find important applications in 3-D fluorescent imaging, e.g., recording fast dynamic events at multiple depths in vivo.
Project description:The technique of chromatin immunoprecipitation (ChIP) is a powerful method for identifying in vivo DNA binding sites of transcription factors and for studying chromatin modifications. Unfortunately, the large number of cells needed for the standard ChIP protocol has hindered the analysis of many biologically interesting cell populations that are difficult to obtain in large numbers. New ChIP methods involving the use of carrier chromatin have been developed that allow the one-gene-at-a-time analysis of very small numbers of cells. However such methods are not useful if the resultant sample will be applied to genomic microarrays or used in ChIP-sequencing assays. Therefore, we have miniaturized the ChIP protocol such that as few as 10,000 cells (without the addition of carrier reagents) can be used to obtain enough sample material to analyze the entire human genome. We demonstrate the reproducibility of this MicroChIP technique using 2.1 million feature high-density oligonucleotide arrays and antibodies to RNA polymerase II and to histone H3 trimethylated on lysine 27 or lysine 9.
Project description:<h4>Background</h4>Transcription factor knockout microarrays (TFKMs) provide useful information about gene regulation. By using statistical methods for detecting differentially expressed genes between the gene expression microarray data of the mutant and wild type strains, the TF knockout targets of the knocked-out TF can be identified. However, the identified TF knockout targets may contain a certain amount of false positives due to the experimental noises inherent in the high-throughput microarray technology. Even if the identified TF knockout targets are true, the molecular mechanisms of how a TF regulates its TF knockout targets remain unknown by this kind of statistical approaches.<h4>Results</h4>To solve these two problems, we developed a method to filter out the false positives in the original TF knockout targets (identified by statistical approaches) so that the biologically interpretable TF knockout targets can be extracted. Our method can further generate experimentally testable hypotheses of the molecular mechanisms of how a TF regulates its biologically interpretable TF knockout targets. The details of our method are as follows. First, a TF binding network was constructed using the ChIP-chip data deposited in the YEASTRACT database. Then for each original TF knockout target, it is said to be biologically interpretable if a path (in the TF binding network) from the knocked-out TF to this target could be identified by our path search algorithm. The identified path explains how the TF may regulate this target either directly by binding to its promoter or indirectly through intermediate TFs. After checking all the original TF knockout targets, the biologically interpretable ones could be extracted and the false positives could be filtered out. We validated the biological significance of our refined (i.e., biologically interpretable) TF knockout targets by assessing their functional enrichment, expression coherence, and the prevalence of protein-protein interactions. Our refined TF knockout targets outperform the original TF knockout targets across all measures.<h4>Conclusions</h4>By jointly analyzing the TFKM and ChIP-chip data, our method can extract the biologically interpretable TF knockout targets by identifying paths (in the TF binding network) from the knocked-out TF to these targets. The identified paths form experimentally testable hypotheses regarding the molecular mechanisms of how a TF may regulate its knockout targets. About seven hundred hypotheses generated by our methods have been experimentally validated in the literature. Our work demonstrates that integrating different data sources is a powerful approach to study complex biological systems.
Project description:Efficient and reliable sample delivery has remained one of the bottlenecks for serial crystallography experiments. Compared with other methods, fixed-target sample delivery offers the advantage of significantly reduced sample consumption and shorter data collection times owing to higher hit rates. Here, a new method of on-chip crystallization is reported which allows the efficient and reproducible growth of large numbers of protein crystals directly on micro-patterned silicon chips for in-situ serial crystallography experiments. Crystals are grown by sitting-drop vapor diffusion and previously established crystallization conditions can be directly applied. By reducing the number of crystal-handling steps, the method is particularly well suited for sensitive crystal systems. Excessive mother liquor can be efficiently removed from the crystals by blotting, and no sealing of the fixed-target sample holders is required to prevent the crystals from dehydrating. As a consequence, 'naked' crystals are obtained on the chip, resulting in very low background scattering levels and making the crystals highly accessible for external manipulation such as the application of ligand solutions. Serial diffraction experiments carried out at cryogenic temperatures at a synchrotron and at room temperature at an X-ray free-electron laser yielded high-quality X-ray structures of the human membrane protein aquaporin 2 and two new ligand-bound structures of thermolysin and the human kinase DRAK2. The results highlight the applicability of the method for future high-throughput on-chip screening of pharmaceutical compounds.
Project description:ChIP-seq is a powerful technology for detecting genomic regions where a protein of interest interacts with DNA. ChIP-seq data for mapping transcription factor binding sites (TFBSs) have a characteristic pattern: around each binding site, sequence reads aligned to the forward and reverse strands of the reference genome form two separate peaks shifted away from each other, and the true binding site is located in between these two peaks. While it has been shown previously that the accuracy and resolution of binding site detection can be improved by modeling the pattern, efficient methods are unavailable to fully utilize that information in TFBS detection procedure. We present PolyaPeak, a new method to improve TFBS detection by incorporating the peak shape information. PolyaPeak describes peak shapes using a flexible Pólya model. The shapes are automatically learnt from the data using Minorization-Maximization (MM) algorithm, then integrated with the read count information via a hierarchical model to distinguish true binding sites from background noises. Extensive real data analyses show that PolyaPeak is capable of robustly improving TFBS detection compared with existing methods. An R package is freely available.
Project description:Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical "complete" chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.