A Hybrid Approach for CpG Island Detection in the Human Genome.
ABSTRACT: BACKGROUND:CpG islands have been demonstrated to influence local chromatin structures and simplify the regulation of gene activity. However, the accurate and rapid determination of CpG islands for whole DNA sequences remains experimentally and computationally challenging. METHODOLOGY/PRINCIPAL FINDINGS:A novel procedure is proposed to detect CpG islands by combining clustering technology with the sliding-window method (PSO-based). Clustering technology is used to detect the locations of all possible CpG islands and process the data, thus effectively obviating the need for the extensive and unnecessary processing of DNA fragments, and thus improving the efficiency of sliding-window based particle swarm optimization (PSO) search. This proposed approach, named ClusterPSO, provides versatile and highly-sensitive detection of CpG islands in the human genome. In addition, the detection efficiency of ClusterPSO is compared with eight CpG island detection methods in the human genome. Comparison of the detection efficiency for the CpG islands in human genome, including sensitivity, specificity, accuracy, performance coefficient (PC), and correlation coefficient (CC), ClusterPSO revealed superior detection ability among all of the test methods. Moreover, the combination of clustering technology and PSO method can successfully overcome their respective drawbacks while maintaining their advantages. Thus, clustering technology could be hybridized with the optimization algorithm method to optimize CpG island detection. CONCLUSION/SIGNIFICANCE:The prediction accuracy of ClusterPSO was quite high, indicating the combination of CpGcluster and PSO has several advantages over CpGcluster and PSO alone. In addition, ClusterPSO significantly reduced implementation time.
Project description:<h4>Background</h4>Regions with abundant GC nucleotides, a high CpG number, and a length greater than 200 bp in a genome are often referred to as CpG islands. These islands are usually located in the 5' end of genes. Recently, several algorithms for the prediction of CpG islands have been proposed.<h4>Methodology/principal findings</h4>We propose here a new method called CPSORL to predict CpG islands, which consists of a complement particle swarm optimization algorithm combined with reinforcement learning to predict CpG islands more reliably. Several CpG island prediction tools equipped with the sliding window technique have been developed previously. However, the quality of the results seems to rely too much on the choices that are made for the window sizes, and thus these methods leave room for improvement.<h4>Conclusions/significance</h4>Experimental results indicate that CPSORL provides results of a higher sensitivity and a higher correlation coefficient in all selected experimental contigs than the other methods it was compared to (CpGIS, CpGcluster, CpGProd and CpGPlot). A higher number of CpG islands were identified in chromosomes 21 and 22 of the human genome than with the other methods from the literature. CPSORL also achieved the highest coverage rate (3.4%). CPSORL is an application for identifying promoter and TSS regions associated with CpG islands in entire human genomic. When compared to CpGcluster, the islands predicted by CPSORL covered a larger region in the TSS (12.2%) and promoter (26.1%) region. If Alu sequences are considered, the islands predicted by CPSORL (Alu) covered a larger TSS (40.5%) and promoter (67.8%) region than CpGIS. Furthermore, CPSORL was used to verify that the average methylation density was 5.33% for CpG islands in the entire human genome.
Project description:<h4>Background</h4>Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.<h4>Results</h4>Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome.<h4>Conclusion</h4>CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
Project description:<h4>Background</h4>CpG islands (CGIs), clusters of CpG dinucleotides in GC-rich regions, are often located in the 5' end of genes and considered gene markers. Hackenberg et al. (2006) recently developed a new algorithm, CpGcluster, which uses a completely different mathematical approach from previous traditional algorithms. Their evaluation suggests that CpGcluster provides a much more efficient approach to detecting functional clusters or islands of CpGs.<h4>Results</h4>We systematically compared CpGcluster with the traditional algorithm by Takai and Jones (2002). Our comparisons of (1) the number of islands versus the number of genes in a genome, (2) the distribution of islands in different genomic regions, (3) island length, (4) the distance between two neighboring islands, and (5) methylation status suggest that Takai and Jones' algorithm is overall more appropriate for identifying promoter-associated islands of CpGs in vertebrate genomes.<h4>Conclusion</h4>The generation of genome sequence and DNA methylation data is expected to accelerate greatly. The information in this study is important for its extensive utility in gene feature analysis and epigenomics including gene prediction and methylation chip design in different genomes.
Project description:CpG islands are genome subsequences with an unexpectedly high number of CG di-nucleotides. They are typically identified using filtering criteria (e.g., G+C% expected vs. observed CpG ratio and length) and are computed using sliding window methods. Most such studies illusively assume an exhaustive search of CpG islands are achieved on the genome sequence of interest. We devise a Lexis diagram and explicitly show that filtering criteria-based definitions of CpG islands are mathematically incomplete and non-operational. These facts imply that the sliding window methods frequently fail to identify a large percentage of subsequences that meet the filtering criteria. We also demonstrate that an exhaustive search is computationally expensive. We develop the Hierarchical Factor Segmentation (HFS) algorithm, a pattern recognition technique with an adaptive model selection device to overcome the incompleteness and non-operational drawbacks, and to achieve effective computations for identifying CpG-islands. The concept of a CpG island "core" is introduced and computed using the HFS algorithm, which is independent from any specific filtering criteria. Upon such a CpG island "core," a CpG-island is constructed using a Lexis diagram. This two-step computational approach provides a nearly exhaustive search for CpG islands that can be practically implemented on whole chromosomes. In a simulation study realistically mimicking CpG-island dynamics through a Hidden Markov Model we demonstrate that this approach retains very high sensitivity and specificity, that is, very low rates of false positives and false negatives. Finally, we apply the HFS algorithm to identify CpG island cores on human chromosome 21.
Project description:BACKGROUND: Genomic islands play an important role in medical, methylation and biological studies. To explore the region, we propose a CpG islands prediction analysis platform for genome sequence exploration (CpGPAP). RESULTS: CpGPAP is a web-based application that provides a user-friendly interface for predicting CpG islands in genome sequences or in user input sequences. The prediction algorithms supported in CpGPAP include complementary particle swarm optimization (CPSO), a complementary genetic algorithm (CGA) and other methods (CpGPlot, CpGProD and CpGIS) found in the literature. The CpGPAP platform is easy to use and has three main features (1) selection of the prediction algorithm; (2) graphic visualization of results; and (3) application of related tools and dataset downloads. These features allow the user to easily view CpG island results and download the relevant island data. CpGPAP is freely available at http://bio.kuas.edu.tw/CpGPAP/. CONCLUSIONS: The platform's supported algorithms (CPSO and CGA) provide a higher sensitivity and a higher correlation coefficient when compared to CpGPlot, CpGProD, CpGIS, and CpGcluster over an entire chromosome.
Project description:Colorectal cancer (CRC) is one of the most common cancer types globally with a 5-year survival rate of < 50% in China. Aberrant DNA methylation is one of the hallmarks of tumor initiation, progression, and metastasis. Here, we investigated the clinical performance of two differentially methylated regions (DMRs) in SDC2 CpG islands for the detection of CRC. A sliding window technique was used to identify the DMRs, and methylation-specific PCR assay was used to assess the DMRs in 198 CRC samples and 54 normal controls. Two DMRs (DMR2 and DMR5) were identified using The Cancer Genome Atlas (TCGA) data, and the hypermethylation of DMR2 and DMR5 was detected in 90.91% (180/198) and 89.90% (178/198) of CRC samples, respectively. When combining DMR2 and DMR5, the sensitivity for CRC detection was 94.4% higher than that of DMR2 or DMR5 alone. Based on the above results, we propose using DMR2 and DMR5 as a sensitive biomarker to detect CRC.
Project description:BACKGROUND: Complete mitochondrial (mt) genome sequencing is becoming increasingly common for phylogenetic reconstruction and as a model for genome evolution. For long template sequencing, i.e., like the entire mtDNA, it is essential to design primers for Polymerase Chain Reaction (PCR) amplicons which are partly overlapping each other. The presented chromosome walking strategy provides the overlapping design to solve the problem for unreliable sequencing data at the 5' end and provides the effective sequencing. However, current algorithms and tools are mostly focused on the primer design for a local region in the genomic sequence. Accordingly, it is still challenging to provide the primer sets for the entire mtDNA. METHODOLOGY/PRINCIPAL FINDINGS: The purpose of this study is to develop an integrated primer design algorithm for entire mt genome in general, and for the common primer sets for closely-related species in particular. We introduce ClustalW to generate the multiple sequence alignment needed to find the conserved sequences in closely-related species. These conserved sequences are suitable for designing the common primers for the entire mtDNA. Using a heuristic algorithm particle swarm optimization (PSO), all the designed primers were computationally validated to fit the common primer design constraints, such as the melting temperature, primer length and GC content, PCR product length, secondary structure, specificity, and terminal limitation. The overlap requirement for PCR amplicons in the entire mtDNA is satisfied by defining the overlapping region with the sliding window technology. Finally, primer sets were designed within the overlapping region. The primer sets for the entire mtDNA sequences were successfully demonstrated in the example of two closely-related fish species. The pseudo code for the primer design algorithm is provided. CONCLUSIONS/SIGNIFICANCE: In conclusion, it can be said that our proposed sliding window-based PSO algorithm provides the necessary primer sets for the entire mt genome amplification and sequencing.
Project description:We used the 4C-Seq technique to characterize the genome-wide patterns of spatial contacts of several CpG islands located on chromosome 14 in cultured chicken lymphoid and erythroid cells. We observed a clear tendency for the spatial clustering of CpG islands present on the same and different chromosomes, regardless of the presence or absence of promoters within these CpG islands. Accordingly, we observed preferential spatial contacts between Sp1 binding motifs and other GC-rich genomic elements, including the DNA sequence motifs capable of forming G-quadruplexes. However, an anchor placed in a gene/CpG island-poor area formed spatial contacts with other gene/CpG island-poor areas on chromosome 14 and other chromosomes. These results corroborate the two-compartment model of the spatial organization of interphase chromosomes and suggest that the clustering of CpG islands constitutes an important determinant of the 3D organization of the eukaryotic genome in the cell nucleus. Using the ChIP-Seq technique, we mapped the genome-wide CTCF deposition sites in the chicken lymphoid and erythroid cells that were used for the 4C analysis. We observed a good correlation between the density of CTCF deposition sites and the level of 4C signals for the anchors located in CpG islands but not for an anchor located in a gene desert. It is thus possible that CTCF contributes to the clustering of CpG islands observed in our experiments.
Project description:Using 4C-Seq experimental procedure we have characterized, in cultured chicken lymphoid and erythroid cells, genome-wide patterns of spatial contacts of several CpG islands scattered along the chromosome 14. A clear tendency for interaction of CpG islands present within the same and different chromosomes has been observed. Accordingly, preferential spatial contacts between Sp1 binding motifs, and other GC-rich genomic elements including DNA sequence motifs capable to form G-quadruplexes were demonstrated. On the other hand, an anchor placed in gene/CpG islands-poor area was found to form spatial contacts with other gene/CpG islands-poor areas within chromosome 14 and other chromosomes. These results corroborate the two compartments model of interphase chromosome spatial organization and suggest that clustering of CpG islands harboring promoters and origins of DNA replication constitutes an important determinant of the 3D organization of eukaryotic genome in the cell nucleus. Using ChIP-Seq experimental procedure we have mapped genome-wide the CTCF deposition sites in chicken lymphoid and erythroid cells subjected to the 4C analysis. A good correlation between the density of these sites and the level of 4C signals was observed for the anchors located in CpG islands. It is thus possible that CTCF contributes to the clustering of CpG islands revealed in our experiments. Using ChIP-Seq experimental procedure we have mapped genome-wide the CTCF deposition sites in chicken lymphoid and erythroid cells subjected to the 4C analysis. CTCF deposition sites in chicken lymphoid and erythroid (induced and non-induced) cells.
Project description:Barrett's esophagus (BE) is a precursor of esophageal adenocarcinoma (EAC). To identify novel tumor suppressors involved in esophageal carcinogenesis and potential biomarkers for the malignant progression of BE, we performed a genome-wide methylation profiling of BE and EAC tissues. Using Illumina's Infinium HumanMethylation27 BeadChip microarray, we examined the methylation status of 27 578 CpG sites in 94 normal esophageal (NE), 77 BE and 117 EAC tissue samples. The overall methylation of CpG sites within the CpG islands was higher, but outside of the CpG islands was lower in BE and EAC tissues than in NE tissues. Hierarchical clustering analysis showed an excellent separation of NE tissues from BE and EAC tissues; however, the clustering of BE and EAC tissues was less clear, suggesting that methylation occurs early during the progression of EAC. We confirmed many previously reported hypermethylated genes and identified a large number of novel hypermethylated genes in BE and EAC tissues, particularly genes encoding ADAM (A Disintegrin And Metalloproteinase) peptidase proteins, cadherins and protocadherins, and potassium voltage-gated channels. Pathway analysis showed that a number of channel and transporter activities were enriched for hypermethylated genes. We used pyrosequencing to validate selected candidate genes and found high correlations between the array and pyrosequencing data (rho > 0.8 for each validated gene). The differentially methylated genes and pathways may provide biological insights into the development and progression of BE and become potential biomarkers for the prediction and early detection of EAC.