ScHiCNorm: a software package to eliminate systematic biases in single-cell Hi-C data.
ABSTRACT: Summary:We build a software package scHiCNorm that uses zero-inflated and hurdle models to remove biases from single-cell Hi-C data. Our evaluations prove that our models can effectively eliminate systematic biases for single-cell Hi-C data, which better reveal cell-to-cell variances in terms of chromosomal structures. Availability and implementation:scHiCNorm is available at http://dna.cs.miami.edu/scHiCNorm/. Perl scripts are provided that can generate bias features. Pre-built bias features for human (hg19 and hg38) and mouse (mm9 and mm10) are available to download. R scripts can be downloaded to remove biases. Contact:email@example.com. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:<h4>Summary</h4>We propose a parametric model, HiCNorm, to remove systematic biases in the raw Hi-C contact maps, resulting in a simple, fast, yet accurate normalization procedure. Compared with the existing Hi-C normalization method developed by Yaffe and Tanay, HiCNorm has fewer parameters, runs >1000 times faster and achieves higher reproducibility.<h4>Availability</h4>Freely available on the web at: http://www.people.fas.harvard.edu/?junliu/HiCNorm/.<h4>Contact</h4>firstname.lastname@example.org<h4>Supplementary information</h4>Supplementary data are available at Bioinformatics online.
Project description:The FANTOM5 consortium described the promoter-level expression atlas of human and mouse by using CAGE (Cap Analysis of Gene Expression) with single molecule sequencing. In the original publications, GRCh37/hg19 and NCBI37/mm9 assemblies were used as the reference genomes of human and mouse respectively; later, the Genome Reference Consortium released newer genome assemblies GRCh38/hg38 and GRCm38/mm10. To increase the utility of the atlas in forthcoming researches, we reprocessed the data to make them available on the recent genome assemblies. The data include observed frequencies of transcription starting sites (TSSs) based on the realignment of CAGE reads, and TSS peaks that are converted from those based on the previous reference. Annotations of the peak names were also updated based on the latest public databases. The reprocessed results enable us to examine frequencies of transcription initiations on the recent genome assemblies and to refer promoters with updated information across the genome assemblies consistently.
Project description:Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.
Project description:BACKGROUND:The genome architecture mapping (GAM) technique can capture genome-wide chromatin interactions. However, besides the known systematic biases in the raw GAM data, we have found a new type of systematic bias. It is necessary to develop and evaluate effective normalization methods to remove all systematic biases in the raw GAM data. RESULTS:We have detected a new type of systematic bias, the fragment length bias, in the genome architecture mapping (GAM) data, which is significantly different from the bias of window detection frequency previously mentioned in the paper introducing the GAM method but is similar to the bias of distances between restriction sites existing in raw Hi-C data. We have found that the normalization method (a normalized variant of the linkage disequilibrium) used in the GAM paper is not able to effectively eliminate the new fragment length bias at 1?Mb resolution (slightly better at 30?kb resolution). We have developed an R package named normGAM for eliminating the new fragment length bias together with the other three biases existing in raw GAM data, which are the biases related to window detection frequency, mappability, and GC content. Five normalization methods have been implemented and included in the R package including Knight-Ruiz 2-norm (KR2, newly designed by us), normalized linkage disequilibrium (NLD), vanilla coverage (VC), sequential component normalization (SCN), and iterative correction and eigenvector decomposition (ICE). CONCLUSIONS:Based on our evaluations, the five normalization methods can eliminate the four biases existing in raw GAM data, with VC and KR2 performing better than the others. We have observed that the KR2-normalized GAM data have a higher correlation with the KR-normalized Hi-C data on the same cell samples indicating that the KR-related methods are better than the others for keeping the consistency between the GAM and Hi-C experiments. Compared with the raw GAM data, the normalized GAM data are more consistent with the normalized distances from the fluorescence in situ hybridization (FISH) experiments. The source code of normGAM can be freely downloaded from http://dna.cs.miami.edu/normGAM/.
Project description:MOTIVATION:In contrast to population-based Hi-C data, single-cell Hi-C data are zero-inflated and do not indicate the frequency of proximate DNA segments. There are a limited number of computational tools that can model the 3D structures of chromosomes based on single-cell Hi-C data. RESULTS:We developed single-cell lattice (SCL), a computational method to reconstruct 3D structures of chromosomes based on single-cell Hi-C data. We designed a loss function and a 2?D Gaussian function specifically for the characteristics of single-cell Hi-C data. A chromosome is represented as beads-on-a-string and stored in a 3?D cubic lattice. Metropolis-Hastings simulation and simulated annealing are used to simulate the structure and minimize the loss function. We evaluated the SCL-inferred 3?D structures (at both 500 and 50?kb resolutions) using multiple criteria and compared them with the ones generated by another modeling software program. The results indicate that the 3?D structures generated by SCL closely fit single-cell Hi-C data. We also found similar patterns of trans-chromosomal contact beads, Lamin-B1 enriched topologically associating domains (TADs), and H3K4me3 enriched TADs by mapping data from previous studies onto the SCL-inferred 3?D structures. AVAILABILITY AND IMPLEMENTATION:The C++ source code of SCL is freely available at http://dna.cs.miami.edu/SCL/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:MOTIVATION:High-resolution Hi-C data are indispensable for the studies of three-dimensional (3D) genome organization at kilobase level. However, generating high-resolution Hi-C data (e.g. 5?kb) by conducting Hi-C experiments needs millions of mammalian cells, which may eventually generate billions of paired-end reads with a high sequencing cost. Therefore, it will be important and helpful if we can enhance the resolutions of Hi-C data by computational methods. RESULTS:We developed a new computational method named HiCNN that used a 54-layer very deep convolutional neural network to enhance the resolutions of Hi-C data. The network contains both global and local residual learning with multiple speedup techniques included resulting in fast convergence. We used mean squared errors and Pearson's correlation coefficients between real high-resolution and computationally predicted high-resolution Hi-C data to evaluate the method. The evaluation results show that HiCNN consistently outperforms HiCPlus, the only existing tool in the literature, when training and testing data are extracted from the same cell type (i.e. GM12878) and from two different cell types in the same or different species (i.e. GM12878 as training with K562 as testing, and GM12878 as training with CH12-LX as testing). We further found that the HiCNN-enhanced high-resolution Hi-C data are more consistent with real experimental high-resolution Hi-C data than HiCPlus-enhanced data in terms of indicating statistically significant interactions. Moreover, HiCNN can efficiently enhance low-resolution Hi-C data, which eventually helps recover two chromatin loops that were confirmed by 3D-FISH. AVAILABILITY AND IMPLEMENTATION:HiCNN is freely available at http://dna.cs.miami.edu/HiCNN/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:MOTIVATION:With the development of chromatin conformation capture technology and its high-throughput derivative Hi-C sequencing, studies of the three-dimensional interactome of the genome that involve multiple Hi-C datasets are becoming available. To account for the technology-driven biases unique to each dataset, there is a distinct need for methods to jointly normalize multiple Hi-C datasets. Previous attempts at removing biases from Hi-C data have made use of techniques which normalize individual Hi-C datasets, or, at best, jointly normalize two datasets. RESULTS:Here, we present multiHiCcompare, a cyclic loess regression-based joint normalization technique for removing biases across multiple Hi-C datasets. In contrast to other normalization techniques, it properly handles the Hi-C-specific decay of chromatin interaction frequencies with the increasing distance between interacting regions. multiHiCcompare uses the general linear model framework for comparative analysis of multiple Hi-C datasets, adapted for the Hi-C-specific decay of chromatin interaction frequencies. multiHiCcompare outperforms other methods when detecting a priori known chromatin interaction differences from jointly normalized datasets. Applied to the analysis of auxin-treated versus untreated experiments, and CTCF depletion experiments, multiHiCcompare was able to recover the expected epigenetic and gene expression signatures of loss of chromatin interactions and reveal novel insights. AVAILABILITY AND IMPLEMENTATION:multiHiCcompare is freely available on GitHub and as a Bioconductor R package https://bioconductor.org/packages/multiHiCcompare. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:<h4>Background</h4>Changes in spatial chromatin interactions are now emerging as a unifying mechanism orchestrating the regulation of gene expression. Hi-C sequencing technology allows insight into chromatin interactions on a genome-wide scale. However, Hi-C data contains many DNA sequence- and technology-driven biases. These biases prevent effective comparison of chromatin interactions aimed at identifying genomic regions differentially interacting between, e.g., disease-normal states or different cell types. Several methods have been developed for normalizing individual Hi-C datasets. However, they fail to account for biases between two or more Hi-C datasets, hindering comparative analysis of chromatin interactions.<h4>Results</h4>We developed a simple and effective method, HiCcompare, for the joint normalization and differential analysis of multiple Hi-C datasets. The method introduces a distance-centric analysis and visualization of the differences between two Hi-C datasets on a single plot that allows for a data-driven normalization of biases using locally weighted linear regression (loess). HiCcompare outperforms methods for normalizing individual Hi-C datasets and methods for differential analysis (diffHiC, FIND) in detecting a priori known chromatin interaction differences while preserving the detection of genomic structures, such as A/B compartments.<h4>Conclusions</h4>HiCcompare is able to remove between-dataset bias present in Hi-C matrices. It also provides a user-friendly tool to allow the scientific community to perform direct comparisons between the growing number of pre-processed Hi-C datasets available at online repositories. HiCcompare is freely available as a Bioconductor R package https://bioconductor.org/packages/HiCcompare/ .
Project description:The Hi-C technology was designed to decode the three-dimensional conformation of the genome. Despite progress towards more and more accurate contact maps, several systematic biases have been demonstrated to affect the resulting data matrix. Here we report a new source of bias that can arise in tumor Hi-C data, which is related to the copy number of genomic DNA. To address this bias, we designed a chromosome-adjusted iterative correction method called caICB. Our caICB correction method leads to significant improvements when compared with the original iterative correction in terms of eliminating copy number bias.The method is available at https://bitbucket.org/mthjwu/hicapp CONTACT: email@example.comSupplementary information: Supplementary data are available at Bioinformatics online.
Project description:Capture Hi-C (CHi-C) is a state-of-the art method for profiling chromosomal interactions involving targeted regions of interest (such as gene promoters) globally and at high resolution. Signal detection in CHi-C data involves a number of statistical challenges that are not observed when using other Hi-C-like techniques. We present a background model, and algorithms for normalisation and multiple testing that are specifically adapted to CHi-C experiments, in which many spatially dispersed regions are captured, such as in Promoter CHi-C. We implement these procedures in CHiCAGO (http://regulatorygenomicsgroup.org/chicago), an open-source package for robust interaction detection in CHi-C. We validate CHiCAGO by showing that promoter-interacting regions detected with this method are enriched for regulatory features and disease-associated SNPs. Three human CHi-C biological replicates were generated (comprising 1, 2and 3 technical replicates). Two mouse CHi-C biological replicates were generated (both comprising three technical replicates) and a mouse Hi-C dataset. The publicly available HiCUP pipeline (doi: 10.12688/f1000research.7334.1) was used to process the raw sequencing reads. This pipeline was used to map the read pairs against the mouse (mm9) and human (hg19) genomes, to filter experimental artefacts (such as circularized reads and re-ligations), and to remove duplicate reads. For the CHi-C data, the resulting BAM files were processed into CHiCAGO input files, retaining only those read pairs that mapped, at least on one end, to a captured bait. CHiCAGO then identified Hi-C restriction fragments interacting, with statistical significant, to captured baits.