ABSTRACT: SUMMARY:Human alpha satellite and satellite 2/3 contribute to several percent of the human genome. However, identifying these sequences with traditional algorithms is computationally intensive. Here we develop dna-brnn, a recurrent neural network to learn the sequences of the two classes of centromeric repeats. It achieves high similarity to RepeatMasker and is times faster. Dna-brnn explores a novel application of deep learning and may accelerate the study of the evolution of the two repeat classes. AVAILABILITY AND IMPLEMENTATION:https://github.com/lh3/dna-nn.
Project description:Begomoviruses (family Geminiviridae) are whitefly-transmitted, plant-infecting single-stranded DNA viruses that cause crop losses throughout the warmer parts of the World. Sweepoviruses are a phylogenetically distinct group of begomoviruses that infect plants of the family Convolvulaceae, including sweet potato (Ipomoea batatas). Two classes of subviral molecules are often associated with begomoviruses, particularly in the Old World; the betasatellites and the alphasatellites. An analysis of sweet potato and Ipomoea indica samples from Spain and Merremia dissecta samples from Venezuela identified small non-coding subviral molecules in association with several distinct sweepoviruses. The sequences of 18 clones were obtained and found to be structurally similar to tomato leaf curl virus-satellite (ToLCV-sat, the first DNA satellite identified in association with a begomovirus), with a region with significant sequence identity to the conserved region of betasatellites, an A-rich sequence, a predicted stem-loop structure containing the nonanucleotide TAATATTAC, and a second predicted stem-loop. These sweepovirus-associated satellites join an increasing number of ToLCV-sat-like non-coding satellites identified recently. Although sharing some features with betasatellites, evidence is provided to suggest that the ToLCV-sat-like satellites are distinct from betasatellites and should be considered a separate class of satellites, for which the collective name deltasatellites is proposed.
Project description:In higher eukaryotes, the DNA composition of centromeres displays a high degree of variation, even between chromosomes of a single species. However, the long-range organization of centromeric DNA apparently follows similar structural rules. In our study, a comparative analysis of the DNA at centromeric regions of Beta species, including cultivated and wild beets, was performed using a set of repetitive DNA sequences. Our results show that these regions in Beta genomes have a complex structure and consist of variable repetitive sequences, including satellite DNA, Ty3-gypsy-like retrotransposons, and microsatellites. Based on their molecular characterization and chromosomal distribution determined by fluorescent in situ hybridization (FISH), centromeric repeated DNA sequences were grouped into three classes. By high-resolution multicolor-FISH on pachytene chromosomes and extended DNA fibers we analyzed the long-range organization of centromeric DNA sequences, leading to a structural model of a centromeric region of the wild beet species Beta procumbens. The chromosomal mutants PRO1 and PAT2 contain a single wild beet minichromosome with centromere activity and provide, together with cloned centromeric DNA sequences, an experimental system toward the molecular isolation of individual plant centromeres. In particular, FISH to extended DNA fibers of the PRO1 minichromosome and pulsed-field gel electrophoresis of large restriction fragments enabled estimations of the array size, interspersion patterns, and higher order organization of these centromere-associated satellite families. Regarding the overall structure, Beta centromeric regions show similarities to their counterparts in the few animal and plant species in which centromeres have been analyzed in detail.
Project description:BACKGROUND: Dispersed repeats are a major component of eukaryotic genomes and drivers of genome evolution. Annotation of DNA sequences homologous to known repetitive elements has been mainly performed with the program REPEATMASKER. Sequences annotated by REPEATMASKER often correspond to fragments of repetitive elements resulting from the insertion of younger elements or other rearrangements. Although REPEATMASKER annotation is indispensable for studying genome biology, this annotation does not contain much information on the common origin of fossil fragments that share an insertion event, especially where clusters of nested insertions of repetitive elements have occurred. RESULTS: Here I present REANNOTATE, a computational tool to process REPEATMASKER annotation for automated i) defragmentation of dispersed repetitive elements, ii) resolution of the temporal order of insertions in clusters of nested elements, and iii) estimating the age of the elements, if they have long terminal repeats. I have re-annotated the repetitive content of human chromosomes, providing evidence for a recent expansion of satellite repeats on the Y chromosome and, from the retroviral age distribution, for a higher rate of evolution on the Y relative to autosomes. CONCLUSION: REANNOTATE is ready to process existing annotation for automated evolutionary analysis of all types of complex repeats in any genome. The tool is freely available under the GPL at http://www.bioinformatics.org/reannotate.
Project description:Centromeres are the chromosomal sites of assembly for kinetochores, the protein complexes that attach to spindle fibers and mediate separation of chromosomes to daughter cells in mitosis and meiosis. In most multicellular organisms, centromeres comprise a single specific family of tandem repeats-often 100-400 bp in length-found on every chromosome, typically in one location within heterochromatin. Drosophila melanogaster is unusual in that the heterochromatin contains many families of mostly short (5-12 bp) tandem repeats, none of which appear to be present at all centromeres, and none of which are found only at centromeres. Although centromere sequences from a minichromosome have been identified and candidate centromere sequences have been proposed, the DNA sequences at native Drosophila centromeres remain unknown. Here we use native chromatin immunoprecipitation to identify the centromeric sequences bound by the foundational kinetochore protein cenH3, known in vertebrates as CENP-A. In D. melanogaster, these sequences include a few families of 5- and 10-bp repeats; but in closely related D. simulans, the centromeres comprise more complex repeats. The results suggest that a recent expansion of short repeats has replaced more complex centromeric repeats in D. melanogaster.
Project description:Satellite DNA sequences consist of tandemly arranged repetitive units up to thousands nucleotides long in head-to-tail orientation. The evolutionary processes by which satellites arise and evolve include unequal crossing over, gene conversion, transposition and extra chromosomal circular DNA formation. Large blocks of satellite DNA are often observed in heterochromatic regions of chromosomes and are a typical component of centromeric and telomeric regions. Satellite-rich loci may show specific banding patterns and facilitate chromosome identification and analysis of structural chromosome changes. Unlike many other genomes, nuclear genomes of banana (Musa spp.) are poor in satellite DNA and the information on this class of DNA remains limited. The banana cultivars are seed sterile clones originating mostly from natural intra-specific crosses within M. acuminata (A genome) and inter-specific crosses between M. acuminata and M. balbisiana (B genome). Previous studies revealed the closely related nature of the A and B genomes, including similarities in repetitive DNA. In this study we focused on two main banana DNA satellites, which were previously identified in silico. Their genomic organization and molecular diversity was analyzed in a set of nineteen Musa accessions, including representatives of A, B and S (M. schizocarpa) genomes and their inter-specific hybrids. The two DNA satellites showed a high level of sequence conservation within, and a high homology between Musa species. FISH with probes for the satellite DNA sequences, rRNA genes and a single-copy BAC clone 2G17 resulted in characteristic chromosome banding patterns in M. acuminata and M. balbisiana which may aid in determining genomic constitution in interspecific hybrids. In addition to improving the knowledge on Musa satellite DNA, our study increases the number of cytogenetic markers and the number of individual chromosomes, which can be identified in Musa.
Project description:Repetitive DNA are DNA sequences that are repeated multiple times in the genome and normally considered nonfunctional. Several studies predict that the rapid evolution of chromosome-specific satellites led to hybrid incompatibilities and speciation. Interestingly, in Drosophila, the X and dot chromosomes share a unique and noteworthy property: They are identified by chromosome-specific binding proteins and they are particularly involved in genetic incompatibilities between closely related species. Here, I show that the X and dot chromosomes are overpopulated by certain repetitive elements that undergo recurrent turnover in Drosophila species. The portion of the X and dot chromosomes covered by such satellites is up to 52 times and 44 times higher than in other chromosomes, respectively. In addition, the newly evolved X chromosome in D. pseudoobscura (the chromosomal arm XR) has been invaded by the same satellite that colonized the ancestral X chromosome (chromosomal arm XL), whereas the autosomal homologs in other species remain mostly devoid of satellites. Contrarily, the Müller element F in D. ananassae, homolog to the dot chromosome in D. melanogaster, has no overrepresented DNA sequences compared with any other chromosome. The biology and evolutionary patterns of the characterized satellites suggest that they provide both chromosomes with some kind of structural identity and are exposed to natural selection. The rapid satellite turnover fits some speciation models and may explain why these two chromosomes are typically involved in hybrid incompatibilities.
Project description:Centromeres are the chromosomal sites of assembly for kinetochores, the protein complexes that attach to spindle fibers and mediate separation of chromosomes to daughter cells in mitosis and meiosis. In most multicellular organisms, centromeres comprise a single specific family of tandem repeats, often 100-400 bp in length, found on every chromosome, typically in one location within heterochromatin. Drosophila melanogaster is unusual in that the heterochromatin contains many families of mostly short (5-12 bp) tandem repeats, none of which appears to be present at all centromeres, and none of which is found only at centromeres. Although centromere sequences from a minichromosome have been identified and candidate centromere sequences have been proposed, the DNA sequences at native Drosophila centromeres remain unknown. Here we use native chromatin immunoprecipitation to identify the centromeric sequences bound by the foundational kinetochore protein cenH3, known in vertebrates as CENP-A. In D. melanogaster, these sequences include a few families of 5-bp and 10-bp repeats, but in closely related D. simulans, a partially overlapping set of short repeats and more complex repeats comprise the centromeres. The results suggest that a recent expansion of short repeats is replacing more complex centromeric repeats in the melanogaster subgroup of Drosophila. Overall design: We used native chromatin immunoprecipitation with anti-CENP-A antibodies to enrich for and sequence centromeric satellites bound by CENP-A in two sibling Drosophila species. We counted 71 candidate repeated sequences and their reverse complements to determine which were enriched in the immunoprecipitates.
Project description:The centromere/kinetochore interaction is responsible for the pairing and segregation of replicated chromosomes in eukaryotes. Centromere DNA is portrayed as scarcely conserved, repetitive in nature, quickly evolving and protein-binding competent. Among primates, the major class of centromeric DNA is the pancentromeric ?-satellite, made of arrays of 171 bp monomers, repeated in a head-to-tail pattern. ?-satellite sequences can either form tandem heterogeneous monomeric arrays or assemble in higher-order repeats (HORs). Gorilla centromere DNA has barely been characterized, and data are mainly based on hybridizations of human alphoid sequences. We isolated and finely characterized gorilla ?-satellite sequences and revealed relevant structure and chromosomal distribution similarities with other great apes as well as gorilla-specific features, such as the uniquely octameric structure of the suprachromosomal family-2 (SF2). We demonstrated for the first time the orthologous localization of alphoid suprachromosomal families-1 and -2 (SF1 and SF2) between human and gorilla in contrast to chimpanzee centromeres. Finally, the discovery of a new 189 bp monomer type in gorilla centromeres unravels clues to the role of the centromere protein B, paving the way to solve the significance of the centromere DNA's essential repetitive nature in association with its function and the peculiar evolution of the ?-satellite sequence.
Project description:Neural networks (NNs) have emerged as a new tool for genomic selection (GS) in animal breeding. However, the properties of NN used in GS for the prediction of phenotypic outcomes are not well characterized due to the problem of over-parameterization of NN and difficulties in using whole-genome marker sets as high-dimensional NN input. In this note, we have developed an R package called snnR that finds an optimal sparse structure of a NN by minimizing the square error subject to a penalty on the L1-norm of the parameters (weights and biases), therefore solving the problem of over-parameterization in NN. We have also tested some models fitted in the snnR package to demonstrate their feasibility and effectiveness to be used in several cases as examples. In comparison of snnR to the R package brnn (the Bayesian regularized single layer NNs), with both using the entries of a genotype matrix or a genomic relationship matrix as inputs, snnR has greatly improved the computational efficiency and the prediction ability for the GS in animal breeding because snnR implements a sparse NN with many hidden layers.
Project description:The functional centromeres of rice (Oryza sativa, AA genome) chromosomes contain two key DNA components: the CRR centromeric retrotransposons and a 155-bp satellite repeat, CentO. However, several wild Oryza species lack the CentO repeat. We developed a chromatin immunoprecipitation-based technique to clone DNA fragments derived from chromatin containing the centromeric histone H3 variant CenH3. Chromatin immunoprecipitation cloning was carried out in the CentO-less species Oryza rhizomatis (CC genome) and Oryza brachyantha (FF genome). Three previously uncharacterized genome-specific satellite repeats, CentO-C1, CentO-C2, and CentO-F, were discovered in the centromeres of these two species. An 80-bp DNA region was found to be conserved in CentO-C1, CentO, and centromeric satellite repeats from maize and pearl millet, species which diverged from rice many millions of years ago. In contrast, the CentO-F repeat shows no sequence similarity to other centromeric repeats but has almost completely replaced other centromeric sequences in O. brachyantha, including the CRR-related sequences that normally constitute a significant fraction of the centromeric DNA in grass species.