Whole genome detection of sequences of six horses from diverse breeds.
ABSTRACT: Completed in 2009, the reference genome assembly of the domesticated horse (EquCab 2.0) produced the majority of publically available annotations of genetic variations in this species. Following that effort, a few other projects that focused at variant discovery of a particular breed or two. In this project we aim to identify and annotate single nucleotide polymorphisms (SNPs), insertions and deletions (INDELs), copy number variations (CNVs) and structural variations (SVs) in the genomes six horses of diverse genetic background using next generation sequencing. Analysis ERZ780739 uses EquCab 2.0 as the reference sequence, while analysis ERZ1195829 uses EquCab 3.0.
Project description:The domesticated horse has played a unique role in human history, serving not just as a source of animal protein, but also as a catalyst for long-distance migration and military conquest. As a result, the horse developed unique physiological adaptations to meet the demands of both their climatic environment and their relationship with man. Completed in 2009, the first domesticated horse reference genome assembly (EquCab 2.0) produced most of the publicly available genetic variations annotations in this species. Yet, there are around 400 geographically and physiologically diverse breeds of horse. To enrich the current collection of genetic variants in the horse, we sequenced whole genomes from six horses of six different breeds: an American Miniature, a Percheron, an Arabian, a Mangalarga Marchador, a Native Mongolian Chakouyi, and a Tennessee Walking Horse, and mapped them to EquCab3.0 genome. Aside from extreme contrasts in body size, these breeds originate from diverse global locations and each possess unique adaptive physiology. A total of 1.3 billion reads were generated for the six horses with coverage between 15x to 24x per horse. After applying rigorous filtration, we identified and functionally annotated 17,514,723 Single Nucleotide Polymorphisms (SNPs), and 1,923,693 Insertions/Deletions (INDELs), as well as an average of 1,540 Copy Number Variations (CNVs) and 3,321 Structural Variations (SVs) per horse. Our results revealed putative functional variants including genes associated with size variation like LCORL gene (found in all horses), ZFAT in the Arabian, American Miniature and Percheron horses and ANKRD1 in the Native Mongolian Chakouyi horse. We detected a copy number variation in the Latherin gene that may be the result of evolutionary selection impacting thermoregulation by sweating, an important component of athleticism and heat tolerance. The newly discovered variants were formatted into user-friendly browser tracks and will provide a foundational database for future studies of the genetic underpinnings of diverse phenotypes within the horse.
Project description:Next generation sequencing (NGS) of PCR amplicons is a standard approach to detect genetic variations in personalized medicine such as cancer diagnostics. Computer programs used in the NGS community often miss insertions and deletions (indels) that constitute a large part of known human mutations. We have developed HeurAA, an open source, heuristic amplicon aligner program. We tested the program on simulated datasets as well as experimental data from multiplex sequencing of 40 amplicons in 12 oncogenes collected on a 454 Genome Sequencer from lung cancer cell lines. We found that HeurAA can accurately detect all indels, and is more than an order of magnitude faster than previous programs. HeurAA can compare reads and reference sequences up to several thousand base pairs in length, and it can evaluate data from complex mixtures containing reads of different gene-segments from different samples. HeurAA is written in C and Perl for Linux operating systems, the code and the documentation are available for research applications at http://sourceforge.net/projects/heuraa/
Project description:In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100 bps to ~10,000 bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.
Project description:Insertions and deletions (indels) represent the second most common type of genetic variations in human genomes. Indels can be deleterious and contribute to disease susceptibility as recent genome sequencing projects revealed a large number of indels in various cancer types. In this study, we investigated the possible effects of small coding indels on protein structure and function, and the baseline characteristics of indels in 2504 individuals of 26 populations from the 1000 Genomes Project. We found that each population has a distinct pattern in genes with small indels. Frameshift (FS) indels are enriched in olfactory receptor activity while non-frameshift (NFS) indels are enriched in transcription-related proteins. Structural analysis of NFS indels revealed that they predominantly adopt coil or disordered conformations, especially in proteins with transcription-related NFS indels. These results suggest that the annotated coding indels from the 1000 Genomes Project, while contributing to genetic variations and phenotypic diversity, generally do not affect the core protein structures and have no deleterious effect on essential biological processes. In addition, we found that a number of reference genome annotations might need to be updated due to the high prevalence of annotated homozygous indels in the general population.
Project description:Human genomes are routinely compared against a universal reference. However, this strategy could miss population-specific and personal genomic variations, which may be detected more efficiently using an ethnically relevant or personal reference. Here we report a hybrid assembly of a Korean reference genome (KOREF) for constructing personal and ethnic references by combining sequencing and mapping methods. We also build its consensus variome reference, providing information on millions of variants from 40 additional ethnically homogeneous genomes from the Korean Personal Genome Project. We find that the ethnically relevant consensus reference can be beneficial for efficient variant detection. Systematic comparison of human assemblies shows the importance of assembly quality, suggesting the necessity of new technologies to comprehensively map ethnic and personal genomic structure variations. In the era of large-scale population genome projects, the leveraging of ethnicity-specific genome assemblies as well as the human reference genome will accelerate mapping all human genome diversity.
Project description:We selected two genetically diverse subspecies of the Trifolium model species, subterranean clover cvs. Daliak and Yarloop. The structural variations (SVs) discovered by Bionano optical mapping (BOM) were validated using Illumina short reads. In the analysis, BOM identified 12 large-scale regions containing deletions and 19 regions containing insertions in Yarloop. The 12 large-scale regions contained 71 small deletions when validated by Illumina short reads. The results suggest that BOM could detect the total size of deletions and insertions, but it could not precisely report the location and actual quantity of SVs in the genome. Nucleotide-level validation is crucial to confirm and characterize SVs reported by optical mapping. The accuracy of SV detection by BOM is highly dependent on the quality of reference genomes and the density of selected nickases.
Project description:Long Interspersed Element-1 (LINE-1) retrotransposition contributes to inter- and intra-individual genetic variation and occasionally can lead to human genetic disorders. Various strategies have been developed to identify human-specific LINE-1 (L1Hs) insertions from short-read whole genome sequencing (WGS) data; however, they have limitations in detecting insertions in complex repetitive genomic regions. Here, we developed a computational tool (PALMER) and used it to identify 203 non-reference L1Hs insertions in the NA12878 benchmark genome. Using PacBio long-read sequencing data, we identified L1Hs insertions that were absent in previous short-read studies (90/203). Approximately 81% (73/90) of the L1Hs insertions reside within endogenous LINE-1 sequences in the reference assembly and the analysis of unique breakpoint junction sequences revealed 63% (57/90) of these L1Hs insertions could be genotyped in 1000 Genomes Project sequences. Moreover, we observed that amplification biases encountered in single-cell WGS experiments led to a wide variation in L1Hs insertion detection rates between four individual NA12878 cells; under-amplification limited detection to 32% (65/203) of insertions, whereas over-amplification increased false positive calls. In sum, these data indicate that L1Hs insertions are often missed using standard short-read sequencing approaches and long-read sequencing approaches can significantly improve the detection of L1Hs insertions present in individual genomes.
Project description:In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the 'detectable' sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.The implementation of the NovelSeq pipeline is available at http://firstname.lastname@example.org; email@example.com
Project description:BACKGROUND: Investigation of tomato genetic resources is a crucial issue for better straight evolution and genetic studies as well as tomato breeding strategies. Traditional Vesuviano and San Marzano varieties grown in Campania region (Southern Italy) are famous for their remarkable fruit quality. Owing to their economic and social importance is crucial to understand the genetic basis of their unique traits. RESULTS: Here, we present the draft genome sequences of tomato Vesuviano and San Marzano genome. A 40x genome coverage was obtained from a hybrid Illumina paired-end reads assembling that combines de novo assembly with iterative mapping to the reference S. lycopersicum genome (SL2.40). Insertions, deletions and SNP variants were carefully measured. When assessed on the basis of the reference annotation, 30% of protein-coding genes are predicted to have variants in both varieties. Copy genes number and gene location were assessed by mRNA transcripts mapping, showing a closer relationship of San Marzano with reference genome. Distinctive variations in key genes and transcription/regulation factors related to fruit quality have been revealed for both cultivars. CONCLUSIONS: The effort performed highlighted varieties relationships and important variants in fruit key processes useful to dissect the path from sequence variant to phenotype.
Project description:Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.