Unknown

Dataset Information

0

The Biological Significance of Multi-copy Regions and Their Impact on Variant Discovery.


ABSTRACT: Identification of genetic variants via high-throughput sequencing (HTS) technologies has been essential for both fundamental and clinical studies. However, to what extent the genome sequence composition affects variant calling remains unclear. In this study, we identified 63,897 multi-copy sequences (MCSs) with a minimum length of 300 bp, each of which occurs at least twice in the human genome. The 151,749 genomic loci (multi-copy regions, or MCRs) harboring these MCSs account for 1.98% of the genome and are distributed unevenly across chromosomes. MCRs containing the same MCS tend to be located on the same chromosome. Gene Ontology (GO) analyses revealed that 3800 genes whose UTRs or exons overlap with MCRs are enriched for Golgi-related cellular component terms and various enzymatic activities in the GO biological function category. MCRs are also enriched for loci that are sensitive to neocarzinostatin-induced double-strand breaks. Moreover, genetic variants discovered by genome-wide association studies and recorded in dbSNP are significantly underrepresented in MCRs. Using simulated HTS datasets, we show that false variant discovery rates are significantly higher in MCRs than in other genomic regions. These results suggest that extra caution must be taken when identifying genetic variants in the MCRs via HTS technologies.

SUBMITTER: Sun J 

PROVIDER: S-EPMC8377240 | biostudies-literature | 2020 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

The Biological Significance of Multi-copy Regions and Their Impact on Variant Discovery.

Sun Jing J   Zhang Yanfang Y   Wang Minhui M   Guan Qian Q   Yang Xiujia X   Ou Jin Xia JX   Yan Mingchen M   Wang Chengrui C   Zhang Yan Y   Li Zhi-Hao ZH   Lan Chunhong C   Mao Chen C   Zhou Hong-Wei HW   Hao Bingtao B   Zhang Zhenhai Z  

Genomics, proteomics & bioinformatics 20200819 5


Identification of genetic variants via high-throughput sequencing (HTS) technologies has been essential for both fundamental and clinical studies. However, to what extent the genome sequence composition affects variant calling remains unclear. In this study, we identified 63,897 multi-copy sequences (MCSs) with a minimum length of 300 bp, each of which occurs at least twice in the human genome. The 151,749 genomic loci (multi-copy regions, or MCRs) harboring these MCSs account for 1.98% of the g  ...[more]

Similar Datasets

| S-EPMC3180300 | biostudies-literature
| S-EPMC10779192 | biostudies-literature
| S-EPMC3592431 | biostudies-literature
| S-EPMC8666084 | biostudies-literature
| S-EPMC4783407 | biostudies-literature
| S-EPMC4234435 | biostudies-literature
| S-EPMC11508295 | biostudies-literature
| S-EPMC3002949 | biostudies-literature
| S-EPMC3563492 | biostudies-literature
| S-EPMC9099228 | biostudies-literature