Rapidly Registering Identity-by-Descent Across Ancestral Recombination Graphs.
ABSTRACT: The genomes of remotely related individuals occasionally contain long segments that are identical by descent (IBD). Sharing of IBD segments has many applications in population and medical genetics, and it is thus desirable to study their properties in simulations. However, no current method provides a direct, efficient means to extract IBD segments from simulated genealogies. Here, we introduce computationally efficient approaches to extract ground-truth IBD segments from a sequence of genealogies, or equivalently, an ancestral recombination graph. Specifically, we use a two-step scheme, where we first identify putative shared segments by comparing the common ancestors of all pairs of individuals at some distance apart. This reduces the search space considerably, and we then proceed by determining the true IBD status of the candidate segments. Under some assumptions and when allowing a limited resolution of segment lengths, our run-time complexity is reduced from O(n(3)?log?n) for the naïve algorithm to O(n?log?n), where n is the number of individuals in the sample.
Project description:Identity-by-descent (IBD) inference is the problem of establishing a genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is an important preceding step in a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of accurate IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: Given a cohort's size, a quadratic number of pairwise genome comparisons must be performed. Therefore, computation time and the false discovery rate can also scale quadratically. To enable accurate and efficient large-scale IBD detection, we present Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses a model that accounts for linkage disequilibrium by explicitly modeling haplotype frequencies. Parente2 operates directly on genotype data without the need to phase data prior to IBD inference. We evaluate Parente2's performance through extensive simulations using real data, and we show that it provides substantially higher accuracy compared to previous state-of-the-art methods while maintaining high computational efficiency.
Project description:Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases, but IBD detection accuracy in nonsimulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father-mother-child trios from the 23andMe, Inc. data set. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false-positive rate over 67% for 2-4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors when detecting IBD, a necessity for retrieving long (>6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that scores IBD segments proportional to the number of switch errors they contain. Applying HaploScore filtering to the IBD data at a precision of 0.8 produced a 13-fold increase in recall when compared with length-based filtering. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative data sources using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.
Project description:Segments of the genome inherited from a common ancestor by related individuals are said to be identical by descent (IBD). Modern genetic marker data provide information to infer such segments among multiple related members of a population, even when pedigree relationships are unknown. Previous methods have been proposed for the detection of pairwise IBD, but the computation of probabilities of trait data under many trait models requires an IBD estimate jointly consistent among individuals and slowly varying across genome locations; we refer to such an estimate as an 'IBD graph'. In this paper, we develop a novel method that builds IBD graphs sequentially among related individuals from a population sample using either phased or unphased genetic marker data. We show how IBD graphs realized conditionally on marker data provide a form of linkage mapping score, analogous to a LOD score, and propose a permutation approach to normalize this mapping score. Using a simulated quantitative trait dependent on the (unobserved) genotype at a major locus, we apply the approach to two samples containing both closely and remotely related individuals, among whom there are complex patterns of IBD. We compare the results of our approach with an alternate approach based on the estimation of local kinship. We show that pairwise estimates derived from a joint IBD graph give significant improvements in LOD score estimation over estimates derived from an intrinsically pairwise approach.
Project description:Identifying and characterizing genomic regions that are shared identical by descent (IBD) among individuals can yield insight into population history, facilitate the identification of adaptively evolving loci, and be an important tool in disease gene mapping. Although increasingly large collections of exome sequences have been generated, it is challenging to detect IBD segments in exomes, precluding many potentially informative downstream analyses. Here, we describe an approach, ExIBD, to robustly detect IBD segments in exome-sequencing data, rigorously evaluate its performance, and apply this method to high-coverage exomes from 6,515 European and African Americans. Furthermore, we show how IBD networks, constructed from patterns of pairwise IBD between individuals, and principles from graph theory provide insight into recent population history and reveal cryptic population structure in European Americans. Our results enable IBD analyses to be performed on exome data, which will expand the scope of inferences that can be made from existing massively large exome-sequencing datasets.
Project description:Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority-152 000 IBD segments-are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.
Project description:Most methods for fast detection of identity by descent (IBD) segments report identity by state segments without any quantification of the uncertainty in the endpoints and lengths of the IBD segments. We present a method for determining the posterior probability distribution of IBD segment endpoints. Our approach accounts for genotype errors, recent mutations, and gene conversions which disrupt DNA sequence identity within IBD segments, and it can be applied to large cohorts with whole-genome sequence or SNP array data. We find that our method's estimates of uncertainty are well calibrated for homogeneous samples. We quantify endpoint uncertainty for 77.7 billion IBD segments from 408,883 individuals of white British ancestry in the UK Biobank, and we use these IBD segments to find regions showing evidence of recent natural selection. We show that many spurious selection signals are eliminated by the use of unbiased estimates of IBD segment endpoints and a pedigree-based genetic map. Eleven of the twelve regions with the greatest evidence for recent selection in our scan have been identified as selected in previous analyses using different approaches. Our computationally efficient method for quantifying IBD segment endpoint uncertainty is implemented in the open source ibd-ends software package.
Project description:Identity-by-descent (IBD) segments are a useful tool for applications ranging from demographic inference to relationship classification, but most detection methods rely on phasing information and therefore require substantial computation time. As genetic datasets grow, methods for inferring IBD segments that scale well will be critical. We developed IBIS, an IBD detector that locates long regions of allele sharing between unphased individuals, and benchmarked it with Refined IBD, GERMLINE, and TRUFFLE on 3,000 simulated individuals. Phasing these with Beagle 5 takes 4.3 CPU days, followed by either Refined IBD or GERMLINE segment detection in 2.9 or 1.1 h, respectively. By comparison, IBIS finishes in 6.8 min or 7.8 min with IBD2 functionality enabled: speedups of 805-946× including phasing time. TRUFFLE takes 2.6 h, corresponding to IBIS speedups of 20.2-23.3×. IBIS is also accurate, inferring ?7 cM IBD segments at quality comparable to Refined IBD and GERMLINE. With these segments, IBIS classifies first through third degree relatives in real Mexican American samples at rates meeting or exceeding other methods tested and identifies fourth through sixth degree pairs at rates within 0.0%-2.0% of the top method. While allele frequency-based approaches that do not detect segments can infer relationship degrees faster than IBIS, the fastest are biased in admixed samples, with KING inferring 30.8% fewer fifth degree Mexican American relatives correctly compared with IBIS. Finally, we ran IBIS on chromosome 2 of the UK Biobank dataset and estimate its runtime on the autosomes to be 3.3 days parallelized across 128 cores.
Project description:Simulations of close relatives and identical by descent (IBD) segments are common in genetic studies, yet most past efforts have utilized sex averaged genetic maps and ignored crossover interference, thus omitting features known to affect the breakpoints of IBD segments. We developed Ped-sim, a method for simulating relatives that can utilize either sex-specific or sex averaged genetic maps and also either a model of crossover interference or the traditional Poisson model for inter-crossover distances. To characterize the impact of previously ignored mechanisms, we simulated data for all four combinations of these factors. We found that modeling crossover interference decreases the standard deviation of pairwise IBD proportions by 10.4% on average in full siblings through second cousins. By contrast, sex-specific maps increase this standard deviation by 4.2% on average, and also impact the number of segments relatives share. Most notably, using sex-specific maps, the number of segments half-siblings share is bimodal; and when combined with interference modeling, the probability that sixth cousins have non-zero IBD sharing ranges from 9.0 to 13.1%, depending on the sexes of the individuals through which they are related. We present new analytical results for the distributions of IBD segments under these models and show they match results from simulations. Finally, we compared IBD sharing rates between simulated and real relatives and find that the combination of sex-specific maps and interference modeling most accurately captures IBD rates in real data. Ped-sim is open source and available from https://github.com/williamslab/ped-sim.
Project description:There has been much interest in detecting genomic identity by descent (IBD) segments from modern dense genetic marker data and in using them to identify human disease susceptibility loci. Here we present a novel Bayesian framework using Markov chain Monte Carlo (MCMC) realizations to jointly infer IBD states among multiple individuals not known to be related, together with the allelic typing error rate and the IBD process parameters. The data are phased single nucleotide polymorphism (SNP) haplotypes. We model changes in latent IBD state along homologous chromosomes by a continuous time Markov model having the Ewens sampling formula as its stationary distribution. We show by simulation that this model for the IBD process fits quite well with the coalescent predictions. Using simulation data sets of 40 haplotypes over regions of 1 and 10 million base pairs (Mbp), we show that the jointly estimated IBD states are very close to the true values, although the presence of linkage disequilibrium decreases the accuracy. We also present comparisons with the ibd_haplo program, which estimates IBD among sets of four haplotypes. Our new IBD detection method focuses on the scale between genome-wide methods using simple IBD models and complex coalescent-based methods that are limited to short genome segments. At the scale of a few Mbp, our approach offers potentially more power for fine-scale IBD association mapping.
Project description:Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. In a common definition, two haplotypes are said to share an IBD segment if that segment is inherited from a recent shared common ancestor without intervening recombination. Segments several cM long can be efficiently detected by a number of algorithms using high-density SNP array data from a population sample, and there are currently efforts to detect shorter segments from sequencing. Here, we study a problem of identifiability: because existing approaches detect IBD based on contiguous segments of identity-by-state, inferred long segments of IBD may arise from the conflation of smaller, nearby IBD segments. We quantified this effect using coalescent simulations, finding that significant proportions of inferred segments 1-2 cM long are results of conflations of two or more shorter segments, each at least 0.2 cM or longer, under demographic scenarios typical for modern humans for all programs tested. The impact of such conflation is much smaller for longer (> 2 cM) segments. This biases the inferred IBD segment length distribution, and so can affect downstream inferences that depend on the assumption that each segment of IBD derives from a single common ancestor. As an example, we present and analyze an estimator of the de novo mutation rate using IBD segments, and demonstrate that unmodeled conflation leads to underestimates of the ages of the common ancestors on these segments, and hence a significant overestimate of the mutation rate. Understanding the conflation effect in detail will make its correction in future methods more tractable.