Dataset Information


HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.

ABSTRACT: Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.


PROVIDER: S-EPMC7414044 | BioStudies | 2020-01-01

REPOSITORIES: biostudies

Similar Datasets

1000-01-01 | S-EPMC3333191 | BioStudies
| PRJNA631961 | ENA
1000-01-01 | S-EPMC4421804 | BioStudies
2020-01-01 | S-EPMC7267826 | BioStudies
1000-01-01 | S-EPMC5591216 | BioStudies
1000-01-01 | S-EPMC4421819 | BioStudies
2008-01-01 | S-EPMC2518838 | BioStudies
1000-01-01 | S-EPMC2699522 | BioStudies
1000-01-01 | S-EPMC4974574 | BioStudies
2011-01-01 | S-EPMC3725746 | BioStudies