Project description:Data storage in DNA has recently emerged as a promising archival solution, offering space-efficient and long-lasting digital storage solutions. Recent studies suggest leveraging the inherent redundancy of synthesis and sequencing technologies by using composite DNA alphabets. A major challenge of this approach involves the noisy inference process, obstructing large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering, in some implementations, a 6.5-fold increase in logical density over standard DNA-based storage systems, with near-zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter consists of a subset of shortmers. We formally define various combinatorial encoding schemes and investigate their theoretical properties. These include information density and reconstruction probabilities, as well as required synthesis and sequencing multiplicities. We then propose an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional (2D) error correction codes, and reconstruction algorithms, under different error regimes. We performed simulations and show, for example, that the use of 2D Reed-Solomon error correction has significantly improved reconstruction rates. We validated our approach by constructing two combinatorial sequences using Gibson assembly, imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage and describes some theoretical research questions and technical challenges. Combining combinatorial principles with error-correcting strategies, and investing in the development of DNA synthesis technologies that efficiently support combinatorial synthesis, can pave the way to efficient, error-resilient DNA-based storage solutions.
Project description:Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.
Project description:DNA-based data storage has emerged as a promising method to satisfy the exponentially increasing demand for information storage. However, practical implementation of DNA-based data storage remains a challenge because of the high cost of data writing through DNA synthesis. Here, we propose the use of degenerate bases as encoding characters in addition to A, C, G, and T, which augments the amount of data that can be stored per length of DNA sequence designed (information capacity) and lowering the amount of DNA synthesis per storing unit data. Using the proposed method, we experimentally achieved an information capacity of 3.37 bits/character. The demonstrated information capacity is more than twice when compared to the highest information capacity previously achieved. The proposed method can be integrated with synthetic technologies in the future to reduce the cost of DNA-based data storage by 50%.
Project description:Polypeptides consisting of amino acid (AA) sequences are suitable for high-density information storage. However, the lack of suitable encoding systems, which accommodate the characteristics of polypeptide synthesis, storage and sequencing, impedes the application of polypeptides for large-scale digital data storage. To address this, two reliable and highly efficient encoding systems, i.e. RaptorQ-Arithmetic-Base64-Shuffle-RS (RABSR) and RaptorQ-Arithmetic-Huffman-Rotary-Shuffle-RS (RAHRSR) systems, are developed for polypeptide data storage. The two encoding systems realized the advantages of compressing data, correcting errors of AA chain loss, correcting errors within AA chains, eliminating homopolymers, and pseudo-randomized encrypting. The coding efficiency without arithmetic compression and error correction of audios, pictures and texts by the RABSR system was 3.20, 3.12 and 3.53 Bits/AA, respectively. While that using the RAHRSR system reached 4.89, 4.80 and 6.84 Bits/AA, respectively. When implemented with redundancy for error correction and arithmetic compression to reduce redundancy, the coding efficiency of audios, pictures and texts by the RABSR system was 4.43, 4.36 and 5.22 Bits/AA, respectively. This efficiency further increased to 7.24, 7.11 and 9.82 Bits/AA by the RAHRSR system, respectively. Therefore, the developed hexadecimal polypeptide-based systems may provide a new scenario for highly reliable and highly efficient data storage.
Project description:Combinatorial chemistry invented nearly 40 years ago was welcomed with enthusiasm in the drug research community. The method offered access to a practically unlimited number of new compounds. The new compounds however are mixtures, and methods had to be developed for the identification of the bioactive components. This was one of the reasons why the method could not providethe expected cornucopia of new drugs. Among the different screening methods, two approaches seem to offer the best results. One of them is based on the intrinsic property of the combinatorial split and pool solid-phase synthesis: One compound forms on each bead of the solid support. Different methods have been developed to encode the beads and identify the structure of compounds formed on them. The most important method applies DNA oligomers for encoding. As a second approach in screening, DNA-encoded combinatorial libraries are synthesized omitting the solid support and the mixtures are screened in solution using affinity binding methods. Libraries containing billions and even trillions of components are synthesized and successfully tested, which led to the identification of a significant number of new leads.
Project description:Safety and ethical issues are the primary concerns for assisted reproductive technology (ART). However, confusion and contamination of samples are common problems in embryo laboratories, preimplantation genetic test (PGT) laboratories, and third-party medical testing laboratories due to large sample numbers and complex procedures. Once these problems occur, they are often difficult to trace, posing risks and ethical challenges to hospital reproductive centers, third-party medical testing laboratories, and patient families. Therefore, it is necessary to establish an effective and feasible tracing system to ensure sample safety. In this study, we designed an exogenous encoding sequence (EES) based on DNA data storage technology, which provide a unique identification code for each in vitro cultured embryo, effectively avoiding potential risks and ethical problems caused by sample confusion and contamination. This exogenous encoding sequence is a DNA molecule that is non-toxic and structurally stable. We verified that a small amount of exogenous encoding sequence (6∗109 copies/uL) can be amplified together with embryo biopsy cells and detected by various sequencing methods without affecting copy number variants (CNVs). Furthermore, if there is contamination from other samples at a proportion of more than 5 %, it can also be identified through the encoding information of the exogenous encoding sequence. Our study proves that the exogenous encoding sequence designed based on DNA data storage technology is effective and reliable, and can be applied in hospital reproductive centers and third-party medical testing laboratories to improve the safety of in vitro cultured embryos and avoid potential ethical problems in the future.