## Mar 25, 2019

### Microsoft Introduces First Prototype For Storing Data In DNA Strands

The growing interconnectivity between biology, chemistry and technology may be one of the most significant new developments in all three of those fields, as well as for the economy. JL

Scientific Reports provides an abstract via MIT Technology Review:

Microsoft has been working toward a photocopier-size device that would replace data centers by storing files, movies, and documents in DNA strands, which can pack in information at mind-boggling density. All the information stored in a warehouse-size data center would fit into a set of Yahztee dice, were it written in DNA. The National Intelligence Agency’s IARPA program is getting ready to hand out tens of millions toward radical new molecular information storage schemes.

Microsoft has helped build the first device that automatically encodes digital information into DNA and back to bits again.
DNA storage: Microsoft has been working toward a photocopier-size device that would replace data centers by storing files, movies, and documents in DNA strands, which can pack in information at mind-boggling density.
According to Microsoft, all the information stored in a warehouse-size data center would fit into a set of Yahztee dice, were it written in DNA.
Demo device: So far, DNA data storage has been carried out by hand in the lab. But now researchers at the University of Washington who are working with the software giant say they created a machine that converts electronic bits to DNA and back without a person involved.
The gadget, made from about $10,000 in parts, uses glass bottles of chemicals to build DNA strands, and a tiny sequencing machine from Oxford Nanopore to read them out again. Still limited: According to a publication on March 21 in the journal Nature Scientific Reports, the team was able to store and retrieve just a single word—“hello”—or five bytes of data. What’s more, the process took 21 hours, mostly because of the slow chemical reactions involved in writing DNA. While the team considered that a success for their prototype, a commercially useful DNA storage system would have to store data millions of times faster. Why now? It’s a good time for companies involved in DNA storage to show off their stuff. The National Intelligence Agency’s IARPA program is getting ready to hand out tens of millions toward radical new molecular information storage schemes. Nature Science Reports Abstract Synthetic DNA has emerged as a novel substrate to encode computer data with the potential to be orders of magnitude denser than contemporary cutting edge techniques. However, even with the help of automated synthesis and sequencing devices, many intermediate steps still require expert laboratory technicians to execute. We have developed an automated end-to-end DNA data storage device to explore the challenges of automation within the constraints of this unique application. Our device encodes data into a DNA sequence, which is then written to a DNA oligonucleotide using a custom DNA synthesizer, pooled for liquid storage, and read using a nanopore sequencer and a novel, minimal preparation protocol. We demonstrate an automated 5-byte write, store, and read cycle with a modular design enabling expansion as new technology becomes available. ## Introduction Storing information in DNA is an emerging technology with considerable potential to be the next generation storage medium of choice. Recent advances have shown storage capacity grow from hundreds of kilobytes to megabytes to hundreds of megabytes1,2,3. Although contemporary approaches are book-ended with mostly automated synthesis4 and sequencing technologies (e.g., column synthesis, array synthesis, Illumina, nanopore, etc.), significant intermediate steps remain largely manual1,2,3,5. Without complete automation in the write to store to read cycle of data storage in DNA, it is unlikely to become a viable option for applications other than extremely seldom read archival. To demonstrate the practicality of integrating fluidics, electronics and infrastructure, and explore the challenges of full DNA storage automation, we developed the first full end-to-end automated DNA storage device. Our device is intended to act as a proof-of-concept that provides a foundation for continuous improvements, and as a first application of modules that can be used in future molecular computing research. As such, we adhered to specific design principles for the implementation: (1) maximize modularity for the sake of replication and reuse, and (2) reduce system complexity to balance cost and labor input required to setup and run the device modules. Our resulting system has three core components that accomplish the write and read operations (Fig. 1a): an encode/decode software module, a DNA synthesis module, and a DNA preparation and sequencing module (Fig. 1b,c). It has a bench-top footprint and costs approximately$10 k USD, though careful calibration and elimination of costly sensors and actuators could reduce its cost to approximately \$3 k–4 k USD at low volumes.
Before a file can be written to DNA, its data must first be translated from 1’s and 0’s to A’s, C’s, T’s, and G’s. The encode software module is responsible for this translation and the addition of error correction into the payload sequence (see the Methods section and work by Richard Hamming6). Once the payload sequence is generated, additional bases are added to ensure its primary and secondary structure is compatible with the read process and the DNA sequence is sent to the synthesis module for instantiation into physical DNA molecules.
The DNA synthesis module is built around two valved manifolds that separately deliver hydrous and anhydrous reagents to the synthesis column. Our initial designs used standard valves, but the dead volume at junction points caused unacceptable contamination between cycles. Therefore, we switched to zero dead volume valves7. The combined flow path is then monitored by a flow sensor, whose output is coupled to a standard fitting; the fitting can be coupled to arbitrary devices, such as a flow cell for array synthesis8 or, in this case, adapted to fit a standard synthesis column. Once synthesis is complete, the synthesized DNA is eluted into a storage vessel, where it is stored until retrieval.
Once sequencing begins, the decode software module aligns each read to the 1 k base extension region and the poly-T hairpin. If the intervening region of DNA is the correct length, the decoder attempts to error check/correct the payload using a Hamming code with an additional parity bit; the code corrects all single-base errors and detects all double-base errors. Once the payload is successfully decoded, it is considered correct if it matches a 6-base hash stored with the data. At this point, sequencing terminates, and the MinION flow cell may be washed and stored for later reuse.
Our system’s write-to-read latency is approximately 21 h. The majority of this time is taken by synthesis, viz., approximately 305 s per base, or 8.4 h to synthesize a 99-mer payload and 12 h to cleave and deprotect the oligonucleotides at room temperature. After synthesis, preparation takes an additional 30 min, and nanopore reading and online decoding take 6 min.
Using this prototype system, we stored and subsequently retrieved the 5-byte message “HELLO” (01001000 01000101 01001100 01001100 01001111 in bits). Synthesis yielded approximately 1 mg of DNA, with approximately 4 μg ≈ 100 pmol retained for sequencing. Nanopore sequencing yielded 3469 reads, 1973 of which aligned to our adapter sequence. Of the aligned sequences, 30 had extractable payload regions. Of those, 1 was successfully decoded with a perfect payload. The remaining 29 payloads were rejected by the decoder for being irrecoverably corrupt.
Inspecting the sequencing data indicates that the low payload yield and decode rate was largely due to two factors. The first and primary factor is low ligation efficiency. Although chemical conditions should be optimal for T4 ligase, incomplete strands from the unpurified synthesis product likely out-competed full-length strands, leading to a poor apparent ligation rate of less than 10% (Fig. 2c). The second factor is read and write fidelity. To interrogate the write error rate, we synthesized a randomly generated 100-base oligonucleotide with distinct 5′ and 3′ primer sequences. The oligonucleotide was then PCR-amplified and sequenced with an Illumina NextSeq instrument to reveal: an error rate of almost zero insertions; <1 1="" a="" and="" data-track-action="figure anchor" data-track-label="link" data-track="click" deletions="" href="https://www.nature.com/articles/s41598-019-41228-8#Fig2" ig.="" nbsp="" substitutions="">2a
We demonstrated the first fully automated end-to-end DNA data storage device. This device establishes a baseline from which new improvements may be made toward a device that eventually operates at a commercially viable scale and throughput. While 5 bytes in 21 hours is not yet commercially viable, there is precedent for many orders of magnitude improvement in data storage13. Infact, recent storage advances by Erlich et al.2 of 2 Mbytes and Organick et al. of 200 Mbytes3 demonstrate orders of magnitude improvements in the past two years and the underlying physics and chemistry show impressive upper bounds for density3.
Furthermore, the modules and methods developed here are now being applied to other molecular computing projects internally. For example, by using a non-cleavable linker in the synthesis column and adding a reagent port for chip-synthesized DNA, we can use the same platform to perform a database query in DNA14. Additionally, our sequencing preparation protocol and loading hardware can be adapted for use with our digital microfluidics platform15 and used as a readout for DNA strand displacement reactions.
Near-term improvements will focus primarily on system optimizations in synthesis, cycle count, and cost. Synthesis time can be reduced by 10–12 hours with the addition of heat in the cleave step16. Multiple writes (with or without reads) can be achieved by the addition of additional synthesis columns and a fluid multiplexer. Multiple reads can also be achieved with minor modifications (Supplemental Section 1) and exploiting the MinION flow cell’s reusability. Additionally, a cost-optimized version could be designed by eliminating the syringe pump and flow sensor, both unnecessary if flow rates are well measured and calibrated. This could save approximately 60% of our current device’s cost at the expense of more laborious operation. Future improvements will focus on bringing storage density, coding, and sequencing yield up to parity with modern manual and semi-automated methods.

## MethodsMethods

### DNA synthesis

DNA synthesis was performed using standard phosphoramidite chemistry17 without capping. Volumes and times, described in Table 1, used reagents purchased from Glen Research Corporation. For solid support (PN: ML1-3500-5), we used a BioAutomation 50 nmole scale synthesis column containing controlled porosity glass.
DNA cleavage was performed in 32% ammonia at room temperature for 1 hour before eluting. De-protection continued for an additional 11 hours in the same ammonia solution in the storage vessel.
Our system is fluidically configured as in Fig. 1b and electrically configured as in Supplemental Section 2.

### Sequencing preparation

The extended adapter was constructed from a 1 kilobase fragment that was PCR-amplified from the lambda genome using hot start TAQ DNA polymerase (NEB M0496) with a Bsa-I restriction site added by the forward primer. The resulting fragment after digestion had a 3′ A overhang and a 5′-GCGT sticky end on the bottom strand. The fragment was then T/A ligated and prepped according to Oxford Nanopore Technology’s (ONT) LSK-108 kit protocol, yielding the extended adapter with a four base sticky end.
The extended adapter was then mixed according to Table 2 into a sequencing master mix that is used in automated sequencing prep. Thirty minutes prior to sequencing, the master mix was combined with the hairpin oligo and incubated. DTT was left out of the T4 buffer because it damages the nanopores and causes sequencing to fail.

### Nanopore sequencing

Nanopore sequencing was done with an Oxford Nanopore Technologies MinION using an MIN-107 R9.5 flowcell and MinKNOW 18.7.2.0 software. Base calling was performed in 4000 event batches using Albacore 2.3.1. The read length distribution and write-to-read quality test were loaded manually (as described in the instructions for LSK-108 sequencing kits); the end-to-end code, write, read, and decode experiment was loaded automatically from the storage vessel.

### Coding and decoding

Prior to coding the user data (“HELLO” in ASCII bytes plus the hash consisting of the right most 12 bits of the SHA256 hash) was passed through a one time a one time pad to increase entropy similar to previous work3. One time pads
${X}_{1}=\left(1\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}0\right)$
and
${X}_{2}=\left(3\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}0\phantom{\rule{thinmathspace}{0ex}}1\phantom{\rule{thinmathspace}{0ex}}3\phantom{\rule{thinmathspace}{0ex}}2\phantom{\rule{thinmathspace}{0ex}}2\right)$
were used for the first and second experiment described in this paper respectively.
Data was coded using a two-layer scheme that stored 5 bytes over 32 dsDNA bases with an additional 13 bases of 3′ padding to compensate for lost fidelity near the read end (Fig. 2). The outer layer consisted of a (31, 26) Hamming code6 over a four-symbol alphabet with a checksum base that detects all two-base read errors and corrects all single-base errors. The following equivalences were made for the sake of algebraic simplicity: A = 0, C = 1, G = 2, T = 3. We used modulo-4 arithmetic and the canonical generator matrix
$G=\left(I\phantom{\rule{thinmathspace}{0ex}}-\phantom{\rule{thinmathspace}{0ex}}{A}^{T}\right),$
along with the canonical parody check matrix
$H=\left(A\phantom{\rule{thinmathspace}{0ex}}I\right),$
where
$A=\left(\begin{array}{cccccccccccccccccccccccccc}1& 1& 0& 1& 1& 0& 1& 0& 1& 0& 1& 1& 0& 1& 0& 1& 0& 1& 0& 1& 0& 1& 0& 1& 0& 1\\ 1& 0& 1& 1& 0& 1& 1& 0& 0& 1& 1& 0& 1& 1& 0& 0& 1& 1& 0& 0& 1& 1& 0& 0& 1& 1\\ 0& 1& 1& 1& 0& 0& 0& 1& 1& 1& 1& 0& 0& 0& 1& 1& 1& 1& 0& 0& 0& 0& 1& 1& 1& 1\\ 0& 0& 0& 0& 1& 1& 1& 1& 1& 1& 1& 0& 0& 0& 0& 0& 0& 0& 1& 1& 1& 1& 1& 1& 1& 1\\ 0& 0& 0& 0& 0& 0& 0& 0& 0& 0& 0& 1& 1& 1& 1& 1& 1& 1& 1& 1& 1& 1& 1& 1& 1& 1\end{array}\right)$
and I is the identity matrix of the appropriate dimension. To increase error detection, 6 of the 26 data bases stored a 12-bit hash of the payload, which was checked after decoding to ensure data integrity. Source code is available in Supplemental Section 3.
For decoding, groups of 4000 reads were collected and base-called using ONT’s Albacore software on 12 CPU cores. Reads that passed QC in Albacore were then aligned to the extended adapter and sequenced for further filtering. Only reads that appeared to have a correctly sized payload region between the adapter sequence and the poly-T hairpin were sent for error checking and decoding.

### DNA alignment

All DNA alignment was done using the parasail parasail_aligner command line tool18 with arguments -d -t 1 -O SSW -a sg_trace_striped_16 -o 8 -m NUC.4.4 -e 4. Alignments to the adapter sequence for decoding used the additional flag -c 20, while payload error analysis used flag -c 8.