By Raguvir Kunani
Data is the dominant player in today’s society. Nearly every piece of information we consume is determined by the data collected about us, from the ads we see to the restaurants we are recommended. Our data even decides what news is shown to us. But in order for all these decisions to be made from data, the data has to be stored somewhere. For years, we’ve stored data on disk drives and tape. But as the demand for data storage exponentially increases, current data storage technologies are struggling to keep pace. While companies and researchers continue to optimize current data storage technologies, others have proposed fundamentally new methods of storing data. One of the newest such methods is encoding information into DNA, a process called DNA data storage. Compared to current storage systems, DNA data storage boasts significantly higher information density (how much data can be stored per unit space) as well as longevity (how long the data can be stored without decaying). Professor Kannan Ramchandran in the Department of Electrical Engineering and Computer Sciences at UC Berkeley along with his post-doc Reinhard Heckel recently researched the potential of DNA data storage, specifically aiming to determine its maximum storage rate (bits per nucleotide). Heckel has studied DNA data storage for a few years and achieved a significant breakthrough in the field in 2015, when he achieved an error-free reading of information encoded in DNA. Ramchandran and Heckel broke the task of determining the maximum storage rate into two parts. First, in order to quantify certain aspects of DNA data storage, they devised a mathematical model that represents the process of DNA data storage. Second, they found an optimal encoding/decoding strategy that maximizes the storage rate (under their model).To understand their model and strategy, we need to know some details of the DNA data storage process. In DNA data storage, the data is converted from its representation as a string of binary bits — a sequence of 0’s and 1’s — into a sequence of nucleotides (the building blocks of DNA). The sequence of nucleotides is then physically made — or synthesized — into a DNA molecule, where the data is stored. To retrieve the data, the sequence of nucleotides is converted back into 0’s and 1’s. The encoding and decoding process is summarized in the following diagram (the amplification stage will be explained in the next paragraph). As one would expect, there are some complications in the process outlined above. In practice, the data being encoded has to be broken into chunks and each chunk is individually encoded and synthesized into a DNA molecule. Thus, in order to get the data back, each DNA molecule has to be decoded separately. For reasons including dealing with the possibility that some DNA molecules could get destroyed during storage, each DNA molecule is copied many times (this is the amplification stage in the figure above). Additionally, the DNA molecules cannot be ordered spatially during storage. In other words, there is no way to put one DNA molecule to the right of or on top of another DNA molecule. The molecules float around like letters in alphabet soup. Due to this lack of spatial order, when decoding a particular DNA molecule, there is no certainty which DNA molecule is being decoded. Ramchandran and Heckel not only captured these non-idealities in their model, but also determined what encoding/decoding strategy would best handle the non- idealities and result in a maximally efficient DNA data storage system. Their strategy treats recovering the encoded data like blindly drawing colored marbles from a bag. Imagine someone blindfolds you, hands you a bag of colored marbles, and asks you to draw all of the green marbles from the bag. If you know exactly how many green marbles are in the bag and the total number of marbles at the start of your blind draws, then you can calculate how many draws it would take on average to ensure you get all the green marbles. If you do not know the exact number of green marbles — but rather an estimate — then you would have to draw more marbles to ensure you get all the green marbles (in order to account for the uncertainty of how many green marbles thereare). This idea of drawing more marbles is at the core of Ramchandran and Heckel’s strategy. In order to deal with the uncertainty presented by the floating DNA molecules (the marbles in the analogy), their strategy involves decoding more DNA molecules than the data was encoded into. Using this strategy of decoding extra molecules, Ramchandran and Heckel calculate the maximum storage rate of a DNA data storage system. More importantly, theyprovethat no strategy can do better given the assumptions that the data have to be broken into chunks and that the DNA molecules have no notion of spatial order. Their result is important because it provides a reliable benchmark of the maximum potential of DNA data storage. Whereas we previously could not quantify how powerful DNA data storage can be at its best, Ramchandran and Heckel’s work proves that an ideal DNA data storage system achieves a storage rate capable of storing all of Facebook and Wikipedia in just a few drops of liquid! Ramchandran decided to explore DNA data storage with Heckel because he was “surprised that [they] formulated a problem nobody had studied before.” Despite the shortage of academic research on the topic, companies like Microsoft and Intel have been allocating considerable amounts of resources to researching and developing DNA data storage systems. Microsoft looks to have an operational DNA storage system in their data centers by 2020, with the ultimate goal of “totally replacing tape drives.” Their aspirations are indicative of a broader application of DNA data storage to drastically reduce the storage space used by enterprise data centers. In the age of big data, increased data storage abilities — especially increases as large as those possible with DNA data storage — might not be something society is ready for. With large tech companies like Google and Facebook facing consequences for their collection and use of data, appropriate regulations must be in place before any more data is transferred and analyzed. Thankfully, we do not have to worry about larger scale data abuse right now since DNA data storage is currently too expensive to be realistically used at industrial scale. However, due to recent advancements in DNA synthesis made at UC Berkeley, the cost of DNA synthesis — and therefore DNA data storage — could decrease in the near future. The work engineers do shapes the world around us. But given the technical nature of that work, non-engineers may not always realize the impact and reach of engineering research. In E185: The Art of STEM Communication, students learn about and practice written and verbal communication skills that can bring the world of engineering to a broader audience. They spend the semester researching projects within the College of Engineering, interviewing professors and graduate students, and ultimately writing about and presenting that work for a general audience. This piece is one of the outcomes of the E185 course. Connect with Raguvir Kunani.DNA data storage: A 1 million year old hard drive was originally published in Berkeley Master of Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.