《SNIA-SDC23-Li-Approximate-DNA-Storage.pdf》由會員分享,可在線閱讀,更多相關《SNIA-SDC23-Li-Approximate-DNA-Storage.pdf(43頁珍藏版)》請在三個皮匠報告上搜索。
1、1|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Approximate DNA Storage with High Robustness and Density for ImagesPresented byBingzhe LiAssistant ProfessorUniversity of Texas at Dallas2|2023 SNIA.All Rights Reserved.3|2023 SNIA.All Rights Reserved.Big Data EraData is doubled a
2、lmost every 2 years44 Zettabytes in 2020175 Zettabytes in 2025Image from:https:/ SNIA.All Rights Reserved.Why DNA Storage?1 Allentoft et al.The half-life of dna in bone:measuring decay kinetics in 158 dated fossils.Proceedings of the Royal Society B:Biological Sciences,279(1748):47244733,2012.2 Gras
3、s et al.Robust chemical preservation of digital information on dna in silica with error-correcting codes.Angewandte Chemie International Edition,54(8):25522555,2015.3 Figure source:IDC25,000 x 8TB HDDs5 10 years of warrantyLarge gap between generated data and installed storage capacity.1 EB data cen
4、ter Fort Worth,TX750,000 sq ft1 gram DNA 1Several centuries 2Photo:Tara Brown/UW5|2023 SNIA.All Rights Reserved.What is DNA Storage?Nucleotides/Bases:ATCGData:BitBase00A01T10G11CGTACACTGSimple encoding:150 300 basesTACAGT1001001100110110primerprimerGCTmetadatapayloadEncodingAssemblingSynthesis100100
5、1100110110SequencingDisassemblingDecodingWriteRead6|2023 SNIA.All Rights Reserved.Issues of DNA StorageErrors of DNA storage:Some patterns may increase error rates:Consecutive identical nucleotides(e.g.,“AAAA”)Hairpin structure/secondary structure etc.GTACAOriginal sequence:GTGCAGTCASubstitution err
6、orDeletion errorInsertion errorGTACAGAGAGDNA storage is Error-proneExpensive(e.g.,$1million/GB)Slow(e.g.,hours/GB)Special preservationLow encoding density(ideal one is 2bits/nt)00-A,01-T,10-C,11-G.7|2023 SNIA.All Rights Reserved.Conclusion:One nucleotide error causes a series of errors in its subseq
7、uenceError Propagation in DNA StorageError propagation:01100001011100001110000110110001100101Binary:.T G A TT C A AT C A AT G C AT G T T.Original DNA sequence:BitBase00A01T10G11CEncodingSynthesis&SequencingT G A TT A A TC A A TG C A TG T T.Sequencing result:01100001010000011100000110110001100101 Dec
8、oded binary:.DecodingDeletion error8|2023 SNIA.All Rights Reserved.Conclusion:error propagation in DNA sequence One nucleotide error causes a series of errors in its subsequenceError Propagation(EP)in DNA Storage cont.EP in sequencing:G T A C AG T A C AG TAG T A C ACTvotingMillions of DNA strandsLin
9、,Dehui,Yasamin Tabatabaee,Yash Pote,and Djordje Jevdjic.Managing reliability skew in DNA storage.In Proceedings of the 49th Annual International Symposium on Computer Architecture,pp.482-494.2022.10|2023 SNIA.All Rights Reserved.Issues of DNA StorageDNA storage is Error-prone Expensive(e.g.,$1millio
10、n/GB)Slow(e.g.,hours/GB)Special preservation Low encoding density(ideal one is 2bits/nt)00-A,01-T,10-C,11-G .DP-DNA MASCOTS23IMG-DNA Systor21HL-DNA ICCD2211|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Increase Density of DNA StorageDP-DNA:A Digital Pattern-Aware DN
11、A Encoding Scheme to Improve Encoding Density of DNA Storage 11 Bingzhe Li,Li Ou,Bo Yuan,and David Du,“DP-DNA:A Digital Pattern-Aware DNA Encoding Scheme to Improve Encoding Density of DNA Storage”,The 31st International Symposium on the Modeling,Analysis,and Simulation of Computer and Telecommunica
12、tion Systems(2023).12|2023 SNIA.All Rights Reserved.A typical encoding scheme rotation codeRotating encoding1 JamesBornholt,RandolphLopez,DouglasMCarmean,LuisCeze,GeorgSeelig,and Karin Strauss.A dna-based archival storage system.In Proceedings of the Twenty-First International Conference on Architec
13、tural Support for Programming Languages and Operating Systems,pages 637649,2016.Avoid long homopolymer GC content is roughly maintained13|2023 SNIA.All Rights Reserved.Issues of previous workLow encoding densityMapping 8 bits to 5 or 6 trits(base3)1.57bits/ntTheoretically,encoding density is 2bits/n
14、t,or 1.98bits/nt14|2023 SNIA.All Rights Reserved.Encoding scheme 2bit-code and unbalance codeIssue:how about 111111 for 2bit-code?Long homopolymers15|2023 SNIA.All Rights Reserved.Issue of 11-codeOn average,encoding density is 1.6 bits/ntBut,an extreme case A sequence of 1111,1111 with an A at the b
15、eginning Then,DNA sequence will be:A ACAC,ACAC Encoding density is 1bits/nt16|2023 SNIA.All Rights Reserved.Observation:to solve the issueFour patterns(i.e.,00,01,10,and 11)have different distributions among sequences1nt/bit is used for the pattern with the lowest percentage.Lower bound case will be
16、 25%for all patterns17|2023 SNIA.All Rights Reserved.Digital Pattern aware code(DP-DNA)Find the lowest-frequency patternUse the corresponding codeFor example,11 has the lowest frequency in a binary sequenceThen,use 11-codeWorst case:All patterns evenly show in a sequenceEncoding density is 1.60 bits
17、/nt 1.57bit/nt18|2023 SNIA.All Rights Reserved.Adding 2bit-code and Using Variable LengthAdding 2bit-code:Ideal encoding density(2bits/nt)If some sequences encoded with 2bits-code have no bio-constraint violations,we can encode those sequences with 2bit-codeEncoding densityVariable Length Ideal enco
18、ding density(2bits/nt)A sequence encoded with 2bits/nt Bio-constraint violationEncoding with 2bit-codewhere 1and 2indicate the code densities of the low-density and high-density codes,respectively.L is the default length of the binary sequence to be encoded.M indicates how many bits are excluded for
19、 the high-density code.Lmeta refers to the number of nucleotides used for metadata such as primer pairs and internal index in DNA strands.19|2023 SNIA.All Rights Reserved.DP-DNA overall design20|2023 SNIA.All Rights Reserved.Experimental resultsDataset Web Database Text Image Video21|2021 Storage De
20、veloper Conference.Insert Company Name Here.All Rights Reserved.Increase Robustness of DNA Storage for ImagesIMG-DNA:approximate dna storage for images11 Bingzhe Li,Li Ou,and David Du.IMG-DNA:approximate dna storage for images.Proceedings of the 14th ACM International Conference on Systems and Stora
21、ge.2021.22|2023 SNIA.All Rights Reserved.High Demand for Storing Images23|2023 SNIA.All Rights Reserved.Small practical tube capacity About 230GB per tube for random-access based DNA storage 1Error prone:Propagation errors 2:One nucleotide error causes a series of errors in its subsequence1 Y.Wei,B.
22、Li,and D.H.Du,“Dna storage:A promising large scale archival storage?”arXiv preprint arXiv:2204.01870,2022.2 B.Li,L.Ou,and D.Du,“Img-dna:approximate dna storage foXr images,”in Proceedings of the 14th ACM International Conference on Systems and Storage,2021,pp.19.Observations of DNA Storage Encoding2
23、4|2023 SNIA.All Rights Reserved.Background of JPEG-based ImageThe 14th ACM International Systems and Storage Conference(Systor21)DCT:Discrete Cosine TransformDPCM:Differential Pulse Code Modulation JFIF:JPEG File Interchange Format DCAC1 Yu-ChunKuo,Ruei-FongChiu,andRen-ShuoLiu.Long-termjpegdataprote
24、ction and recovery for nand flash-based solid-state storage.In 2019 35th Symposium on Mass Storage Systems and Technologies(MSST),pages 141147.IEEE,2019.2 Qianqian Fan,David J Lilja,and Sachin S Sapatnekar.Adaptive-length coding of image data for low-cost approximate storage.IEEE Transactions on Com
25、puters,69(2):239252,2019.Two observation 1,2:Fault toleranceDC and AC coefficients have different influence on the quality of images25|2023 SNIA.All Rights Reserved.Our ContributionsImage-based DNA Storage ArchitectureAC/DC Coefficient Separation at DNA Level Adding Barriers Asymmetric Barriers for
26、AC/DC Coefficients The 14th ACM International Systems and Storage Conference(Systor21)26|2023 SNIA.All Rights Reserved.Image-based DNA Storage ArchitectureThe 14th ACM International Systems and Storage Conference(Systor21)1.AC/DC separation2.Encoding3.Adding barrier4.Chunking&assembling27|2023 SNIA.
27、All Rights Reserved.Adding Barriers and Asymmetric Barriers The 14th ACM International Systems and Storage Conference(Systor21)”AA”as a barrier keeps the error propagation within a partition No two consecutive identical“A”in the rotation encoding scheme The probability of generating”AA”caused by err
28、ors is low Barrier window is used for preventing the errors of insertion and deletionAsymmetric Barriers for AC/DC coefficients Quality:AC/DC have different influence on the quality of images Overhead:The number of ACs is much more than that of DC28|2023 SNIA.All Rights Reserved.Experimental Results
29、Dataset:ImageNetBaselines:1)Raw-DNA;2)Approx-IMG;3)IMG-DNAMetric:SSIM(structural similarity index metric)DNA strand length 250bpEnvironment:A system with Intel i-7-47900 CPU3.6GHz and 8GB memory MATLAB2020a The 14th ACM International Systems and Storage Conference(Systor21)29|2023 SNIA.All Rights Re
30、served.Robustness of Image-based DNA System The 14th ACM International Systems and Storage Conference(Systor21)The SSIM is higher,the quality of images is betterMore results are shown in the paperA graphic view of an image with different encoding schemes(0.1%error rate):Overall comparison:30|2021 St
31、orage Developer Conference.Insert Company Name Here.All Rights Reserved.Increase Robustness and Density of DNA Storage for ImagesHL-DNA:A Hybrid Lossy/Lossless Encoding Scheme to Enhance DNA Storage Density and Robustness for Images11 Yi Li,David HC Du,Li Ou,and Bingzhe Li.HL-DNA:A Hybrid Lossy/Loss
32、less Encoding Scheme to Enhance DNA Storage Density and Robustness for Images.In 2022 IEEE 40th International Conference on Computer Design(ICCD),pp.434-442.IEEE,2022.31|2023 SNIA.All Rights Reserved.MotivationImages are error tolerantDNA storage is error-proneConsider them together32|2023 SNIA.All
33、Rights Reserved.Lossless code designDNA strands need to follow some bio-constraints to avoid high errorsRotation code helps avoid homopolymers(e.g.,AAAA)Lossless code designHigh density area:2bits/ntLow density area:1bits/nt33|2023 SNIA.All Rights Reserved.Lossless code designDNA strands need to fol
34、low some bio-constraints to avoid high errorsRotation code helps avoid homopolymers(e.g.,AAAA)Lossless code designHigh density area:2bits/ntLow density area:1bits/ntCommon first nucleotide34|2023 SNIA.All Rights Reserved.Lossy code design Combine two low density rows together Using four different co
35、des(C10,C11,C00,and C01)Four codes have different error preferences 1X(0)indicates 11 and 10 are both encoded into the same nucleotides but will be decoded back to 1035|2023 SNIA.All Rights Reserved.Partition Scheme:Adding Barrier”A”as a barrier indicator Improve the robustness of DNA storage like 1
36、 Restricts the error propagation in a partition Enable multiple encodings in the same DNA strand to improve the encoding density/reduce error rates induced by the lossy encoding1 B.Li,L.Ou,and D.Du,“Img-dna:approximate dna storage foXr images,”in Proceedings of the 14th ACM International Conference
37、on Systems and Storage,2021,pp.19.36|2023 SNIA.All Rights Reserved.Overall Design of HL-DNA1.Encode binary to nucleotides based on encoding schemeBased on density lossy to select which encoding is used2.Insert”barrier”to the DNA sequence3.Adding the corresponding metadata such as primers,index,ECC,e
38、tc.4.Coding format to indicate multiple encodings in the DNA strand37|2023 SNIA.All Rights Reserved.Experimental Setup Dataset:ImageNet Four schemes:Church et al.1,Organick et al.2,Blawat et al.3,and HL-DNA Metric:Encoding density(bits/nt)SSIM(structural similarity index metric)DNA strand length 300
39、bp1 G.M.Church,Y.Gao,and S.Kosuri,“Next-generation digital information storage in dna,”Science,vol.337,no.6102,pp.16281628,2012.L.Organick,S.D.Ang,Y.-J.Chen,R.Lopez,S.Yekhanin,2 K.Makarychev,M.Z.Racz,G.Kamath,P.Gopalan,B.Nguyen et al.,“Random access in large-scale dna data storage,”Nature biotechnol
40、ogy,vol.36,no.3,p.242,2018.3 M.Blawat,K.Gaedke,I.Huetter,X.-M.Chen,B.Turczyk,S.Inverso,B.W.Pruitt,and G.M.Church,“Forward error correction for dna data storage,”Procedia Computer Science,vol.80,pp.10111022,2016.38|2023 SNIA.All Rights Reserved.Overall encoding density comparisonHL-DNA increases the
41、average encoding density of the previous studies by about 20.2%-89.4%.HL-DNA achieves the highest SSIM,which indicates the best robustness among different schemes.39|2023 SNIA.All Rights Reserved.Robustness of Image-based DNA System The higher the SSIM is,the better the quality of images is.A graphi
42、c view of an image with different encoding schemes(0.5%error rate):40|2023 SNIA.All Rights Reserved.Potential DNA storage researchScalabilityCapabilityEncoding/ECCMicrofluidic systemMore issues:DNA storage preservation Issue of limited read number Performance of sequencing/synthesis API to users41|2
43、023 SNIA.All Rights Reserved.ConclusionsDP-DNA for increase areal densityIMG-DNA is a robust architecture of DNA storage for imagesA hybrid lossy/lossless encoding based DNA storage architecture called HL-DNAPotential DNA storage research directions42|2023 SNIA.All Rights Reserved.Thanks!Q&A43|2023 SNIA.All Rights Reserved.AcknowledgementProf.David DuDr.Li OuYixun WeiAlex SensintaiffarYi Li44|2023 SNIA.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.