《SNIA-SDC23-GuilhermeDaCostaMartins-Pantheon-DNA-Data-Storage-CODEC_1.pdf》由會員分享,可在線閱讀,更多相關《SNIA-SDC23-GuilhermeDaCostaMartins-Pantheon-DNA-Data-Storage-CODEC_1.pdf(50頁珍藏版)》請在三個皮匠報告上搜索。
1、#|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Pantheon DNA Data Storage CODECExperiences,Challenges,and InnovationsAndr Guilherme da Costa Martins,PhD Biomed.Sci.Bioinformatics researcher Institute for Technological Research-IPT,Brazil andremartinsipt.br Who are we?The Instit
2、ute for Technological Research-IPT has been contributing actively for 124 years to science and technical advances.We provide technical solutions for industry,governments,and society,enabling them to overcome the challenges of our time.1|2021 Storage Developer Conference.IPT-Lenovo.All Rights Reserve
3、d.ATGCTAGCAAGGCGTGCAATGCTAGCAAGGCGTGCAOriginalBinary dataEncoded DNAStoraged DNA moleculesSequenced DNARecoveredBinary dataDNASequencingDNASynthesisENCODINGDECODING2|2023 SNIA.All Rights Reserved.A versatile CODEC:Robust DNA data architecture Binary data pre-processing Multiple choices for mapping a
4、lgorithms Multi-layer ECC strategy Supports SNIAs sectors(S0&S1)Includes NGS processing algorithms Compatible with multiple sequencing and storage strategiesThe Pantheon CODEC3|2023 SNIA.All Rights Reserved.ApolloArtemisC hironHermesHephaestusG aia4|2023 SNIA.All Rights Reserved.5|2023 SNIA.All Righ
5、ts Reserved.6|2023 SNIA.All Rights Reserved.Apollo,the encoding moduleApollo and the Muses-Michel Dorigny,early 1640s7|2023 SNIA.All Rights Reserved.5-3EndPrimerStartPrimerAddresslabelPayloadInner ECCAdenineGuanineThymineCytosineApollo:oligonucleotide structure8|2023 SNIA.All Rights Reserved.ATTGCAT
6、GGATAACGAnalysis&Processing5-3EndPrimerStartPrimerAddresslabelPayloadInner ECCAdenineGuanineThymineCytosineMetadata from file system9|2023 SNIA.All Rights Reserved.Apollo:encoding process(.)Metadata fromfile systemCodec parameters from data blocksMetadata fromdata blocks(.)10|2023 SNIA.All Rights Re
7、served.Data blocksApollo:metadata&Outer ECCArchive Metadata Block(AMB)(.)11|2023 SNIA.All Rights Reserved.Data blocksOuter ECCOuter ECCOuter ECCApollo:metadata&Outer ECCStartPrimerUniversal prime pairBlock specific prime pairEndPrimer12|2023 SNIA.All Rights Reserved.Apollo:primers architecture 13|20
8、23 SNIA.All Rights Reserved.Apollo:DNA blocks architecture System directory tree Files checksum Files coordenates within blocks CODEC parameters for blocks Primers data CODEC manual Other types of data to assist data recover14|2023 SNIA.All Rights Reserved.Apollo:DNA blocks architecture 15|2023 SNIA
9、.All Rights Reserved.Apollo:DNA blocks architecture Archive Metadata Block(AMB)Data BlocksS1-CODEC parameters General CODEC&ECC parameters AMB especific ECC&CODEC parameters AMB files checksum Apollo:Integration with SNIAs specifications for sectors S0&S1 S0&S1AMBData Blocks16|2023 SNIA.All Rights R
10、eserved.17|2023 SNIA.All Rights Reserved.Chiron,the NGS pre-processing moduleThe Education of Achilles-Bnigne Gagneraux,178518|2023 SNIA.All Rights Reserved.Chiron:Pre-processing NGS readsPre-processing steps:Merge read pair(paired-end strategy)Adapters/Primer trimming Demultiplexing coding blocks R
11、eorient DNA sequences Discard low-quality reads19|2023 SNIA.All Rights Reserved.Pre-processing steps:Merge read pair(paired-end strategy)Adapters/Primer trimming Demultiplexing coding blocks Reorient DNA sequences Discard low-quality readsAdenineGuanineThymineCytosineR1R220|2023 SNIA.All Rights Rese
12、rved.Chiron:Pre-processing NGS readsAdenineGuanineThymineCytosineR1R2Pre-processing steps:Merge read pair(paired-end strategy)Adapters/Primer trimming Demultiplexing coding blocks Reorient DNA sequences Discard low-quality reads21|2023 SNIA.All Rights Reserved.Chiron:Pre-processing NGS readsPlanned
13、data blocksSequenced DNA readsSequenced&Demultiplexed into data blocksBlock specific prime pairsPre-processing steps:Merge read pair(paired-end strategy)Adapters/Primer trimming Demultiplexing coding blocks Reorient DNA sequences Discard low-quality reads22|2023 SNIA.All Rights Reserved.Chiron:Pre-p
14、rocessing NGS readsHermes,address-oriented moduleMercury(Hermes)-Buti Lodovico,16th century23|2023 SNIA.All Rights Reserved.ATCGGCATGCTACATAGTGATGCTGCATCGGGGGGGTACATCATCATCATCAATCGGCATACTACHermes:address library5-Addresslabel-3AddressesdatabaseEdit distanceDNA blockAdenineGuanineThymineCytosine24|20
15、23 SNIA.All Rights Reserved.5-PayloadInner ECC-3Addr.Mapped dataInner ECCData blockScreenshot of the first 35 DNA sequences from a planned data block in a multi-fasta format.The payload was mapped using a G&A rich scheme.AdenineGuanineThymineCytosine25|2023 SNIA.All Rights Reserved.Hermes:parity che
16、ckAddr.StatusSeq.pileAddress tableTRUE#1#2#3#n.OpenCloseFilledClose5 3128 7Hermes inputs.Addr.DBDemplx.block.fastqLength&parity checkFALSEAddresscheckFALSETRUEClosestaddressTRUEFALSEAddresscheckFALSETRUEExitAddr.DB error!MSA&consensusHephaestusboxArtemis boxDNA readDemutiplexed block.fastq26|2023 SN
17、IA.All Rights Reserved.Hermes:basic algorithm,part 1 Hephaestusboxe.g.read pile for addr.#2TRUEMultiple sequence alignment(MSA)FALSEconsensus sequenceminimum coverageTRUELength&parity checkFALSEArtemis box5-(.)ATCGGCAT-ACTAC(.)-35-(.)ATCGGCAT-AC-AC(.)-35-(.)ATC-GCAG-ACTAC(.)-35-(.)ACCGCCATTACTAC(.)-
18、35-(.)AACGGCAT-ACTAC(.)-35-(.)ATCGCCATTACTAC(.)-3LengthParityLengthParity27|2023 SNIA.All Rights Reserved.Hermes:basic algorithm,part 2 Thetis Receiving the Weapons of Achilles from Hephaestus Anthony van Dyck,163228|2023 SNIA.All Rights Reserved.Hephaestus,full-length clustering moduleDemultiplexed
19、 block.fastqHephaestus:pair-wise clustering and consensus Cluster 1Cluster 2Cluster nConsensus nConsensus 2Consensus 1(.)(.)29|2023 SNIA.All Rights Reserved.Demultiplexing speeds up clustering k=45-ATCGGCAT-3ATCGTCGGCGGCShared K-mer strategy to quickly identify and align first highly similar DNA seq
20、uencesGGCAGCATDemultiplexed block.fastqCluster 1Cluster 2Cluster n(.)Strategy adapted from metagenomics studies30|2023 SNIA.All Rights Reserved.Hephaestus:pair-wise clustering and consensus Consensus nConsensus 2Consensus 1(.)k=45-ATCGGCAT-3ATCGTCGGCGGCShared K-mer strategy to quickly identify and a
21、lign first highly similar DNA sequencesGGCAGCATDiscardDemultiplexed block.fastqFAILHephaestusboxPairwise aligmentK-mer processingClustering1st thresholdConsensusPairwise aligmentK-mer processingClustering2nd thresholdConsensusPairwise aligmentK-mer processingClustering3rd thresholdFinal consensusChi
22、meric dectectionPASSArtemis boxOR31|2023 SNIA.All Rights Reserved.Similarity threshold(%)1st 2nd 3rdCentroidsimilaritythresholdHephaestus:basic clustering algorithm32|2023 SNIA.All Rights Reserved.Artemis,the decoding moduleArtemis returning from the hunt-Colombel,169733|2023 SNIA.All Rights Reserve
23、d.S0&S1Artemis:decoding the archive metadata block34|2023 SNIA.All Rights Reserved.AMBData BlocksCodec parameters.jsonDirectory tree.jsonFiles checksum.json(.)DECODERS0 S1InnerECCBit stream HandlerArtemis boxAMB.fastaOuterECCATCGCCAArchive metadataDECODERS0 S1Inner ECC0100100110011000110010010101010
24、00110011100001111111110001100111001100011110100101000110001111000110001111110000001111110001001000011100100010001100111000011100111000000100011000111110000111111InnerECCPayloadBit stream HandlerS0&S135|2023 SNIA.All Rights Reserved.AMBArtemis boxAMB.fastaData BlocksOuterECCATCGCCAOuter ECCArtemis:ap
25、plying ECC to correct errorsArchive metadataDECODERArtemis:Decoding the data blocksData Blocks.fastaS0&S136|2023 SNIA.All Rights Reserved.AMBData BlocksInnerECCBit stream HandlerOuterECCATCGCCA(.)DECODERS0 S1InnerECCBit stream HandlerArtemis boxAMB.fastaOuterECCATCGCCAArchive metadata37|2023 SNIA.Al
26、l Rights Reserved.Gaia,the development moduleTellus Mater(Mother Gaia)panel,Ara Pacis Augustae-Rome,Italy38|2023 SNIA.All Rights Reserved.What Gaia does:Simulate different sequencing strategies Single or Paired-ends Library preparation Coverage variation Sequencing platforms Simulate different synth
27、esis strategies and biases Pandoras box of bias modelsGaia:a sandbox module to support development 39|2023 SNIA.All Rights Reserved.MedusaManticore40|2023 SNIA.All Rights Reserved.Gaia:a sandbox module to support development What Gaia does:Simulate different sequencing strategies Single or Paired-en
28、ds Library preparation Coverage variation Sequencing platforms Simulate different synthesis strategies and biases Pandoras box of bias modelsTesting Phanteon CODEC with real data41|2023 SNIA.All Rights Reserved.R1&R2 Fastq filesSingle poolExperiment workflowAMB15 data blocksMicrotube1.6 MB dataEncod
29、ingDNASynthesisMicrotubeSingle poolDecodingClusteringProcessing42|2023 SNIA.All Rights Reserved.Fasta filesFiles were integrally recovered using Outer ECC 0.03%of missing oligos43|2023 SNIA.All Rights Reserved.Sequencing results67,854 oligosFiles were integrally recovered using Outer ECC 0.03%of mis
30、sing oligos22 oligos were missing(coverage=0)162 oligos were low coveraged(5)44|2023 SNIA.All Rights Reserved.Sequencing results67,854 oligosFiles were integrally recovered using Outer ECC 0.03%of missing oligosThe pit of Tartarus for DNA data storage45|2023 SNIA.All Rights Reserved.Sequencing resul
31、ts67,854 oligosThe pit of Tartarus for DNA data storageFiles were integrally recovered using Outer ECC 0.03%of missing oligosAvoiding unwanted DNA patterns in designed oligos46|2023 SNIA.All Rights Reserved.Sequencing results67,854 oligos00:00:31.490(AMB+DB)00:00:01.700 AMB00:01:30.730 AMB00:00:07.8
32、00 AMB 00:00:00.200 AMB4 fastq files with 9.3M reads in totalA single CPU core was used AMB(Archive metadata block)Hermes47|2023 SNIA.All Rights Reserved.CODEC performanceApolloArtemisChironHephaestusHH:MM:SS.SSS00:01:00.000 DB00:15:42.000 DB00:00:58.350 DB(fast mode)*00:00:21.040 DB00:00:00.590 DB0
33、9:11:28.000 DB*Hermes48|2023 SNIA.All Rights Reserved.CODEC performanceApolloArtemisC hironHephaestus00:00:31.490(AMB+DB)*Increases the sequence loss4 fastq files with 9.3M reads in totalA single CPU core was used DB(Data Blocks)*no prior demultiplexing stepHH:MM:SS.SSSPlease take a moment to rate this session.Your feedback is important to us.49 2023 SNIA.All Rights Reserved.