《基因組學的效率挑戰.pdf》由會員分享,可在線閱讀,更多相關《基因組學的效率挑戰.pdf(25頁珍藏版)》請在三個皮匠報告上搜索。
1、AI Hardware&SystemsaiandsystemsEfficiency Challenges in GenomicsTom Sheffler AI Hardware&SystemsaiandsystemsPreface Goal is to give insights into the characteristics of genomics computations Explain AI/ML on the Edge for DNA processing Challenges in AI/ML for genomics(from the real world)AI Hardware
2、&SystemsaiandsystemsGenomics Applications why does it matter?Cancer Screening identify DNA changes that increase a persons risk guide selection of therapies Whole Genome Sequencing for newborns(Wash Post 2018)6 days old severe seizures 39 hours to sequence whole genome simple treatment identifiedhtt
3、ps:/ Hardware&SystemsaiandsystemsRapidly decreasing cost increasing data and computation Cost for WGS(Whole Genome Sequencing)$300K in 2006 2020$1000$100 Ultima UG 100 Jan 2024https:/www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-costAI Hardware&SystemsaiandsystemsIntroduction to
4、SequencingAI Hardware&SystemsaiandsystemsSequencing Workflow and AnalysisATGCTACGExtractionTemplateAdapterLigationFragmented DNAFragmentSequencing LibraryLibrary PrepPoolingSequencingAnalysis PipelinePrimaryAnalysisdemuxconsensusvariantcallingconsensusvariantcalling*NAI Hardware&SystemsaiandsystemsE
5、xtractionATGCTACGAI Hardware&SystemsaiandsystemsLibrary PreparationTemplateAdapterLigationFragmented DNAFragmentSequencing Library“ACAC”AI Hardware&SystemsaiandsystemsPoolingPoolLibrariesAI Hardware&SystemsaiandsystemsSequencingPoolDataSequencer100 GB to 1 TB+12 to 48 hoursAI Hardware&Systemsaiandsy
6、stemsData PipelineInherent data parallelism,potential streaming parallelismPrimaryAnalysisdemuxconsensusvariantcallingBaseCalls(500GB+)DemuxedBaseCalls(500GB+)ConsensusReadsVariants(100MB+)consensusvariantcalling*NsensorAI Hardware&SystemsaiandsystemsPrimary AnalysisPrimaryAnalysisdemuxconsensusvari
7、antcallingconsensusvariantcalling*NAI Hardware&SystemsaiandsystemsNanopore SequencingPrimary:Raw signal to Base-Calls Many NanoporesAnalyzeaacgtcgtactagtctactctaggtacctagtactaaRaw Data:Neural NetDSPorhttps:/ Hardware&SystemsaiandsystemsChallenge:ML for Basecalling Keeping up with real-time constrain
8、ts Runs are expensive Data changes with chemistry Noise in Data Sequencing is inherently inexactAI Hardware&SystemsaiandsystemsDemultiplexPrimaryAnalysisdemuxconsensusvariantcallingconsensusvariantcalling*NAI Hardware&SystemsaiandsystemsDemultiplexingaacgtcgtactagtctactctaggtacctagtactaacgatccgattag
9、ctactacgatagtacgattaactaaacgcgatacagacttacaacgcatacattacgaatacgatagcctagactactactagcctaactataccttgaacgtcgtactagtctactaacgcgatacagacttacaacgcatacattacgaatacgatccgattagctactacgatagtacgattaactacgatagcctagactactactaggtacctagtactaactagcctaactataccttgAI Hardware&SystemsaiandsystemsDemultiplexing using Mac
10、hine Learning Challenges applying ML to this process Error rates from previous step stack up Consider a basecalling accuracy of 99%(Q20)and an adapter seq length of 40 The probability that the entire barcode sequence is correct is.9940=0.68 A.99C.99C.99T.99G.99T.99C.99A.99=.9940=0.68AI Hardware&Syst
11、emsaiandsystemsONT:Demultiplexing using Machine Learninghttps:/www.ncbi.nlm.nih.gov/pmc/articles/PMC10173771/AI Hardware&SystemsaiandsystemsConsensusPrimaryAnalysisdemuxconsensusvariantcallingconsensusvariantcalling*NAI Hardware&SystemsaiandsystemsMachine Learning for Consensus“Deep Consensus”Accele
12、ration One 8M SMRT Cell can take 500 hours to run,500-way parallelization is 1 hour per shard(GPU V100 3.3x faster)4 On PacBio Revio 5.5 hours per SMRT cell 3“Being able to achieve higher accuracy using DeepConsensus now allows us to deliver accurate HiFi reads to customers in a shorter amount of ti
13、me;whereas the Sequel IIe has a standard sequencing time of 30 hours,for our new platform,we can now reduce that to 24 hours.”3AI Hardware&SystemsaiandsystemsTo the Cloud,and BackPrimaryAnalysisdemuxalignmentconsensusvariantcallingalignmentconsensusvariantcalling*NInterpretationTherapyPrimaryAnalysi
14、sdemuxalignmentconsensusvariantcallingalignmentconsensusvariantcalling*NInstrument+AI/ML with GPUsCloudInstrumentCloud/DatacenterAI Hardware&SystemsaiandsystemsChallenge:Heat and Noise GPUs generate a lot of heat,more than some labs can cool.Noise of the fans cooling the GPUs can exceed sound allowa
15、nces for operators.Solution:separate acceleration unit Problem:IT department headachesAI Hardware&SystemsaiandsystemsChallenge:deploying updates Updates in a clinical setting Validation of software updates Updates on the order of 6 months may be tolerated Connectivity and size Bandwidth to labs is c
16、ontinually improving But may not be what other industries are used toAI Hardware&SystemsaiandsystemsChallenge:Obtaining Datasets for Training Problems:Privacy Consent Population bias Noise in Data Sequencing is inherently inexactAI Hardware&SystemsaiandsystemsSummary Genomics computations are inherently parallel!Challenges exist in Thermals,noise and power Deployment,updates Obtaining training data ML and GPUs have enabled strides in Turnaround time Processing more data