《SNIA-SDC23-Manno-Leveraging-Computational-Storage-Simulation-Science.pdf》由會員分享,可在線閱讀,更多相關《SNIA-SDC23-Manno-Leveraging-Computational-Storage-Simulation-Science.pdf(13頁珍藏版)》請在三個皮匠報告上搜索。
1、1|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Leveraging Computational Storage for Simulation Science Storage System DesignDominic MannoLA-UR-23-299172|2023 SNIA.All Rights Reserved.Computational Storage Why Need flexibility in where computation is done(host,network,device)as
2、 economics will change over time Its not just energy and time to insight,some of these analytics require the same size analytics footprint as the simulation footprint(petabyte of ram)making analytics not always as feasible Data Agnostic Offloads Server memory BW does not allow many passes over strea
3、ming data Data Aware Offloads Analytics is often multiple orders of magnitude less reading than writing You just have a hard time finding what you are looking for(filter/index/histogram/etc.)Can we add metadata/indexing/ordering to data as it is written with almost no overhead and reap huge wins on
4、read(time,hdwr resources,energy)For Science Particle methods-“Ordered”row-based analytics(KV)Grid methods-columnar-based analytics Large Complex Grid methods -THE KITCHEN SINK3|2023 SNIA.All Rights Reserved.Data Agnostic Offload(ABOF)Offload erasure,encoding,compression and make it possible to run e
5、ach anywhere you like(host,network/DPU,storage/CSA/CSP/CSD)Consumable:Open ZFSKernel module for offloads to register intoZFS mods unstreamedTested under LustreData written is same,upon failure just fall back to hostNo app changes other than faster on less capable host Not a block interface,a program
6、ming interface Remote malloc,copy,operate on Great partnership4|2023 SNIA.All Rights Reserved.Analytics Under Erasure CSHDDData Aware Row And Col Based Analytics Offload TrialsRow(Point methods in Science)Early KV-CSD work was hashed based but many KV workloads require“order from chaos”Leverage LANL
7、/CMU(DeltaFS(Best Student Paper SC19)trace 1K particles in 1T moving in 1 M cells based on LSM/Rocks WHY KV-CS:Performance/Energy win.Few sstables have filter matches but also very few records needed per sstable Consumable:(Pretty consumable)User facing Rocks APIExtension to SNIA NVME-KV interfaceCo
8、uld be accelerator under RocksColumnar(Grid methods in Science)Leverage Apache ecosystem columnar technology(Parquet/etc.)LANL ZFS knowledge what blocks are part of a parquet file and DuckDB Why CSHDD:Performance/Energy win.Many row groups selected(due to AMR)but big variety of number of records nee
9、ded from those row groups Consumable:(Not very consumable-due to inside filesystem implementation)Special knowledge of ZFS file/erasure,something of a layer violation Tiny proc/mem performed simple reduction slightly faster than host With many drives behind host,scaling beats host by itself(frees ho
10、st)5|2023 SNIA.All Rights Reserved.CS data agnostic/data aware learnings CSA/CSPs:multi-device/big resource ops(erasure,compression,pipelined functions,etc.)CSDs:single-device ops,wins on reductions(requires analytics friendly stripe/erasure)Consumption models and broader use cases matter Accelerate
11、 under popular server apps:Rocks(rows)/Object Servers(cols)(select on cols/rows)Leveraging ecosystems seems wise Consumers of large scale data live on one of two basic camps Dont know what you are looking for,want a compact representation(ML training/AI)Know what you are looking for,finding is diffi
12、cult(due to data size and/or complexity)this is the most obvious place where CS helps and can be implemented in mostly north-south communication patterns File systems can be a bit heavy for devices and blocks are useless,perhaps Object is the compromise,with erasure analytics chunks intact on single
13、 devices At rest compression/erasure seems doable but encryption will be interestingSubstrait6|2023 SNIA.All Rights Reserved.Why Columnar and why Offload to near Storage?1 PB file per time step contains all the state(for restart)(think 1 PB)(and thousands of time step)Each cell has 10-100 state vari
14、ables(64float)(temp,pressure,energy,momentum,differentials(for gradients)In 3D applications view this as 10-100 distributed arrays(COLUMNS),serialize in Hilbert space filling curve orderTime Steps 1 2 3 4 5 612345161CB789AProcesses have roughly same number of cells for comp/mem balance but must shuf
15、fle cells for AMRFind outer edge of eddys(light blue and yellow).Can light weight indexing near storage yield nnnX less data?Single process Hilbert order If you need all row groups,or all rows of groups,or all columns of groups why offload?We dont Reduce trivially by excluding entire row groups Need
16、 entire row group(all rows(subset of column)s Need only some rows of the row group Its never that simpleColumnar?Multi-dimensional Unstructured Adaptive Meshes(grid methods)use distributed arrays/columns Offload?Many selected row groups will yield only some rows and most queries will exclude many co
17、lumns7|2023 SNIA.All Rights Reserved.A Columnar end-to-end demo with Object CS OffloadsS3 plus RPC pushdown service(more complex SQL and Query over more than one object(bucket/objects list)Open Src object server with S3+RPC pushdown,use OCS backendSSDSSDOCSDOCSDOCS api/OCS on NVMEDEMO StackOCSACourt
18、esy Voltron DataODBCJDBCADBCCustom ConnectorObject enables same view of“analytic chunk”everywhere(App,obj server,CSA and CSD.(object not block!)Object Future proofs CS:if CXL becomes how to address smart storage,offload will be based on memory objects,and if file everywhere ever wins,files and objec
19、ts are close cousins.Assists with reality that economics will insist on north/south bw devices 8|2023 SNIA.All Rights Reserved.LANL use case(high level)LANL grid-based AMR columnar to Parquet Overall data in Hilbert order One to multiple variable query across many objects or partsS3 and New Pushdown
20、 ServiceVersity Open Src object server that can write to OCSsWith/without Analytics Friendly Erasure or Compression.(future work)?OCSDSSDOCSDOCS api/OCS on NVMELANL simulation app writes either individual parquet row group objects into bucketOr a multi-part object where each part is a parquet row gr
21、oup(applications must obey this rule)Application can write any of supported types(Parquet,JSON,CSV,SST)(likely Parquet and maybe SST)OCSDSSDLANL custom analytics appWriteQueryPotential Erasure/compressionOCSAODBCJDBCADBCCustom Connector9|2023 SNIA.All Rights Reserved.OCS Initiators/Targets Open Ecos
22、ystem Demonstration(s)SPDK targetS3 Access MethodPrestoObject Server(Get,Put,List,(uses REST protocol up and OCS SPDK down)NB CSA deviceSPDK initiator(Get,Put,List)Pushdown GRPC servicesuses Pushdown GRPC up and OCS SPDK down)Query resultsInitiator test vehicle(uses OCS SPDK down)AirMettleNeuroBlade
23、SK Hynix OCSV1+push down targetPushdown GRPC clientOCS(A/D)target test vehicleObject files(in a file system on the OCSA/OCSD system available to all via NFS)VersityTCP/IPNVME/NVMEoFFile Access MethodHive map schemaPushdown GRPC test clientModified Hive ConnectorSPDK targetSPDK targetselectToward Pus
24、hdown GRPC accepted methodToward OCS NVMEAirMettle AnalyticsSK hynix/LANLSimulation files via NFSS3 clientOpen SourceProprietaryValue add10|2023 SNIA.All Rights Reserved.Partnering has been the key to this exploration!11|2023 SNIA.All Rights Reserved.What does this notionally look like?Support both
25、row and columnar(multiple types of support(parquet,csv,json,sst,)Leverage S3 for inspiration Leverage Apache Analytics Ecosystem Leverage as much of SNIA NVME KV interface as possible Open ecosystem for CSData Services HostZFSNVME ArrayDeviceCompute hostPFS(lustre)clientParallel KV ClientParallel Ob
26、ject ClientPFS(lustre)ServerZFS offloadKernel offloadNVME CSNVMECSACompression/Erasure/encodingNVMEKV serverObject serverOffloadRocksVersity/OtheroffloadNVME CSNVMEFile System or similarFile System or similarCSDObject Columnar query KV Sort/MergeKV filter/query NVMEHigh Perf InterconnectCSAObject compression and erasurePeersNVME/NFMEoF12|2023 SNIA.All Rights Reserved.Thanks for your time!13|2023 SNIA.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.