《XPI:加速編程接口.pdf》由會員分享,可在線閱讀,更多相關《XPI:加速編程接口.pdf(25頁珍藏版)》請在三個皮匠報告上搜索。
1、OCP Global Summit October 18,2023|San Jose,CAPankaj MehraElephance Memory,Inc.XPI:Acceleration Programming InterfaceScope of XPI WorkDefine and evaluate,through implementation and integration,new,dev-friendly programming interfaces forNear Data Processing(NDP)infrastructurenow emerging at the inters
2、ection of:Domain-Specific Architecture(major trend)andComputational Memory/Storage(major opportunities,not yet a trend).1.Data Infra2.Scalable AI/ML3.Media Processing4.Memory NodesStorage device/array w/exposed/pvt RAM w/pre-/post-CXL NDP in front,integrated behind switch,or accelerator-first1.DPUs3
3、.Comp.Memory2.Comp.StorageTargeted Developer Communities:Adopters:Scale,Standardize,AccelerateProviders:Mutual and External ConsistencyModern Storage WorkloadsEB-scale storage PB-scale memory Open format dataData Gravity is a key consideration for power and performanceThe need for End-to-End acceler
4、ationCompute-Memory HierarchyFabricServerRackPackageMEMORY HIERARCHYV$HBMRDMADDRCXL 2CXL 1CXL 3NVMeUPICOMPUTE HIERARCHYCXL Type 2accelerator with memoryideal for near-memory processingCommand-Heavy&Communication-Poor interfaces of existing standardsConsider offloading sorts or group-bys on sharded t
5、ables DB,or convolution offload for graph neural networks DL,or offloading subgraph building against embeddings DLRMThe unit of offload is tiny;curse of Amdahl!Hosts coordinate all data movement between host:device(North-South)using eager data movement semantics even though CXL enables lazy semantic
6、device:device(East-West)data movement is not contemplated even though computational memories will naturally move data peer-to-peer in CXL 3Existing interfaces fundamentally lacking for CXL-enabled accelerators and memoryLack of application-specific optimization over function/data placement,movementF
7、unctions to be offloaded depend on other functionsOpen data formats work differently than application-specific formatsCertain functions require certain runtimes or specific acceleratorFunction placement and data placement are mutually constrainingWe need a language for applications to express intent
8、 on function and data coplacement without resorting to low-level constructs such as blocks and cachelinesNeed simple notions of data and function for deeply tiered hierarchy of memory and computePrincipal Gaps in Existing WorkXPI:Core Concepts;Theory of OperationsHeavy leverage of RedfishCSSTraditio
9、nal DAG of functions with new set-valued dependenciesService MeshNew idea for tiered and computational memory&storageData SetXPI clients can create and control the execution of a SERVICE MESH on a DATA SET using a COMPUTATIONAL SUBSYSTEM that they have requestedCSS instances are createdby Infra Ops
10、for Tenants who request them through normal IaaS workflows.Computational SubSystem(CSS)An XPI DATA SET is a set of device-computable objects that can be created,read,updated,or deleted by service meshes deployed on a Computational Memory nodeXPI Data SetsThe distance between memory resources and com
11、pute resources is a critical consideration in NDP.CXL introduces multiple new tiers of latency at the interface between memory and compute.CXL 3 enables peer to peer visibility and access of exposed memory ranges.XPI considers data as computable by a service if it is in promoted state and reachable
12、when accessed from the service end point at which the service has been instantiated.Data SetsState Tier Characteristics AllocatedObject HostNearMemory OnHost;InPromotedState PackedObject HostFarMemory OnDisaggregatedMemoryNode AsExposedMemory;InDemotedState;FileCache;FileMap;ZeroEffortUnpack Unpacke
13、dObject DeviceMemory OnCSD;InPromotedState;AsExposedMemory File(Named manifest of PackedObjects and their associated extents)HostNearMemory;DeviceMemory;(Mappable,Coherently Cacheable)OnDevice;InDemotedState;AsFileStorage ArchivedObject Nearline Archived;AsObjectStorage Objects can continue to be Ho
14、st-computable following N-S movement in a CSS through promotion/demotion,or East/West movement in a CSS through application of XPIs copying,communication,and sharing functions,provided they meet additional requirements.Self-describing data format,orOpen data formatObjects can become Device-computabl
15、e provided they meet the following additional requirements.Located in Unpacked state and accessible from a service end point on the same device.Cached possibly coherently copy of a remote object in Unpacked state located on exposed memory of a remote device or a host.Computing with Data SetsActions
16、required to make these data sets computableVector and Zero-Copy Friendly ArrowThere is unprecedented adoption of the Apache Arrow ecosystem of vector-friendly formats and operationsEsp.for Accelerator-Accelerator communication,Host-to-accelerator fn.offload,and Accelerator-friendly memory Computatio
17、nal Memory apps will benefit greatly and CXL NDP will see rapid adoptionCustomers:Standard Memory format enable reuse of libraries of algorithms,even across languages and frameworksVendors:Avoid having to define own data format and drive its adoption by software and algorithmsCustomers:Avoid the per
18、formance penalty namely,serialization-deserialization overheads when moving dataVendors:Ease adoption of on-device acceleration without having to worry about protocol conversion overheadsVendors:Leverage readily-available transpilersAcceleration Programming EvolutionCoroutinePipelinedGraph LaunchA S
19、ea of Accelerators?Google ISCA23 paper(Profiling Hyperscale Big Data Workloads)See Slide NotesChained,Asynch Accelerators Google ISCA23 paper(Profiling Hyperscale Big Data Workloads)End to End Acceleration ofHyperscale Workloads requires:-A sea of accelerators-Chained and Balanced-Low set-up time-HW
20、-SW Co-design approach-Working on Service MeshesWorking on Service MeshesParallelism ExposedGrossman,et al.SHMEMGrossman,et al.SHMEM-ML:Leveraging OpenSHMEM and ML:Leveraging OpenSHMEM and Apache Arrow for Scalable,Composable Machine Learning.Apache Arrow for Scalable,Composable Machine Learning.Wor
21、kshop on OpenSHMEM and Related Technologies.2021.Workshop on OpenSHMEM and Related Technologies.2021.log scale/ism exposedSPMD model/ism hiddenclient-server modelPre-DisaggregationDSM abstraction failsto scale beyond 32DAGs(Directed Acyclic Graph)of programs,installed/configured*services/functionsKe
22、y notion:ServiceParent(s):SubserviceService Mesh*Configured services can be accessed without installation.E.g.,CUDA,MPI communication library,Java VM,Docker service,SQLite.Not all configured services are available at every service end point.A service specified in a service mesh may be installed at a
23、n SEPSEPs on host/device/DPU can be enumerated for its currently installed/configd svcsAn installed service in turn presents its own SEP;subservices that require it as a parent can in turn be deployed thereCo-location of services and data objects is not required but preferences can be expressed by s
24、pecifying a File or an Unpacked Object from a Data Set as one of the parents.all Data Sets and,resp.,Unpacked Objects are services;by default,they support a basic set of storage operations and,resp.,memory accesses(incl.remote accesses if InExposedMemory).SEP:service end pointTagCharacteristicsPipeF
25、riendOperates in streaming fashion on chunks of input producing optional streaming outputDoes not build up memory in proportion to the number of inputs processedNon-restartable due to high side effectPipeBreakAccumulatorReductive operator that does not terminate until it has seen all or indeterminat
26、ely many inputs;does not build up memory in proportion to the number of inputs processedOptionally exposes Accumulator state AsExposedMemoryPotentially restartablePipeBreakCollatorOperator that processes inputs and builds up substantial memory state in proportion to the number of inputs processed or
27、 in proportion to the number of outputs producedHigh cost of restartingParameterCharacteristicsChunkSizeEnum(H,M,L)Service Mesh:Function Tags based on Pipeline Execution ModelGeyer,et al.Pipeline Group Optimization on Disaggregated Systems.In 13th Conference on Innovative Data Systems Research,CIDR
28、2023,linkNew Learnings,Resulting XPI Refinements1.Nvidia disc.:GPU clustersa.Formalize Data Setsb2.DB Analytics Community:Pipeline Execution Modela3.AI ML Learnings:DB+AIa trend,Indexesb4.FMS23:Filea abstraction,NVMe compatibilityb5.Academia:Virtual Objectsa,kNN search distance functionsb,Tiering(Me
29、mory costc considerations)6.Google Sea of Accelerators architecture:Asynch interface for pipelininga across fabric connecting on-chip and off-chip accelerationReqXPI part*(CSS/DS/SM):Implication1aNo change needed.CSS/Host includes a private Redfish fabric1bContinue current direction,formalize DS/Tag
30、s,align to industry work2aDS/Tags and SM/FnTags NEW for inference and insertion offloads,resp.,into S3 and Aurora tiers for instance.Arrow format SM/ServiceParent.3aDS/Tags:ONNX(ML model)SM/ServiceParent:Arrow compute(op)3bSee paper:High performance offload of pointer-rich data structures4aDS/FILE c
31、ompatible with ADLS,Colossus,etc.(tiering,partial mapping)4bIn progress with DCI5aTBD if Parquet,Delta open formats sufficiently rich for Virtual Objects5bSee paper:AI/ML offloads involve pointer chasing5cSM/ServiceMesh/Function:chunkSizes parameter,in preference order6aSM/FnTags:PipeFriend,PipeBrea
32、kAccumulator,PipeBreakCollator*CSS:Computational SubSystem DS:Data Set SM:Service MeshReviewed and Blessed by Developerswho are building:Hyperscale DBAccelerator CoresVery-High-Performance Accelerator Computational StorageData-rich ServiceBig Data and DL frameworksAcceleration LibraryComputational M
33、emory DeviceHigh Perf FilesystemRepresentative Usage of XPIAPI technical work on parallelism,progress visibility,and steeringAdopter Considerations,Use CasesProvider ConsiderationsTesting a range of optimizationsAbove the line/manualBelow the line/automaticCall for ParticipationCloud Sort benchmarkE-W Commusing libFabricArrow ComputeAceroExamples of WIP:Collective Comm over Fabric MemOCP Global Summit|October 18,2023|San Jose,CA