XPI:加速編程接口.pdf

編號:158220 PDF 25頁 2.50MB 下載積分:VIP專享
下載報告請您先登錄!

XPI:加速編程接口.pdf

1、OCP Global Summit October 18,2023|San Jose,CAPankaj MehraElephance Memory,Inc.XPI:Acceleration Programming InterfaceScope of XPI WorkDefine and evaluate,through implementation and integration,new,dev-friendly programming interfaces forNear Data Processing(NDP)infrastructurenow emerging at the inters

2、ection of:Domain-Specific Architecture(major trend)andComputational Memory/Storage(major opportunities,not yet a trend).1.Data Infra2.Scalable AI/ML3.Media Processing4.Memory NodesStorage device/array w/exposed/pvt RAM w/pre-/post-CXL NDP in front,integrated behind switch,or accelerator-first1.DPUs3

3、.Comp.Memory2.Comp.StorageTargeted Developer Communities:Adopters:Scale,Standardize,AccelerateProviders:Mutual and External ConsistencyModern Storage WorkloadsEB-scale storage PB-scale memory Open format dataData Gravity is a key consideration for power and performanceThe need for End-to-End acceler

4、ationCompute-Memory HierarchyFabricServerRackPackageMEMORY HIERARCHYV$HBMRDMADDRCXL 2CXL 1CXL 3NVMeUPICOMPUTE HIERARCHYCXL Type 2accelerator with memoryideal for near-memory processingCommand-Heavy&Communication-Poor interfaces of existing standardsConsider offloading sorts or group-bys on sharded t

5、ables DB,or convolution offload for graph neural networks DL,or offloading subgraph building against embeddings DLRMThe unit of offload is tiny;curse of Amdahl!Hosts coordinate all data movement between host:device(North-South)using eager data movement semantics even though CXL enables lazy semantic

6、device:device(East-West)data movement is not contemplated even though computational memories will naturally move data peer-to-peer in CXL 3Existing interfaces fundamentally lacking for CXL-enabled accelerators and memoryLack of application-specific optimization over function/data placement,movementF

7、unctions to be offloaded depend on other functionsOpen data formats work differently than application-specific formatsCertain functions require certain runtimes or specific acceleratorFunction placement and data placement are mutually constrainingWe need a language for applications to express intent

8、 on function and data coplacement without resorting to low-level constructs such as blocks and cachelinesNeed simple notions of data and function for deeply tiered hierarchy of memory and computePrincipal Gaps in Existing WorkXPI:Core Concepts;Theory of OperationsHeavy leverage of RedfishCSSTraditio

9、nal DAG of functions with new set-valued dependenciesService MeshNew idea for tiered and computational memory&storageData SetXPI clients can create and control the execution of a SERVICE MESH on a DATA SET using a COMPUTATIONAL SUBSYSTEM that they have requestedCSS instances are createdby Infra Ops

10、for Tenants who request them through normal IaaS workflows.Computational SubSystem(CSS)An XPI DATA SET is a set of device-computable objects that can be created,read,updated,or deleted by service meshes deployed on a Computational Memory nodeXPI Data SetsThe distance between memory resources and com

11、pute resources is a critical consideration in NDP.CXL introduces multiple new tiers of latency at the interface between memory and compute.CXL 3 enables peer to peer visibility and access of exposed memory ranges.XPI considers data as computable by a service if it is in promoted state and reachable

12、when accessed from the service end point at which the service has been instantiated.Data SetsState Tier Characteristics AllocatedObject HostNearMemory OnHost;InPromotedState PackedObject HostFarMemory OnDisaggregatedMemoryNode AsExposedMemory;InDemotedState;FileCache;FileMap;ZeroEffortUnpack Unpacke

13、dObject DeviceMemory OnCSD;InPromotedState;AsExposedMemory File(Named manifest of PackedObjects and their associated extents)HostNearMemory;DeviceMemory;(Mappable,Coherently Cacheable)OnDevice;InDemotedState;AsFileStorage ArchivedObject Nearline Archived;AsObjectStorage Objects can continue to be Ho

14、st-computable following N-S movement in a CSS through promotion/demotion,or East/West movement in a CSS through application of XPIs copying,communication,and sharing functions,provided they meet additional requirements.Self-describing data format,orOpen data formatObjects can become Device-computabl

15、e provided they meet the following additional requirements.Located in Unpacked state and accessible from a service end point on the same device.Cached possibly coherently copy of a remote object in Unpacked state located on exposed memory of a remote device or a host.Computing with Data SetsActions

16、required to make these data sets computableVector and Zero-Copy Friendly ArrowThere is unprecedented adoption of the Apache Arrow ecosystem of vector-friendly formats and operationsEsp.for Accelerator-Accelerator communication,Host-to-accelerator fn.offload,and Accelerator-friendly memory Computatio

17、nal Memory apps will benefit greatly and CXL NDP will see rapid adoptionCustomers:Standard Memory format enable reuse of libraries of algorithms,even across languages and frameworksVendors:Avoid having to define own data format and drive its adoption by software and algorithmsCustomers:Avoid the per

18、formance penalty namely,serialization-deserialization overheads when moving dataVendors:Ease adoption of on-device acceleration without having to worry about protocol conversion overheadsVendors:Leverage readily-available transpilersAcceleration Programming EvolutionCoroutinePipelinedGraph LaunchA S

19、ea of Accelerators?Google ISCA23 paper(Profiling Hyperscale Big Data Workloads)See Slide NotesChained,Asynch Accelerators Google ISCA23 paper(Profiling Hyperscale Big Data Workloads)End to End Acceleration ofHyperscale Workloads requires:-A sea of accelerators-Chained and Balanced-Low set-up time-HW

20、-SW Co-design approach-Working on Service MeshesWorking on Service MeshesParallelism ExposedGrossman,et al.SHMEMGrossman,et al.SHMEM-ML:Leveraging OpenSHMEM and ML:Leveraging OpenSHMEM and Apache Arrow for Scalable,Composable Machine Learning.Apache Arrow for Scalable,Composable Machine Learning.Wor

21、kshop on OpenSHMEM and Related Technologies.2021.Workshop on OpenSHMEM and Related Technologies.2021.log scale/ism exposedSPMD model/ism hiddenclient-server modelPre-DisaggregationDSM abstraction failsto scale beyond 32DAGs(Directed Acyclic Graph)of programs,installed/configured*services/functionsKe

22、y notion:ServiceParent(s):SubserviceService Mesh*Configured services can be accessed without installation.E.g.,CUDA,MPI communication library,Java VM,Docker service,SQLite.Not all configured services are available at every service end point.A service specified in a service mesh may be installed at a

23、n SEPSEPs on host/device/DPU can be enumerated for its currently installed/configd svcsAn installed service in turn presents its own SEP;subservices that require it as a parent can in turn be deployed thereCo-location of services and data objects is not required but preferences can be expressed by s

24、pecifying a File or an Unpacked Object from a Data Set as one of the parents.all Data Sets and,resp.,Unpacked Objects are services;by default,they support a basic set of storage operations and,resp.,memory accesses(incl.remote accesses if InExposedMemory).SEP:service end pointTagCharacteristicsPipeF

25、riendOperates in streaming fashion on chunks of input producing optional streaming outputDoes not build up memory in proportion to the number of inputs processedNon-restartable due to high side effectPipeBreakAccumulatorReductive operator that does not terminate until it has seen all or indeterminat

26、ely many inputs;does not build up memory in proportion to the number of inputs processedOptionally exposes Accumulator state AsExposedMemoryPotentially restartablePipeBreakCollatorOperator that processes inputs and builds up substantial memory state in proportion to the number of inputs processed or

27、 in proportion to the number of outputs producedHigh cost of restartingParameterCharacteristicsChunkSizeEnum(H,M,L)Service Mesh:Function Tags based on Pipeline Execution ModelGeyer,et al.Pipeline Group Optimization on Disaggregated Systems.In 13th Conference on Innovative Data Systems Research,CIDR

28、2023,linkNew Learnings,Resulting XPI Refinements1.Nvidia disc.:GPU clustersa.Formalize Data Setsb2.DB Analytics Community:Pipeline Execution Modela3.AI ML Learnings:DB+AIa trend,Indexesb4.FMS23:Filea abstraction,NVMe compatibilityb5.Academia:Virtual Objectsa,kNN search distance functionsb,Tiering(Me

29、mory costc considerations)6.Google Sea of Accelerators architecture:Asynch interface for pipelininga across fabric connecting on-chip and off-chip accelerationReqXPI part*(CSS/DS/SM):Implication1aNo change needed.CSS/Host includes a private Redfish fabric1bContinue current direction,formalize DS/Tag

30、s,align to industry work2aDS/Tags and SM/FnTags NEW for inference and insertion offloads,resp.,into S3 and Aurora tiers for instance.Arrow format SM/ServiceParent.3aDS/Tags:ONNX(ML model)SM/ServiceParent:Arrow compute(op)3bSee paper:High performance offload of pointer-rich data structures4aDS/FILE c

31、ompatible with ADLS,Colossus,etc.(tiering,partial mapping)4bIn progress with DCI5aTBD if Parquet,Delta open formats sufficiently rich for Virtual Objects5bSee paper:AI/ML offloads involve pointer chasing5cSM/ServiceMesh/Function:chunkSizes parameter,in preference order6aSM/FnTags:PipeFriend,PipeBrea

32、kAccumulator,PipeBreakCollator*CSS:Computational SubSystem DS:Data Set SM:Service MeshReviewed and Blessed by Developerswho are building:Hyperscale DBAccelerator CoresVery-High-Performance Accelerator Computational StorageData-rich ServiceBig Data and DL frameworksAcceleration LibraryComputational M

33、emory DeviceHigh Perf FilesystemRepresentative Usage of XPIAPI technical work on parallelism,progress visibility,and steeringAdopter Considerations,Use CasesProvider ConsiderationsTesting a range of optimizationsAbove the line/manualBelow the line/automaticCall for ParticipationCloud Sort benchmarkE-W Commusing libFabricArrow ComputeAceroExamples of WIP:Collective Comm over Fabric MemOCP Global Summit|October 18,2023|San Jose,CA

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(XPI:加速編程接口.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站