當前位置：首頁 > 報告詳情

XPI：加速編程接口.pdf

上傳人：張** 編號：158220 2024-03-31 PDF PDF 25頁 2.50MB

該報告所屬合集： 2023OCP Future Technologies Symposium 嘉賓演講PPT-圣何塞

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/25

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《XPI：加速編程接口.pdf》由會員分享，可在線閱讀，更多相關《XPI：加速編程接口.pdf（25頁珍藏版）》請在三個皮匠報告上搜索。

1、OCP Global Summit October 18,2023|San Jose,CAPankaj MehraElephance Memory,Inc.XPI:Acceleration Programming InterfaceScope of XPI WorkDefine and evaluate,through implementation and integration,new,dev-friendly programming interfaces forNear Data Processing(NDP)infrastructurenow emerging at the inters

2、ection of:Domain-Specific Architecture(major trend)andComputational Memory/Storage(major opportunities,not yet a trend).1.Data Infra2.Scalable AI/ML3.Media Processing4.Memory NodesStorage device/array w/exposed/pvt RAM w/pre-/post-CXL NDP in front,integrated behind switch,or accelerator-first1.DPUs3

3、.Comp.Memory2.Comp.StorageTargeted Developer Communities:Adopters:Scale,Standardize,AccelerateProviders:Mutual and External ConsistencyModern Storage WorkloadsEB-scale storage PB-scale memory Open format dataData Gravity is a key consideration for power and performanceThe need for End-to-End acceler

4、ationCompute-Memory HierarchyFabricServerRackPackageMEMORY HIERARCHYV$HBMRDMADDRCXL 2CXL 1CXL 3NVMeUPICOMPUTE HIERARCHYCXL Type 2accelerator with memoryideal for near-memory processingCommand-Heavy&Communication-Poor interfaces of existing standardsConsider offloading sorts or group-bys on sharded t

5、ables DB,or convolution offload for graph neural networks DL,or offloading subgraph building against embeddings DLRMThe unit of offload is tiny;curse of Amdahl!Hosts coordinate all data movement between host:device(North-South)using eager data movement semantics even though CXL enables lazy semantic

6、device:device(East-West)data movement is not contemplated even though computational memories will naturally move data peer-to-peer in CXL 3Existing interfaces fundamentally lacking for CXL-enabled accelerators and memoryLack of application-specific optimization over function/data placement,movementF

7、unctions to be offloaded depend on other functionsOpen data formats work differently than application-specific formatsCertain functions require certain runtimes or specific acceleratorFunction placement and data placement are mutually constrainingWe need a language for applications to express intent

8、 on function and data coplacement without resorting to low-level constructs such as blocks and cachelinesNeed simple notions of data and function for deeply tiered hierarchy of memory and computePrincipal Gaps in Existing WorkXPI:Core Concepts;Theory of OperationsHeavy leverage of RedfishCSSTraditio

9、nal DAG of functions with new set-valued dependenciesService MeshNew idea for tiered and computational memory&storageData SetXPI clients can create and control the execution of a SERVICE MESH on a DATA SET using a COMPUTATIONAL SUBSYSTEM that they have requestedCSS instances are createdby Infra Ops

10、for Tenants who request them through normal IaaS workflows.Computational SubSystem(CSS)An XPI DATA SET is a set of device-computable objects that can be created,read,updated,or deleted by service meshes deployed on a Computational Memory nodeXPI Data SetsThe distance between memory resources and com

11、pute resources is a critical consideration in NDP.CXL introduces multiple new tiers of latency at the interface between memory and compute.CXL 3 enables peer to peer visibility and access of exposed memory ranges.XPI considers data as computable by a service if it is in promoted state and reachable

12、when accessed from the service end point at which the service has been instantiated.Data SetsState Tier Characteristics AllocatedObject HostNearMemory OnHost;InPromotedState PackedObject HostFarMemory OnDisaggregatedMemoryNode AsExposedMemory;InDemotedState;FileCache;FileMap;ZeroEffortUnpack Unpacke

13、dObject DeviceMemory OnCSD;InPromotedState;AsExposedMemory File(Named manifest of PackedObjects and their associated extents)HostNearMemory;DeviceMemory;(Mappable,Coherently Cacheable)OnDevice;InDemotedState;AsFileStorage ArchivedObject Nearline Archived;AsObjectStorage Objects can continue to be Ho

14、st-computable following N-S movement in a CSS through promotion/demotion,or East/West movement in a CSS through application of XPIs copying,communication,and sharing functions,provided they meet additional requirements.Self-describing data format,orOpen data formatObjects can become Device-computabl

15、e provided they meet the following additional requirements.Located in Unpacked state and accessible from a service end point on the same device.Cached possibly coherently copy of a remote object in Unpacked state located on exposed memory of a remote device or a host.Computing with Data SetsActions

16、required to make these data sets computableVector and Zero-Copy Friendly ArrowThere is unprecedented adoption of the Apache Arrow ecosystem of vector-friendly formats and operationsEsp.for Accelerator-Accelerator communication,Host-to-accelerator fn.offload,and Accelerator-friendly memory Computatio

17、nal Memory apps will benefit greatly and CXL NDP will see rapid adoptionCustomers:Standard Memory format enable reuse of libraries of algorithms,even across languages and frameworksVendors:Avoid having to define own data format and drive its adoption by software and algorithmsCustomers:Avoid the per

18、formance penalty namely,serialization-deserialization overheads when moving dataVendors:Ease adoption of on-device acceleration without having to worry about protocol conversion overheadsVendors:Leverage readily-available transpilersAcceleration Programming EvolutionCoroutinePipelinedGraph LaunchA S

19、ea of Accelerators?Google ISCA23 paper(Profiling Hyperscale Big Data Workloads)See Slide NotesChained,Asynch Accelerators Google ISCA23 paper(Profiling Hyperscale Big Data Workloads)End to End Acceleration ofHyperscale Workloads requires:-A sea of accelerators-Chained and Balanced-Low set-up time-HW

20、-SW Co-design approach-Working on Service MeshesWorking on Service MeshesParallelism ExposedGrossman,et al.SHMEMGrossman,et al.SHMEM-ML:Leveraging OpenSHMEM and ML:Leveraging OpenSHMEM and Apache Arrow for Scalable,Composable Machine Learning.Apache Arrow for Scalable,Composable Machine Learning.Wor

21、kshop on OpenSHMEM and Related Technologies.2021.Workshop on OpenSHMEM and Related Technologies.2021.log scale/ism exposedSPMD model/ism hiddenclient-server modelPre-DisaggregationDSM abstraction failsto scale beyond 32DAGs(Directed Acyclic Graph)of programs,installed/configured*services/functionsKe

22、y notion:ServiceParent(s):SubserviceService Mesh*Configured services can be accessed without installation.E.g.,CUDA,MPI communication library,Java VM,Docker service,SQLite.Not all configured services are available at every service end point.A service specified in a service mesh may be installed at a

23、n SEPSEPs on host/device/DPU can be enumerated for its currently installed/configd svcsAn installed service in turn presents its own SEP;subservices that require it as a parent can in turn be deployed thereCo-location of services and data objects is not required but preferences can be expressed by s

24、pecifying a File or an Unpacked Object from a Data Set as one of the parents.all Data Sets and,resp.,Unpacked Objects are services;by default,they support a basic set of storage operations and,resp.,memory accesses(incl.remote accesses if InExposedMemory).SEP:service end pointTagCharacteristicsPipeF

25、riendOperates in streaming fashion on chunks of input producing optional streaming outputDoes not build up memory in proportion to the number of inputs processedNon-restartable due to high side effectPipeBreakAccumulatorReductive operator that does not terminate until it has seen all or indeterminat

26、ely many inputs;does not build up memory in proportion to the number of inputs processedOptionally exposes Accumulator state AsExposedMemoryPotentially restartablePipeBreakCollatorOperator that processes inputs and builds up substantial memory state in proportion to the number of inputs processed or

27、 in proportion to the number of outputs producedHigh cost of restartingParameterCharacteristicsChunkSizeEnum(H,M,L)Service Mesh:Function Tags based on Pipeline Execution ModelGeyer,et al.Pipeline Group Optimization on Disaggregated Systems.In 13th Conference on Innovative Data Systems Research,CIDR

28、2023,linkNew Learnings,Resulting XPI Refinements1.Nvidia disc.:GPU clustersa.Formalize Data Setsb2.DB Analytics Community:Pipeline Execution Modela3.AI ML Learnings:DB+AIa trend,Indexesb4.FMS23:Filea abstraction,NVMe compatibilityb5.Academia:Virtual Objectsa,kNN search distance functionsb,Tiering(Me

29、mory costc considerations)6.Google Sea of Accelerators architecture:Asynch interface for pipelininga across fabric connecting on-chip and off-chip accelerationReqXPI part*(CSS/DS/SM):Implication1aNo change needed.CSS/Host includes a private Redfish fabric1bContinue current direction,formalize DS/Tag

30、s,align to industry work2aDS/Tags and SM/FnTags NEW for inference and insertion offloads,resp.,into S3 and Aurora tiers for instance.Arrow format SM/ServiceParent.3aDS/Tags:ONNX(ML model)SM/ServiceParent:Arrow compute(op)3bSee paper:High performance offload of pointer-rich data structures4aDS/FILE c

31、ompatible with ADLS,Colossus,etc.(tiering,partial mapping)4bIn progress with DCI5aTBD if Parquet,Delta open formats sufficiently rich for Virtual Objects5bSee paper:AI/ML offloads involve pointer chasing5cSM/ServiceMesh/Function:chunkSizes parameter,in preference order6aSM/FnTags:PipeFriend,PipeBrea

32、kAccumulator,PipeBreakCollator*CSS:Computational SubSystem DS:Data Set SM:Service MeshReviewed and Blessed by Developerswho are building:Hyperscale DBAccelerator CoresVery-High-Performance Accelerator Computational StorageData-rich ServiceBig Data and DL frameworksAcceleration LibraryComputational M

33、emory DeviceHigh Perf FilesystemRepresentative Usage of XPIAPI technical work on parallelism,progress visibility,and steeringAdopter Considerations,Use CasesProvider ConsiderationsTesting a range of optimizationsAbove the line/manualBelow the line/automaticCall for ParticipationCloud Sort benchmarkE-W Commusing libFabricArrow ComputeAceroExamples of WIP:Collective Comm over Fabric MemOCP Global Summit|October 18,2023|San Jose,CA

相關圖表

本文主要介紹了OCP全球峰會的主要內容，包括對Near Data Processing (NDP)基礎設施的新的、開發者友好的編程接口的定義和評估。文章討論了現有的接口在CXL加速器和內存方面存在的問題，如缺乏應用特定優化、函數和數據放置的相互約束等。提出了新的編程接口XPI，用于在不同的計算內存/存儲層之間實現端到端加速。此外，文章還討論了數據集的概念，以及如何在服務網格上創建和控制數據集的執行。最后，文章概述了XPI的一些核心概念和理論操作，以及在不同領域（如數據庫分析、AI/ML等）的應用和學習。

如何推進計算存儲領域的新趨勢?" 如何為近數據處理帶來端到端加速?" 如何創新實現數據計算優化?"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站