當前位置：首頁 > 報告詳情

會議13_在 MPI 中設計網絡內計算感知縮減集合.pdf

上傳人： Ch****l 編號：171254 2024-07-03 PDF PDF 33頁 1.86MB

該報告所屬合集： 2024年OFA線上研討會（OFA Virtual Workshop Sessions）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/33

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《會議13_在 MPI 中設計網絡內計算感知縮減集合.pdf》由會員分享，可在線閱讀，更多相關《會議13_在 MPI 中設計網絡內計算感知縮減集合.pdf（33頁珍藏版）》請在三個皮匠報告上搜索。

1、DESIGNING IN-NETWORK COMPUTING AWARE REDUCTION COLLECTIVES IN MPI2024 OFA Virtual WorkshopBharath Ramesh and Dhabaleswar K.(DK)PandaNetwork Based Computing LaboratoryThe Ohio State Universityhttp:/nowlab.cse.ohio-state.edu/3Network Based Computing LaboratoryOFA workshop April24IntroductionBackground

2、MotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline4Network Based Computing LaboratoryOFA workshop April24Introduction:Drivers of Modern HPC Cluster ArchitecturesMulti-core/many-core technologiesRemote

3、Direct Memory Access(RDMA)-enabled networking(InfiniBand,RoCE,Slingshot)Solid State Drives(SSDs),Non-Volatile Random-Access Memory(NVRAM),NVMe-SSDAccelerators(NVIDIA GPGPUs)Acceleratorshigh compute density,high performance/watt9.7 TFlop DP on a chip High Performance Interconnects InfiniBandMulti-/Ma

4、ny-core ProcessorsSSD,NVMe-SSD,NVRAMFrontierSummitLumiFugaku5Network Based Computing LaboratoryOFA workshop April24Reduction collectives(such as MPI_Allreduce)are important for HPC and AIInvolve both compute and communicationUsing CPUs everywhere leads to sub-optimal scale-up and scale-out efficienc

5、yMotivates the need for offloading common operations away from the CPU to allow the CPU to perform other useful tasksIn-network compute allows offloading operations to network devicesSwitches are a good candidate due to high bandwidth and ability to reduce data on-the-fly eliminating redundancyHigh

6、scale-out efficiency and network topology awarenessFrees up CPU cycles for other operationsMPI Reduction collectives and In-network Computing6Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache desi

7、gnProposed Allreduce designResultsConclusion and Future workOutline7Network Based Computing LaboratoryOFA workshop April24SHARP Reduction trees and Streaming Aggregation(SAT)Aggregation TreeSwitch-level reduction(radix 16)Images taken from Graham,Richard et al.Scalable Hierarchical Aggregation and R

8、eduction Protocol(SHARP)TMStreaming-Aggregation Hardware Design and Evaluation.DOI:10.1007/978-3-030-50743-5_3(https:/ Based Computing LaboratoryOFA workshop April24Hierarchical design for small message MPI_AllreducePhase 1:Intra-socket reductionPhase 2:Inter-socket reductionLegendSocket leader shar

9、ed memoryIntra-socket shared memoryWrite to shared memoryReduce from shared memoryIntra-socket leader processRegular processSocket 1Socket 2Socket 1Socket 2Node 1Node 2LegendSocket leader shared memoryIntra-socket shared memoryWrite to shared memoryReduce from shared memoryIntra-socket leader proces

10、sSocket 1Socket 2Socket 1Socket 2Node 1Node 29Network Based Computing LaboratoryOFA workshop April24Hierarchical design for small message MPI_AllreducePhase 3:Inter-node allreduce.Uses SHARP for scale-out performanceLegendSocket leader shared memoryIntra-socket shared memoryIntra-socket leader proce

11、ssInter-node allreduceSocket 1Socket 2Socket 1Socket 2Node 1Node 210Network Based Computing LaboratoryOFA workshop April24Hierarchical design for small message MPI_AllreducePhase 4:Inter-socket broadcastPhase 5:Intra-socket broadcastLegendSocket leader shared memoryIntra-socket shared memoryWrite to

12、 shared memoryRead from shared memoryIntra-socket leader processSocket 1Socket 2Socket 1Socket 2Node 1Node 2LegendSocket leader shared memoryIntra-socket shared memoryWrite to shared memoryRead from shared memoryIntra-socket leader processRegular processSocket 1Socket 2Node 1Socket 1Socket 2Node 211

13、Network Based Computing LaboratoryOFA workshop April24Overview of the MVAPICH ProjectHigh Performance open-source MPI Library Support for multiple interconnectsInfiniBand,Omni-Path,Ethernet/iWARP,RDMA over Converged Ethernet(RoCE),AWS EFA,OPX,Broadcom RoCE,Intel Ethernet,Rockport Networks,Slingshot

14、10/11Support for multiple platformsx86,OpenPOWER,ARM,Xeon-Phi,GPGPUs(NVIDIA and AMD)Started in 2001,first open-source version demonstrated at SC 02Supports the latest MPI-3.1 standardhttp:/mvapich.cse.ohio-state.eduAdditional optimized versions for different systems/environments:MVAPICH2-X(Advanced

15、MPI+PGAS),since 2011MVAPICH2-GDR with support for NVIDIA(since 2014)and AMD(since 2020)GPUsMVAPICH2-MIC with support for Intel Xeon-Phi,since 2014MVAPICH2-Virt with virtualization support,since 2015MVAPICH2-EA with support for Energy-Awareness,since 2015MVAPICH2-Azure for Azure HPC IB instances,sinc

16、e 2019MVAPICH2-X-AWS for AWS HPC+EFA instances,since 2019Tools:OSU MPI Micro-Benchmarks(OMB),since 2003OSU InfiniBand Network Analysis and Monitoring(INAM),since 2015Used by more than 3,375 organizations in 91 countriesMore than 1.77 Million downloads from the OSU site directlyEmpowering many TOP500

17、 clusters(Nov 23 ranking)11th,10,649,600-core(Sunway TaihuLight)at NSC,Wuxi,China29th,448,448 cores(Frontera)at TACC46th,288,288 cores(Lassen)at LLNL61st,570,020 cores(Nurion)in South Korea and many othersAvailable with software stacks of many vendors and Linux Distros(RedHat,SuSE,OpenHPC,and Spack)

18、Partner in the 29thranked TACC Frontera systemEmpowering Top500 systems for more than 18 years12Network Based Computing LaboratoryOFA workshop April24Results for small MPI_Allreduce Varying message sizesScaling with message size,average latencyClose to a flat curve across message sizes up to 2K1 ppn

19、,7861 nodes16 ppn,1024 nodesAvailable in the MVAPICH 3.0 release13Network Based Computing LaboratoryOFA workshop April24Results for small MPI_Allreduce Varying node countsScaling with increasing node counts,16 bytes,average latencySame as trends with reduce(implementations are almost the same except

20、 for the intra-node broadcast phases)1 ppn,7861 nodes16 ppn,1024 nodesB.Ramesh,K.Suresh,N.Sarkauskas,M.Bayatpour,J.Hashmi,H.Subramoni,and DK Panda “Scalable MPI Collectives using SHARP:Large Scale Performance Evaluation on the TACC Frontera System”,ExaMPI20 More information in the following paper14N

21、etwork Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline15Network Based Computing LaboratoryOFA workshop April24Two-copy reduction c

22、ollectives with SHARPUsed leader-based schemes that had a reduction,followed by a SHARP operation and finally a broadcastNot suitable for large message sizes(=128k)Single-copy schemes are very efficient for large message data movementXPMEM allows remote process to have load/store access through addr

23、ess space mappingUsing Sharp SAT in MPI has a few limitations and bottlenecks that need to be addressed for achieving good scale-out performanceMotivates the need for large message reduction designs that combine advantages of SHARP and single-copy schemes like XPMEMLimitations of state-of-the-art sc

24、hemes for large message reduction collectives16Network Based Computing LaboratoryOFA workshop April24Motivation05000100001500020000250001M2M4M8M16M32MLatency(us)Message size(bytes)Allreduce runtime registration overheadSHARP-allreduce-without-registrationSHARP-registrationSHARP SAT provides excellen

25、t bandwidth with close to point-to-point latencyRegistration involves pinning pages to memory(like InfiniBand registration)Overhead increases significantly with increase in message sizeRequires a cache that avoids expensive calls to sharp_coll_reg_mrSwitch resources are limitedCauses bottlenecks whe

26、n scaling up on modern CPUs with hundreds of coresThe SHARP runtime places limits to manage resourcesMotivates need for designs that are aware of SHARP runtime capabilities,overcome bottlenecks and scale-up efficiently for many processes per node010020030040050016K32K64K128K256K512K1MLatency(us)Mess

27、age size(bytes)Comparison of SHARP protocols on 8 nodesSHARP-LLTSHARP-SAT17Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline

28、18Network Based Computing LaboratoryOFA workshop April24Problem Statement-Can we propose an algorithm for large message AllReduce that overcomes bottlenecks and resource constraints in the SHARP runtime by making efficient use of node and network level resources?ContributionsIdentify registration ov

29、erheads involved in the use of SHARP streaming aggregation for large messages and propose solutions to address themAnalyze the impact of chunking reductions when using streaming aggregation for different message sizes to empirically determine ways to overlap intra-node reductions with SHARP-based re

30、ductionsPropose an algorithm for large AllReduce that utilizes SAT and CPUs efficientlyEvaluate the proposed design by comparing it against state-of-the-art MPI librariesProblem Statement and Contributions19Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem

31、 Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline20Network Based Computing LaboratoryOFA workshop April24Use a registration cache to amortize registration costs in the SHARP runtimeDesignate a“leader”process on each n

32、ode to interact with SHARPChunk buffer into PPN(number of processes per node)chunks and reduce to a single buffer belonging to the leader processUses XPMEM for load/store accessAll processes perform local reductions,but only leader process calls the SHARP runtimeOnce local reductions are complete,le

33、ader calls a non-blocking MPI_Allreduce Perfect overlap of intra-node and inter-node stepsLocal reduction happens in batches for good network utilizationFinal result broadcast within the nodeProposed Design Overview21Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivat

34、ionProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline22Network Based Computing LaboratoryOFA workshop April24Registration cache design11010010001000016K32K64K128K 256K 512K1M2M4M8MLatency(us)Message Size(bytes)I

35、mpact of registration cache designsSAT-with-registration-cacheSAT-without-registration-cacheUse an AVL tree or similar,to store buffer addressesO(log n)insertion/query timeIf buffer entry exists,directly get registration information from cacheUp to 5.6X reduction in latency23Network Based Computing

36、LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline24Network Based Computing LaboratoryOFA workshop April24Analyzing impact of chunking iallreduce ope

37、rations05010015020016384655362621441048576Latency(us)Message Size(bytes)Iallreduce+waitall using SHARP SAT1 chunk2 chunks4 chunks8 chunks16 chunks32 chunksMeasure impact of a message sent using one call to the SHARP library vs multiple callsGiven a message size M and number of chunks C,call non-bloc

38、king SHARP allreduce C times(of size M/C each)followed by waitallIndirect measure of overlap at the network levelSplitting into chunks of size=16384 gives the same latency(independent of num_chunks)Can be overlapped with reductions within the node25Network Based Computing LaboratoryOFA workshop Apri

39、l24Proposed Allreduce DesignP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufInitial State Reduce to leaderFirst process on each node is designated as leaderBefore reduction,exchange buffer information using shared memory(for XPMEM load/stores)P

40、rocess i reduces the ith chunk from every process and stores to tmpbuf at leaderAt the end of this step,leader on every node has the reduced result for the current phaseLeader process initiates non-blocking inter-node SHARP allreduceUse“request”objects to track progress of SHARP Allreduce operations

41、P0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufInitiate non-blocking SHARP-based inter-node allreduce26Network Based Computing LaboratoryOFA workshop April24Proposed Allreduce Design ContinuedFor large buffers,the intra-node reduction and inter-node phases are run multiple timesReduction of lar

42、ge buffers is time consumingDone in multiple phases for good network utilizationChunk size if tuned to get perfect overlap of intra-node and inter-node operationsLeader waits for non-blocking allreduces to complete after all runs of the first two phases are donePerform and intra-node broadcast to ge

43、t final resultP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufAfter WaitallAfter BroadcastB.Ramesh,G.Kuncham,K.Suresh,R.Vaidya,N.Alnaasan,M.Abduljabbar,A.Shafi,D.Panda,Designing In-network Computing Aware Reduction Collectives in MPI,Hot Interc

44、onnects 2023,Aug 2023.More information in the following paper27Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline28Network Ba

45、sed Computing LaboratoryOFA workshop April24Experimental setupClusterMRIHPCACProcessor modelAMD EPYC 7713Intel(R)Xeon(R)Gold 6138Max Clock speed3.72GHz2GHzNumber of sockets22Cores per socket6420RAM256GB196GBInterconnectNVIDIA HDR-200 with Quantum 2 switchesNVIDIA HDR-200 with Quantum 2 switchesMPI l

46、ibrariesMVAPICH2-X,HPC-XMVAPICH2-X,HPC-X29Network Based Computing LaboratoryOFA workshop April24Results for large MPI_Allreduce 2 nodesIncreased parallelism by using multiple processes and SHARP for reductionUp to 81.43%over state-of-the-art for 32PPN and 86.4%for 64PPN on MRIUp to 33.67%over state-

47、of-the-art for 32PPN and 60%for 64PPN on HPCACIncreased number of page faults leads to decreased benefits at 1M(Needs to be investigated further)MRI-32PPNMRI-64PPNHPCAC-32PPNHPCAC-64PPN30Network Based Computing LaboratoryOFA workshop April24Results for large MPI_Allreduce 4 nodesIncreased parallelis

48、m by using multiple processes and SHARP for reductionUp to 83.05%over state-of-the-art for 32PPN and 88.52%for 64PPN on MRIUp to 32.62%over state-of-the-art for 32PPN and 46.91%for 64PPN on HPCACMRI-32PPNMRI-64PPNHPCAC-32PPNHPCAC-64PPN31Network Based Computing LaboratoryOFA workshop April24Results f

49、or large MPI_Allreduce 8 nodesIncreased parallelism by using multiple processes and SHARP for reductionUp to 79.44%over state-of-the-art for 32PPN and 78.36%for 64PPN on MRIUp to 58.08%over state-of-the-art for 32PPN and 52.13%for 64PPN on HPCACMRI-32PPNMRI-64PPNHPCAC-32PPNHPCAC-64PPN32Network Based

50、 Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline33Network Based Computing LaboratoryOFA workshop April24SHARP runtime enables in-network

51、 offload with excellent bandwidth utilizationProposed designs overcome various bottlenecks by using a leader-based algorithm and streaming aggregation for large message reductionsOutperforms state-of-the-art by up to 86%Will be available in a future release of MVAPICH-plusFuture workComprehensive ap

52、plication evaluationEvaluating performance at larger scalesExploring NUMA-awarenessConclusion and Future Work34Network Based Computing LaboratoryOFA workshop April24THANK YOU!Network-Based Computing Laboratoryhttp:/nowlab.cse.ohio-state.edu/The High-Performance MPI/PGASProjecthttp:/mvapich.cse.ohio-state.edu/The High-Performance Big Data Projecthttp:/hibd.cse.ohio-state.edu/The High-Performance Deep Learning Projecthttp:/hidl.cse.ohio-state.edu/

相關圖表

本文主要介紹了在MPI中設計網絡計算感知的減少集合的方法。主要內容包括： 1. 背景介紹：現代高性能計算集群架構的驅動因素，包括多核/眾核技術、遠程直接內存訪問(RDMA)網絡、固態硬盤(SSD)、非易失性隨機訪問內存(NVRAM)、NVMe-SSD以及加速器(如NVIDIA GPGPU)。 2. 動機：減少集合(如MPI_Allreduce)在HPC和AI中非常重要，涉及計算和通信。使用CPU進行所有操作會導致擴展效率低下，因此需要將常見操作卸載到網絡設備，如交換機。 3. 問題陳述和貢獻：提出了一種算法，用于大型消息的AllReduce，該算法克服了SHARP運行時中的瓶頸和資源限制，通過高效利用節點和網絡級資源。 4. 設計：包括注冊緩存設計和提出的AllReduce設計。使用注冊緩存來攤銷SHARP運行時中的注冊成本，并設計了一個基于領導者的算法，使用流聚合進行大型消息減少。 5. 結果：實驗結果表明，該設計在某些情況下比現有技術高出86%的性能。 6. 結論和未來工作：該設計克服了各種瓶頸，通過使用基于領導者的算法和流聚合進行大型消息減少，優于現有技術。未來工作包括全面應用評估、在更大規模上的性能評估以及探索NUMA感知。

如何設計高效的大消息MPI_Allreduce算法？如何在MPI中實現網絡計算感知的大消息減少集合？如何在MPI中利用SHARP和單份方案的優勢實現大消息減少？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站