當前位置：首頁 > 報告詳情

會議4_基于 BROADCOM ROCE 的高性能可擴展 MPI 庫.pdf

上傳人： Ch****l 編號：171251 2024-07-03 PDF PDF 30頁 2.24MB

該報告所屬合集： 2024年OFA線上研討會（OFA Virtual Workshop Sessions）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/30

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《會議4_基于 BROADCOM ROCE 的高性能可擴展 MPI 庫.pdf》由會員分享，可在線閱讀，更多相關《會議4_基于 BROADCOM ROCE 的高性能可擴展 MPI 庫.pdf（30頁珍藏版）》請在三個皮匠報告上搜索。

1、HIGH PERFORMANCE&SCALABLE MPI LIBRARY OVER BROADCOM ROCE2024 OFA Virtual WorkshopMustafa Abduljabbar,Shulei Xu,Noah Pavuk,and Hemal Shah*Core Switching Group(CSG),Broadcom Inc Network Based Computing Laboratory(NOWLAB),The Ohio State UniversityFollow us onhttps:/ 2024 Virtual OFA Workshop2Network Ba

2、sed Computing Laboratoryhttps:/ Virtual OFA Workshop3Network Based Computing LaboratoryEnhanced Performance:RoCE delivers significantly lower latency and higher throughput compared to traditional Ethernet,empowering HPC applications to achieve peak performance.Optimized Efficiency:RoCE offloads RDMA

3、 operations to specialized hardware,reducing CPU overhead and freeing up valuable processing resources for critical computational tasks.Cost-Effective Solution:RoCE leverages existing Ethernet infrastructure.Scalable and Flexible:RoCE supports a range of Ethernet speeds and Layer 3 routing.Why RoCE

4、for HPC?2024 Virtual OFA Workshop4Network Based Computing Laboratory1.MVAPICH-CPU release:Optimizing MPI communication operations on new generation Broadcom adaptersWe provide support for newer generation Broadcom network adapters(Thor 200 Gbps)in MVAPICH2 and optimize the communication protocols(RC

5、,UD,Hybrid)Focus will be towards point-to-point operations(two-sided)and frequently used collective operations(such as Allreduce and Alltoall).Benefits of these designs will be studied at the applications level.These design changes will be incorporated into the future MVAPICH release.2.MVAPICH-GPU r

6、elease:Exploring the use of Peer Direct capabilities in new Broadcom adapters for high-performance data transfers to/from GPU device memoryBroadcom has introduced support for Peer Direct RDMA to enable high-performance communication operations from device memory.We study and evaluate the performance

7、 of Broadcoms Peer Direct with Thor adapters.We explore designs in MVAPICH2-GDR for accelerating relevant portions of device-based communication operations using Peer Direct technology with Thor adapters.The focus will be on point-to-point intra-node,inter-node,and commonly used collectives(Allreduc

8、e and Alltoall).The designs will be incorporated into the future MVAPICH2-GDR release.Goal:Highly optimized MPI for Broadcom RoCEv22024 Virtual OFA Workshop5Network Based Computing LaboratoryOverview of the MVAPICH ProjectHigh Performance open-source MPI Library Support for multiple interconnectsInf

9、iniBand,Omni-Path,Ethernet/iWARP,RDMA over Converged Ethernet(RoCE),AWS EFA,OPX,Broadcom RoCE,Intel Ethernet,Rockport Networks,Slingshot 10/11Support for multiple platformsx86,OpenPOWER,ARM,Xeon-Phi,GPGPUs(NVIDIA and AMD)Started in 2001,first open-source version demonstrated at SC 02Supports the lat

10、est MPI-3.1 standardhttp:/mvapich.cse.ohio-state.edu Additional optimized versions for different systems/environments:MVAPICH2-X(Advanced MPI+PGAS),since 2011MVAPICH2-GDR with support for NVIDIA(since 2014)and AMD(since 2020)GPUsMVAPICH2-MIC with support for Intel Xeon-Phi,since 2014MVAPICH2-Virt wi

11、th virtualization support,since 2015MVAPICH2-EA with support for Energy-Awareness,since 2015MVAPICH2-Azure for Azure HPC IB instances,since 2019MVAPICH2-X-AWS for AWS HPC+EFA instances,since 2019Tools:OSU MPI Micro-Benchmarks(OMB),since 2003OSU InfiniBand Network Analysis and Monitoring(INAM),since

12、2015Used by more than 3,375 organizations in 91 countriesMore than 1.77 Million downloads from the OSU site directlyEmpowering many TOP500 clusters(Nov 23 ranking)11th,10,649,600-core(Sunway TaihuLight)at NSC,Wuxi,China29th,448,448 cores(Frontera)at TACC46th,288,288 cores(Lassen)at LLNL61st,570,020

13、cores(Nurion)in South Korea and many othersAvailable with software stacks of many vendors and Linux Distros(RedHat,SuSE,OpenHPC,and Spack)Partner in the 29th ranked TACC Frontera systemEmpowering Top500 systems for more than 18 years2024 Virtual OFA Workshop6Network Based Computing LaboratoryOvervie

14、wIntroductionPerformance CharacterizationMPI performance overheads vs.IB levelLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplication levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop7Network Based Computing LaboratoryMVAPICH2 Runtime:RC:MV2_USE_UD_

15、HYBRID=0 MV2_USE_ONLY_UD=0UD:MV2_USE_UD_HYBRID=0 MV2_USE_ONLY_UD=1UCX 1.12.1:./configure-prefix=OpenMPI 4.1.4(w/UCX 1.12.1):./configure prefix=-with-ucx=OpenMPI Runtime:mpirun-np -npernode -hostfile hosts-mca pml ucx x UCX_TLS=self,sm,rc_v/path/to/cp2k.popt-i/path/to/inputfileConfiguration&Runtime20

16、24 Virtual OFA Workshop8Network Based Computing LaboratoryCluster Setup*Courtesy of DELL Technology*2024 Virtual OFA Workshop10Network Based Computing LaboratoryMVAPICH2.3.x-Broadcom RDMA RC Latency on RoCE 100 GbE IBPerf latency performance on the left side,MVAPICH overhead on the right side.MVAPIC

17、H not shown on the left due to invisible scale difference Overhead is in the order of nanoseconds to a few microsecondsThanks to our optimized pt-to-pt parameter tuning0200040006000800010000120001400016000180002832128 5122K8K32K 128K 512K 2MOverhead (Nanoseconds)Message Size(Bytes)Latency OverheadsM

18、V2 Overhead0501001502002503003504002832128 5122K8K32K 128K 512K 2MLatency (Microseconds)Message Size(Bytes)Pt2pt LatencyIB lvl3.93us140ns16us2024 Virtual OFA Workshop11Network Based Computing LaboratoryMVAPICH2.3.x-Broadcom RDMA UD Latency on RoCE 100 GbE IBPerf latency performance on the left side,

19、MVAPICH overhead on the right side.UD overhead of 1.5 2.4 us is add on top of IB level latencyThis has some impact on the application-level when switching to UD at a large scale.Issue is unique to Broadcom RoCE+2.3 series,based on our experience.0123456789248163264128 256 5121K2K4KLatency (Microseco

20、nds)Message Size(Bytes)Pt2pt LatencyIB level5.34us3.82us0500100015002000250030002481632641282565121K2K4KOverhead (Nanoseconds)Message Size(Bytes)Latency OverheadsMV2 Overhead1520ns2420ns2024 Virtual OFA Workshop12Network Based Computing LaboratoryOverviewIntroductionPerformance CharacterizationMPI p

21、erformance overheads vs.IB levelLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplication levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop13Network Based Computing LaboratoryAdd corresponding point-to-point&collective tuning tablesFor up to 64 nodes

22、x 128 PPN=8192 processes Based on Dell Bluebonnet(CPU)system and Rattler2(GPU)systemEnhanced UD+RC hybrid transport mode tuned for Broadcom adapterOptimized default CPU mapping policySupport for asynchronous threading progressStartup OptimizationPoint-to-point Message CoalescingSGL packetized eager

23、communicationLatency and Message Rate Optimization2024 Virtual OFA Workshop14Network Based Computing LaboratoryRC has better performance vs.UD in most casesUD Hybrid becomes exclusive on large scales(e.g.alltoall with=16 nodes)Tuned hybrid transport modeUse RC for small scale&message sizesUse UD for

24、 the other casesUD/RC Hybrid Transport Protocol Analysis020000040000060000080000010000002565121K2K4K8K16KLatency (Microseconds)Message Size(Bytes)osu_alltoall 16 Nodes,128 PPNContinued scalingContinued scaling110100100010000481632641282565121K2K4K8K16K32K64K128K256K512K1MLatency (Microseconds)Messag

25、e Size(Bytes)Allreduce-16 Nodes,128 PPN33%lower15%lower2024 Virtual OFA Workshop15Network Based Computing LaboratoryAdd corresponding point-to-point&collective tuning tablesEnhanced UD+RC hybrid transport mode tuned for Broadcom adapterOptimized default CPU mapping policyMake hybrid spread CPU mappi

26、ng policy as defaultExample showing in right table:Support to enable affinity with asynchronous progress threadAllow to manage communication in parallel with application computation to achieve better computation/communication overlapControlled by runtime parameter:MV2_OPTIMIZED_ASYNC_PROGRESS=1UD St

27、artup OptimizationPoint-to-point Message CoalescingSGL packetized eager communicationLatency and Message Rate Optimization(Contd)Hybrid-Spread Affinity Policy2024 Virtual OFA Workshop16Network Based Computing LaboratoryAdd corresponding point-to-point&collective tuning tablesEnhanced UD+RC hybrid tr

28、ansport mode tuned for Broadcom adapterOptimized default CPU mapping policySupport to enable affinity with asynchronous progress threadUD Startup OptimizationOptimize the specific function calls with highest overhead by analysis of the UD startup profiling dataPoint-to-point Message CoalescingCombin

29、e small messages to reduce send/recv callsSGL packetized eager communicationUse scatter-gather list(SGL)to packetize the eager send requestsLatency and Message Rate Optimization(Contd)2024 Virtual OFA Workshop17Network Based Computing LaboratoryProvide up to 2.3x faster UD startup in small 4 nodes s

30、caleProvide up to 2.1x faster UD startup in large 64 nodes scaleUD Startup Optimization42543647558576410491719334118419022634052881914963120010002000300040001248163264128Startup Time(ms)PPN4 Node UD Startupud-headud-fix770438477817188434924737805753120424565417382954464579620200040006000800010000124

31、8163264128Startup Time(ms)PPN64 Node UD Startupud-headud-fix2.1x faster2.1x faster2024 Virtual OFA Workshop18Network Based Computing LaboratoryPt-to-Pt Message Coalescing Performanceosu_bwosu_bibw-Enabling/disabling coalescing has an impact on BW and BiBW-Coalescing effective up to 1K message size-U

32、p to 1.6x higher bandwidth,2.7x higher bi-bandwidth with medium sized messages020004000600080001000012000140001248163264 128 256 512 1K2K4K8K 16K 32K 64KBandwidth(MB/s)Message Size(Bytes)enable_coalescedisable2.7x05000100001500020000250001248163264 128 256 512 1K2K4K8K 16K 32K 64KBandwidth(MB/s)Mess

33、age Size(Bytes)enable_coalescedisable1.6xThe hybrid policy takes advantage of message coalescing below 1KB size and disable it for larger sizes2024 Virtual OFA Workshop19Network Based Computing LaboratoryPt2pt Message Coalescing Single-Pair Message Rateosu_mbw_mr0500100015002000250030003500400012481

34、63264 128 256 512 1K2K4K8K 16K 32K 64KMessage Rate (x1000 Messages/s)Message Size(Bytes)Pt2pt Message Rate 1 PPNenable_coalescedisable1.5xTest Name:Single-Pair Bandwidth and Message Rate TestEvaluation Focus:Aggregate uni-directional bandwidth and message rateParticipants:1 process per nodeSending P

35、rocess Behavior:Sends a fixed number of messages(window size)Sends messages back-to-back to the paired receiving processWaits for a reply from the receiverIterations:Repeated for 1000 iterations1.Enabling/disabling coalescing has an impact on MR 2.Up to 1.5x higher bandwidth with limits4x2024 Virtua

36、l OFA Workshop20Network Based Computing LaboratorySGL packetized eager communication 100 GbE Reduce up to 16%alltoall latency for 4 bytes messages sizeEnable by adding MV2_USE_EAGER_SGL=1 runtime parameterEnabled by default for up to 1KB message sizes(MV2_USE_EAGER_SGL_LIMIT=1k)050100150200250300350

37、40045050012481632641282565121KLatency(us)osu_alltoall 2 x 128 procsDisableEnable11010010001000012481632641282565121KLatency(us)Message Size(Bytes)osu_alltoall 2 x 128 procsDisableEnable16%lower2024 Virtual OFA Workshop21Network Based Computing LaboratoryOverviewIntroductionPerformance Characterizati

38、onLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplication levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop22Network Based Computing LaboratoryExperiment results from Dell BluebonnetUp to 20%reduction in small message point-to-point latencyFrom 0.1x

39、 to 2x increase in bandwidthUp to 12.4x lower MPI_Allreduce latencyUp to 5x lower MPI_Scatter latencyPerformance Evaluation CPU Microbenchmarks 0246812481632641282565121K2K4KLatency (Microseconds)Message Size(Bytes)osu_latency Small Messages2.3.x-broadcomOpenMPI2.3.70500010000150004K8K16K32K64K128K2

40、56K512K1M2M4MBandwidth (MB/s)Message Size(Bytes)osu_bw Large Messages11010010001000010000010000001000000012481632641282565121K2K4K8KLatency (Microseconds)Message Size(Bytes)Alltoall-64 Nodes,128 PPN2.3.x-broadcomOpenMPI2.3.71101001000100001416642561K4K16K64K 256KLatency (Microseconds)Message Size(By

41、tes)Allreduce-64 Nodes,128 PPN11010010001000010000012481632641282565121K2K4K8K16K32K64KLatency (Microseconds)Message Size(Bytes)Scatter-64 Nodes,128 PPN2024 Virtual OFA Workshop23Network Based Computing LaboratoryExperiment results from Rattler2 DELL cluster(A100 GPUs)Compared to non-optimized versi

42、on,up to 2 3x reduction in medium to large message point-to-point latencyUp to 2.6x increase in bandwidthUp to 35%reduction in alltoall latencyPerformance Evaluation GPU Microbenchmarks 0510152012481632641282565121K2K4KLatency (Microseconds)Message Size(Bytes)osu_latency Small Messages05000100001500

43、04K8K16K32K64K128K256K512K1M2M4MBandwidth (MB/s)Message Size(Bytes)osu_bw Large Messages110100100010000141664 2561K4K 16K 64K 256K 1MLatency (Microseconds)Message Size(Bytes)Alltoall-2 Nodes,4 PPNMV2GDR OptOpenMPIMV2GDR11010010001416642561K4K16K 64K 256K 1MLatency (Microseconds)Message Size(Bytes)Bc

44、ast-2 Nodes,4 PPN1101001000100001416642561K4K16K64K 256KLatency (Microseconds)Message Size(Bytes)Allreduce-2 Nodes,4 PPN2024 Virtual OFA Workshop24Network Based Computing LaboratoryReduce up to 45%execution time of OpenFOAM Motorbike on 16 nodes 128 PPN scaleReduce up to 51%execution time of GROMACS

45、 benchPEP on 64 nodes 128 PPN scalePerformance Evaluation Applications 224145946156505224615310792998010223414090775541530501001502002503001248163264SecondsNodesGROMACS-benchPEP-128 PPN2.3.x-broadcom2.3.7OpenMPI494214100614249523211181774732259592610100200300400500600124816SecondsNodesOpenFOAM90 x36

46、x36(15.5M cells)Motorbike 128 PPN2.3.x-broadcom2.3.7OpenMPI45%lower51%lower2024 Virtual OFA Workshop25Network Based Computing LaboratoryReduce up to 15%execution time of CP2K H2O-dft-ls(NREP4)Reduce up to 7%execution time of WRF CONUS 3KMPerformance Evaluation Applications428306269255297438321296273

47、0501001502002503003504004505002x1284x1288x12816x12832x128Execution Time(Sec)#.Nodes*PPNCP2K H2O-dft-ls(NREP4)128 PPNMVAPICH2OpenMPI2021136543210122775120311573490501001502002501*1282*1284*1288*128Execution Time(Sec)#Nodes*PPNWRF CONUS 12KM 128 PPN2.3.x-broadcomOpenMPI2.3.71021543303.2197112757934810

48、955633212070200400600800100012008*12816*12832*12864*128Execution Time(Sec)#Nodes*PPNWRF CONUS 3KM 128 PPN2024 Virtual OFA Workshop26Network Based Computing LaboratoryOverviewIntroductionPerformance CharacterizationLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplic

49、ation levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop27Network Based Computing LaboratoryMVAPICH 3.0 provides competitive point-to-point performanceReduce 9%latency with 16KB message sizeMVAPICH-3.0 Pt-to-Pt Latency(RC)on FW 227(RHEL 8.8)01234567891248163264128 256 5121K2K4K8K16KLat

50、ency(us)Message Size(Bytes)Latency Small MessageMV3.0MV2.3.XOMPI0501001502002503003504008K16K32K64K128K256K512K1M2M4MLatency(us)Message Size(Bytes)Latency Large MessageMV3.0MV2.3.XOMPI9%lower2024 Virtual OFA Workshop28Network Based Computing LaboratoryMVAPICH 3.0 provides competitive point-to-point

51、UD performanceReduce 28%latency with 16KB message sizeMVAPICH-3.0 Pt-to-Pt Latency(UD)on FW 227(RHEL 8.8)0123456789101248163264128 256 5121K2K4K8KLatency(us)Message Size(Bytes)Latency Small MessageMV3.0MV2.3.X0501001502002503003504004505008K16K32K64K128K256K512K1M2M4MLatency(us)Message Size(Bytes)La

52、tency Large MessageMV3.0MV2.3.X21%lower2024 Virtual OFA Workshop29Network Based Computing LaboratoryConclusion:We have analyzed MPI overheads vs.IB level performance on Broadcom adapterWe have tuned MVAPICH2 1)coalescing,2)SGL eager usage,3)binding policies,4)UD start-up,5)UD/RC thresholds and 6)col

53、lective algorithms for Broadcom Thor familiesThe bottom-up approach targeting microbenchmark latency and message rate resulted in significant microbenchmark and application-level gains Future Work:Optimize additional applicationsIntegrate existing optimizations with MVAPICH-3.0 on Broadcom systemsIn

54、 progress:MVAPICH-2.3.8(with enhanced RoCEv2 support)Optimize MVAPICH for UltraEthernetConclusion&Future Work2024 Virtual OFA Workshop30Network Based Computing LaboratoryUltra Ethernet Consortium-Modernizing RDMA Classic RDMAIn-order packet deliveryOut-of-order placement,in-order message completionG

55、o-back-n inefficientSelective Ack and retransmitNo multipathingPacket-level multipathingDCQCN hard to tune Scalable and Simplified Congestion controlHigher fabric utilization at ultra-high scale with automated config and tuning2024 Virtual OFA Workshop31Network Based Computing LaboratoryTHANK YOU!Network-Based Computing Laboratoryhttp:/nowlab.cse.ohio-state.edu/The High-Performance MPI/PGAS Projecthttp:/mvapich.cse.ohio-state.edu/The High-Performance Big Data Projecthttp:/hibd.cse.ohio-state.edu/The High-Performance Deep Learning Projecthttp:/hidl.cse.ohio-state.edu/

相關圖表

本文主要介紹了MVAPICH2庫對Broadcom RoCEv2網絡的優化。MVAPICH2是一個高性能的開源MPI庫，支持多種互連，包括InfiniBand、Omni-Path、Ethernet/iWARP、RDMA over Converged Ethernet (RoCE)等。文章首先介紹了RoCE在HPC應用中的優勢，包括低延遲、高吞吐量、減少CPU開銷等。然后，文章詳細介紹了MVAPICH2對Broadcom RoCEv2網絡的優化工作，包括優化點對點通信、集合通信、啟動優化、消息合并、SGL包化等。最后，文章通過微基準測試和應用測試展示了優化后的性能提升，例如在OpenFOAM和GROMACS等應用中，性能提升了20%到50%。

RoCE如何提高HPC應用性能？ MVAPICH2如何優化Broadcom RoCE性能？ MVAPICH2在Broadcom RoCE上表現如何？

相關報告

會議11_針對 InfiniBand 上的高性能 MPI 庫優化的全對全連接建立.pdf

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站