《會議4_基于 BROADCOM ROCE 的高性能可擴展 MPI 庫.pdf》由會員分享,可在線閱讀,更多相關《會議4_基于 BROADCOM ROCE 的高性能可擴展 MPI 庫.pdf(30頁珍藏版)》請在三個皮匠報告上搜索。
1、HIGH PERFORMANCE&SCALABLE MPI LIBRARY OVER BROADCOM ROCE2024 OFA Virtual WorkshopMustafa Abduljabbar,Shulei Xu,Noah Pavuk,and Hemal Shah*Core Switching Group(CSG),Broadcom Inc Network Based Computing Laboratory(NOWLAB),The Ohio State UniversityFollow us onhttps:/ 2024 Virtual OFA Workshop2Network Ba
2、sed Computing Laboratoryhttps:/ Virtual OFA Workshop3Network Based Computing LaboratoryEnhanced Performance:RoCE delivers significantly lower latency and higher throughput compared to traditional Ethernet,empowering HPC applications to achieve peak performance.Optimized Efficiency:RoCE offloads RDMA
3、 operations to specialized hardware,reducing CPU overhead and freeing up valuable processing resources for critical computational tasks.Cost-Effective Solution:RoCE leverages existing Ethernet infrastructure.Scalable and Flexible:RoCE supports a range of Ethernet speeds and Layer 3 routing.Why RoCE
4、for HPC?2024 Virtual OFA Workshop4Network Based Computing Laboratory1.MVAPICH-CPU release:Optimizing MPI communication operations on new generation Broadcom adaptersWe provide support for newer generation Broadcom network adapters(Thor 200 Gbps)in MVAPICH2 and optimize the communication protocols(RC
5、,UD,Hybrid)Focus will be towards point-to-point operations(two-sided)and frequently used collective operations(such as Allreduce and Alltoall).Benefits of these designs will be studied at the applications level.These design changes will be incorporated into the future MVAPICH release.2.MVAPICH-GPU r
6、elease:Exploring the use of Peer Direct capabilities in new Broadcom adapters for high-performance data transfers to/from GPU device memoryBroadcom has introduced support for Peer Direct RDMA to enable high-performance communication operations from device memory.We study and evaluate the performance
7、 of Broadcoms Peer Direct with Thor adapters.We explore designs in MVAPICH2-GDR for accelerating relevant portions of device-based communication operations using Peer Direct technology with Thor adapters.The focus will be on point-to-point intra-node,inter-node,and commonly used collectives(Allreduc
8、e and Alltoall).The designs will be incorporated into the future MVAPICH2-GDR release.Goal:Highly optimized MPI for Broadcom RoCEv22024 Virtual OFA Workshop5Network Based Computing LaboratoryOverview of the MVAPICH ProjectHigh Performance open-source MPI Library Support for multiple interconnectsInf
9、iniBand,Omni-Path,Ethernet/iWARP,RDMA over Converged Ethernet(RoCE),AWS EFA,OPX,Broadcom RoCE,Intel Ethernet,Rockport Networks,Slingshot 10/11Support for multiple platformsx86,OpenPOWER,ARM,Xeon-Phi,GPGPUs(NVIDIA and AMD)Started in 2001,first open-source version demonstrated at SC 02Supports the lat
10、est MPI-3.1 standardhttp:/mvapich.cse.ohio-state.edu Additional optimized versions for different systems/environments:MVAPICH2-X(Advanced MPI+PGAS),since 2011MVAPICH2-GDR with support for NVIDIA(since 2014)and AMD(since 2020)GPUsMVAPICH2-MIC with support for Intel Xeon-Phi,since 2014MVAPICH2-Virt wi
11、th virtualization support,since 2015MVAPICH2-EA with support for Energy-Awareness,since 2015MVAPICH2-Azure for Azure HPC IB instances,since 2019MVAPICH2-X-AWS for AWS HPC+EFA instances,since 2019Tools:OSU MPI Micro-Benchmarks(OMB),since 2003OSU InfiniBand Network Analysis and Monitoring(INAM),since
12、2015Used by more than 3,375 organizations in 91 countriesMore than 1.77 Million downloads from the OSU site directlyEmpowering many TOP500 clusters(Nov 23 ranking)11th,10,649,600-core(Sunway TaihuLight)at NSC,Wuxi,China29th,448,448 cores(Frontera)at TACC46th,288,288 cores(Lassen)at LLNL61st,570,020
13、cores(Nurion)in South Korea and many othersAvailable with software stacks of many vendors and Linux Distros(RedHat,SuSE,OpenHPC,and Spack)Partner in the 29th ranked TACC Frontera systemEmpowering Top500 systems for more than 18 years2024 Virtual OFA Workshop6Network Based Computing LaboratoryOvervie
14、wIntroductionPerformance CharacterizationMPI performance overheads vs.IB levelLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplication levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop7Network Based Computing LaboratoryMVAPICH2 Runtime:RC:MV2_USE_UD_
15、HYBRID=0 MV2_USE_ONLY_UD=0UD:MV2_USE_UD_HYBRID=0 MV2_USE_ONLY_UD=1UCX 1.12.1:./configure-prefix=OpenMPI 4.1.4(w/UCX 1.12.1):./configure prefix=-with-ucx=OpenMPI Runtime:mpirun-np -npernode -hostfile hosts-mca pml ucx x UCX_TLS=self,sm,rc_v/path/to/cp2k.popt-i/path/to/inputfileConfiguration&Runtime20
16、24 Virtual OFA Workshop8Network Based Computing LaboratoryCluster Setup*Courtesy of DELL Technology*2024 Virtual OFA Workshop10Network Based Computing LaboratoryMVAPICH2.3.x-Broadcom RDMA RC Latency on RoCE 100 GbE IBPerf latency performance on the left side,MVAPICH overhead on the right side.MVAPIC
17、H not shown on the left due to invisible scale difference Overhead is in the order of nanoseconds to a few microsecondsThanks to our optimized pt-to-pt parameter tuning0200040006000800010000120001400016000180002832128 5122K8K32K 128K 512K 2MOverhead (Nanoseconds)Message Size(Bytes)Latency OverheadsM
18、V2 Overhead0501001502002503003504002832128 5122K8K32K 128K 512K 2MLatency (Microseconds)Message Size(Bytes)Pt2pt LatencyIB lvl3.93us140ns16us2024 Virtual OFA Workshop11Network Based Computing LaboratoryMVAPICH2.3.x-Broadcom RDMA UD Latency on RoCE 100 GbE IBPerf latency performance on the left side,
19、MVAPICH overhead on the right side.UD overhead of 1.5 2.4 us is add on top of IB level latencyThis has some impact on the application-level when switching to UD at a large scale.Issue is unique to Broadcom RoCE+2.3 series,based on our experience.0123456789248163264128 256 5121K2K4KLatency (Microseco
20、nds)Message Size(Bytes)Pt2pt LatencyIB level5.34us3.82us0500100015002000250030002481632641282565121K2K4KOverhead (Nanoseconds)Message Size(Bytes)Latency OverheadsMV2 Overhead1520ns2420ns2024 Virtual OFA Workshop12Network Based Computing LaboratoryOverviewIntroductionPerformance CharacterizationMPI p
21、erformance overheads vs.IB levelLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplication levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop13Network Based Computing LaboratoryAdd corresponding point-to-point&collective tuning tablesFor up to 64 nodes
22、x 128 PPN=8192 processes Based on Dell Bluebonnet(CPU)system and Rattler2(GPU)systemEnhanced UD+RC hybrid transport mode tuned for Broadcom adapterOptimized default CPU mapping policySupport for asynchronous threading progressStartup OptimizationPoint-to-point Message CoalescingSGL packetized eager
23、communicationLatency and Message Rate Optimization2024 Virtual OFA Workshop14Network Based Computing LaboratoryRC has better performance vs.UD in most casesUD Hybrid becomes exclusive on large scales(e.g.alltoall with=16 nodes)Tuned hybrid transport modeUse RC for small scale&message sizesUse UD for
24、 the other casesUD/RC Hybrid Transport Protocol Analysis020000040000060000080000010000002565121K2K4K8K16KLatency (Microseconds)Message Size(Bytes)osu_alltoall 16 Nodes,128 PPNContinued scalingContinued scaling110100100010000481632641282565121K2K4K8K16K32K64K128K256K512K1MLatency (Microseconds)Messag
25、e Size(Bytes)Allreduce-16 Nodes,128 PPN33%lower15%lower2024 Virtual OFA Workshop15Network Based Computing LaboratoryAdd corresponding point-to-point&collective tuning tablesEnhanced UD+RC hybrid transport mode tuned for Broadcom adapterOptimized default CPU mapping policyMake hybrid spread CPU mappi
26、ng policy as defaultExample showing in right table:Support to enable affinity with asynchronous progress threadAllow to manage communication in parallel with application computation to achieve better computation/communication overlapControlled by runtime parameter:MV2_OPTIMIZED_ASYNC_PROGRESS=1UD St
27、artup OptimizationPoint-to-point Message CoalescingSGL packetized eager communicationLatency and Message Rate Optimization(Contd)Hybrid-Spread Affinity Policy2024 Virtual OFA Workshop16Network Based Computing LaboratoryAdd corresponding point-to-point&collective tuning tablesEnhanced UD+RC hybrid tr
28、ansport mode tuned for Broadcom adapterOptimized default CPU mapping policySupport to enable affinity with asynchronous progress threadUD Startup OptimizationOptimize the specific function calls with highest overhead by analysis of the UD startup profiling dataPoint-to-point Message CoalescingCombin
29、e small messages to reduce send/recv callsSGL packetized eager communicationUse scatter-gather list(SGL)to packetize the eager send requestsLatency and Message Rate Optimization(Contd)2024 Virtual OFA Workshop17Network Based Computing LaboratoryProvide up to 2.3x faster UD startup in small 4 nodes s
30、caleProvide up to 2.1x faster UD startup in large 64 nodes scaleUD Startup Optimization42543647558576410491719334118419022634052881914963120010002000300040001248163264128Startup Time(ms)PPN4 Node UD Startupud-headud-fix770438477817188434924737805753120424565417382954464579620200040006000800010000124
31、8163264128Startup Time(ms)PPN64 Node UD Startupud-headud-fix2.1x faster2.1x faster2024 Virtual OFA Workshop18Network Based Computing LaboratoryPt-to-Pt Message Coalescing Performanceosu_bwosu_bibw-Enabling/disabling coalescing has an impact on BW and BiBW-Coalescing effective up to 1K message size-U
32、p to 1.6x higher bandwidth,2.7x higher bi-bandwidth with medium sized messages020004000600080001000012000140001248163264 128 256 512 1K2K4K8K 16K 32K 64KBandwidth(MB/s)Message Size(Bytes)enable_coalescedisable2.7x05000100001500020000250001248163264 128 256 512 1K2K4K8K 16K 32K 64KBandwidth(MB/s)Mess
33、age Size(Bytes)enable_coalescedisable1.6xThe hybrid policy takes advantage of message coalescing below 1KB size and disable it for larger sizes2024 Virtual OFA Workshop19Network Based Computing LaboratoryPt2pt Message Coalescing Single-Pair Message Rateosu_mbw_mr0500100015002000250030003500400012481
34、63264 128 256 512 1K2K4K8K 16K 32K 64KMessage Rate (x1000 Messages/s)Message Size(Bytes)Pt2pt Message Rate 1 PPNenable_coalescedisable1.5xTest Name:Single-Pair Bandwidth and Message Rate TestEvaluation Focus:Aggregate uni-directional bandwidth and message rateParticipants:1 process per nodeSending P
35、rocess Behavior:Sends a fixed number of messages(window size)Sends messages back-to-back to the paired receiving processWaits for a reply from the receiverIterations:Repeated for 1000 iterations1.Enabling/disabling coalescing has an impact on MR 2.Up to 1.5x higher bandwidth with limits4x2024 Virtua
36、l OFA Workshop20Network Based Computing LaboratorySGL packetized eager communication 100 GbE Reduce up to 16%alltoall latency for 4 bytes messages sizeEnable by adding MV2_USE_EAGER_SGL=1 runtime parameterEnabled by default for up to 1KB message sizes(MV2_USE_EAGER_SGL_LIMIT=1k)050100150200250300350
37、40045050012481632641282565121KLatency(us)osu_alltoall 2 x 128 procsDisableEnable11010010001000012481632641282565121KLatency(us)Message Size(Bytes)osu_alltoall 2 x 128 procsDisableEnable16%lower2024 Virtual OFA Workshop21Network Based Computing LaboratoryOverviewIntroductionPerformance Characterizati
38、onLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplication levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop22Network Based Computing LaboratoryExperiment results from Dell BluebonnetUp to 20%reduction in small message point-to-point latencyFrom 0.1x
39、 to 2x increase in bandwidthUp to 12.4x lower MPI_Allreduce latencyUp to 5x lower MPI_Scatter latencyPerformance Evaluation CPU Microbenchmarks 0246812481632641282565121K2K4KLatency (Microseconds)Message Size(Bytes)osu_latency Small Messages2.3.x-broadcomOpenMPI2.3.70500010000150004K8K16K32K64K128K2
40、56K512K1M2M4MBandwidth (MB/s)Message Size(Bytes)osu_bw Large Messages11010010001000010000010000001000000012481632641282565121K2K4K8KLatency (Microseconds)Message Size(Bytes)Alltoall-64 Nodes,128 PPN2.3.x-broadcomOpenMPI2.3.71101001000100001416642561K4K16K64K 256KLatency (Microseconds)Message Size(By
41、tes)Allreduce-64 Nodes,128 PPN11010010001000010000012481632641282565121K2K4K8K16K32K64KLatency (Microseconds)Message Size(Bytes)Scatter-64 Nodes,128 PPN2024 Virtual OFA Workshop23Network Based Computing LaboratoryExperiment results from Rattler2 DELL cluster(A100 GPUs)Compared to non-optimized versi
42、on,up to 2 3x reduction in medium to large message point-to-point latencyUp to 2.6x increase in bandwidthUp to 35%reduction in alltoall latencyPerformance Evaluation GPU Microbenchmarks 0510152012481632641282565121K2K4KLatency (Microseconds)Message Size(Bytes)osu_latency Small Messages05000100001500
43、04K8K16K32K64K128K256K512K1M2M4MBandwidth (MB/s)Message Size(Bytes)osu_bw Large Messages110100100010000141664 2561K4K 16K 64K 256K 1MLatency (Microseconds)Message Size(Bytes)Alltoall-2 Nodes,4 PPNMV2GDR OptOpenMPIMV2GDR11010010001416642561K4K16K 64K 256K 1MLatency (Microseconds)Message Size(Bytes)Bc
44、ast-2 Nodes,4 PPN1101001000100001416642561K4K16K64K 256KLatency (Microseconds)Message Size(Bytes)Allreduce-2 Nodes,4 PPN2024 Virtual OFA Workshop24Network Based Computing LaboratoryReduce up to 45%execution time of OpenFOAM Motorbike on 16 nodes 128 PPN scaleReduce up to 51%execution time of GROMACS
45、 benchPEP on 64 nodes 128 PPN scalePerformance Evaluation Applications 224145946156505224615310792998010223414090775541530501001502002503001248163264SecondsNodesGROMACS-benchPEP-128 PPN2.3.x-broadcom2.3.7OpenMPI494214100614249523211181774732259592610100200300400500600124816SecondsNodesOpenFOAM90 x36
46、x36(15.5M cells)Motorbike 128 PPN2.3.x-broadcom2.3.7OpenMPI45%lower51%lower2024 Virtual OFA Workshop25Network Based Computing LaboratoryReduce up to 15%execution time of CP2K H2O-dft-ls(NREP4)Reduce up to 7%execution time of WRF CONUS 3KMPerformance Evaluation Applications428306269255297438321296273
47、0501001502002503003504004505002x1284x1288x12816x12832x128Execution Time(Sec)#.Nodes*PPNCP2K H2O-dft-ls(NREP4)128 PPNMVAPICH2OpenMPI2021136543210122775120311573490501001502002501*1282*1284*1288*128Execution Time(Sec)#Nodes*PPNWRF CONUS 12KM 128 PPN2.3.x-broadcomOpenMPI2.3.71021543303.2197112757934810
48、955633212070200400600800100012008*12816*12832*12864*128Execution Time(Sec)#Nodes*PPNWRF CONUS 3KM 128 PPN2024 Virtual OFA Workshop26Network Based Computing LaboratoryOverviewIntroductionPerformance CharacterizationLatency and Message Rate OptimizationPerformance EvaluationMicro-benchmark levelApplic
49、ation levelMVAPICH 3.0 Performance Evaluation2024 Virtual OFA Workshop27Network Based Computing LaboratoryMVAPICH 3.0 provides competitive point-to-point performanceReduce 9%latency with 16KB message sizeMVAPICH-3.0 Pt-to-Pt Latency(RC)on FW 227(RHEL 8.8)01234567891248163264128 256 5121K2K4K8K16KLat
50、ency(us)Message Size(Bytes)Latency Small MessageMV3.0MV2.3.XOMPI0501001502002503003504008K16K32K64K128K256K512K1M2M4MLatency(us)Message Size(Bytes)Latency Large MessageMV3.0MV2.3.XOMPI9%lower2024 Virtual OFA Workshop28Network Based Computing LaboratoryMVAPICH 3.0 provides competitive point-to-point
51、UD performanceReduce 28%latency with 16KB message sizeMVAPICH-3.0 Pt-to-Pt Latency(UD)on FW 227(RHEL 8.8)0123456789101248163264128 256 5121K2K4K8KLatency(us)Message Size(Bytes)Latency Small MessageMV3.0MV2.3.X0501001502002503003504004505008K16K32K64K128K256K512K1M2M4MLatency(us)Message Size(Bytes)La
52、tency Large MessageMV3.0MV2.3.X21%lower2024 Virtual OFA Workshop29Network Based Computing LaboratoryConclusion:We have analyzed MPI overheads vs.IB level performance on Broadcom adapterWe have tuned MVAPICH2 1)coalescing,2)SGL eager usage,3)binding policies,4)UD start-up,5)UD/RC thresholds and 6)col
53、lective algorithms for Broadcom Thor familiesThe bottom-up approach targeting microbenchmark latency and message rate resulted in significant microbenchmark and application-level gains Future Work:Optimize additional applicationsIntegrate existing optimizations with MVAPICH-3.0 on Broadcom systemsIn
54、 progress:MVAPICH-2.3.8(with enhanced RoCEv2 support)Optimize MVAPICH for UltraEthernetConclusion&Future Work2024 Virtual OFA Workshop30Network Based Computing LaboratoryUltra Ethernet Consortium-Modernizing RDMA Classic RDMAIn-order packet deliveryOut-of-order placement,in-order message completionG
55、o-back-n inefficientSelective Ack and retransmitNo multipathingPacket-level multipathingDCQCN hard to tune Scalable and Simplified Congestion controlHigher fabric utilization at ultra-high scale with automated config and tuning2024 Virtual OFA Workshop31Network Based Computing LaboratoryTHANK YOU!Network-Based Computing Laboratoryhttp:/nowlab.cse.ohio-state.edu/The High-Performance MPI/PGAS Projecthttp:/mvapich.cse.ohio-state.edu/The High-Performance Big Data Projecthttp:/hibd.cse.ohio-state.edu/The High-Performance Deep Learning Projecthttp:/hidl.cse.ohio-state.edu/