《1B-103_Accelerating HPC Applications with SmartNICs-x-scalesolutions.PDF》由會員分享,可在線閱讀,更多相關《1B-103_Accelerating HPC Applications with SmartNICs-x-scalesolutions.PDF(22頁珍藏版)》請在三個皮匠報告上搜索。
1、Accelerating HPC Applications with SmartNICsDonglai DaiChief Engineercontactusx-San Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022R
2、equirements for Next-Generation Communication Libraries SmartNICs have the potential to take over a wide range of overhead tasks in a variety of applications from the host CPUs in systems Message Passing Interface(MPI)libraries are widely used for parallel and distributed HPC and AI applications in
3、HPC/data centers and clouds Requirements for a high-performance and scalable MPI library:Low latency communication High bandwidth communication Minimum contention for host CPU resources to progress non-blocking collectives High overlap of computation with communication CPU based non-blocking communi
4、cation progress can lead to sub-par performance as the main application has less CPU resources for useful application-level computationSan Jose,CA April 26-28,2022Can MPI Functions be Offloaded?The area of network offloading of MPI primitives is still nascentState-of-the-art BlueField DPUs bring mor
5、e compute power into the networkExploit additional compute capabilities of modern BlueField DPUs into existing MPI middleware to extractPeak pure communication performance Overlap of communication and computationSan Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Desig
6、n Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022Overview of BlueField-2 DPUConnectX-6 network adapter with 200Gbps InfiniBandSystem-on-chip containing eight 64-bit ARMv8 A72 cores with 2.7 GHz each16GB of memory for t
7、he ARM coresMVAPICH2-DPU MPI library is designed to take advantage of DPUs and accelerate scientific applicationsSan Jose,CA April 26-28,2022Basic Idea for MPI offloading to DPU Use of generic and optimized asynchronous progress threads on ARM cores for Point-to-point CollectivesRMA operationsP0MPI_
8、Wait/MPI_WaitallNon-Blocking P2P/Collective/RMA OperationP1P2Bluefield P3ComputationCommunicationControl MessagesCommunication Process/ThreadSan Jose,CA April 26-28,2022High Level Design for MPI offloading to DPU Better support for critical collective communication operationsEnable offloading to the
9、 Bluefield ARM SoC Performance enhancing algorithm selection based on the communication characteristics of application High-Performance Bluefield RDMA-Capable HCASoftware Kernel based Collective Offload on Programmable ARM CoresMVAPICH2-DPU LibraryIn-Network Collective CommunicationDesigns for Data
10、on CPUDesigns for Data on DPUOffload Decision LogicHigh-Performance SHARP-Enabled SwitchHardware(ASIC)based Collective Offload on SHARP-Enabled SwitchesYesNoGeneric Non-OffloadedCollective OperationsSan Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Featur
11、es of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022MVAPICH2-DPU Library 2022.02 ReleaseImplemented by X-ScaleSolutionsBased on MVAPICH2 2.3.6,compliant to MPI 3.1 standardSupports all features available with the MVAPICH2 2.3.6 releas
12、e(http:/mvapich.cse.ohio-state.edu)Novel framework to offload non-blocking collectives to DPUOffloads non-blocking collectives(MPI_Ialltoall,MPI_Iallgather,MPI_Ibcast,etc)to DPUUp to 100%overlap of computation with non-blocking collectiveAccelerates scientific applications using non-blocking collect
13、ivesSan Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022Total Execution Time with osu_Ialltoall(32 nodes)32 Nodes,32 PPN32 Nodes,16 P
14、PN0.00100.00200.00300.00400.00500.00600.00700.00800.0064K128K256K512KComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_ialltoall)MVAPICH2MVAPICH2-DPU0.00500.001,000.001,500.002,000.002,500.003,000.0064K128K256K512KComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_ialltoall)MVAPICH2MVAPICH2
15、-DPU22%21%20%22%21%17%23%17%San Jose,CA April 26-28,202201020304050607080901001K2K4K8K16K 32K 64K 128K256K512KOverlap(%)Message SizeOverlap(osu_ialltoall)MVAPICH2MVAPICH2-DPU32 Nodes,32 PPN32 Nodes,16 PPN100%Delivers peak overlap01020304050607080901001K2K4K8K16K 32K 64K 128K256K512KOverlap(%)Message
16、 SizeOverlap(osu_ialltoall)MVAPICH2MVAPICH2-DPU98%Overlap Between Computation&Communication with osu_Ialltoall(32 nodes)San Jose,CA April 26-28,2022Total Execution Time with osu_Iallgather(16 nodes)16 Nodes,32 PPN16 Nodes,1 PPN00.511.522.533.5128K256K512K1MOverall Time(ms)Message SizeTotal Execution
17、 Time,BF-2(osu_iallgather)MVAPICH2MVAPICH2-DPU024681012141618202K4K8K16KOverall Time(ms)Message SizeTotal Execution Time,BF-2(osu_iallgather)MVAPICH2MVAPICH2-DPU41%29%36%84%38%39%30%57%San Jose,CA April 26-28,2022Overlap Between Computation&Communication with osu_Iallgather(16 nodes)Delivers peak ov
18、erlap16 Nodes,1 PPN0102030405060708090100128K256K512K1MOverlap(%)Message SizeOverlap(osu_iallgather)MVAPICH2MVAPICH2-DPU97%San Jose,CA April 26-28,2022Total Execution Time with osu_Ibcast(32 nodes)32 Nodes,16 PPN32 Nodes,1 PPN05101520252M4M8M16MComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_
19、ibcast)MVAPICH2MVAPICH2-DPU0510152025303540452M4M8M16MComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_ibcast)MVAPICH2MVAPICH2-DPU57%47%58%38%58%48%46%8%San Jose,CA April 26-28,2022Overlap Between Computation&Communication with osu_Ibcast(32 nodes)32 Nodes,16 PPNDelivers peak overlap0102030405
20、0607080901002M4M8M16MOverlap(%)Message SizeOverlap(osu_ibcast)MVAPICH2MVAPICH2-DPU+30%32 Nodes,1 PPN01020304050607080901002M4M8M16MOverlap(%)Message SizeOverlap(osu_ibcast)MVAPICH2MVAPICH2-DPU+38%San Jose,CA April 26-28,2022P3DFFT Application Execution Time(32 nodes)Benefits in application-levelexec
21、ution time0510152025Latency(s)Grid SizeMVAPICH232 Nodes,32 PPN18%21%16%05101520253035Latency(s)Grid SizeMVAPICH232 Nodes,16 PPN12%12%14%San Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks
22、and Applications ConclusionSan Jose,CA April 26-28,2022ConclusionEfficient MVAPICH2-DPU MPI library utilizes the BlueField DPU to progress MPI non-blocking collective operationsProvides up to 100%overlap of communication and computation for non-blocking Alltoall,Allgather,Bcast,etcReduces the total
23、execution time of P3DFFT application up to 21%on 1,024 processes Work in progress for MVAPICH2-DPU library to efficiently offload more types of non-blocking collective operations to DPUs San Jose,CA April 26-28,2022Exhibition and Live DemoIf you are interested in knowing more details,please come and visit our exhibit booth#8 next doorLive demo on MVAPICH2-DPU library at our booth6-7 pm,today1-2 pm,tomorrowThank You!Donglai Daicontactusx-http:/x-