《會議13_在 MPI 中設計網絡內計算感知縮減集合.pdf》由會員分享,可在線閱讀,更多相關《會議13_在 MPI 中設計網絡內計算感知縮減集合.pdf(33頁珍藏版)》請在三個皮匠報告上搜索。
1、DESIGNING IN-NETWORK COMPUTING AWARE REDUCTION COLLECTIVES IN MPI2024 OFA Virtual WorkshopBharath Ramesh and Dhabaleswar K.(DK)PandaNetwork Based Computing LaboratoryThe Ohio State Universityhttp:/nowlab.cse.ohio-state.edu/3Network Based Computing LaboratoryOFA workshop April24IntroductionBackground
2、MotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline4Network Based Computing LaboratoryOFA workshop April24Introduction:Drivers of Modern HPC Cluster ArchitecturesMulti-core/many-core technologiesRemote
3、Direct Memory Access(RDMA)-enabled networking(InfiniBand,RoCE,Slingshot)Solid State Drives(SSDs),Non-Volatile Random-Access Memory(NVRAM),NVMe-SSDAccelerators(NVIDIA GPGPUs)Acceleratorshigh compute density,high performance/watt9.7 TFlop DP on a chip High Performance Interconnects InfiniBandMulti-/Ma
4、ny-core ProcessorsSSD,NVMe-SSD,NVRAMFrontierSummitLumiFugaku5Network Based Computing LaboratoryOFA workshop April24Reduction collectives(such as MPI_Allreduce)are important for HPC and AIInvolve both compute and communicationUsing CPUs everywhere leads to sub-optimal scale-up and scale-out efficienc
5、yMotivates the need for offloading common operations away from the CPU to allow the CPU to perform other useful tasksIn-network compute allows offloading operations to network devicesSwitches are a good candidate due to high bandwidth and ability to reduce data on-the-fly eliminating redundancyHigh
6、scale-out efficiency and network topology awarenessFrees up CPU cycles for other operationsMPI Reduction collectives and In-network Computing6Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache desi
7、gnProposed Allreduce designResultsConclusion and Future workOutline7Network Based Computing LaboratoryOFA workshop April24SHARP Reduction trees and Streaming Aggregation(SAT)Aggregation TreeSwitch-level reduction(radix 16)Images taken from Graham,Richard et al.Scalable Hierarchical Aggregation and R
8、eduction Protocol(SHARP)TMStreaming-Aggregation Hardware Design and Evaluation.DOI:10.1007/978-3-030-50743-5_3(https:/ Based Computing LaboratoryOFA workshop April24Hierarchical design for small message MPI_AllreducePhase 1:Intra-socket reductionPhase 2:Inter-socket reductionLegendSocket leader shar
9、ed memoryIntra-socket shared memoryWrite to shared memoryReduce from shared memoryIntra-socket leader processRegular processSocket 1Socket 2Socket 1Socket 2Node 1Node 2LegendSocket leader shared memoryIntra-socket shared memoryWrite to shared memoryReduce from shared memoryIntra-socket leader proces
10、sSocket 1Socket 2Socket 1Socket 2Node 1Node 29Network Based Computing LaboratoryOFA workshop April24Hierarchical design for small message MPI_AllreducePhase 3:Inter-node allreduce.Uses SHARP for scale-out performanceLegendSocket leader shared memoryIntra-socket shared memoryIntra-socket leader proce
11、ssInter-node allreduceSocket 1Socket 2Socket 1Socket 2Node 1Node 210Network Based Computing LaboratoryOFA workshop April24Hierarchical design for small message MPI_AllreducePhase 4:Inter-socket broadcastPhase 5:Intra-socket broadcastLegendSocket leader shared memoryIntra-socket shared memoryWrite to
12、 shared memoryRead from shared memoryIntra-socket leader processSocket 1Socket 2Socket 1Socket 2Node 1Node 2LegendSocket leader shared memoryIntra-socket shared memoryWrite to shared memoryRead from shared memoryIntra-socket leader processRegular processSocket 1Socket 2Node 1Socket 1Socket 2Node 211
13、Network Based Computing LaboratoryOFA workshop April24Overview of the MVAPICH ProjectHigh Performance open-source MPI Library Support for multiple interconnectsInfiniBand,Omni-Path,Ethernet/iWARP,RDMA over Converged Ethernet(RoCE),AWS EFA,OPX,Broadcom RoCE,Intel Ethernet,Rockport Networks,Slingshot
14、10/11Support for multiple platformsx86,OpenPOWER,ARM,Xeon-Phi,GPGPUs(NVIDIA and AMD)Started in 2001,first open-source version demonstrated at SC 02Supports the latest MPI-3.1 standardhttp:/mvapich.cse.ohio-state.eduAdditional optimized versions for different systems/environments:MVAPICH2-X(Advanced
15、MPI+PGAS),since 2011MVAPICH2-GDR with support for NVIDIA(since 2014)and AMD(since 2020)GPUsMVAPICH2-MIC with support for Intel Xeon-Phi,since 2014MVAPICH2-Virt with virtualization support,since 2015MVAPICH2-EA with support for Energy-Awareness,since 2015MVAPICH2-Azure for Azure HPC IB instances,sinc
16、e 2019MVAPICH2-X-AWS for AWS HPC+EFA instances,since 2019Tools:OSU MPI Micro-Benchmarks(OMB),since 2003OSU InfiniBand Network Analysis and Monitoring(INAM),since 2015Used by more than 3,375 organizations in 91 countriesMore than 1.77 Million downloads from the OSU site directlyEmpowering many TOP500
17、 clusters(Nov 23 ranking)11th,10,649,600-core(Sunway TaihuLight)at NSC,Wuxi,China29th,448,448 cores(Frontera)at TACC46th,288,288 cores(Lassen)at LLNL61st,570,020 cores(Nurion)in South Korea and many othersAvailable with software stacks of many vendors and Linux Distros(RedHat,SuSE,OpenHPC,and Spack)
18、Partner in the 29thranked TACC Frontera systemEmpowering Top500 systems for more than 18 years12Network Based Computing LaboratoryOFA workshop April24Results for small MPI_Allreduce Varying message sizesScaling with message size,average latencyClose to a flat curve across message sizes up to 2K1 ppn
19、,7861 nodes16 ppn,1024 nodesAvailable in the MVAPICH 3.0 release13Network Based Computing LaboratoryOFA workshop April24Results for small MPI_Allreduce Varying node countsScaling with increasing node counts,16 bytes,average latencySame as trends with reduce(implementations are almost the same except
20、 for the intra-node broadcast phases)1 ppn,7861 nodes16 ppn,1024 nodesB.Ramesh,K.Suresh,N.Sarkauskas,M.Bayatpour,J.Hashmi,H.Subramoni,and DK Panda “Scalable MPI Collectives using SHARP:Large Scale Performance Evaluation on the TACC Frontera System”,ExaMPI20 More information in the following paper14N
21、etwork Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline15Network Based Computing LaboratoryOFA workshop April24Two-copy reduction c
22、ollectives with SHARPUsed leader-based schemes that had a reduction,followed by a SHARP operation and finally a broadcastNot suitable for large message sizes(=128k)Single-copy schemes are very efficient for large message data movementXPMEM allows remote process to have load/store access through addr
23、ess space mappingUsing Sharp SAT in MPI has a few limitations and bottlenecks that need to be addressed for achieving good scale-out performanceMotivates the need for large message reduction designs that combine advantages of SHARP and single-copy schemes like XPMEMLimitations of state-of-the-art sc
24、hemes for large message reduction collectives16Network Based Computing LaboratoryOFA workshop April24Motivation05000100001500020000250001M2M4M8M16M32MLatency(us)Message size(bytes)Allreduce runtime registration overheadSHARP-allreduce-without-registrationSHARP-registrationSHARP SAT provides excellen
25、t bandwidth with close to point-to-point latencyRegistration involves pinning pages to memory(like InfiniBand registration)Overhead increases significantly with increase in message sizeRequires a cache that avoids expensive calls to sharp_coll_reg_mrSwitch resources are limitedCauses bottlenecks whe
26、n scaling up on modern CPUs with hundreds of coresThe SHARP runtime places limits to manage resourcesMotivates need for designs that are aware of SHARP runtime capabilities,overcome bottlenecks and scale-up efficiently for many processes per node010020030040050016K32K64K128K256K512K1MLatency(us)Mess
27、age size(bytes)Comparison of SHARP protocols on 8 nodesSHARP-LLTSHARP-SAT17Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline
28、18Network Based Computing LaboratoryOFA workshop April24Problem Statement-Can we propose an algorithm for large message AllReduce that overcomes bottlenecks and resource constraints in the SHARP runtime by making efficient use of node and network level resources?ContributionsIdentify registration ov
29、erheads involved in the use of SHARP streaming aggregation for large messages and propose solutions to address themAnalyze the impact of chunking reductions when using streaming aggregation for different message sizes to empirically determine ways to overlap intra-node reductions with SHARP-based re
30、ductionsPropose an algorithm for large AllReduce that utilizes SAT and CPUs efficientlyEvaluate the proposed design by comparing it against state-of-the-art MPI librariesProblem Statement and Contributions19Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem
31、 Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline20Network Based Computing LaboratoryOFA workshop April24Use a registration cache to amortize registration costs in the SHARP runtimeDesignate a“leader”process on each n
32、ode to interact with SHARPChunk buffer into PPN(number of processes per node)chunks and reduce to a single buffer belonging to the leader processUses XPMEM for load/store accessAll processes perform local reductions,but only leader process calls the SHARP runtimeOnce local reductions are complete,le
33、ader calls a non-blocking MPI_Allreduce Perfect overlap of intra-node and inter-node stepsLocal reduction happens in batches for good network utilizationFinal result broadcast within the nodeProposed Design Overview21Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivat
34、ionProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline22Network Based Computing LaboratoryOFA workshop April24Registration cache design11010010001000016K32K64K128K 256K 512K1M2M4M8MLatency(us)Message Size(bytes)I
35、mpact of registration cache designsSAT-with-registration-cacheSAT-without-registration-cacheUse an AVL tree or similar,to store buffer addressesO(log n)insertion/query timeIf buffer entry exists,directly get registration information from cacheUp to 5.6X reduction in latency23Network Based Computing
36、LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline24Network Based Computing LaboratoryOFA workshop April24Analyzing impact of chunking iallreduce ope
37、rations05010015020016384655362621441048576Latency(us)Message Size(bytes)Iallreduce+waitall using SHARP SAT1 chunk2 chunks4 chunks8 chunks16 chunks32 chunksMeasure impact of a message sent using one call to the SHARP library vs multiple callsGiven a message size M and number of chunks C,call non-bloc
38、king SHARP allreduce C times(of size M/C each)followed by waitallIndirect measure of overlap at the network levelSplitting into chunks of size=16384 gives the same latency(independent of num_chunks)Can be overlapped with reductions within the node25Network Based Computing LaboratoryOFA workshop Apri
39、l24Proposed Allreduce DesignP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufInitial State Reduce to leaderFirst process on each node is designated as leaderBefore reduction,exchange buffer information using shared memory(for XPMEM load/stores)P
40、rocess i reduces the ith chunk from every process and stores to tmpbuf at leaderAt the end of this step,leader on every node has the reduced result for the current phaseLeader process initiates non-blocking inter-node SHARP allreduceUse“request”objects to track progress of SHARP Allreduce operations
41、P0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufInitiate non-blocking SHARP-based inter-node allreduce26Network Based Computing LaboratoryOFA workshop April24Proposed Allreduce Design ContinuedFor large buffers,the intra-node reduction and inter-node phases are run multiple timesReduction of lar
42、ge buffers is time consumingDone in multiple phases for good network utilizationChunk size if tuned to get perfect overlap of intra-node and inter-node operationsLeader waits for non-blocking allreduces to complete after all runs of the first two phases are donePerform and intra-node broadcast to ge
43、t final resultP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufP0P1N0sendbuftmpbufrecvbufP3P4N1sendbuftmpbufrecvbufAfter WaitallAfter BroadcastB.Ramesh,G.Kuncham,K.Suresh,R.Vaidya,N.Alnaasan,M.Abduljabbar,A.Shafi,D.Panda,Designing In-network Computing Aware Reduction Collectives in MPI,Hot Interc
44、onnects 2023,Aug 2023.More information in the following paper27Network Based Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline28Network Ba
45、sed Computing LaboratoryOFA workshop April24Experimental setupClusterMRIHPCACProcessor modelAMD EPYC 7713Intel(R)Xeon(R)Gold 6138Max Clock speed3.72GHz2GHzNumber of sockets22Cores per socket6420RAM256GB196GBInterconnectNVIDIA HDR-200 with Quantum 2 switchesNVIDIA HDR-200 with Quantum 2 switchesMPI l
46、ibrariesMVAPICH2-X,HPC-XMVAPICH2-X,HPC-X29Network Based Computing LaboratoryOFA workshop April24Results for large MPI_Allreduce 2 nodesIncreased parallelism by using multiple processes and SHARP for reductionUp to 81.43%over state-of-the-art for 32PPN and 86.4%for 64PPN on MRIUp to 33.67%over state-
47、of-the-art for 32PPN and 60%for 64PPN on HPCACIncreased number of page faults leads to decreased benefits at 1M(Needs to be investigated further)MRI-32PPNMRI-64PPNHPCAC-32PPNHPCAC-64PPN30Network Based Computing LaboratoryOFA workshop April24Results for large MPI_Allreduce 4 nodesIncreased parallelis
48、m by using multiple processes and SHARP for reductionUp to 83.05%over state-of-the-art for 32PPN and 88.52%for 64PPN on MRIUp to 32.62%over state-of-the-art for 32PPN and 46.91%for 64PPN on HPCACMRI-32PPNMRI-64PPNHPCAC-32PPNHPCAC-64PPN31Network Based Computing LaboratoryOFA workshop April24Results f
49、or large MPI_Allreduce 8 nodesIncreased parallelism by using multiple processes and SHARP for reductionUp to 79.44%over state-of-the-art for 32PPN and 78.36%for 64PPN on MRIUp to 58.08%over state-of-the-art for 32PPN and 52.13%for 64PPN on HPCACMRI-32PPNMRI-64PPNHPCAC-32PPNHPCAC-64PPN32Network Based
50、 Computing LaboratoryOFA workshop April24IntroductionBackgroundMotivationProblem Statement and ContributionsDesignOverviewRegistration cache designProposed Allreduce designResultsConclusion and Future workOutline33Network Based Computing LaboratoryOFA workshop April24SHARP runtime enables in-network
51、 offload with excellent bandwidth utilizationProposed designs overcome various bottlenecks by using a leader-based algorithm and streaming aggregation for large message reductionsOutperforms state-of-the-art by up to 86%Will be available in a future release of MVAPICH-plusFuture workComprehensive ap
52、plication evaluationEvaluating performance at larger scalesExploring NUMA-awarenessConclusion and Future Work34Network Based Computing LaboratoryOFA workshop April24THANK YOU!Network-Based Computing Laboratoryhttp:/nowlab.cse.ohio-state.edu/The High-Performance MPI/PGASProjecthttp:/mvapich.cse.ohio-state.edu/The High-Performance Big Data Projecthttp:/hibd.cse.ohio-state.edu/The High-Performance Deep Learning Projecthttp:/hidl.cse.ohio-state.edu/