1、NVIDIAIN-NETWORK COMPUTINGPROGRAMMING FOR GPUAPPLICATIONSGTC CHINA 2020Qingchun SongDecember 2020#page#AGENDAREMOTE DIRECT MEMORY ACCESS (RDMAGPU DIRECT RDMA(GDR)GPU DIRECT STORAGE (GDS) SCALABLE HIERARCHICAL AGGREGATION ANDREDUCTION PROTOCOL (SHARPNCCL SHARP會#page#REMOTE DIRECT MEMORYACCESS (RDMA)#
2、page#INFINIBANDS LAYERED ARCHITECTUREUpper LevelClientClientprotocolsIBAIBAOperationsOperationsTransportMessagesLayerQP)SARSARInter Subnet RoutingNetworkNetworkNetwork(IFv6)LayerPacket5UdLink買LinkyuEncodingEncodingLinkLayerMediaMediaFlowMAAcCOSSACCOSSControrControlControlSignaingPhysicalEnd NodeSwit
3、chRouterEndNodeLayer#page#RDMA DATA TRANSFER MODELQueue Pair (QP)QPs are in pairs (Send/Receive)TransmitReceiveWork Request(WR)RecetvdTransmitRemote QPLocal QPWork items that the HW should performWorkQueueConWork Completion (completion)When a WR is completed,it may create a Work Completionwhich prov
4、ides information about the ended WRWQEWQEWork Queue (WQ)WorkQueueWQEWQEA queue which contains WRsScheduled by the HWCompletiow QueuICQEICQECQECQECan be either Send or Receive QueueCompletion Queue (CQ)#page#RDMA OPERATION: SENDThe responder Post Receive Requests (before datais received)RequesterResb
5、onderThe requester Post Send RequestSyncPost RROnly data is sent over the wirePost SRdataACK is sent only in reliable transport typesACKPollcaPollcQ#page#RDMA OPERATION: RDMA WRITEThe requester Post Send RequestRequesterResponderData and remote memory attributes are sentResponder is passivePostSRKa+
6、ppe+elepImmediate data can be used to consume RRs atthe responder sideACKACK is sent only in reliable transport typesPollcQ#page#RDMA OPERATION: RDMA READThe requester Post Send RequestRequesterResponderData and remote memory attributes are sentResponder is passivePost SRaddr+rkeyData is sent from t
7、he responderAvailable only in reliable transport typesdataPoll cQ#page#RDMA OPERATION:ATOMICThe requester Post Send RequestRequesterResponderData and remote memory attributes are sentResponder is passivePostSRdata+addr+rkeyOriginal data is sent from the responderdataRead-modify-write is performed in
8、 respondersmemoryPollcQAvailable only in reliable transport types#page#RDMA OPERATIONS SUMMARYSend / ReceiveUDUCRC Two-sided communicationVSend / ReceiveXVRDMA WriteRDMA WriteXXRDMA Read/Atomic One-sided communication, responder is not involvedMTU2GB2GBMax Send SizeRDMA ReadXXReliability1NNScalabili
9、ty? One-sided communication. responder is not involved(per-process for N processes)Atomic OperationsOne-sided communication,rread-modiffy-write transaction#page#GPU DIRECT RDMA(GDR)#page#WHATS GPUDIRECT RDMAGPUDirect RDMA is used as an API between IB_CORE and GPU peer memory clientHCA/NIC Read/Write
10、 directly to GPU peer memory data buffersGPUDirect RDMA works with InfiniBand and RoCE seamlesslynv_peer_mem is a kernel module that translates peer memory addressesadauo sauoq aul ouu enlaaq uonesun ssappe Kouau Joyseq asesuauaduuu adAuTranslate virtual addresses to GPU memoryApplication calls ibv_
11、reg_mr to register memory region, NIC does RDMA directly from/to GPU memoryGPU Direct AsyncAlowss handling of Riing Doorbell and Poll CQ from GPU kernel#page#GPUDIRECT RDMA EVOLUTIONGPUDirect RDMA AsyncWith Out GPUDirectGPUDirectGPUDirect RDMACPUis involved inWQETwo copiesCPU is involved in theOneco
12、pypreparation and releasecontrol path,WQEcompletedWQEspreparation,ring doorbell,GPU copies data from GPUGPU copies data from GPUhandles completionsinternal memory to GPU driverinternalmemorytosystemGPU is involved in Ringsystem pinned memory(1)pinned memory(1)Doorbell,Handles completionsRDMA device
13、sends data tofor incoming packets to GPUUserspaceneeds tocopy datanetwork from GPU memoryRDMA devicesends data tobetween the GPU drivernetworksystem pinnedmemory(1)andData pathis sameasGDRRDMA device receive dataRDMA system pinned memoryfrom network to GPU(2)RDMA device sends data tomemorynetwork#pa
14、ge#GPUDIRECT RDMAGPUDirect RDMAGPU memory is exposed to RDMA NIwhile(in)Direct data path from GPU to networkData path is zerogpu_kernel(buf;copycudaStreamSynchronizestream);ibvpost_sendbuf;ibv_poll_cq(cqe);doorbell handles completions for incoming packets to GPU1009 CPU UtilizationZero copyRDMA devi
15、ce sends data to network from GPU memoryRDMA devicereceive data from network to GPU memoryThe CPU still synchronizes between GPU tasks and datatransfers#page#GPUDIRECT RDMA ASYNCGPUDirect RDMA AsyncGPU memory is exposed to RDMA NICwhile(fin)Direct data path from GPU to networkData path is zero copyg
16、pukeme,stambugds_stream_queue_sendstream,qp,bufCPU is involved in WQE preparation and release completed WQEsgds_streamwaitcqlstreamcqe)No CPU in critical pathGPU is involved in Ring Doorbell Handles completions for incomingpackets to GPUZero copyRDMA device sends data to network from GPU memoryRDMA
17、device receive data from network to GPU memoryReduce CPU utilization#page#GPUDIRECT ASYNC ARCHITECTUREKernel sideAppProvides PeerDirect through nv_peer_memlibgdsyncUser spacemlx5 providerLibibverbsUse standard verbs to createresources (QP,CQlibmlx5CUDAlibibverbsLibmlx5Use direct verbs to expose unde
18、rlying resourcesLibgdsyncuserUsesabstraction layers internally tosupport different NICS andkernelGPUSofa-kernelnvidiaInstructs CUDAstream to peek CQs directlynv_peer_memAnVIDIA#page#10X HIGHER PERFORMANCE WITH GPUDIRECT IM RDMAAccelerates HPC and Deep Learning performanceLowest communication latency
19、 for GPUs公MVAPICHPUnfiiBandpuThroushputB-Dir)119X10X#page#GPU DIRECT STORAGE(GDS)#page#WHAT IS GPUDIRECT STORAGE?SystemMemarySystemMemoryUFC出每色出SLoradCPUCPUA100GB/sPClcStorage0008CPUGPUWithoutGPUDirectStorageWithGPUDirectStorageScaling upwithGPUDirect StorageHigherBandwcthLower LatencDataF#page#MAGN
20、UM IONVIDIAS Multi-GPU, Multi-Node Networking and Storage 10 Optimization StackMagnum IOPoint-to-PointCollectivesStorageMPI,UCXMPINCCLcuFileNVSHMEMNVLINK/NVSWITCHGPUDirect P2PGPUPcleBGPUDirect RDMAGPUNICROCEGPUDirect StorageStofageGPUDirect StorageIPCle吉CPU#page#page#USER-LEVEL CUFILE LIBRARYlibcufi
21、le.so cuFileDriverAPIS:cuFileDriverOpen,ClosecuFileDriverfGetProperties Set* cuFilelo APIS:cuFileHandleRegisterDeregistercuFileBufRegister,DeregistercuFilefRead,Write) Stream subset ofthecuFile APIS:cuFilefRead,WriteAsyn cuFileBatch APIS:cuFileBatchiO(Submit,GetStatusCancel,Destroy)IOIAU#page#KERNEL
22、-LEVEL DRIVERnvidia-fs.koImplements callbacks from modified Linux kernel modules orfromproprietaryfilesystemsthatenabledirect DMAtoGPU memory= Licensed underGPLV2.= Likewise,any kernel thirdparty kernelcomponents that call the nvidia-fs APIS should expect to besubject toGPLV2.APIsareusedtocompleteth
23、efollowingtasks:Check whether the DwAtarget address is on the GPU(nvfs_is_gpu_page )and needs to be handled speciallynvfs blk rq map_sg returns a nonzero sizeS for a set of target addresses if they contain GPU addressesthevFS.#page#SCALABLE HIERARCHICALAGGREGATION ANDREDUCTION PROTOCOL(SHARP)#page#D
24、ATA AGGREGATION PROTOCOL3-Level Fat Tree NetworkSwitch Network87Nodes#page#TRADITIONAL DATA AGGREGATIONPhase 1Switch NetworkNodes#page#TRADITIONAL DATA AGGREGATIONPhase 2Switch Network2L08Nodes#page#TRADITIONAL DATA AGGREGATIONPhase 3Switch Network267Nodes#page#SCALABLE HIERARCHICAL AGGREGATION AND
25、REDUCTIONPROTOCOL (SHARP)In-network Tree based aggregation mechanismSHARPSwitchMultiple simultaneous outstanding operationsAggregatedAggregatedFor HPC (MPI/ SHMEM) and Distributed Machine Learning applicationsResultDataScalable High Performance Collective OffloadSwitchSwitchSHARPSHARPhBarrier,Reduce
26、,AlL-Reduce,Broadcast and moreAggregatedDataResultSum,Min,Max,Min-loc,max-loc,OR,XOR,ANDRD口Integer and Floating-Point,16/32/64bitsHostHostHostHostHost#page#SHARP IN-NETWORK COMPUTING DATA AGGREGATIONSwitch Network8Nodes0#page#DATA AGGREGATION- COMPARISONTraditionalIn-Network ComputingHigh latencyLow
27、 latencyHigh amount of transferred dataOptimized data motionCPU overheadNo CPU latency additionIOIAU#page#page#SHARP SOFTWARE ARCHITECTUREMPI-all major MPI librariesMPIDL FrameworksHCOLLOptimized collective libraryHCOLLNCCLNCCL Optimized GPU collective librarySHARPSHARP (libsharp/libsharp coll)Libsh
28、arp_coll.so Libsharp.soImplementation of high level SHARP APIInfiniBand NetworkUses low level libsharp.so APIEasy to integrate with multiple MPlsIntegrated into NCCLImpplementation off low level SHARP API#page#SHARP SOFTWARE COMPONENTSLibspublic APIsharp.hLibsharp.so (low level api)libsharp_coll.so
29、(high level api)sharpd: SHARP daemonDaemonscompute nodessharpd,sharp_amLight wait processAlmost 0%cpu usageScriptsOnly control pathsharp_benchmark.shsharp_daemons setup.shsharp_am:Aggregation Manager daemonsame node as Subnet ManagerUtilitiesResource managersharp_coll_dump_configsharp_hellosharp_mpi
30、_test#page#SHARP: CONFIGURING AGGREGATION MANAGERUsing OpensSM4.9 or later does not require any special configuration in the AMConfigure AM with OpenSM v4.7-4.8:Create aconiguration directory for the future SHARP configuration file.%mkdir SHPCX_SHARP_DIR/confCreate root GUIDS fileCopy the root_guids
31、.conf file if used for configuration of Subnet Manager to SHPCX_SHARP_DIR/conf/root_guid.cfg(or)Identify the root switches ofthefabric and createafilewith thenode GUDs of theroot switches ofthefabricForexample:if therearetwo rootswitchesfiles containsQx0002c900000000010x0002c90000000008Createsharp_a
32、m.conffile%catSHPCX_SHARP_DiR/conf/sharp_am.confEOFroot_guids_fileSHPCX_SHARP_DIR/conf/root_guid.cfgib_port_guid PortGUD of the relevant HCA port or OxOsEOF#page#SHARP: RUNNING SHARP DAEMONSSetup the daemonsSHPCX_SHARP_DIR/sbin/sharp_daemons_setup.shUsageUsage: sharp_daemons_setup.sh lp SHARP locati
33、on dirlc-d daemon m-S- Setup SHARP daemon-r-Remove SHARP daemonpPath toalternativeSHARP location dir-d-Daemon name(sharpdor sharp_am)-mAdd monitor capability for daemon controluedeuspIddHSXdHS Ssdnas suoup deus/uas/Id dNHS XOdHSSservice sharp_am start#page#SHARP: RUNNING SHARP DAEMONSsharp_am%SHPCX_
34、SHARP_DIR/sbin/sharp_daemons_setup.sh -s SHPCX_SHARP_DIR -d sharp_am%servicesharp_amstartLog:/var/log/sharp_am.logSharpdconffile:SHPCX_SHARP_DIR/conf/sharpd.confib_devsharpd_log_level 2%pdshw SHPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh-s SHPCX_SHARP_DIR-dsharpd%pdshwjupiterfo01-032service sharpdstar
35、tLog:/var/log/sharpd.log#page#HCOLL fCOLLECTIVES LIBRARYDeployed in production in the worlds leading supercomputersSHARP enabled collectives-GPU and Host memory (Allreduce, Reduce,and barrier)Support for collectives over GPU Direct RDMAScatter-Reduce Algather reduction algorithm for large vectorsRel
36、iable Multicast enabled collectives (VMC)-Bcast large and smallSupport for FP16 reductions (requires Open MPI 5.X)Highly optimized shared memory operationsUCX accelerated collectives (Alreduce large vector, Altoall, AltoALIV,Allgather, Algatherv)Non-blocking MPI collectives (HCA based collective off
37、loadsCrossChannelHigh performance multithreaded impplementation#page#ENABLING HCOLLOpen MPI which ships with MOFED/HPC-X enables HCOLL by default, and its priority is set to highest.To explicitly enable/disable HCOLLS mpirun mca col hcolLenable 1-mca col hcolL np 0./app.exe:Enable/Disable HCOLL-mca
38、colL hcollenable 0/13-mca colL hcolLnp ccomm_sizes: Communicator size threshold to use HCOLLcollectivesSelecting HCA device-XHCOLL_MAIN_IB=dev:port,#page#HCOLLHCOLL Supported CollectivesMPI BarrierMPl_ IbarrierMPLAllreduceIMPLlallreduceSHARP SupportedMPLReduceI MPl_IreduceMPL_BcastIMPl_IbcastMPLAllg
39、atherMPLlallgatherMPLAlltoallvMPLlalltoallvMPLAltoallIMPLlalltoallMPLAlltoallyMPL_lalltoallyFor non supported collectives,falback to next high priority collective module Ctuned/basic )Disable specific colective in HCOLLHCOLL_ML_DISABLE_=1Example:Disable BcastXHCOLL_ML_DISABLE_BCAST=1#page#HCOLLHCOLL
40、 comes with its own set of communication primitive components,known as “BCOL”(UCX),uCX_p2pbasesmuma(SHM),(Cross-Channel)CCrdpgs se uMouy squauoduo dnousqns Jo as UMO S! puybasesmsocket:Socket subgroupbasesmuma:UMA subgroupP2p:Network subgroup (SHARP group)Choosing BCOLS and SBGPsTwocommon env.variab
41、lestosetare:HCOLL_SBGP and HCOLL_BCOLTheaboveenv.variable havea1:1 mapping,must be thesamenumbeSBGPspecifies thesubggroup level,BCOLspecifies the transporttouseat that level#page#HCOLL: TOPOLOGY AWARE SUB GROUPING (SBGP)HCOLL SBGP=BASESMSOCKET,BASESMUMA,P2PMPI Communicator rank layout:10121314511N2N
42、1N3N6N735N4N5socket-0sockot-1s0cket-0socket-0Socket1ocket-1Host1Host-0Host#N“SOCKET” Subgroup:0N2N6N77N3N4NHost-1Host-OHost-N“UMA”Subgroup:N113Host.1Host-0Host-N“P2P” Subgroup:(SHARP group):N1#page#DGXA100CPUOCPU1PCleNVLinkmlx5_0mlx5_2mlx5_6mlx5_8NICONIC2NIC4NIC6nx5_10nlx5PCle SwitcPCleSwitchNIC7NIC
43、1NIC3NIC5mlx5_1mlx53mlx5_7mlx5_9NVSwitch#page#HCOLL: TOPOLOGY AWARE COLLECTIVES:ALLREDUCE-X HCOLL SBGP=BASESMSOCKET,BASESMUMA,P2P-XHCOLL BCOL=BASESMUMA,BASESMUMA,UCX P2P1015111314HostReduce: basesmuma: SHARED MEMORYT50H0130HReduce:basesmuma: SHARED MEMORYAllreduce: UCX pZp/SHARPBcast:basesmuma: SHAR
44、ED MEMORY13HOSt0T350HBcast:basesmuma: SHARED MEMORY(731315Host-035OH#page#HCOLL: TOPOLOGY AWARE SUB GROUPING (SBGP)HCOLL SBGP=BASESMSOCKET,P2PMPI Communicator rank layout:1311121415165N1N2N3N4N70-05ocket-1sockat-0socket-1socket-0socket1Host1HoSt-0NHSOH“SOCKET”Subgroup:12)131415M2N3N6N7Host1Host-0HOS
45、t“p2P” Subgroup:#page#HCOLL: ENABLE SHARPHCOLL_ENABLE SHARP(Default)0Disable SHARF1-Probe SHARP availability and useit.(recommended if SHARP is not must)2-Force to use SHARP.(recommended tf SHARP is must3-Force to use SHARP for all MPlcommunicators (Benchmarking)4-Force to use SHARP for all MPI comm
46、unicators and for all supported collectivesSHARP_COLL LOG LEVELSHARP debug log level0-fatal1-error2-warn(Default)3-info (minimal progress log,Recommended to track SHARP is being used ornot)4-debug5.trace#page#HCOLL:ENABLE SHARPDisable SHARPS mpirun -map-by node np8 x HCOLL_MAIN_IB=mlx5_1:1-x HCOLL_E
47、NABLE_SHARP=0 XSHARP_COLL_LOG_LEVEL=3/osu_barrier#OSUMPIBarrierLatencyTestv5.3.2#AvgLatency(us)3.72HCOLL ENABLE_ SHARP=1,Setup status: correctS mpirun map-by node np 8 X HCOLL_MAIN_IB=mlx5_1:1 -x HCOLL_ENABLE_SHARP=1X SHARP_COLL_LOG_LEVEL=3/osu barriernemo01:0:21559-context.c:485INFOjob(ID:327680001
48、)resourcerequestquota:(osts:0user_data_per_ost:128max_groups:0max_qps:1max_group_channels:1,num_trees:1)nemo01:0:21559context.c:628INFO tree_info:tree idx:quota:(ost:51userdata per_ost:128 max_groups:51 max_qps:1max_group_channels:1)nemo01:0:m.c:417INFoaroup#:ogroupid:7treeidx:orail idx:ogroupsize:8
49、quota:(osts:2user_data_per_ost:128)mgid:(subnet prefix:0xff12a01bfe800000interfaceid:0x4c590e000007)mlid:c005#OSUMPIBarrierLatency Testv5.3.2#AvgLatency(us)08#page#HCOLL ENABLE SHARP CONT.HCOLL ENABLE SHARP=1, setup status: incorrect (SHARPD is down)S mpirunmap-bynode np8 XHCOLL MAINJB=mx5_1:1 XHCOL
50、L_ENABLE SHARP=1 XSHARP_COLLLOG LEVEL=3 /osu barrier-i100001000nemo01:0:26916context.c:195ERRORfailed to open sharp session with SHARPDnemoo126916:ojcommonshap.c288:commsharp coinitSHARP:Failed to initalze SHARP collectvescannot connect to SHARPDC8) job ID:342491137#OSUMPI BarrierLatency Testv5.3.2#
51、AvgLatency(us)3.73HCOLL_ENABLE_SHARP=2(probe),Setup status:incorrect(SHARPD is down)Smpirun-map-by node np2xHCOLL_MAIN_IBmlx5_1:1HCOLL_ENABLE_SHARP=2 -X SHARP_COLL_LOG_LEVEL-3/osubarrier-i10000-1000nemo01:0:27007context.c:195ERRORfailed to open sharpsession with SHARPDnemo01:27007:0jcommon_sharp.c:2
52、88:comm_sharp_collinit SHARP:Failed to initialize SHARP collectives:Cannot connect to SHARPD-8)jobID:353107969nemo01:27007:0jcommon sharp.c:292:comm_sharp_collinitl SHARP:Falbackisdisabled.exitingnemo04:24392:3jcommon_sharp.c:292:comm_sharp_colLinitSHARP:Falbackisdisabled.exitingPrimary job terminat
53、ed normaly but 1 process returnedanon-zero exit code.Per user-direction,the job has been aborted#page#SHARP MULTI-CHANNELWithoutMultichannelOnly one rank per node partcpates in the network communicatorWith MultichannelBestsuitswith HCOLLSubgroupingmechanism.-XHCOLL_SBGP=basesmsocket,p2pHCOLL_BCOL=ba
54、sesmuma,ucx_p2pSocket leaders form P2PSubgroup.Usefulwith“Ful PpNor“ranks on both socketsAvoidinter-socketcommunication.Best forsmall messages.ForLarge messages,Shared memory channel isless overheadthan networkoverhead.“SOCKET” Subgroup:101311121415N12N757NNANSNEsocket-0socket-1socket-1socket-1socke
55、t-0socket-0Host-1050HHost#N“p2P” Subgroup:N1#page#SHARP MULTI-CHANNEL, SINGLE PORTSHARP TreePORTPORTPORTSocket NUJMA-1letE131012141511N2N3三N7socket0socket-1socket0socket-0socket-1socket-1Host-1Host0Host#N#page#SHARP MULTI-CHANNEL,TWO PORTS, SINGLE TREESHARP TreePORT-2PORT-1PORT-1PORT-2PORT-1PORT-2So
56、cket NUMA-0 leader buffe1013141593126NN2N3N77N4N6socket-0socket-1socket0socket-0socket-1socket-1Host-1Host-0Host-#N#page#SHARP MULTI-RAILMultiple SHARP trees,tree per portMore resourcesSHARP group pertreeMultiple SHARP groups per MPI CommunicatorUseful for large messagesStripe messages on both group
57、s.Small message ( OST payload size)On Single groupRound-RobinSHARP_COLL_GROUPS_PER_COMM(default:1,Max:#trees)nVID#page#MULTI-RAIL,2 PORTS,2 TREES.2 GROUPSSHARP Tree -1SHARP Tree-2PORT-2PORT-1PORT-2PORT-1PORT-1PORT-2msgmsgmsgmsgmsgmsgpt-2pt-1pt-2p12pt-1pt-1Socket NUMA-1leaderbuffr0N1N2CNN4NNN7sockat-
58、0socket-0socket-0socket-1Host-1Host-0Host-N#page#NVIDIA DGX: COMPUTE RAILS (PLANES)NCCL creates logical “rings among the same HCALil on multiple DGXs, effectively creating asub-network per HCALiRail 1Rail 2Rail nDotted lines = Ringtraffic flowDGX1DGX2DGXmIOIAU#page#MAPPING RAILS TO INFINIBAND TREESS
59、=4n=88茶m=20DGX1DGX2DGXm#page#SHARP V2- STREAMING AGGREGATIONOB.2GB vector size aggregationUp to 200G aggregated dataUp to 2 concurrent aggregations flows, one per treeUp to 64 transactions in flight per flowReliable MC supportResult spreading- using RC transport (UD is not alowed)Includes single tar
60、get for optimized reduce operationsMH ppolonSM suodoloudeus olnsbsnosluoealase uusuuuwThis operationalows to lock for more than one aggregation operation/infiniteSemaphore release can be done by HWwhen credits are useed or manually usingan “unlock semaphore operation#page#ENABLE STREAMING AGGREGATIO
61、NSHARP_COLL_ENABLE_SATEnable SAT (Streaming Aggregation Tree) capabilitiesX-E=OOHSXT=dHSNOHX-T:SX=NIWOHX9TdupouqdewundusSHARP_COLL ENABLE SAT=1 /sharp coll test-d SHCA -mode perf -collectives allreduce Ssize Siterclx-mld-061:0:49532-context.c:577INFO job(ID:35748427966255972)resourcerequest quota:(o
62、sts:0user_data_per_ost:128max_groups:0max_qps:1 max_group_channels:1num_trees:1)clx-mld-061:0:49532-context.c:748INFOtree_info:type:LLTtreeidx:0treeID:0x0caps:0x6quota:(osts:230user_data_per_ost:128(T:Slauudnoa xewtsdbxeu oez:sdnoa xeuclx-mld-061:0:49532-context.c:752INFOtree_info:type:SATtreeidx:1t
63、reelD:0x3fcaps:0x16clx-mld-061:0:49532-comm.c:404INFOlgroup#:ogroupid:9etree idx:0tree_type:LTrailidx:0groupsize:16quota:(osts:2user_data_per_ost:128)mgid:(subnetprefix:0xOinterfaceid:0x0)mlid:0clx-mld-061:049532-comm.c:404INFOlgroupt:1groupid9etree idx:1tree_type:SATrailidx:0group size:16quota:(ost
64、s:64user_data_per_ost:0)mgid:(subnet prefixOxOinterface id:0xo)mlid:0#page#STREAMING AGGREGATIONSHARP_COLL_SAT_THRESHOLD(default: 8096)Message size threshold to trigger SAT (Streaming Aggregation Tree) modeSHARP_COLL_SAT_LOCK_BATCH_SIZE(default:0xffff(65535)Infinity)SAT Lock batch sizeSHARP_COLL_SAT
65、_LOCK_RETRY_COUNT (default: 1)SAT Lock retry countSHARP_COLL_SAT_LOCK RETRY_TIMEOUT (default: 100ms)SAT Lock retry timeout(ms)#page#NCCL SHARP#page#NCCL OVERVIEWNCCL:NVIDIA Collective Communication LibraryEnables Multi GPU ComputingData Parallel multi GPU trainingNCCL Alreduce:Aggregate gradients ac
66、ross GPUSDL Frameworks (Tensorflow/Horovod,PyTorch,MXNet,Chainer,)NCCL1.0Single node RingNCCL 2.0Ring across mutiple nodesRDMANCCL2.4Hierarchical tree algorithm#page#NCCL RINGSimple,full Bandwidth,linear latencyNetworkHCAHCAHCACUU中U口#page#NCCL TREEKeep Intra-node chain, node leaders participate in t
67、ree, binary double tree, multiple rings - Multiple treesNetworkHCAHCAHCA中口口#page#NCCL SHARPSHARPSwitchoAggregatedAggregatedResultDataSHARPSwitchSwitchSHARPAggregatedDataResultNetworkHCAHCAHCA普U廠#page#NCCL SHARPCollective network PluginReplace Inter-node tree with SHARP Tree Keeps Intra-node ringAggr
68、egation in network switchStreaming from GPU memory with GPU Direct RDMA2.5x thebandwidthSupport for SHARP plugin (Oct.2019)Support for UCX plugin (Feb.2020)#page#SHARP DELIVERS HIGHEST PERFORMANCE FORAlSHARP Enables Nearly 20% Higher Applications PerformanceVAE BenchmarkGNMT MLPerf Benchmarkfacebook
69、VariableAuto-EncoderMLPerfNeural Machine Translation18%189848768312.78720950eSHARPSHARPNo-SHARPNO-SHARP24xDGX1V+4xMellanoxConnectx-632xDGX1V+4xMellanox ConnectX-6GNMTMLPerf0.6 benchmark:Batch Size=32,Overlap=0.15VAEbenchmark:Model=3,BS=512#page#NATURAL LANGUAGE PROCESSING WITH SHARPSHARP Enables Nea
70、rly 17% Higher PerformanceNVIDIABERT Megatron,PyTorchImage classification and object detection was009mainstreampast fewyears45017%betterNatural language processing NLP)significantly400improve customer experience and reduce call1350center staffing怡008050Deployments with NLP are latency sensitivetion200150100500NCCLNCCLSHARP#page#INFINIBAND SHARP Al PERFORMANCE ADVANTAGE2.5X Higher Performance128NVIDIADGXA100(1024GPUs,1024InfiniBandAdapters)TTO0.6福MessageSizeB)SHARPDEMO#page#NVIDIA#page#