網絡計算技術加速 GPU 應用.pdf

編號:29476 PDF 68頁 2.75MB 下載積分:VIP專享
下載報告請您先登錄!

網絡計算技術加速 GPU 應用.pdf

1、NVIDIAIN-NETWORK COMPUTINGPROGRAMMING FOR GPUAPPLICATIONSGTC CHINA 2020Qingchun SongDecember 2020#page#AGENDAREMOTE DIRECT MEMORY ACCESS (RDMAGPU DIRECT RDMA(GDR)GPU DIRECT STORAGE (GDS) SCALABLE HIERARCHICAL AGGREGATION ANDREDUCTION PROTOCOL (SHARPNCCL SHARP會#page#REMOTE DIRECT MEMORYACCESS (RDMA)#

2、page#INFINIBANDS LAYERED ARCHITECTUREUpper LevelClientClientprotocolsIBAIBAOperationsOperationsTransportMessagesLayerQP)SARSARInter Subnet RoutingNetworkNetworkNetwork(IFv6)LayerPacket5UdLink買LinkyuEncodingEncodingLinkLayerMediaMediaFlowMAAcCOSSACCOSSControrControlControlSignaingPhysicalEnd NodeSwit

3、chRouterEndNodeLayer#page#RDMA DATA TRANSFER MODELQueue Pair (QP)QPs are in pairs (Send/Receive)TransmitReceiveWork Request(WR)RecetvdTransmitRemote QPLocal QPWork items that the HW should performWorkQueueConWork Completion (completion)When a WR is completed,it may create a Work Completionwhich prov

4、ides information about the ended WRWQEWQEWork Queue (WQ)WorkQueueWQEWQEA queue which contains WRsScheduled by the HWCompletiow QueuICQEICQECQECQECan be either Send or Receive QueueCompletion Queue (CQ)#page#RDMA OPERATION: SENDThe responder Post Receive Requests (before datais received)RequesterResb

5、onderThe requester Post Send RequestSyncPost RROnly data is sent over the wirePost SRdataACK is sent only in reliable transport typesACKPollcaPollcQ#page#RDMA OPERATION: RDMA WRITEThe requester Post Send RequestRequesterResponderData and remote memory attributes are sentResponder is passivePostSRKa+

6、ppe+elepImmediate data can be used to consume RRs atthe responder sideACKACK is sent only in reliable transport typesPollcQ#page#RDMA OPERATION: RDMA READThe requester Post Send RequestRequesterResponderData and remote memory attributes are sentResponder is passivePost SRaddr+rkeyData is sent from t

7、he responderAvailable only in reliable transport typesdataPoll cQ#page#RDMA OPERATION:ATOMICThe requester Post Send RequestRequesterResponderData and remote memory attributes are sentResponder is passivePostSRdata+addr+rkeyOriginal data is sent from the responderdataRead-modify-write is performed in

8、 respondersmemoryPollcQAvailable only in reliable transport types#page#RDMA OPERATIONS SUMMARYSend / ReceiveUDUCRC Two-sided communicationVSend / ReceiveXVRDMA WriteRDMA WriteXXRDMA Read/Atomic One-sided communication, responder is not involvedMTU2GB2GBMax Send SizeRDMA ReadXXReliability1NNScalabili

9、ty? One-sided communication. responder is not involved(per-process for N processes)Atomic OperationsOne-sided communication,rread-modiffy-write transaction#page#GPU DIRECT RDMA(GDR)#page#WHATS GPUDIRECT RDMAGPUDirect RDMA is used as an API between IB_CORE and GPU peer memory clientHCA/NIC Read/Write

10、 directly to GPU peer memory data buffersGPUDirect RDMA works with InfiniBand and RoCE seamlesslynv_peer_mem is a kernel module that translates peer memory addressesadauo sauoq aul ouu enlaaq uonesun ssappe Kouau Joyseq asesuauaduuu adAuTranslate virtual addresses to GPU memoryApplication calls ibv_

11、reg_mr to register memory region, NIC does RDMA directly from/to GPU memoryGPU Direct AsyncAlowss handling of Riing Doorbell and Poll CQ from GPU kernel#page#GPUDIRECT RDMA EVOLUTIONGPUDirect RDMA AsyncWith Out GPUDirectGPUDirectGPUDirect RDMACPUis involved inWQETwo copiesCPU is involved in theOneco

12、pypreparation and releasecontrol path,WQEcompletedWQEspreparation,ring doorbell,GPU copies data from GPUGPU copies data from GPUhandles completionsinternal memory to GPU driverinternalmemorytosystemGPU is involved in Ringsystem pinned memory(1)pinned memory(1)Doorbell,Handles completionsRDMA device

13、sends data tofor incoming packets to GPUUserspaceneeds tocopy datanetwork from GPU memoryRDMA devicesends data tobetween the GPU drivernetworksystem pinnedmemory(1)andData pathis sameasGDRRDMA device receive dataRDMA system pinned memoryfrom network to GPU(2)RDMA device sends data tomemorynetwork#pa

14、ge#GPUDIRECT RDMAGPUDirect RDMAGPU memory is exposed to RDMA NIwhile(in)Direct data path from GPU to networkData path is zerogpu_kernel(buf;copycudaStreamSynchronizestream);ibvpost_sendbuf;ibv_poll_cq(cqe);doorbell handles completions for incoming packets to GPU1009 CPU UtilizationZero copyRDMA devi

15、ce sends data to network from GPU memoryRDMA devicereceive data from network to GPU memoryThe CPU still synchronizes between GPU tasks and datatransfers#page#GPUDIRECT RDMA ASYNCGPUDirect RDMA AsyncGPU memory is exposed to RDMA NICwhile(fin)Direct data path from GPU to networkData path is zero copyg

16、pukeme,stambugds_stream_queue_sendstream,qp,bufCPU is involved in WQE preparation and release completed WQEsgds_streamwaitcqlstreamcqe)No CPU in critical pathGPU is involved in Ring Doorbell Handles completions for incomingpackets to GPUZero copyRDMA device sends data to network from GPU memoryRDMA

17、device receive data from network to GPU memoryReduce CPU utilization#page#GPUDIRECT ASYNC ARCHITECTUREKernel sideAppProvides PeerDirect through nv_peer_memlibgdsyncUser spacemlx5 providerLibibverbsUse standard verbs to createresources (QP,CQlibmlx5CUDAlibibverbsLibmlx5Use direct verbs to expose unde

18、rlying resourcesLibgdsyncuserUsesabstraction layers internally tosupport different NICS andkernelGPUSofa-kernelnvidiaInstructs CUDAstream to peek CQs directlynv_peer_memAnVIDIA#page#10X HIGHER PERFORMANCE WITH GPUDIRECT IM RDMAAccelerates HPC and Deep Learning performanceLowest communication latency

19、 for GPUs公MVAPICHPUnfiiBandpuThroushputB-Dir)119X10X#page#GPU DIRECT STORAGE(GDS)#page#WHAT IS GPUDIRECT STORAGE?SystemMemarySystemMemoryUFC出每色出SLoradCPUCPUA100GB/sPClcStorage0008CPUGPUWithoutGPUDirectStorageWithGPUDirectStorageScaling upwithGPUDirect StorageHigherBandwcthLower LatencDataF#page#MAGN

20、UM IONVIDIAS Multi-GPU, Multi-Node Networking and Storage 10 Optimization StackMagnum IOPoint-to-PointCollectivesStorageMPI,UCXMPINCCLcuFileNVSHMEMNVLINK/NVSWITCHGPUDirect P2PGPUPcleBGPUDirect RDMAGPUNICROCEGPUDirect StorageStofageGPUDirect StorageIPCle吉CPU#page#page#USER-LEVEL CUFILE LIBRARYlibcufi

21、le.so cuFileDriverAPIS:cuFileDriverOpen,ClosecuFileDriverfGetProperties Set* cuFilelo APIS:cuFileHandleRegisterDeregistercuFileBufRegister,DeregistercuFilefRead,Write) Stream subset ofthecuFile APIS:cuFilefRead,WriteAsyn cuFileBatch APIS:cuFileBatchiO(Submit,GetStatusCancel,Destroy)IOIAU#page#KERNEL

22、-LEVEL DRIVERnvidia-fs.koImplements callbacks from modified Linux kernel modules orfromproprietaryfilesystemsthatenabledirect DMAtoGPU memory= Licensed underGPLV2.= Likewise,any kernel thirdparty kernelcomponents that call the nvidia-fs APIS should expect to besubject toGPLV2.APIsareusedtocompleteth

23、efollowingtasks:Check whether the DwAtarget address is on the GPU(nvfs_is_gpu_page )and needs to be handled speciallynvfs blk rq map_sg returns a nonzero sizeS for a set of target addresses if they contain GPU addressesthevFS.#page#SCALABLE HIERARCHICALAGGREGATION ANDREDUCTION PROTOCOL(SHARP)#page#D

24、ATA AGGREGATION PROTOCOL3-Level Fat Tree NetworkSwitch Network87Nodes#page#TRADITIONAL DATA AGGREGATIONPhase 1Switch NetworkNodes#page#TRADITIONAL DATA AGGREGATIONPhase 2Switch Network2L08Nodes#page#TRADITIONAL DATA AGGREGATIONPhase 3Switch Network267Nodes#page#SCALABLE HIERARCHICAL AGGREGATION AND

25、REDUCTIONPROTOCOL (SHARP)In-network Tree based aggregation mechanismSHARPSwitchMultiple simultaneous outstanding operationsAggregatedAggregatedFor HPC (MPI/ SHMEM) and Distributed Machine Learning applicationsResultDataScalable High Performance Collective OffloadSwitchSwitchSHARPSHARPhBarrier,Reduce

26、,AlL-Reduce,Broadcast and moreAggregatedDataResultSum,Min,Max,Min-loc,max-loc,OR,XOR,ANDRD口Integer and Floating-Point,16/32/64bitsHostHostHostHostHost#page#SHARP IN-NETWORK COMPUTING DATA AGGREGATIONSwitch Network8Nodes0#page#DATA AGGREGATION- COMPARISONTraditionalIn-Network ComputingHigh latencyLow

27、 latencyHigh amount of transferred dataOptimized data motionCPU overheadNo CPU latency additionIOIAU#page#page#SHARP SOFTWARE ARCHITECTUREMPI-all major MPI librariesMPIDL FrameworksHCOLLOptimized collective libraryHCOLLNCCLNCCL Optimized GPU collective librarySHARPSHARP (libsharp/libsharp coll)Libsh

28、arp_coll.so Libsharp.soImplementation of high level SHARP APIInfiniBand NetworkUses low level libsharp.so APIEasy to integrate with multiple MPlsIntegrated into NCCLImpplementation off low level SHARP API#page#SHARP SOFTWARE COMPONENTSLibspublic APIsharp.hLibsharp.so (low level api)libsharp_coll.so

29、(high level api)sharpd: SHARP daemonDaemonscompute nodessharpd,sharp_amLight wait processAlmost 0%cpu usageScriptsOnly control pathsharp_benchmark.shsharp_daemons setup.shsharp_am:Aggregation Manager daemonsame node as Subnet ManagerUtilitiesResource managersharp_coll_dump_configsharp_hellosharp_mpi

30、_test#page#SHARP: CONFIGURING AGGREGATION MANAGERUsing OpensSM4.9 or later does not require any special configuration in the AMConfigure AM with OpenSM v4.7-4.8:Create aconiguration directory for the future SHARP configuration file.%mkdir SHPCX_SHARP_DIR/confCreate root GUIDS fileCopy the root_guids

31、.conf file if used for configuration of Subnet Manager to SHPCX_SHARP_DIR/conf/root_guid.cfg(or)Identify the root switches ofthefabric and createafilewith thenode GUDs of theroot switches ofthefabricForexample:if therearetwo rootswitchesfiles containsQx0002c900000000010x0002c90000000008Createsharp_a

32、m.conffile%catSHPCX_SHARP_DiR/conf/sharp_am.confEOFroot_guids_fileSHPCX_SHARP_DIR/conf/root_guid.cfgib_port_guid PortGUD of the relevant HCA port or OxOsEOF#page#SHARP: RUNNING SHARP DAEMONSSetup the daemonsSHPCX_SHARP_DIR/sbin/sharp_daemons_setup.shUsageUsage: sharp_daemons_setup.sh lp SHARP locati

33、on dirlc-d daemon m-S- Setup SHARP daemon-r-Remove SHARP daemonpPath toalternativeSHARP location dir-d-Daemon name(sharpdor sharp_am)-mAdd monitor capability for daemon controluedeuspIddHSXdHS Ssdnas suoup deus/uas/Id dNHS XOdHSSservice sharp_am start#page#SHARP: RUNNING SHARP DAEMONSsharp_am%SHPCX_

34、SHARP_DIR/sbin/sharp_daemons_setup.sh -s SHPCX_SHARP_DIR -d sharp_am%servicesharp_amstartLog:/var/log/sharp_am.logSharpdconffile:SHPCX_SHARP_DIR/conf/sharpd.confib_devsharpd_log_level 2%pdshw SHPCX_SHARP_DIR/sbin/sharp_daemons_setup.sh-s SHPCX_SHARP_DIR-dsharpd%pdshwjupiterfo01-032service sharpdstar

35、tLog:/var/log/sharpd.log#page#HCOLL fCOLLECTIVES LIBRARYDeployed in production in the worlds leading supercomputersSHARP enabled collectives-GPU and Host memory (Allreduce, Reduce,and barrier)Support for collectives over GPU Direct RDMAScatter-Reduce Algather reduction algorithm for large vectorsRel

36、iable Multicast enabled collectives (VMC)-Bcast large and smallSupport for FP16 reductions (requires Open MPI 5.X)Highly optimized shared memory operationsUCX accelerated collectives (Alreduce large vector, Altoall, AltoALIV,Allgather, Algatherv)Non-blocking MPI collectives (HCA based collective off

37、loadsCrossChannelHigh performance multithreaded impplementation#page#ENABLING HCOLLOpen MPI which ships with MOFED/HPC-X enables HCOLL by default, and its priority is set to highest.To explicitly enable/disable HCOLLS mpirun mca col hcolLenable 1-mca col hcolL np 0./app.exe:Enable/Disable HCOLL-mca

38、colL hcollenable 0/13-mca colL hcolLnp ccomm_sizes: Communicator size threshold to use HCOLLcollectivesSelecting HCA device-XHCOLL_MAIN_IB=dev:port,#page#HCOLLHCOLL Supported CollectivesMPI BarrierMPl_ IbarrierMPLAllreduceIMPLlallreduceSHARP SupportedMPLReduceI MPl_IreduceMPL_BcastIMPl_IbcastMPLAllg

39、atherMPLlallgatherMPLAlltoallvMPLlalltoallvMPLAltoallIMPLlalltoallMPLAlltoallyMPL_lalltoallyFor non supported collectives,falback to next high priority collective module Ctuned/basic )Disable specific colective in HCOLLHCOLL_ML_DISABLE_=1Example:Disable BcastXHCOLL_ML_DISABLE_BCAST=1#page#HCOLLHCOLL

40、 comes with its own set of communication primitive components,known as “BCOL”(UCX),uCX_p2pbasesmuma(SHM),(Cross-Channel)CCrdpgs se uMouy squauoduo dnousqns Jo as UMO S! puybasesmsocket:Socket subgroupbasesmuma:UMA subgroupP2p:Network subgroup (SHARP group)Choosing BCOLS and SBGPsTwocommon env.variab

41、lestosetare:HCOLL_SBGP and HCOLL_BCOLTheaboveenv.variable havea1:1 mapping,must be thesamenumbeSBGPspecifies thesubggroup level,BCOLspecifies the transporttouseat that level#page#HCOLL: TOPOLOGY AWARE SUB GROUPING (SBGP)HCOLL SBGP=BASESMSOCKET,BASESMUMA,P2PMPI Communicator rank layout:10121314511N2N

42、1N3N6N735N4N5socket-0sockot-1s0cket-0socket-0Socket1ocket-1Host1Host-0Host#N“SOCKET” Subgroup:0N2N6N77N3N4NHost-1Host-OHost-N“UMA”Subgroup:N113Host.1Host-0Host-N“P2P” Subgroup:(SHARP group):N1#page#DGXA100CPUOCPU1PCleNVLinkmlx5_0mlx5_2mlx5_6mlx5_8NICONIC2NIC4NIC6nx5_10nlx5PCle SwitcPCleSwitchNIC7NIC

43、1NIC3NIC5mlx5_1mlx53mlx5_7mlx5_9NVSwitch#page#HCOLL: TOPOLOGY AWARE COLLECTIVES:ALLREDUCE-X HCOLL SBGP=BASESMSOCKET,BASESMUMA,P2P-XHCOLL BCOL=BASESMUMA,BASESMUMA,UCX P2P1015111314HostReduce: basesmuma: SHARED MEMORYT50H0130HReduce:basesmuma: SHARED MEMORYAllreduce: UCX pZp/SHARPBcast:basesmuma: SHAR

44、ED MEMORY13HOSt0T350HBcast:basesmuma: SHARED MEMORY(731315Host-035OH#page#HCOLL: TOPOLOGY AWARE SUB GROUPING (SBGP)HCOLL SBGP=BASESMSOCKET,P2PMPI Communicator rank layout:1311121415165N1N2N3N4N70-05ocket-1sockat-0socket-1socket-0socket1Host1HoSt-0NHSOH“SOCKET”Subgroup:12)131415M2N3N6N7Host1Host-0HOS

45、t“p2P” Subgroup:#page#HCOLL: ENABLE SHARPHCOLL_ENABLE SHARP(Default)0Disable SHARF1-Probe SHARP availability and useit.(recommended if SHARP is not must)2-Force to use SHARP.(recommended tf SHARP is must3-Force to use SHARP for all MPlcommunicators (Benchmarking)4-Force to use SHARP for all MPI comm

46、unicators and for all supported collectivesSHARP_COLL LOG LEVELSHARP debug log level0-fatal1-error2-warn(Default)3-info (minimal progress log,Recommended to track SHARP is being used ornot)4-debug5.trace#page#HCOLL:ENABLE SHARPDisable SHARPS mpirun -map-by node np8 x HCOLL_MAIN_IB=mlx5_1:1-x HCOLL_E

47、NABLE_SHARP=0 XSHARP_COLL_LOG_LEVEL=3/osu_barrier#OSUMPIBarrierLatencyTestv5.3.2#AvgLatency(us)3.72HCOLL ENABLE_ SHARP=1,Setup status: correctS mpirun map-by node np 8 X HCOLL_MAIN_IB=mlx5_1:1 -x HCOLL_ENABLE_SHARP=1X SHARP_COLL_LOG_LEVEL=3/osu barriernemo01:0:21559-context.c:485INFOjob(ID:327680001

48、)resourcerequestquota:(osts:0user_data_per_ost:128max_groups:0max_qps:1max_group_channels:1,num_trees:1)nemo01:0:21559context.c:628INFO tree_info:tree idx:quota:(ost:51userdata per_ost:128 max_groups:51 max_qps:1max_group_channels:1)nemo01:0:m.c:417INFoaroup#:ogroupid:7treeidx:orail idx:ogroupsize:8

49、quota:(osts:2user_data_per_ost:128)mgid:(subnet prefix:0xff12a01bfe800000interfaceid:0x4c590e000007)mlid:c005#OSUMPIBarrierLatency Testv5.3.2#AvgLatency(us)08#page#HCOLL ENABLE SHARP CONT.HCOLL ENABLE SHARP=1, setup status: incorrect (SHARPD is down)S mpirunmap-bynode np8 XHCOLL MAINJB=mx5_1:1 XHCOL

50、L_ENABLE SHARP=1 XSHARP_COLLLOG LEVEL=3 /osu barrier-i100001000nemo01:0:26916context.c:195ERRORfailed to open sharp session with SHARPDnemoo126916:ojcommonshap.c288:commsharp coinitSHARP:Failed to initalze SHARP collectvescannot connect to SHARPDC8) job ID:342491137#OSUMPI BarrierLatency Testv5.3.2#

51、AvgLatency(us)3.73HCOLL_ENABLE_SHARP=2(probe),Setup status:incorrect(SHARPD is down)Smpirun-map-by node np2xHCOLL_MAIN_IBmlx5_1:1HCOLL_ENABLE_SHARP=2 -X SHARP_COLL_LOG_LEVEL-3/osubarrier-i10000-1000nemo01:0:27007context.c:195ERRORfailed to open sharpsession with SHARPDnemo01:27007:0jcommon_sharp.c:2

52、88:comm_sharp_collinit SHARP:Failed to initialize SHARP collectives:Cannot connect to SHARPD-8)jobID:353107969nemo01:27007:0jcommon sharp.c:292:comm_sharp_collinitl SHARP:Falbackisdisabled.exitingnemo04:24392:3jcommon_sharp.c:292:comm_sharp_colLinitSHARP:Falbackisdisabled.exitingPrimary job terminat

53、ed normaly but 1 process returnedanon-zero exit code.Per user-direction,the job has been aborted#page#SHARP MULTI-CHANNELWithoutMultichannelOnly one rank per node partcpates in the network communicatorWith MultichannelBestsuitswith HCOLLSubgroupingmechanism.-XHCOLL_SBGP=basesmsocket,p2pHCOLL_BCOL=ba

54、sesmuma,ucx_p2pSocket leaders form P2PSubgroup.Usefulwith“Ful PpNor“ranks on both socketsAvoidinter-socketcommunication.Best forsmall messages.ForLarge messages,Shared memory channel isless overheadthan networkoverhead.“SOCKET” Subgroup:101311121415N12N757NNANSNEsocket-0socket-1socket-1socket-1socke

55、t-0socket-0Host-1050HHost#N“p2P” Subgroup:N1#page#SHARP MULTI-CHANNEL, SINGLE PORTSHARP TreePORTPORTPORTSocket NUJMA-1letE131012141511N2N3三N7socket0socket-1socket0socket-0socket-1socket-1Host-1Host0Host#N#page#SHARP MULTI-CHANNEL,TWO PORTS, SINGLE TREESHARP TreePORT-2PORT-1PORT-1PORT-2PORT-1PORT-2So

56、cket NUMA-0 leader buffe1013141593126NN2N3N77N4N6socket-0socket-1socket0socket-0socket-1socket-1Host-1Host-0Host-#N#page#SHARP MULTI-RAILMultiple SHARP trees,tree per portMore resourcesSHARP group pertreeMultiple SHARP groups per MPI CommunicatorUseful for large messagesStripe messages on both group

57、s.Small message ( OST payload size)On Single groupRound-RobinSHARP_COLL_GROUPS_PER_COMM(default:1,Max:#trees)nVID#page#MULTI-RAIL,2 PORTS,2 TREES.2 GROUPSSHARP Tree -1SHARP Tree-2PORT-2PORT-1PORT-2PORT-1PORT-1PORT-2msgmsgmsgmsgmsgmsgpt-2pt-1pt-2p12pt-1pt-1Socket NUMA-1leaderbuffr0N1N2CNN4NNN7sockat-

58、0socket-0socket-0socket-1Host-1Host-0Host-N#page#NVIDIA DGX: COMPUTE RAILS (PLANES)NCCL creates logical “rings among the same HCALil on multiple DGXs, effectively creating asub-network per HCALiRail 1Rail 2Rail nDotted lines = Ringtraffic flowDGX1DGX2DGXmIOIAU#page#MAPPING RAILS TO INFINIBAND TREESS

59、=4n=88茶m=20DGX1DGX2DGXm#page#SHARP V2- STREAMING AGGREGATIONOB.2GB vector size aggregationUp to 200G aggregated dataUp to 2 concurrent aggregations flows, one per treeUp to 64 transactions in flight per flowReliable MC supportResult spreading- using RC transport (UD is not alowed)Includes single tar

60、get for optimized reduce operationsMH ppolonSM suodoloudeus olnsbsnosluoealase uusuuuwThis operationalows to lock for more than one aggregation operation/infiniteSemaphore release can be done by HWwhen credits are useed or manually usingan “unlock semaphore operation#page#ENABLE STREAMING AGGREGATIO

61、NSHARP_COLL_ENABLE_SATEnable SAT (Streaming Aggregation Tree) capabilitiesX-E=OOHSXT=dHSNOHX-T:SX=NIWOHX9TdupouqdewundusSHARP_COLL ENABLE SAT=1 /sharp coll test-d SHCA -mode perf -collectives allreduce Ssize Siterclx-mld-061:0:49532-context.c:577INFO job(ID:35748427966255972)resourcerequest quota:(o

62、sts:0user_data_per_ost:128max_groups:0max_qps:1 max_group_channels:1num_trees:1)clx-mld-061:0:49532-context.c:748INFOtree_info:type:LLTtreeidx:0treeID:0x0caps:0x6quota:(osts:230user_data_per_ost:128(T:Slauudnoa xewtsdbxeu oez:sdnoa xeuclx-mld-061:0:49532-context.c:752INFOtree_info:type:SATtreeidx:1t

63、reelD:0x3fcaps:0x16clx-mld-061:0:49532-comm.c:404INFOlgroup#:ogroupid:9etree idx:0tree_type:LTrailidx:0groupsize:16quota:(osts:2user_data_per_ost:128)mgid:(subnetprefix:0xOinterfaceid:0x0)mlid:0clx-mld-061:049532-comm.c:404INFOlgroupt:1groupid9etree idx:1tree_type:SATrailidx:0group size:16quota:(ost

64、s:64user_data_per_ost:0)mgid:(subnet prefixOxOinterface id:0xo)mlid:0#page#STREAMING AGGREGATIONSHARP_COLL_SAT_THRESHOLD(default: 8096)Message size threshold to trigger SAT (Streaming Aggregation Tree) modeSHARP_COLL_SAT_LOCK_BATCH_SIZE(default:0xffff(65535)Infinity)SAT Lock batch sizeSHARP_COLL_SAT

65、_LOCK_RETRY_COUNT (default: 1)SAT Lock retry countSHARP_COLL_SAT_LOCK RETRY_TIMEOUT (default: 100ms)SAT Lock retry timeout(ms)#page#NCCL SHARP#page#NCCL OVERVIEWNCCL:NVIDIA Collective Communication LibraryEnables Multi GPU ComputingData Parallel multi GPU trainingNCCL Alreduce:Aggregate gradients ac

66、ross GPUSDL Frameworks (Tensorflow/Horovod,PyTorch,MXNet,Chainer,)NCCL1.0Single node RingNCCL 2.0Ring across mutiple nodesRDMANCCL2.4Hierarchical tree algorithm#page#NCCL RINGSimple,full Bandwidth,linear latencyNetworkHCAHCAHCACUU中U口#page#NCCL TREEKeep Intra-node chain, node leaders participate in t

67、ree, binary double tree, multiple rings - Multiple treesNetworkHCAHCAHCA中口口#page#NCCL SHARPSHARPSwitchoAggregatedAggregatedResultDataSHARPSwitchSwitchSHARPAggregatedDataResultNetworkHCAHCAHCA普U廠#page#NCCL SHARPCollective network PluginReplace Inter-node tree with SHARP Tree Keeps Intra-node ringAggr

68、egation in network switchStreaming from GPU memory with GPU Direct RDMA2.5x thebandwidthSupport for SHARP plugin (Oct.2019)Support for UCX plugin (Feb.2020)#page#SHARP DELIVERS HIGHEST PERFORMANCE FORAlSHARP Enables Nearly 20% Higher Applications PerformanceVAE BenchmarkGNMT MLPerf Benchmarkfacebook

69、VariableAuto-EncoderMLPerfNeural Machine Translation18%189848768312.78720950eSHARPSHARPNo-SHARPNO-SHARP24xDGX1V+4xMellanoxConnectx-632xDGX1V+4xMellanox ConnectX-6GNMTMLPerf0.6 benchmark:Batch Size=32,Overlap=0.15VAEbenchmark:Model=3,BS=512#page#NATURAL LANGUAGE PROCESSING WITH SHARPSHARP Enables Nea

70、rly 17% Higher PerformanceNVIDIABERT Megatron,PyTorchImage classification and object detection was009mainstreampast fewyears45017%betterNatural language processing NLP)significantly400improve customer experience and reduce call1350center staffing怡008050Deployments with NLP are latency sensitivetion200150100500NCCLNCCLSHARP#page#INFINIBAND SHARP Al PERFORMANCE ADVANTAGE2.5X Higher Performance128NVIDIADGXA100(1024GPUs,1024InfiniBandAdapters)TTO0.6福MessageSizeB)SHARPDEMO#page#NVIDIA#page#

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(網絡計算技術加速 GPU 應用.pdf)為本站 (X-iao) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站