1、#page#DATA SCIENCE IN PRACTICEMore than 70% of Compute is Spent on Data ProcessingMore time is spent processing datathan any other stageInference Compute required is greater Benefit to developers is highTrainingNVIDIA opportunity is sizableDataProcessing#page#Accelerating the Machine Learning Data P
2、ipelineStreamsMLTrainandOptimizeedis6redis586mrageStoProduction ML Service(inference)#page#RDMA- Remote DMAInfiniBand and RoCE (RDMA over Converged Ethernet)Messages send/receive, remote DMA and remote atomicsHardware transport專山r Kernel and hypervisor bypassAdvantagesF Lower latency- 10us 700ns Hig
3、her message rate-215M messages/sLower CPU utilization-0%RDDMA NIC#page#RDMA寫操作過程當數據包到達HCA的時,它會檢查地址HCA消耗一個WQE,和內存Key,并直接寫入內存讀取buffer并發送到遠端,產生發送完成信息HostA內存HostB內存發送隊列發送隊列HCAHCA接收隊列完成隊列完成隊列發送端分配一個發送buffer并在HCA注冊,放置一申請分配接收buffer并傳送地址和Key到遠端個包含遠端虛擬地址的發送WQE并“按一下門鈴”nVIOIA#page#Enabling Kernel BypassApplica
4、tionApplicationApplication!MiddlewareMiddlewareMiddlewareSeparation of Control and Data pathsVerbsVerbsVerbsData pathRDMARDMARDMALibraryLibraryLibrarySend,Receive,RDMACompletion, Request eventRDWAConnection establishmentStackControl pathDeviceResource setupDriverMemorymanagementConnection establishm
5、entNIC Hardware#page#RDMA over Converged Ethernet (RoCE)BTH+CRCCSEUhL2UDPRoCEv2 PacketPayloadIPInfiniBand PacketWCRCLRHGRHBTHICRCPayloadRDMARDMAO InfiniBand is a centrally managed, lossless high performanceTransportTransportnetwork architectureG Provides specification for the RDMA transportUDPIB Net
6、workLayerIPO RoCE adapts the efficient RDMA transport to run over EthernetnetworksInfiniBandEthernetLinklayerLink layerInfiniBandROCEv2O Standard Ethernet managementInfiniBandEthernet/ipManagementManagement#page#Verbs programming1.Get the device list2. Open the requested device3. Query the device ca
7、pabilitiesHC4.Alocate a Protection Domain to containyour resourcesCQ5.Register a memory region6.Create a Completion Queue (CQ)7.Create a Queue Pair (QP)8.Bring up a QP9.Post work requests and poll for completion10.Cleanup#page#RDMA 8 InfiniBand software StackSeamless application integration with Upp
8、er Layer Protocols (ULPs)Packet InterfaceRDMATransport InterfaceRDMAMPIDPDKSocketVideoTensorfloySparkAPPSSHMEMNCCLAPPSAPPSAPPSRDMADPDKSHARPUCXVARiverma88PMDAppsRDMACoreServiceslibrary(verbs,CM)Device driver libraryUser油KerneRDMACoreServicesverbs,CM)DeviceDriverHardwareRDMA InfrastructureApplications
9、operating system#page#UCX LibraryUnified Communication X(http:/openucx.org)AMDALos AlamosNATIONALLABORAT通Mission:ArgonneCollaboration between industry,laboratories,andacademia to create production grade communicationNVIDIA.0frameworks and open standards for data centric andhigh-performance applicati
10、onsarm重克OHICSIAIEUNIVERSITAnVIOL#page#UCX LAYERSApplicationsHPC(MPI,SHMEM,Storage,RPC,AIWeb2.0(Spark,Hadoop)UCX-PyUCP- High Level API CProtocols)Transport selection,multi-rail,fragmentationHPCAPI:I/OAPI:Connection establishmenttagmatchingactivemessagesStream,RPC,remotememoryaccess,atomicsclient/serv
11、er,externalUCXUCT- Low Level API CTransports)RDMAGPU / AcceleratorsOthersSharedOmniPatRCDCTUDiWarpCUDATCPCrayAMD/ROCMmemorhCudaROCMOFA Verbs DriverHardwarenviol#page#JUCX -JAVA BINDINGS FOR UCXTransport abstraction- implemented on top of UCP layerCan run over different types of transports (Shared me
12、mory, Infiniband/RoCE,CudaEase of use APIwrapper over high level UCP layerSupported operations: non blocking send/recv/put/get#page#SPARK SHUFFLE BASICSInputMap outputPubtish Map StatusFileMapFileMapSparkFileMasterFileMapFileMapReduce taskFetch blocksReduceFetch blocksReduce taskReduce taskFetch blo
13、cksReduce taskFetch blocksReduce taskFetch blocksnVIOI#page#THE COST OF SHUFFLING Shuffling is very expensive in terms of CPU, RAM, disk and network los= Spark users try to avoid shuffles as much as they canSpeedy shuffles can relieve developers of such concerns, and simplifyapplications#page#Shuffl
14、eManager PlugirSpark allows for external implementations of ShuffleManagers to be plugged inConfigurable per-job using:“spark.shuffle.manager”Interface allows proprietary implementations of Shuffle Writers and Readers, and essentiallydefers the entire Shuffle process to the new componentSparkUCX uti
15、lizes this interface to introduce RDMA in the Shuffle processSparlSparkSortShuffleManagerUcxShuffleManager#page#SparkuCX PerformancePagerankTerasort80027%00LTCF600青500%2400UC00里%20020608010012014016018040100Time.seconds0TataltimeReducertimeUCXTCP#page#SPARK 3.0 IS GPU ACCELERATEDMeeting the requirem
16、ents of data processing today and tomorrowData toAnalyzeDataProcessingSpark 3.0 scale-out clusters are acceleratedRequireusing RAPIDS software and NVIDIA GPUsGPU year-over-year performance increases tomeet and exceed data processingSpans.0requirementsGPU“These contributions lead to fasterdata pipeli
17、nes,model training andscoring for morebreakthroughsandSpankinsightswithApache Spark3.0andDatabricks.CPU-Matei Zaharia,original creator of Apache2010200020202030Hadoop EraSpark EraSpark+GPU EranVIOA#page#Bringing GPU Acceleration To Apache SparkSeamless Integration with Spark 3.0DISTRIBUTED SCALE-OUT
18、 APPLICATIONSFeaturesSPARK COMPONENTSML/DLFRAMEWORKSUse existing (unmodified)customer codeXHdvuopuk SuremingMLHb品發Spark features that are not GPU enabledrun transparently on the CPUAPACHE SPARK3.0CORE江裝店RAPIDS Accelerator forSparkInitial Release-GPU Acceleration of:Spark Data FramesRAPIDSCUDFCUMLcuG
19、raphSpark SQLCUDA-XAML/DL training frameworksUCXMAGNUM IOInfrastructurewith NVIDIAGPUS=GPU-Accelerated#page#10X Higher Performance with GPUDirect I RDMA= Accelerates HPC and Deep Learning performance Lowest communication latency for GPUsGPUnfiniBandpulatencyCPU10X1.88usecNoGpuDirectGPUDirectGPU-nfin
20、iBand-GPuThroughputBiDir)GPU-nfiniBand-GPUThroughput(uni-Dir)11XMVAPICH#page#GPUDirect StorageSystem MemorySystem MemoryUS出StorageCPUCPUA100GB/s00GB/PCle SwitchNICsStorageStorage00GB/GPUGPUWithoutGPUDirectStorageWithGPUDirect StorageScaling up with GPUDirect StorageHigherBandwdthLowerLatencyData Pa#
21、page#Accelerated Shuffle ResultsInventory Pricing Query Speedups250228conds50k45CPUGPUGPU+UCX#page#SPARK 3.0 SOLUTION ADVANTAGESA100 GPU,Mellanox Networking,and Vertically-Integrated SolutionNVIDIA A100 GPUsFastest at processing Spark SQL,and DataFrame, ML/DL operations (Largest GPU memory, Superior
22、 memory B/W,most operations/s)TCO advantages reduces cost per job while saving on data center footprintAbility to partition/shareGPUs tooptimize foreffectiveutilizationPaired with best-of-breed CPUs,enables converged data processing and ML/DL infrastructureMellanox Cennectx HtICs and SwitchesBest-of
23、-breed performance (latency and bandwidth) minimal CPU overheadAdvanced storage capabilities includes NVMe over Fabric offloadssoftware StackZero code changes to run Spark workloads on A100 (Spark APlcalls that can run on GPUs are automatically intercepted and accelerated)Can be delivered and configured out-of-the-box by HPEVertical IntegrationOpptimal overall performance and TCO-especially when compared to cloud-based solutions#page#NVIDIA#page#