使用網絡 RDMA 技術為 SPARK 架構加速.pdf

編號:29489 PDF 23頁 36.59MB 下載積分:VIP專享
下載報告請您先登錄!

使用網絡 RDMA 技術為 SPARK 架構加速.pdf

1、#page#DATA SCIENCE IN PRACTICEMore than 70% of Compute is Spent on Data ProcessingMore time is spent processing datathan any other stageInference Compute required is greater Benefit to developers is highTrainingNVIDIA opportunity is sizableDataProcessing#page#Accelerating the Machine Learning Data P

2、ipelineStreamsMLTrainandOptimizeedis6redis586mrageStoProduction ML Service(inference)#page#RDMA- Remote DMAInfiniBand and RoCE (RDMA over Converged Ethernet)Messages send/receive, remote DMA and remote atomicsHardware transport專山r Kernel and hypervisor bypassAdvantagesF Lower latency- 10us 700ns Hig

3、her message rate-215M messages/sLower CPU utilization-0%RDDMA NIC#page#RDMA寫操作過程當數據包到達HCA的時,它會檢查地址HCA消耗一個WQE,和內存Key,并直接寫入內存讀取buffer并發送到遠端,產生發送完成信息HostA內存HostB內存發送隊列發送隊列HCAHCA接收隊列完成隊列完成隊列發送端分配一個發送buffer并在HCA注冊,放置一申請分配接收buffer并傳送地址和Key到遠端個包含遠端虛擬地址的發送WQE并“按一下門鈴”nVIOIA#page#Enabling Kernel BypassApplica

4、tionApplicationApplication!MiddlewareMiddlewareMiddlewareSeparation of Control and Data pathsVerbsVerbsVerbsData pathRDMARDMARDMALibraryLibraryLibrarySend,Receive,RDMACompletion, Request eventRDWAConnection establishmentStackControl pathDeviceResource setupDriverMemorymanagementConnection establishm

5、entNIC Hardware#page#RDMA over Converged Ethernet (RoCE)BTH+CRCCSEUhL2UDPRoCEv2 PacketPayloadIPInfiniBand PacketWCRCLRHGRHBTHICRCPayloadRDMARDMAO InfiniBand is a centrally managed, lossless high performanceTransportTransportnetwork architectureG Provides specification for the RDMA transportUDPIB Net

6、workLayerIPO RoCE adapts the efficient RDMA transport to run over EthernetnetworksInfiniBandEthernetLinklayerLink layerInfiniBandROCEv2O Standard Ethernet managementInfiniBandEthernet/ipManagementManagement#page#Verbs programming1.Get the device list2. Open the requested device3. Query the device ca

7、pabilitiesHC4.Alocate a Protection Domain to containyour resourcesCQ5.Register a memory region6.Create a Completion Queue (CQ)7.Create a Queue Pair (QP)8.Bring up a QP9.Post work requests and poll for completion10.Cleanup#page#RDMA 8 InfiniBand software StackSeamless application integration with Upp

8、er Layer Protocols (ULPs)Packet InterfaceRDMATransport InterfaceRDMAMPIDPDKSocketVideoTensorfloySparkAPPSSHMEMNCCLAPPSAPPSAPPSRDMADPDKSHARPUCXVARiverma88PMDAppsRDMACoreServiceslibrary(verbs,CM)Device driver libraryUser油KerneRDMACoreServicesverbs,CM)DeviceDriverHardwareRDMA InfrastructureApplications

9、operating system#page#UCX LibraryUnified Communication X(http:/openucx.org)AMDALos AlamosNATIONALLABORAT通Mission:ArgonneCollaboration between industry,laboratories,andacademia to create production grade communicationNVIDIA.0frameworks and open standards for data centric andhigh-performance applicati

10、onsarm重克OHICSIAIEUNIVERSITAnVIOL#page#UCX LAYERSApplicationsHPC(MPI,SHMEM,Storage,RPC,AIWeb2.0(Spark,Hadoop)UCX-PyUCP- High Level API CProtocols)Transport selection,multi-rail,fragmentationHPCAPI:I/OAPI:Connection establishmenttagmatchingactivemessagesStream,RPC,remotememoryaccess,atomicsclient/serv

11、er,externalUCXUCT- Low Level API CTransports)RDMAGPU / AcceleratorsOthersSharedOmniPatRCDCTUDiWarpCUDATCPCrayAMD/ROCMmemorhCudaROCMOFA Verbs DriverHardwarenviol#page#JUCX -JAVA BINDINGS FOR UCXTransport abstraction- implemented on top of UCP layerCan run over different types of transports (Shared me

12、mory, Infiniband/RoCE,CudaEase of use APIwrapper over high level UCP layerSupported operations: non blocking send/recv/put/get#page#SPARK SHUFFLE BASICSInputMap outputPubtish Map StatusFileMapFileMapSparkFileMasterFileMapFileMapReduce taskFetch blocksReduceFetch blocksReduce taskReduce taskFetch blo

13、cksReduce taskFetch blocksReduce taskFetch blocksnVIOI#page#THE COST OF SHUFFLING Shuffling is very expensive in terms of CPU, RAM, disk and network los= Spark users try to avoid shuffles as much as they canSpeedy shuffles can relieve developers of such concerns, and simplifyapplications#page#Shuffl

14、eManager PlugirSpark allows for external implementations of ShuffleManagers to be plugged inConfigurable per-job using:“spark.shuffle.manager”Interface allows proprietary implementations of Shuffle Writers and Readers, and essentiallydefers the entire Shuffle process to the new componentSparkUCX uti

15、lizes this interface to introduce RDMA in the Shuffle processSparlSparkSortShuffleManagerUcxShuffleManager#page#SparkuCX PerformancePagerankTerasort80027%00LTCF600青500%2400UC00里%20020608010012014016018040100Time.seconds0TataltimeReducertimeUCXTCP#page#SPARK 3.0 IS GPU ACCELERATEDMeeting the requirem

16、ents of data processing today and tomorrowData toAnalyzeDataProcessingSpark 3.0 scale-out clusters are acceleratedRequireusing RAPIDS software and NVIDIA GPUsGPU year-over-year performance increases tomeet and exceed data processingSpans.0requirementsGPU“These contributions lead to fasterdata pipeli

17、nes,model training andscoring for morebreakthroughsandSpankinsightswithApache Spark3.0andDatabricks.CPU-Matei Zaharia,original creator of Apache2010200020202030Hadoop EraSpark EraSpark+GPU EranVIOA#page#Bringing GPU Acceleration To Apache SparkSeamless Integration with Spark 3.0DISTRIBUTED SCALE-OUT

18、 APPLICATIONSFeaturesSPARK COMPONENTSML/DLFRAMEWORKSUse existing (unmodified)customer codeXHdvuopuk SuremingMLHb品發Spark features that are not GPU enabledrun transparently on the CPUAPACHE SPARK3.0CORE江裝店RAPIDS Accelerator forSparkInitial Release-GPU Acceleration of:Spark Data FramesRAPIDSCUDFCUMLcuG

19、raphSpark SQLCUDA-XAML/DL training frameworksUCXMAGNUM IOInfrastructurewith NVIDIAGPUS=GPU-Accelerated#page#10X Higher Performance with GPUDirect I RDMA= Accelerates HPC and Deep Learning performance Lowest communication latency for GPUsGPUnfiniBandpulatencyCPU10X1.88usecNoGpuDirectGPUDirectGPU-nfin

20、iBand-GPuThroughputBiDir)GPU-nfiniBand-GPUThroughput(uni-Dir)11XMVAPICH#page#GPUDirect StorageSystem MemorySystem MemoryUS出StorageCPUCPUA100GB/s00GB/PCle SwitchNICsStorageStorage00GB/GPUGPUWithoutGPUDirectStorageWithGPUDirect StorageScaling up with GPUDirect StorageHigherBandwdthLowerLatencyData Pa#

21、page#Accelerated Shuffle ResultsInventory Pricing Query Speedups250228conds50k45CPUGPUGPU+UCX#page#SPARK 3.0 SOLUTION ADVANTAGESA100 GPU,Mellanox Networking,and Vertically-Integrated SolutionNVIDIA A100 GPUsFastest at processing Spark SQL,and DataFrame, ML/DL operations (Largest GPU memory, Superior

22、 memory B/W,most operations/s)TCO advantages reduces cost per job while saving on data center footprintAbility to partition/shareGPUs tooptimize foreffectiveutilizationPaired with best-of-breed CPUs,enables converged data processing and ML/DL infrastructureMellanox Cennectx HtICs and SwitchesBest-of

23、-breed performance (latency and bandwidth) minimal CPU overheadAdvanced storage capabilities includes NVMe over Fabric offloadssoftware StackZero code changes to run Spark workloads on A100 (Spark APlcalls that can run on GPUs are automatically intercepted and accelerated)Can be delivered and configured out-of-the-box by HPEVertical IntegrationOpptimal overall performance and TCO-especially when compared to cloud-based solutions#page#NVIDIA#page#

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(使用網絡 RDMA 技術為 SPARK 架構加速.pdf)為本站 (X-iao) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站