Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf

編號:161491 PDF 11頁 1.44MB 下載積分:VIP專享
下載報告請您先登錄!

Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf

1、Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningNariakiTateiwa,Researcher,NTT(NIPPON TELEGRAPH AND TELEPHONE CORPORATION)Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningARTIFICIAL INTELLIGENCE(AI)AIAllreduce Communication Bottleneck in Distributed MLWo

2、rkloadBenefits of Aggregation OffloadingOffloading aggregation operation to other resources can reduce Allreduce data transfer in half compared to Ring-Allreduce.Limitations of Existing Offloading ToolsThese tools lack support for Ethernet-based protocols like RoCE.They require specific hardware for

3、 operation.Our Proposal:OptcastOur tool can 2x speed-up Allreduce over Ring-Allreduce,supports RoCE/InfiniBand/AWS-EFA/Socket protocols on commodity hardwareOptcastis an open-source prototype.Give it a try!OverviewDistributed Deep Learning TrafficData Parallelism syncs all processes after communicat

4、ion phase,which is the most common parallel strategy.Allreduce collective is executed in the communication phase.Allreduce data size grows with model parameters.AI models have expanded 1000 x in 3 yearsComputeComm.Sync.Model#parametersTotal Allreduce size per comm.phaseGPT-31.75B6.5 GB*PaLM540B2012.

5、7 GB*Data Parallel Workload*Actually,Allreduce is executed in parallel by distributed worker groups.33NodeAllreduce00+11+22+Aggregation is performed on other computing resources(switch ASIC,server,FPGA,etc.)with well-pipelined transmissionIt reduces Allreduce data transfer by about half compared to

6、the conventional Ring-Allreduce.nExisting technologiesSHARP(Offloading to Switch ASIC)Reduction server of Google Vertex AI(Offloading to Servers)Aggregation OffloadingSHARPReduction Server33ProcessSwitch00+11+22+33ProcessServer00+11+22+ServerServerThey support only certain transport protocols,such a

7、s InfiniBand,not RoCE.They requires specific hardware.Major existing tools are proprietary and could cause compatibility and applicability issues to various current and future environment.Limitations of Existing TechnologiesTransport ProtocolHardware RequirementCode AvailabilitySHARPInfiniBandNVIDIA

8、 Quantum&ConnectX-6 and above(support low latency operation and streaming aggregation operations)ClosedReduction ServerGoogle Vertex AIInfiniBandCloud ServiceClosedGoal:Aggregation Offloading in any environmentnPluggable Aggregation Fabric:Abstraction of network component for aggregationSeamless Net

9、work IntegrationIt must be easily deployable with minimal additions or replacements.Openly developed implementations will help it.It should have protocol transparency,which can work in RoCE,InfiniBand and coexist network.Optical direct links naturally provideprotocol transparencyWell-pipelined trans

10、missionCXL may be useful for this Scalable to Collective WorkloadsIt should support 1001000 workers for ML workloadPrototype ImplementationnOptcast:open-source aggregation offload toolUsing CPU servers as aggregation computationImplemented as NCCL Collnet pluginNCCL is the de facto GPU comm.libraryC

11、ollnet is a module for aggregation offloadNo code changes neededImplemented by RustSupport RoCE/InfiniBand/Socket/AWS-EFAprotocolsOther cloud protocols will be supportedGithubrepository is here!Reduction Server StylePerformance Evaluation8-node Allreduce throughputs of Optcast,SHARP,NCCL in InfiniBa

12、ndOptcastis competitive to SHARP in sufficient Allreduce size for distributed ML workloadRequired more tuning for small Allreduce sizeSettings:4 GPU serversCPU:Intel(R)Xeon(R)CPU E5-2660 v4 2.00GHz2 NVIDIA Tesla V100 GPUsNCCL:v2.19.4-1 with a patch for adjustable chunk sizeGPU-Direct RDMA:enabled4 C

13、PU aggregation servers for OptcastCPU:Intel(R)Xeon(R)CPU E5-2660 v4 2.00GHzOS:Ubuntu 20.04.6 LTS1 NVIDIA InfiniBand switch QM8700 for SHARPCall to ActionWe showed the efficiency of aggregation offloading using Optcast prototypeProblem to handlePerformance Enhancement:Integration with CXL and NIC off

14、loadingEnvironment Expansion:Supporting other MPI libraries such as OpenMPI and RCCLScalability Improvement:Enhanced flexibility of the aggregation fabric to run in large-scale executions of 1000 workersTry and feedback about ithttps:/ to find additional informationBlog post for further details:https:/ PostGithubrepositoryThank you!

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站