當前位置：首頁 > 報告詳情

Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf

上傳人：張** 編號：161491 2024-05-05 PDF PDF 11頁 1.44MB

該報告所屬合集： 2024 OCP EMEA Regional Summit嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/11

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf》由會員分享，可在線閱讀，更多相關《Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf（11頁珍藏版）》請在三個皮匠報告上搜索。

1、Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningNariakiTateiwa,Researcher,NTT(NIPPON TELEGRAPH AND TELEPHONE CORPORATION)Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningARTIFICIAL INTELLIGENCE(AI)AIAllreduce Communication Bottleneck in Distributed MLWo

2、rkloadBenefits of Aggregation OffloadingOffloading aggregation operation to other resources can reduce Allreduce data transfer in half compared to Ring-Allreduce.Limitations of Existing Offloading ToolsThese tools lack support for Ethernet-based protocols like RoCE.They require specific hardware for

3、 operation.Our Proposal:OptcastOur tool can 2x speed-up Allreduce over Ring-Allreduce,supports RoCE/InfiniBand/AWS-EFA/Socket protocols on commodity hardwareOptcastis an open-source prototype.Give it a try!OverviewDistributed Deep Learning TrafficData Parallelism syncs all processes after communicat

4、ion phase,which is the most common parallel strategy.Allreduce collective is executed in the communication phase.Allreduce data size grows with model parameters.AI models have expanded 1000 x in 3 yearsComputeComm.Sync.Model#parametersTotal Allreduce size per comm.phaseGPT-31.75B6.5 GB*PaLM540B2012.

5、7 GB*Data Parallel Workload*Actually,Allreduce is executed in parallel by distributed worker groups.33NodeAllreduce00+11+22+Aggregation is performed on other computing resources(switch ASIC,server,FPGA,etc.)with well-pipelined transmissionIt reduces Allreduce data transfer by about half compared to

6、the conventional Ring-Allreduce.nExisting technologiesSHARP(Offloading to Switch ASIC)Reduction server of Google Vertex AI(Offloading to Servers)Aggregation OffloadingSHARPReduction Server33ProcessSwitch00+11+22+33ProcessServer00+11+22+ServerServerThey support only certain transport protocols,such a

7、s InfiniBand,not RoCE.They requires specific hardware.Major existing tools are proprietary and could cause compatibility and applicability issues to various current and future environment.Limitations of Existing TechnologiesTransport ProtocolHardware RequirementCode AvailabilitySHARPInfiniBandNVIDIA

8、 Quantum&ConnectX-6 and above(support low latency operation and streaming aggregation operations)ClosedReduction ServerGoogle Vertex AIInfiniBandCloud ServiceClosedGoal:Aggregation Offloading in any environmentnPluggable Aggregation Fabric:Abstraction of network component for aggregationSeamless Net

9、work IntegrationIt must be easily deployable with minimal additions or replacements.Openly developed implementations will help it.It should have protocol transparency,which can work in RoCE,InfiniBand and coexist network.Optical direct links naturally provideprotocol transparencyWell-pipelined trans

10、missionCXL may be useful for this Scalable to Collective WorkloadsIt should support 1001000 workers for ML workloadPrototype ImplementationnOptcast:open-source aggregation offload toolUsing CPU servers as aggregation computationImplemented as NCCL Collnet pluginNCCL is the de facto GPU comm.libraryC

11、ollnet is a module for aggregation offloadNo code changes neededImplemented by RustSupport RoCE/InfiniBand/Socket/AWS-EFAprotocolsOther cloud protocols will be supportedGithubrepository is here!Reduction Server StylePerformance Evaluation8-node Allreduce throughputs of Optcast,SHARP,NCCL in InfiniBa

12、ndOptcastis competitive to SHARP in sufficient Allreduce size for distributed ML workloadRequired more tuning for small Allreduce sizeSettings:4 GPU serversCPU:Intel(R)Xeon(R)CPU E5-2660 v4 2.00GHz2 NVIDIA Tesla V100 GPUsNCCL:v2.19.4-1 with a patch for adjustable chunk sizeGPU-Direct RDMA:enabled4 C

13、PU aggregation servers for OptcastCPU:Intel(R)Xeon(R)CPU E5-2660 v4 2.00GHzOS:Ubuntu 20.04.6 LTS1 NVIDIA InfiniBand switch QM8700 for SHARPCall to ActionWe showed the efficiency of aggregation offloading using Optcast prototypeProblem to handlePerformance Enhancement:Integration with CXL and NIC off

14、loadingEnvironment Expansion:Supporting other MPI libraries such as OpenMPI and RCCLScalability Improvement:Enhanced flexibility of the aggregation fabric to run in large-scale executions of 1000 workersTry and feedback about ithttps:/ to find additional informationBlog post for further details:https:/ PostGithubrepositoryThank you!

相關圖表

本文介紹了一種名為Optcast的開源工具，旨在優化分布式深度學習中的聚合計算。該工具通過將聚合操作卸載到其他計算資源（如交換機ASIC、服務器、FPGA等），可以減少Allreduce數據傳輸量，與傳統的環形Allreduce相比，數據傳輸量減少約一半。Optcast支持以太網協議RoCE，以及InfiniBand、AWS-EFA和Socket協議，能在商品硬件上運行。研究指出，過去三年AI模型參數擴大了1000倍，帶來了Allreduce數據量的激增。Optcast的性能評估顯示，在適當的Allreduce大小下，其性能與SHARP相當。作者呼吁社區嘗試Optcast并提供反饋，以進一步改進和推廣該工具。

"Optcast如何提高Allreduce的速度？" "Optcast與其他聚合卸載工具相比有哪些優勢和局限性？" "如何嘗試和使用Optcast這個開源聚合卸載工具？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站