《Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf》由會員分享,可在線閱讀,更多相關《Optcast Open-Source Aggregation Offloading for Distributed Deep Learning.pdf(11頁珍藏版)》請在三個皮匠報告上搜索。
1、Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningNariakiTateiwa,Researcher,NTT(NIPPON TELEGRAPH AND TELEPHONE CORPORATION)Optcast:Open-Source Aggregation Offloading for Distributed Deep LearningARTIFICIAL INTELLIGENCE(AI)AIAllreduce Communication Bottleneck in Distributed MLWo
2、rkloadBenefits of Aggregation OffloadingOffloading aggregation operation to other resources can reduce Allreduce data transfer in half compared to Ring-Allreduce.Limitations of Existing Offloading ToolsThese tools lack support for Ethernet-based protocols like RoCE.They require specific hardware for
3、 operation.Our Proposal:OptcastOur tool can 2x speed-up Allreduce over Ring-Allreduce,supports RoCE/InfiniBand/AWS-EFA/Socket protocols on commodity hardwareOptcastis an open-source prototype.Give it a try!OverviewDistributed Deep Learning TrafficData Parallelism syncs all processes after communicat
4、ion phase,which is the most common parallel strategy.Allreduce collective is executed in the communication phase.Allreduce data size grows with model parameters.AI models have expanded 1000 x in 3 yearsComputeComm.Sync.Model#parametersTotal Allreduce size per comm.phaseGPT-31.75B6.5 GB*PaLM540B2012.
5、7 GB*Data Parallel Workload*Actually,Allreduce is executed in parallel by distributed worker groups.33NodeAllreduce00+11+22+Aggregation is performed on other computing resources(switch ASIC,server,FPGA,etc.)with well-pipelined transmissionIt reduces Allreduce data transfer by about half compared to
6、the conventional Ring-Allreduce.nExisting technologiesSHARP(Offloading to Switch ASIC)Reduction server of Google Vertex AI(Offloading to Servers)Aggregation OffloadingSHARPReduction Server33ProcessSwitch00+11+22+33ProcessServer00+11+22+ServerServerThey support only certain transport protocols,such a
7、s InfiniBand,not RoCE.They requires specific hardware.Major existing tools are proprietary and could cause compatibility and applicability issues to various current and future environment.Limitations of Existing TechnologiesTransport ProtocolHardware RequirementCode AvailabilitySHARPInfiniBandNVIDIA
8、 Quantum&ConnectX-6 and above(support low latency operation and streaming aggregation operations)ClosedReduction ServerGoogle Vertex AIInfiniBandCloud ServiceClosedGoal:Aggregation Offloading in any environmentnPluggable Aggregation Fabric:Abstraction of network component for aggregationSeamless Net
9、work IntegrationIt must be easily deployable with minimal additions or replacements.Openly developed implementations will help it.It should have protocol transparency,which can work in RoCE,InfiniBand and coexist network.Optical direct links naturally provideprotocol transparencyWell-pipelined trans
10、missionCXL may be useful for this Scalable to Collective WorkloadsIt should support 1001000 workers for ML workloadPrototype ImplementationnOptcast:open-source aggregation offload toolUsing CPU servers as aggregation computationImplemented as NCCL Collnet pluginNCCL is the de facto GPU comm.libraryC
11、ollnet is a module for aggregation offloadNo code changes neededImplemented by RustSupport RoCE/InfiniBand/Socket/AWS-EFAprotocolsOther cloud protocols will be supportedGithubrepository is here!Reduction Server StylePerformance Evaluation8-node Allreduce throughputs of Optcast,SHARP,NCCL in InfiniBa
12、ndOptcastis competitive to SHARP in sufficient Allreduce size for distributed ML workloadRequired more tuning for small Allreduce sizeSettings:4 GPU serversCPU:Intel(R)Xeon(R)CPU E5-2660 v4 2.00GHz2 NVIDIA Tesla V100 GPUsNCCL:v2.19.4-1 with a patch for adjustable chunk sizeGPU-Direct RDMA:enabled4 C
13、PU aggregation servers for OptcastCPU:Intel(R)Xeon(R)CPU E5-2660 v4 2.00GHzOS:Ubuntu 20.04.6 LTS1 NVIDIA InfiniBand switch QM8700 for SHARPCall to ActionWe showed the efficiency of aggregation offloading using Optcast prototypeProblem to handlePerformance Enhancement:Integration with CXL and NIC off
14、loadingEnvironment Expansion:Supporting other MPI libraries such as OpenMPI and RCCLScalability Improvement:Enhanced flexibility of the aggregation fabric to run in large-scale executions of 1000 workersTry and feedback about ithttps:/ to find additional informationBlog post for further details:https:/ PostGithubrepositoryThank you!