Bagua!并行通訊庫.pdf

編號:29477 PDF 30頁 1.43MB 下載積分:VIP專享
下載報告請您先登錄!

Bagua!并行通訊庫.pdf

1、A Modular & Efficient CommunicationLayer for Distributed LearningBAGUAXiangru LianRui Wang,Hongmei Shi,留快手Shengzhuo Zhang, Yan Tian, Ji LiuShaoduo Gan Jiawei Jiang,EH aaridCe Zhang名快手 EmHzuiriich#page#Deep Learning Eating the World留快手Fascinating progress over thelastdecadefueled by increasingly larg

2、e dataset andAt Kuaishoucomputation power. Recommendation systemThe availability of computation changes our(300+M daily users behaviors)viewon what we can do.EC2 Istance:g2.8xlargeMultimedia understanding2O154XGRIDK520,4.89TFLOPS$2.60/hourCdozens of thousands yearsEC2 Instance: p4d.24xlargedaily acc

3、umulated watchings32.77/htime)What used to take weeks now take hoursVideo preprocessing C200CThink about ImageNet); but users getvideos are uploaded per sec)“greedy with new tasks thattake weeks ona beefy machine.(Think about BERT)#page#Existing Systems and Our VisionAmazingAmazingQuestioni Houu to

4、accommodate theTechniquesSystemsGAP:Currentewer-growing demand of ML trainingAmazing Systemsouer euer-growing scale ofdataDont SupportDecentralizedo Could be algorithmic solutions:Recently DewelopedTrainingbetter models,algorithms,Amazing Techniquesoptimizer.AsynchronousIByteDanceo Could be sustem s

5、olutions: SCALEBytePSTrainingSCALE.SCALENVIDIABAGUAWe focus on scaling with dataCommunicationApexOURGOAL:parallelism- each worker holds aQuantizationDistributed Learningpartition of data, and they jointly trainMicrosoftwith SOTAa single ML model.DeepSpeedCommunicationCommunicationSparsificationOptim

6、izationTechniques.#page#Bagua: Modular Communication (“Gossip”Primitive Communication PattensPrimitive Logical Channels品品品品品品配H配品CentralizedAsynchronousDecentralizedLosslessSparsificationQuantizationModularCompositionDeep LearningOptimization AlgorithmTraining StackSGDADAM品印ReadilyDeep Learning Libr

7、aryApplicablePyTorchTensorFlowPhysical Communication Channels品品NCCLMPIRDMATCP/IPD+A+L#page#Bagua: ChallengesHou to combine all communicationpatterns: new algorithm andtheoretical analysis needed foritiogicalmany combinations.PrimiHouo to prouide eficient imple-mentations for all communicationQuantiz

8、ationLosslessSparsificationpatterns:carefinl sustemoptimizations to unleash the品品品品品品fullpotential ofeach pattern.PatternsPrimitivDepict the tradeoff ofal comm品unication patterns:carefil平配青empirical evahiations andanalytical cost modeling.CentralizedAsynchronousDecentralized#page#OvervieWGUA Theoret

9、ical Foundation:Correctness System Design: Performance Empirical Eualuation: Guideline for practitioners品肉手HHzurich#page#Bagua: Theoretical FoundationHPrimi60Distributed Learning Systemswith First-Order MethodsJiLiuandCeZhangQuantizationLosslessSparsificationCommunication品品品品品品PrimitivePatterns配平配青n

10、owCentralizedAsynchronousDecentralized#page#Data Parallel SGD in a Nutshellf(x,a)minXGradient: gData SourceStorage DeviceModelxData ai0000SensorDRAMDatabaseCPU CacheComputationDeviceGPU,CPU,FPGA#page#Data Parallel SGD in a NutshellGradient: gf(x,ai)minStorage DeviceModel xDataai0000DRAMCPU CacheAver

11、age:(x1+x2)/2Gradient:gStorage DeviceModelDalaa,a0000DRAMCPU Cache#page#page#Different algorithms have differentconvergence and system profile- Letswalk-through some of them.#page#Baseline: Centralized, Synchronous, Lossless, SGDMathematical FormulationSynchronous Averagesync model O)X品g(xa)品Xt+1=xt

12、-y國=get_data()i=1.n品pxb5-ConvergenceGoal1:Keep品This SimilarO(1/T)Centralizedoepan=Lossless Data MovementGoal 2: MakeSystem Profilethis FasterIdea記Distribute batch gradient calculation to multipleTworkersSynchronize workers with a central serverComputationCor AlIReduce)Communication#page#SGDCentraliz

13、ed,Synchronous,Quantized,Mathematical FormulationSynchronous AverageX=sync model C)Q(xa)中品xt+1=x-目=get_data()Convergence國pexb15-中O(1/nTe/VT)Centralized QuantizedQuantization error:e姐oacpdn=區Lossy Data MovementSystem ProfileIdea學Centralized synchronous SGD + Quantized gradientsCsmaller communication

14、payload)ComputationCommunication#page#Centralized,Asynchronous,Lossless,SGD一 Mathematical FormulationAsynchronous AverageX=sync model O)xt+1=x-yg(xt-ai)品目=get_data()staleness caused by asyncConvergence國pExb45=OC1/TAsynchronousoacpdn=區System ProfileLossless Data Movement營BapI2Distribute batch gradien

15、t calculation to multiple2workersWorkers update the global model concurrentlydless synchronization time)ComputationCommunication#page#Decentralized, Synchronous, Lossless,SGDMathematical FormulationDecentralized AverageXsync model O)Xt+1i3日=get_data()Generalform:WeightedAveragext1,xtnConvergence國pEx

16、b45=01715O(1/nTp: network topology constantDecentralizedoacpdn=區System ProfileLossless Data MovementIdeaEvery worker update gradients on its local model學Workers average local model with neighbors(partial average)ComputationCommunication#page#Stacking them together is possible,but far from trivial- L

17、et me showone example.you#page#SGDDecentralized,Synchronous,QuantizedMathematical FormulationDecentralized AverageXsync model O)=WeightedAvg xti,xenl-yg(xeia)+品1x*x=2日=get_data()xt+1.i=xti+Q(z)Convergence國pExb45=0+)/T213)OC1/nTp:network topology constant c:quanterrorDecentralized Quantized一oacpdn=區L

18、ossy Data MovementSystem ProfileIdea學學Decentralized communication pattern like beforeQuantized info sharing between neighborsCsmaller communication payload)ComputationCommunication#page#page#OverviewGUATheoretical Foundation:Correctness System Design: Performance Empirical Eualuation: Guideline for

19、practitioners品肉手HHzurich#page#System DesignA naive implementation might not give you any observablespeed-wps.o To fully take advantage of the flexibilities provided by systemrelaxations, we need to carefully optimize the system implementations.It has been an interesting journeys something we learned

20、:o Scatter-Gather-style Designo Communication/Computation Ouerlappingo Optimize for Different Communication Channels: Local vs Remoteo Carefuly Iplemented Quantization KernelNone of these are specifically hard but getting the detailsright is crucial for the end-to-end performance.#page#Bagua: System

21、 Design- Centralized CommunicationFor Centralized communication, weuse a Scatter-Gather styleimplementation - Each worker isWorker2Worker3Worker1responsible to apartiton of modelall other workers sendupdates ofaLocalLocalLocalModelModelModelpartition to theresponsible workerand pull the latest versi

22、on ofthisGlobalGlobalGlobalModelModelModelpartition.Benefits:Updates gradient) ono Take advantage of aggregatednetwork bandwidths Qlike AlReduce)Latest Version ofo Easy to support both synchronousand asynchronous communications(unlike AlIReduce)#page#Bagua: System Design- Decentralized Communication

23、For Decentralizedcommunication simply pick arandom worker and sends theWorker3Worker2Worker1current version of the local modelBenefits:LocalLocalLocalModelModelModelo Less expensivein terms oflatencyCcompared with RingAlIReduce/Scatter-Gather),o Save 2x in terms of bandwidth,Send Local Modelto a Ran

24、domly Chosen MachineCcompared withAllReduce/Scatter-Gather)o Easy to support both synchronousand asynchronouscommunications#page#Bagua: System Design-Overlapping 8 BatchingLayersOverlap communicationInputOutputand computationReduce#ofForwardcommunications byBackwardbatching small messagesBatch Small

25、 Messages品#page#Bagua: System Design- HeterogeneitySupport differentalgorithms within aMachine-levelLocal Modelmachine and betweenmachinesoWithin machine:computation cost communication costO Between machines:Machine-levelcommunication cost Local Modelcomputation cost#page#OverviewGUATheoretical Foun

26、dation: Correctness System Design: Performance Empirical Eoaluation: Guideline for practitioners品肉手HHzurich#page#Experiments:End-to-end PerformanceBertbase:All algorithms get O.93 testaccuracyin 10 epochsSetup:V100 GPU,100Gb TCP network Single GPU= 22455 seconds/epochSpeedup:10090.98069.70406.5皖8837

27、.529.02025.319.918.51.0話SingleAliReduceAlIReduceQSGDDecentralizedAsyncGPUHorovodBytePSBagua!Bagua!Bagua!QSGD, Decentralized SGD and Async SGD are 50% faster.#page#Experiments: Tradeoff (High Latency Network)Bertbase:Allalgorithms get O.93 test accuracy in 10 epochsHoode/spuooas ggtzz = ndS auts Aouo

28、rei sug uM Yiomau dOL 4500T nd5 O0IA dmaS8083.2Speedup:604036.825.82020.019.38.36.812.913.22.91.0SingleAlReduceAlReduceQSGDDecentralizedGPUHorovodBytePSBagualBagua!Decentralized algorithms are much faster under hiigh latency.#page#Experiments: Tradeoff (Low Bandwidth Network)Bertbase:Allalgorithms g

29、et 0.93 testaccuracy in10 epochsSetup:Vioo GPU5GbTCP network,Single GPU=22455 seconds/epochSpeedup:93.2806049.747.64025.12017.26.415.812.612.510.71.0安SingleAlReduceAlReduceQSGDDecentralizedGPUBagualHorovodBytePSBagualQuantized algorithms are muchh faster under low bandwiidth.#page#Experiments:Tradeo

30、ff(With Straggler)Bertbase:All algorithms get o.93 testaccuracy in10 epochsSetup:V100 GPU. 100Gb TCP network with 1 GPU clock frequency limited to877MHz(memory)/585MHz(graphics), Single GPU= 22455 seconds/epoch10097.5Speedup:806053.849.8474044.62023.21.011.70SingleAllReduceQSGDDecentralizedAsyncGPUBagualHorovodBagua!Bagua!Async algorithms are much faster with straggler workers.#page#Thank YouBAGUA#page#

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(Bagua!并行通訊庫.pdf)為本站 (X-iao) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站