Bagua！并行通訊庫.pdf

報告預覽

Bagua！并行通訊庫.pdf

編號：29477

PDF 30頁 1.43MB 下載積分：VIP專享

下載報告請您先登錄！

Bagua！并行通訊庫.pdf

1、A Modular & Efficient CommunicationLayer for Distributed LearningBAGUAXiangru LianRui Wang，Hongmei Shi，留快手Shengzhuo Zhang， Yan Tian， Ji LiuShaoduo Gan Jiawei Jiang，EH aaridCe Zhang名快手 EmHzuiriich#page#Deep Learning Eating the World留快手Fascinating progress over thelastdecadefueled by increasingly larg

2、e dataset andAt Kuaishoucomputation power. Recommendation systemThe availability of computation changes our(300+M daily users behaviors）viewon what we can do.EC2 Istance:g2.8xlargeMultimedia understanding2O154XGRIDK520,4.89TFLOPS$2.60/hourCdozens of thousands yearsEC2 Instance: p4d.24xlargedaily acc

3、umulated watchings32.77/htime）What used to take weeks now take hoursVideo preprocessing C200CThink about ImageNet）； but users getvideos are uploaded per sec）“greedy with new tasks thattake weeks ona beefy machine.(Think about BERT）#page#Existing Systems and Our VisionAmazingAmazingQuestioni Houu to

4、accommodate theTechniquesSystemsGAP:Currentewer-growing demand of ML trainingAmazing Systemsouer euer-growing scale ofdataDont SupportDecentralizedo Could be algorithmic solutions:Recently DewelopedTrainingbetter models,algorithms，Amazing Techniquesoptimizer.AsynchronousIByteDanceo Could be sustem s

5、olutions: SCALEBytePSTrainingSCALE.SCALENVIDIABAGUAWe focus on scaling with dataCommunicationApexOURGOAL:parallelism- each worker holds aQuantizationDistributed Learningpartition of data, and they jointly trainMicrosoftwith SOTAa single ML model.DeepSpeedCommunicationCommunicationSparsificationOptim

6、izationTechniques.#page#Bagua: Modular Communication （“Gossip”Primitive Communication PattensPrimitive Logical Channels品品品品品品配H配品CentralizedAsynchronousDecentralizedLosslessSparsificationQuantizationModularCompositionDeep LearningOptimization AlgorithmTraining StackSGDADAM品印ReadilyDeep Learning Libr

7、aryApplicablePyTorchTensorFlowPhysical Communication Channels品品NCCLMPIRDMATCP/IPD+A+L#page#Bagua: ChallengesHou to combine all communicationpatterns: new algorithm andtheoretical analysis needed foritiogicalmany combinations.PrimiHouo to prouide eficient imple-mentations for all communicationQuantiz

8、ationLosslessSparsificationpatterns:carefinl sustemoptimizations to unleash the品品品品品品fullpotential ofeach pattern.PatternsPrimitivDepict the tradeoff ofal comm品unication patterns:carefil平配青empirical evahiations andanalytical cost modeling.CentralizedAsynchronousDecentralized#page#OvervieWGUA Theoret

9、ical Foundation:Correctness System Design: Performance Empirical Eualuation: Guideline for practitioners品肉手HHzurich#page#Bagua: Theoretical FoundationHPrimi60Distributed Learning Systemswith First-Order MethodsJiLiuandCeZhangQuantizationLosslessSparsificationCommunication品品品品品品PrimitivePatterns配平配青n

10、owCentralizedAsynchronousDecentralized#page#Data Parallel SGD in a Nutshellf（x，a）minXGradient: gData SourceStorage DeviceModelxData ai0000SensorDRAMDatabaseCPU CacheComputationDeviceGPU，CPU，FPGA#page#Data Parallel SGD in a NutshellGradient: gf（x，ai）minStorage DeviceModel xDataai0000DRAMCPU CacheAver

11、age:（x1+x2）/2Gradient:gStorage DeviceModelDalaa，a0000DRAMCPU Cache#page#page#Different algorithms have differentconvergence and system profile- Letswalk-through some of them.#page#Baseline: Centralized， Synchronous， Lossless， SGDMathematical FormulationSynchronous Averagesync model O）X品g(xa）品Xt+1=xt

12、-y國=get_data(）i=1.n品pxb5-ConvergenceGoal1:Keep品This SimilarO（1/T）Centralizedoepan=Lossless Data MovementGoal 2: MakeSystem Profilethis FasterIdea記Distribute batch gradient calculation to multipleTworkersSynchronize workers with a central serverComputationCor AlIReduce）Communication#page#SGDCentraliz

13、ed，Synchronous，Quantized，Mathematical FormulationSynchronous AverageX=sync model C）Q(xa）中品xt+1=x-目=get_data(）Convergence國pexb15-中O(1/nTe/VT）Centralized QuantizedQuantization error:e姐oacpdn=區Lossy Data MovementSystem ProfileIdea學Centralized synchronous SGD + Quantized gradientsCsmaller communication

14、payload）ComputationCommunication#page#Centralized，Asynchronous，Lossless，SGD一 Mathematical FormulationAsynchronous AverageX=sync model O）xt+1=x-yg(xt-ai）品目=get_data(）staleness caused by asyncConvergence國pExb45=OC1/TAsynchronousoacpdn=區System ProfileLossless Data Movement營BapI2Distribute batch gradien

15、t calculation to multiple2workersWorkers update the global model concurrentlydless synchronization time）ComputationCommunication#page#Decentralized， Synchronous， Lossless，SGDMathematical FormulationDecentralized AverageXsync model O）Xt+1i3日=get_data(）Generalform:WeightedAveragext1,xtnConvergence國pEx

16、b45=01715O(1/nTp: network topology constantDecentralizedoacpdn=區System ProfileLossless Data MovementIdeaEvery worker update gradients on its local model學Workers average local model with neighbors(partial average）ComputationCommunication#page#Stacking them together is possible，but far from trivial- L

17、et me showone example.you#page#SGDDecentralized，Synchronous，QuantizedMathematical FormulationDecentralized AverageXsync model O）=WeightedAvg xti,xenl-yg(xeia）+品1x*x=2日=get_data(）xt+1.i=xti+Q（z）Convergence國pExb45=0+）/T213）OC1/nTp:network topology constant c:quanterrorDecentralized Quantized一oacpdn=區L

18、ossy Data MovementSystem ProfileIdea學學Decentralized communication pattern like beforeQuantized info sharing between neighborsCsmaller communication payload）ComputationCommunication#page#page#OverviewGUATheoretical Foundation:Correctness System Design: Performance Empirical Eualuation: Guideline for

19、practitioners品肉手HHzurich#page#System DesignA naive implementation might not give you any observablespeed-wps.o To fully take advantage of the flexibilities provided by systemrelaxations, we need to carefully optimize the system implementations.It has been an interesting journeys something we learned

20、:o Scatter-Gather-style Designo Communication/Computation Ouerlappingo Optimize for Different Communication Channels: Local vs Remoteo Carefuly Iplemented Quantization KernelNone of these are specifically hard but getting the detailsright is crucial for the end-to-end performance.#page#Bagua: System

21、 Design- Centralized CommunicationFor Centralized communication， weuse a Scatter-Gather styleimplementation - Each worker isWorker2Worker3Worker1responsible to apartiton of modelall other workers sendupdates ofaLocalLocalLocalModelModelModelpartition to theresponsible workerand pull the latest versi

22、on ofthisGlobalGlobalGlobalModelModelModelpartition.Benefits:Updates gradient） ono Take advantage of aggregatednetwork bandwidths Qlike AlReduce）Latest Version ofo Easy to support both synchronousand asynchronous communications(unlike AlIReduce）#page#Bagua: System Design- Decentralized Communication

23、For Decentralizedcommunication simply pick arandom worker and sends theWorker3Worker2Worker1current version of the local modelBenefits:LocalLocalLocalModelModelModelo Less expensivein terms oflatencyCcompared with RingAlIReduce/Scatter-Gather），o Save 2x in terms of bandwidth，Send Local Modelto a Ran

24、domly Chosen MachineCcompared withAllReduce/Scatter-Gather）o Easy to support both synchronousand asynchronouscommunications#page#Bagua: System Design-Overlapping 8 BatchingLayersOverlap communicationInputOutputand computationReduce#ofForwardcommunications byBackwardbatching small messagesBatch Small

25、 Messages品#page#Bagua: System Design- HeterogeneitySupport differentalgorithms within aMachine-levelLocal Modelmachine and betweenmachinesoWithin machine:computation cost communication costO Between machines:Machine-levelcommunication cost Local Modelcomputation cost#page#OverviewGUATheoretical Foun

26、dation: Correctness System Design: Performance Empirical Eoaluation: Guideline for practitioners品肉手HHzurich#page#Experiments:End-to-end PerformanceBertbase:All algorithms get O.93 testaccuracyin 10 epochsSetup:V100 GPU,100Gb TCP network Single GPU= 22455 seconds/epochSpeedup:10090.98069.70406.5皖8837

27、.529.02025.319.918.51.0話SingleAliReduceAlIReduceQSGDDecentralizedAsyncGPUHorovodBytePSBagua！Bagua！Bagua！QSGD， Decentralized SGD and Async SGD are 50% faster.#page#Experiments: Tradeoff (High Latency Network）Bertbase:Allalgorithms get O.93 test accuracy in 10 epochsHoode/spuooas ggtzz = ndS auts Aouo

28、rei sug uM Yiomau dOL 4500T nd5 O0IA dmaS8083.2Speedup:604036.825.82020.019.38.36.812.913.22.91.0SingleAlReduceAlReduceQSGDDecentralizedGPUHorovodBytePSBagualBagua！Decentralized algorithms are much faster under hiigh latency.#page#Experiments: Tradeoff (Low Bandwidth Network）Bertbase:Allalgorithms g

29、et 0.93 testaccuracy in10 epochsSetup:Vioo GPU5GbTCP network，Single GPU=22455 seconds/epochSpeedup:93.2806049.747.64025.12017.26.415.812.612.510.71.0安SingleAlReduceAlReduceQSGDDecentralizedGPUBagualHorovodBytePSBagualQuantized algorithms are muchh faster under low bandwiidth.#page#Experiments:Tradeo

30、ff（With Straggler）Bertbase:All algorithms get o.93 testaccuracy in10 epochsSetup:V100 GPU. 100Gb TCP network with 1 GPU clock frequency limited to877MHz(memory）/585MHz（graphics）， Single GPU= 22455 seconds/epoch10097.5Speedup:806053.849.8474044.62023.21.011.70SingleAllReduceQSGDDecentralizedAsyncGPUBagualHorovodBagua！Bagua！Async algorithms are much faster with straggler workers.#page#Thank YouBAGUA#page#

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后，可能會被瀏覽器默認打開，此種情況可以點擊瀏覽器菜單，保存網頁到桌面，就可以正常下載了。
3、本站不支持迅雷下載，請使用電腦自帶的IE瀏覽器，或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮，下載后原文更清晰。

本文（Bagua！并行通訊庫.pdf）為本站（X-iao）主動上傳，三個皮匠報告文庫僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對上載內容本身不做任何修改或編輯。若此文所含內容侵犯了您的版權或隱私，請立即通知三個皮匠報告文庫（點擊聯系客服），我們立即給予刪除！

溫馨提示：如果因為網速或其他原因下載失敗請重新下載，重復下載不扣分。

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站