1、A Modular & Efficient CommunicationLayer for Distributed LearningBAGUAXiangru LianRui Wang,Hongmei Shi,留快手Shengzhuo Zhang, Yan Tian, Ji LiuShaoduo Gan Jiawei Jiang,EH aaridCe Zhang名快手 EmHzuiriich#page#Deep Learning Eating the World留快手Fascinating progress over thelastdecadefueled by increasingly larg
2、e dataset andAt Kuaishoucomputation power. Recommendation systemThe availability of computation changes our(300+M daily users behaviors)viewon what we can do.EC2 Istance:g2.8xlargeMultimedia understanding2O154XGRIDK520,4.89TFLOPS$2.60/hourCdozens of thousands yearsEC2 Instance: p4d.24xlargedaily acc
3、umulated watchings32.77/htime)What used to take weeks now take hoursVideo preprocessing C200CThink about ImageNet); but users getvideos are uploaded per sec)“greedy with new tasks thattake weeks ona beefy machine.(Think about BERT)#page#Existing Systems and Our VisionAmazingAmazingQuestioni Houu to
4、accommodate theTechniquesSystemsGAP:Currentewer-growing demand of ML trainingAmazing Systemsouer euer-growing scale ofdataDont SupportDecentralizedo Could be algorithmic solutions:Recently DewelopedTrainingbetter models,algorithms,Amazing Techniquesoptimizer.AsynchronousIByteDanceo Could be sustem s
5、olutions: SCALEBytePSTrainingSCALE.SCALENVIDIABAGUAWe focus on scaling with dataCommunicationApexOURGOAL:parallelism- each worker holds aQuantizationDistributed Learningpartition of data, and they jointly trainMicrosoftwith SOTAa single ML model.DeepSpeedCommunicationCommunicationSparsificationOptim
6、izationTechniques.#page#Bagua: Modular Communication (“Gossip”Primitive Communication PattensPrimitive Logical Channels品品品品品品配H配品CentralizedAsynchronousDecentralizedLosslessSparsificationQuantizationModularCompositionDeep LearningOptimization AlgorithmTraining StackSGDADAM品印ReadilyDeep Learning Libr
7、aryApplicablePyTorchTensorFlowPhysical Communication Channels品品NCCLMPIRDMATCP/IPD+A+L#page#Bagua: ChallengesHou to combine all communicationpatterns: new algorithm andtheoretical analysis needed foritiogicalmany combinations.PrimiHouo to prouide eficient imple-mentations for all communicationQuantiz
8、ationLosslessSparsificationpatterns:carefinl sustemoptimizations to unleash the品品品品品品fullpotential ofeach pattern.PatternsPrimitivDepict the tradeoff ofal comm品unication patterns:carefil平配青empirical evahiations andanalytical cost modeling.CentralizedAsynchronousDecentralized#page#OvervieWGUA Theoret
9、ical Foundation:Correctness System Design: Performance Empirical Eualuation: Guideline for practitioners品肉手HHzurich#page#Bagua: Theoretical FoundationHPrimi60Distributed Learning Systemswith First-Order MethodsJiLiuandCeZhangQuantizationLosslessSparsificationCommunication品品品品品品PrimitivePatterns配平配青n
10、owCentralizedAsynchronousDecentralized#page#Data Parallel SGD in a Nutshellf(x,a)minXGradient: gData SourceStorage DeviceModelxData ai0000SensorDRAMDatabaseCPU CacheComputationDeviceGPU,CPU,FPGA#page#Data Parallel SGD in a NutshellGradient: gf(x,ai)minStorage DeviceModel xDataai0000DRAMCPU CacheAver
11、age:(x1+x2)/2Gradient:gStorage DeviceModelDalaa,a0000DRAMCPU Cache#page#page#Different algorithms have differentconvergence and system profile- Letswalk-through some of them.#page#Baseline: Centralized, Synchronous, Lossless, SGDMathematical FormulationSynchronous Averagesync model O)X品g(xa)品Xt+1=xt
12、-y國=get_data()i=1.n品pxb5-ConvergenceGoal1:Keep品This SimilarO(1/T)Centralizedoepan=Lossless Data MovementGoal 2: MakeSystem Profilethis FasterIdea記Distribute batch gradient calculation to multipleTworkersSynchronize workers with a central serverComputationCor AlIReduce)Communication#page#SGDCentraliz
13、ed,Synchronous,Quantized,Mathematical FormulationSynchronous AverageX=sync model C)Q(xa)中品xt+1=x-目=get_data()Convergence國pexb15-中O(1/nTe/VT)Centralized QuantizedQuantization error:e姐oacpdn=區Lossy Data MovementSystem ProfileIdea學Centralized synchronous SGD + Quantized gradientsCsmaller communication
14、payload)ComputationCommunication#page#Centralized,Asynchronous,Lossless,SGD一 Mathematical FormulationAsynchronous AverageX=sync model O)xt+1=x-yg(xt-ai)品目=get_data()staleness caused by asyncConvergence國pExb45=OC1/TAsynchronousoacpdn=區System ProfileLossless Data Movement營BapI2Distribute batch gradien
15、t calculation to multiple2workersWorkers update the global model concurrentlydless synchronization time)ComputationCommunication#page#Decentralized, Synchronous, Lossless,SGDMathematical FormulationDecentralized AverageXsync model O)Xt+1i3日=get_data()Generalform:WeightedAveragext1,xtnConvergence國pEx
16、b45=01715O(1/nTp: network topology constantDecentralizedoacpdn=區System ProfileLossless Data MovementIdeaEvery worker update gradients on its local model學Workers average local model with neighbors(partial average)ComputationCommunication#page#Stacking them together is possible,but far from trivial- L
17、et me showone example.you#page#SGDDecentralized,Synchronous,QuantizedMathematical FormulationDecentralized AverageXsync model O)=WeightedAvg xti,xenl-yg(xeia)+品1x*x=2日=get_data()xt+1.i=xti+Q(z)Convergence國pExb45=0+)/T213)OC1/nTp:network topology constant c:quanterrorDecentralized Quantized一oacpdn=區L
18、ossy Data MovementSystem ProfileIdea學學Decentralized communication pattern like beforeQuantized info sharing between neighborsCsmaller communication payload)ComputationCommunication#page#page#OverviewGUATheoretical Foundation:Correctness System Design: Performance Empirical Eualuation: Guideline for
19、practitioners品肉手HHzurich#page#System DesignA naive implementation might not give you any observablespeed-wps.o To fully take advantage of the flexibilities provided by systemrelaxations, we need to carefully optimize the system implementations.It has been an interesting journeys something we learned
20、:o Scatter-Gather-style Designo Communication/Computation Ouerlappingo Optimize for Different Communication Channels: Local vs Remoteo Carefuly Iplemented Quantization KernelNone of these are specifically hard but getting the detailsright is crucial for the end-to-end performance.#page#Bagua: System
21、 Design- Centralized CommunicationFor Centralized communication, weuse a Scatter-Gather styleimplementation - Each worker isWorker2Worker3Worker1responsible to apartiton of modelall other workers sendupdates ofaLocalLocalLocalModelModelModelpartition to theresponsible workerand pull the latest versi
22、on ofthisGlobalGlobalGlobalModelModelModelpartition.Benefits:Updates gradient) ono Take advantage of aggregatednetwork bandwidths Qlike AlReduce)Latest Version ofo Easy to support both synchronousand asynchronous communications(unlike AlIReduce)#page#Bagua: System Design- Decentralized Communication
23、For Decentralizedcommunication simply pick arandom worker and sends theWorker3Worker2Worker1current version of the local modelBenefits:LocalLocalLocalModelModelModelo Less expensivein terms oflatencyCcompared with RingAlIReduce/Scatter-Gather),o Save 2x in terms of bandwidth,Send Local Modelto a Ran
24、domly Chosen MachineCcompared withAllReduce/Scatter-Gather)o Easy to support both synchronousand asynchronouscommunications#page#Bagua: System Design-Overlapping 8 BatchingLayersOverlap communicationInputOutputand computationReduce#ofForwardcommunications byBackwardbatching small messagesBatch Small
25、 Messages品#page#Bagua: System Design- HeterogeneitySupport differentalgorithms within aMachine-levelLocal Modelmachine and betweenmachinesoWithin machine:computation cost communication costO Between machines:Machine-levelcommunication cost Local Modelcomputation cost#page#OverviewGUATheoretical Foun
26、dation: Correctness System Design: Performance Empirical Eoaluation: Guideline for practitioners品肉手HHzurich#page#Experiments:End-to-end PerformanceBertbase:All algorithms get O.93 testaccuracyin 10 epochsSetup:V100 GPU,100Gb TCP network Single GPU= 22455 seconds/epochSpeedup:10090.98069.70406.5皖8837
27、.529.02025.319.918.51.0話SingleAliReduceAlIReduceQSGDDecentralizedAsyncGPUHorovodBytePSBagua!Bagua!Bagua!QSGD, Decentralized SGD and Async SGD are 50% faster.#page#Experiments: Tradeoff (High Latency Network)Bertbase:Allalgorithms get O.93 test accuracy in 10 epochsHoode/spuooas ggtzz = ndS auts Aouo
28、rei sug uM Yiomau dOL 4500T nd5 O0IA dmaS8083.2Speedup:604036.825.82020.019.38.36.812.913.22.91.0SingleAlReduceAlReduceQSGDDecentralizedGPUHorovodBytePSBagualBagua!Decentralized algorithms are much faster under hiigh latency.#page#Experiments: Tradeoff (Low Bandwidth Network)Bertbase:Allalgorithms g
29、et 0.93 testaccuracy in10 epochsSetup:Vioo GPU5GbTCP network,Single GPU=22455 seconds/epochSpeedup:93.2806049.747.64025.12017.26.415.812.612.510.71.0安SingleAlReduceAlReduceQSGDDecentralizedGPUBagualHorovodBytePSBagualQuantized algorithms are muchh faster under low bandwiidth.#page#Experiments:Tradeo
30、ff(With Straggler)Bertbase:All algorithms get o.93 testaccuracy in10 epochsSetup:V100 GPU. 100Gb TCP network with 1 GPU clock frequency limited to877MHz(memory)/585MHz(graphics), Single GPU= 22455 seconds/epoch10097.5Speedup:806053.849.8474044.62023.21.011.70SingleAllReduceQSGDDecentralizedAsyncGPUBagualHorovodBagua!Bagua!Async algorithms are much faster with straggler workers.#page#Thank YouBAGUA#page#