1、Alibaba Cloud | Q2SWorldwide Cloud Services PartnerWhaleA Unified Distributed Training FrameworkAngWangwangangwaalibaba-PAI, Alibaba Cloud15/12/2020WW.ALIBABACLOUD.COM#page#Motivation銷陽Models are getting largerWebText Vaidaton PerplexyMemory(GB)ParametersM)20000050345M775M2.5B8.317500080401500003230
2、100000162085000010110001500340251170112P4P100V100A100EpochModels are getting largerLarger models lead to betterModel size grows far beyondresults with lower validationand more complexupgrading of hardwareperplexities1https/de#page#MotivationData parallelism becomes less optimal for lots of distribut
3、ed workloadsGrads AReduceData ParallelCDP) is widely used in distributed trainingas it is simple and easy to implement.DP is not always optimal for every distributed trainingworkloadsNecessary to find an efficient parallel strategy that canGPUOGPU1make full use of the resources and speedup the train
4、ingDistribute the trainingworkload wiith data paralleliism#page#期MotivationData parallelism becomes less optimal for lots of distributed workloadsBO3中中中EgBertLargeE.g.VGG16E.g.T5GPT-3Some layers contribute most ofDifficult to increase the batch sizein aModel size is far larger than thesingle GPU dev
5、ice due to the limitation ofparameters but have asmall proportionmemory size of a single GPUthe GPU device memory capacity.device.of computation such as FC layers inVGG16.Large weight size,long communicationUnable to train the model unlessHard to overlap computation withtime lead to poor scalability
6、.adopting the model parallelismcommunication. lead to poor scalabilityeras-for-beqinners-ow.ai/2019/05/21/167211https:ed-transformer/#page#阿里云Q9Distributed Model Training ApproachGPU1Grads AIReduceGPUOGPU1GPUOGPU1GPUOData ParallelismPipeline ParallelismOperator ShardingHybrid Parallelism#page#WhaleA
7、n industrial distributed training frameworkDataParallelSupport various distributed strategies within oneModelAutomaticframework.ParallelParallelSupport both customized model parallelism andautoparallel distribution strategy.WhaleEasy to use and compatible with various sets of DSL APIS.HybridOperator
8、ShardingParallelPipeline#page#Key Design配Introduce Whale IR as the basic unit to exploreTF/KerasWhaleAPand implement distribution strategies.Model ReprentationGraphOperationTensorIntroduce three key abstractions: Subgraph MultiWhale IRParallel AbstractionDimension and Virtual Devices which are flexi
9、bleMut-DimensonVirualDevicesSubgraphto express sophisticated parallelism executionClusteringPruningplanModelstructures are grouped withStrategyShardingCost ModelWhale EngineOperatorGroup by clustering and pruning in orderDistributedStrategySearchto speedup the exploration of distributed strategy.Exe
10、cutionPlanGeneratorA meta-driven cost model for exploringTFGraphTensorflow Runtime EngineRuntimedistribution strategies.Hardware#page#Parallel Flow西會全GPUOGPU6Virutal Devico1Device 0GPU1GPU7SPUGPU4 GPU5GPU2GPU8GPU3GPU9GPU4GPU5GPU6GPUOGPU8Virtualize devices as virtual devicesGPU4GPU2GPU10PU7Pipeline p
11、arallelGenerate final parallel planPartition model to subgraphs#page#APIsTwo Sets ofAPIS are defined to contruct hybrid parallelism flexibly and easilyclusterscopeVirtual Device DefinationPrimitive StrategiesGroup physical devices toPrimitives to describe basic parallel types:virtual devices by pred
12、efined replica (data parallelism)layouts: stage (model parallelism)Row Layout split (operator sharding) Column Layout pipeline (pipeline parallelism)Average Layoutauto-parallel(auto parallelism) Spcified Layout#page#Auto ParallelWhale EngineCost ModelMemoryFLOPssubgraph1subgraphoPartition models to
13、subgrahs automaticallyLarger batch size- memory balance of subgraphs-benefit models with batch norm Higher throughput-expplore strategies based on FLOPs/memory/communication#page#ExamplesOData ParallelCDP) DP + Op Sharding Model Parallelwith wh.cluster()with wh.cluster():withwh.stage()with wh.cluste
14、r():withwh.replica()out= ModelParti()with wh.replica0out=ModelPart1()withwh.stage()withwh.split():out=Model()out = ModelPartz(out)out=ModelPart2(out)S Auto ParallelQ Pipeline + DPwithwh.cluster():withwh.replica()withwh.pipeline(micro_batch=4):wh.auto_parallel()withwh.stage():out= ModelPart1()out = M
15、odelC)withwh.stage():out=ModelPart2(out)#page#Best practice-Large Classification Task ResNet-50 with 100,000 classesSoftmaxgradients allreduce of FC layerFC782MBottleneckTFC parameters = 8.7 X ResNet50 parameters90MResNet50代Data Parallelism:Communication of FC layerbecomes the bottleneckinndul#page#
16、Best practice-Large Classification TaskOp ShardingFC Pan2FCPan3FCPatoFC Par1FCPaMap subgraph to virtual devicesHybrid Paralleliism:DP + Opp Sharding#page#Best practice等麗-Large Classification TaskResNet50 100,000 classes8070import whale as wh60cluster = wh.cluster(layout = f“all)50H40with cluster:wit
17、h wh.replica():30features = ResNetse(inputs)20withwh.split()10logits= FC(features)predictions= Softmax(logits)3248口Whale DPWhale Op Sharding-Easy to program a hybrid parallelism training-Using 64 GPUsworkload-Whale Op Sharding = 14.8 X Whale DP-Could scale up to 100 million classesoud ECS.816GV100.G
18、PUlsis356b#page#Best practice-BertlargePoolaPoolerPoolePooler24ENCODER(zna)whale24ENCODER24ENCODER24ENCODERwithwh.cluster():17ENCODER17ENCODERENCODERwith wh.replica():withwh.pipeline(micro_batch4)with wh.stage():ENCODERENCODERENCODERENCODERstage1_layer_8(output)GPU11ENCODERENCODERENCODERENCODERwith
19、wh.stage():outputencoder_layer_8_16(output)withwh.stage():ENCODERENCODERENCODERENCODERoutput=encoder_layer_16_24(output)80ENCODERENCODERENCODERENCODER(andno)tood=ndno(GPUOEmbeddingEmbedcngEmbedcngEmbedngBertLargeBertlargeBertlargePipeline + DPReplica 0Replica 1Replica 2Gradient AllReduce within work
20、er(1.2G)Sendactivation among workers(27M)#page#Best practice-BertlargeBertlarge40.0035.0000082500Using 64 GPUS Batch Size=620.00Whale Pipeline = 1.34 X Whale DP15.00=2.32 X Horovod DP10.005.000.00241632GPUNHoroved D口Whale DPWhale PipelineHECSis356b#page#Whale in Alibaba Cloud奧運會全理Whale have been integrated in both:PAI DSWPAI DLCData SclenceWorkshopthe services are now availlable in Alibaba Cloud.#page#中阿里云歡迎加入PAI釘釘群更多干貨資料等你來合作咨詢搜索Q機器學習PAI