Whale：統一多種并行化策略的分布式深度學習框架.pdf

報告預覽

Whale：統一多種并行化策略的分布式深度學習框架.pdf

編號：29537

PDF 19頁 1.72MB 下載積分：VIP專享

下載報告請您先登錄！

Whale：統一多種并行化策略的分布式深度學習框架.pdf

1、Alibaba Cloud | Q2SWorldwide Cloud Services PartnerWhaleA Unified Distributed Training FrameworkAngWangwangangwaalibaba-PAI， Alibaba Cloud15/12/2020WW.ALIBABACLOUD.COM#page#Motivation銷陽Models are getting largerWebText Vaidaton PerplexyMemory（GB）ParametersM）20000050345M775M2.5B8.317500080401500003230

2、100000162085000010110001500340251170112P4P100V100A100EpochModels are getting largerLarger models lead to betterModel size grows far beyondresults with lower validationand more complexupgrading of hardwareperplexities1https/de#page#MotivationData parallelism becomes less optimal for lots of distribut

3、ed workloadsGrads AReduceData ParallelCDP） is widely used in distributed trainingas it is simple and easy to implement.DP is not always optimal for every distributed trainingworkloadsNecessary to find an efficient parallel strategy that canGPUOGPU1make full use of the resources and speedup the train

4、ingDistribute the trainingworkload wiith data paralleliism#page#期MotivationData parallelism becomes less optimal for lots of distributed workloadsBO3中中中EgBertLargeE.g.VGG16E.g.T5GPT-3Some layers contribute most ofDifficult to increase the batch sizein aModel size is far larger than thesingle GPU dev

5、ice due to the limitation ofparameters but have asmall proportionmemory size of a single GPUthe GPU device memory capacity.device.of computation such as FC layers inVGG16.Large weight size,long communicationUnable to train the model unlessHard to overlap computation withtime lead to poor scalability

6、.adopting the model parallelismcommunication. lead to poor scalabilityeras-for-beqinners-ow.ai/2019/05/21/167211https:ed-transformer/#page#阿里云Q9Distributed Model Training ApproachGPU1Grads AIReduceGPUOGPU1GPUOGPU1GPUOData ParallelismPipeline ParallelismOperator ShardingHybrid Parallelism#page#WhaleA

7、n industrial distributed training frameworkDataParallelSupport various distributed strategies within oneModelAutomaticframework.ParallelParallelSupport both customized model parallelism andautoparallel distribution strategy.WhaleEasy to use and compatible with various sets of DSL APIS.HybridOperator

8、ShardingParallelPipeline#page#Key Design配Introduce Whale IR as the basic unit to exploreTF/KerasWhaleAPand implement distribution strategies.Model ReprentationGraphOperationTensorIntroduce three key abstractions: Subgraph MultiWhale IRParallel AbstractionDimension and Virtual Devices which are flexi

9、bleMut-DimensonVirualDevicesSubgraphto express sophisticated parallelism executionClusteringPruningplanModelstructures are grouped withStrategyShardingCost ModelWhale EngineOperatorGroup by clustering and pruning in orderDistributedStrategySearchto speedup the exploration of distributed strategy.Exe

10、cutionPlanGeneratorA meta-driven cost model for exploringTFGraphTensorflow Runtime EngineRuntimedistribution strategies.Hardware#page#Parallel Flow西會全GPUOGPU6Virutal Devico1Device 0GPU1GPU7SPUGPU4 GPU5GPU2GPU8GPU3GPU9GPU4GPU5GPU6GPUOGPU8Virtualize devices as virtual devicesGPU4GPU2GPU10PU7Pipeline p

11、arallelGenerate final parallel planPartition model to subgraphs#page#APIsTwo Sets ofAPIS are defined to contruct hybrid parallelism flexibly and easilyclusterscopeVirtual Device DefinationPrimitive StrategiesGroup physical devices toPrimitives to describe basic parallel types:virtual devices by pred

12、efined replica (data parallelism）layouts: stage (model parallelism）Row Layout split (operator sharding） Column Layout pipeline (pipeline parallelism）Average Layoutauto-parallel（auto parallelism） Spcified Layout#page#Auto ParallelWhale EngineCost ModelMemoryFLOPssubgraph1subgraphoPartition models to

13、subgrahs automaticallyLarger batch size- memory balance of subgraphs-benefit models with batch norm Higher throughput-expplore strategies based on FLOPs/memory/communication#page#ExamplesOData ParallelCDP） DP + Op Sharding Model Parallelwith wh.cluster（）with wh.cluster（）：withwh.stage()with wh.cluste

14、r(）：withwh.replica（）out= ModelParti（）with wh.replica0out=ModelPart1（）withwh.stage（）withwh.split（）：out=Model（）out = ModelPartz（out）out=ModelPart2（out）S Auto ParallelQ Pipeline + DPwithwh.cluster（）：withwh.replica（）withwh.pipeline（micro_batch=4）：wh.auto_parallel（）withwh.stage（）：out= ModelPart1（）out = M

15、odelC）withwh.stage(）：out=ModelPart2（out）#page#Best practice-Large Classification Task ResNet-50 with 100,000 classesSoftmaxgradients allreduce of FC layerFC782MBottleneckTFC parameters = 8.7 X ResNet50 parameters90MResNet50代Data Parallelism:Communication of FC layerbecomes the bottleneckinndul#page#

16、Best practice-Large Classification TaskOp ShardingFC Pan2FCPan3FCPatoFC Par1FCPaMap subgraph to virtual devicesHybrid Paralleliism:DP + Opp Sharding#page#Best practice等麗-Large Classification TaskResNet50 100,000 classes8070import whale as wh60cluster = wh.cluster(layout = f“all）50H40with cluster：wit

17、h wh.replica(）：30features = ResNetse（inputs）20withwh.split(）10logits= FC（features）predictions= Softmax(logits）3248口Whale DPWhale Op Sharding-Easy to program a hybrid parallelism training-Using 64 GPUsworkload-Whale Op Sharding = 14.8 X Whale DP-Could scale up to 100 million classesoud ECS.816GV100.G

18、PUlsis356b#page#Best practice-BertlargePoolaPoolerPoolePooler24ENCODER（zna）whale24ENCODER24ENCODER24ENCODERwithwh.cluster（）：17ENCODER17ENCODERENCODERwith wh.replica（）：withwh.pipeline（micro_batch4）with wh.stage（）：ENCODERENCODERENCODERENCODERstage1_layer_8（output）GPU11ENCODERENCODERENCODERENCODERwith

19、wh.stage（）：outputencoder_layer_8_16(output）withwh.stage（）：ENCODERENCODERENCODERENCODERoutput=encoder_layer_16_24（output）80ENCODERENCODERENCODERENCODER（andno)tood=ndno（GPUOEmbeddingEmbedcngEmbedcngEmbedngBertLargeBertlargeBertlargePipeline + DPReplica 0Replica 1Replica 2Gradient AllReduce within work

20、er(1.2G）Sendactivation among workers(27M）#page#Best practice-BertlargeBertlarge40.0035.0000082500Using 64 GPUS Batch Size=620.00Whale Pipeline = 1.34 X Whale DP15.00=2.32 X Horovod DP10.005.000.00241632GPUNHoroved D口Whale DPWhale PipelineHECSis356b#page#Whale in Alibaba Cloud奧運會全理Whale have been integrated in both:PAI DSWPAI DLCData SclenceWorkshopthe services are now availlable in Alibaba Cloud.#page#中阿里云歡迎加入PAI釘釘群更多干貨資料等你來合作咨詢搜索Q機器學習PAI

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后，可能會被瀏覽器默認打開，此種情況可以點擊瀏覽器菜單，保存網頁到桌面，就可以正常下載了。
3、本站不支持迅雷下載，請使用電腦自帶的IE瀏覽器，或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮，下載后原文更清晰。

本文（Whale：統一多種并行化策略的分布式深度學習框架.pdf）為本站（X-iao）主動上傳，三個皮匠報告文庫僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對上載內容本身不做任何修改或編輯。若此文所含內容侵犯了您的版權或隱私，請立即通知三個皮匠報告文庫（點擊聯系客服），我們立即給予刪除！

溫馨提示：如果因為網速或其他原因下載失敗請重新下載，重復下載不扣分。

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站