當前位置：首頁 > 報告詳情

會議5_使用 MVAPICH 中的混合 GPU 壓縮來擴展大型語言模型訓練.pdf

上傳人： Ch****l 編號：171247 2024-07-03 PDF PDF 22頁 3.15MB

該報告所屬合集： 2024年OFA線上研討會（OFA Virtual Workshop Sessions）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/22

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《會議5_使用 MVAPICH 中的混合 GPU 壓縮來擴展大型語言模型訓練.pdf》由會員分享，可在線閱讀，更多相關《會議5_使用 MVAPICH 中的混合 GPU 壓縮來擴展大型語言模型訓練.pdf（22頁珍藏版）》請在三個皮匠報告上搜索。

1、SCALING LARGE LANGUAGE MODEL TRAINING USING HYBRID GPU-BASED COMPRESSION IN MVAPICHAamir Shafi,Research ScientistLang Xu,Ph.D.StudentNetwork Based Computing LaboratoryThe Ohio State Universityhttp:/nowlab.cse.ohio-state.edu/2024 OFA Virtual WorkshopFollow us onhttps:/ 2024 Virtual OFA Workshop2Netwo

2、rk Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop3Network Based Computing LaboratoryLarge Language Models(LLaMA2,GPT4,Claude3)are powerful in various areas(dialogue syst

3、ems,knowledge base,)Model capability scales with number of parameters(100 Million BERT to 500 Billion Megatron-Turing NLG)Training Billion parameter models requires:Parallelism strategies(scaling up to thousands of GPUs)Memory optimization(fitting models within GPUs)Efficient communication(reducing

4、interconnect bandwidth pressure)Training Large Language Model2024 Virtual OFA Workshop4Network Based Computing LaboratoryParallelism StrategiesData Parallelism(DP):Maintains full model replica on each DP rank and takes mini-batch as inputData-intensive gradient synchronization using AllreducePipelin

5、e Parallelism(PP):Shards model layers across devices and executes in a pipeline orderPoint-to-point communication passing activations and gradientsTensor Parallelism(TP):Distributes Matrix Multiplication over different devicesFrequent Allreduce and Allgather communication ensuring correctness3D Para

6、llelism combines DP+PP+TP(Megatron-LM)2024 Virtual OFA Workshop5Network Based Computing LaboratoryMemory OptimizationDeepSpeed ZeRO Optimizer:A novel memory optimization technology for large-scale distributed deep learningEnables training models with billions of parameter among GPUEach GPU only upda

7、tes its portion of data(optimizer states,gradients,model parameters)Reduces memory footprintRequires Allgather and Reduce-Scatter to synchronize between processesZeRO-1:Partitions optimizer states(momentum&variances)across GPUsZeRO-2:Further partitions gradientsZeRO-3:Further partitions model parame

8、tersDeepspeed Zero:https:/arxiv.org/abs/1910.02054v32024 Virtual OFA Workshop6Network Based Computing LaboratoryLLM Training requires data-intensive collective communication using 3D parallelism+ZeRO-1Large communication overhead 1Saturates interconnect bandwidthDifferent sparsity across data struct

9、ure 2Gradients are generally sparse(mostly zeros)Activations and optimizer states are denseCo-designing MPI with GPU-based Compression has proved to greatly leverage bandwidth and throughput!34Profiling and Optimizing Communication1 Q.Anthony,et al.,MCR-DL:Mix-and-Match Communication Runtime for Dee

10、p Learning,in 2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS),St.Petersburg,FL,USA,2023 2 S.Bian et al“Does compression activations help model parallel training?”https:/arxiv.org/abs/2301.026543 Q.Zhou et al.,Designing High-Performance MPI Libraries with On-the-fly Compr

11、ession for Modern GPU Clusters,2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS),Portland,OR,USA,2021,pp.444-453,doi:10.1109/IPDPS49936.2021.00053.4 Q.Zhou et al.,Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communi

12、cation,2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS),St.Petersburg,FL,USA,2023,pp.134-144,doi:10.1109/IPDPS54959.2023.00023.2024 Virtual OFA Workshop7Network Based Computing LaboratoryMotivationUsing compression-assisted MPI collectives(Allgather,Reduce-scatter&point-t

13、o-point)to accelerate large language model training(in a 3D parallelism+ZeRO-1 setting)2024 Virtual OFA Workshop8Network Based Computing LaboratoryWhat are the major communication routines involved in a typical 3D parallelism+ZeRO-1 training scenario?Understanding different implementations on these

14、parallelism strategiesHow to efficiently utilize the different sparsity inherent in the messages without compromising accuracy?Determine message types being transferred in each parallelism degreeUtilize lossless and lossy compressionHow to avoid over-compression in certain parallelism degree?Differe

15、nt parallelism stage uses different compression ratioChallenges2024 Virtual OFA Workshop9Network Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop10Network Based Computing

16、LaboratoryUtilize lossless MPC compression for model parallelismMaintains activation accuracyApplies to inter-layer gradients to avoid over-compressionPreserving accuracyUtilize lossy ZFP compression for Data-Parallel data-intensive gradient AllreduceCompress sparse gradientsProviding speedupsMZHybr

17、id:MPC for MP&ZFP for DP2024 Virtual OFA Workshop11Network Based Computing LaboratoryZHybrid:high-rate ZFP for MP&low-rate ZFP for DPUtilize high-rate ZFP compression for model parallelismMaintains activation accuracyApplies to inter-layer gradients to avoid over-compressionPreserving accuracyUtiliz

18、e low-rate ZFP compression for Data-Parallel data-intensive gradient AllreduceCompress sparse gradientsProviding speedupsMore throughput oriented(no lossless components)2024 Virtual OFA Workshop12Network Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Desi

19、gn Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop13Network Based Computing LaboratoryExperiment SetupCPUIBM Power9 44 Cores/NodeMemory256GBGPUNVIDIA Tesla V100(32GB)InterconnectInfiniBand EDR 100GB/sModelGPT-NeoX-20BDatasetBooks3PP Degree6MP Degree4Grad Accumulation S

20、tep1Micro batch size per GPU4Lassen cluster configuration2024 Virtual OFA Workshop14Network Based Computing LaboratoryEnforce consistent ZFP compression across all parallelism and ZeRO-1ZFP-8 is more aggressive than ZFP-16 in compression(loses more info)ZFP-16:15.4%increase in throughput(samples/sec

21、)11.14%increase in TFLOPS per GPUZFP-8:23.6%increase in throughput(samples/sec)22.5%increase in TFLOPS per GPUStarting from Naive Compression(ZFP)Aggressive lossy compression across all collective communication results in model performance degradation!(higher final test loss)2024 Virtual OFA Worksho

22、p15Network Based Computing LaboratoryEnforce lossless MPC for all collectivesClose to baseline accuracy!However,we lose TFLOPS and throughputStarting from Naive Compression(MPC)2024 Virtual OFA Workshop16Network Based Computing LaboratoryHybrid CompressionNave ZFP or MPC solution poses different pro

23、s and consLossy ZFP provides speedups but degradation in accuracyLossless MPC maintains baseline accuracy but degradation in throughputDP Gradients are sparse,MP activations are densePossible Hybrid solution for according parallelism degree2024 Virtual OFA Workshop17Network Based Computing Laborator

24、yMZHybridlossy ZFP compression for Data Parallel gradient Allreduce+lossless MPC compression for Model Parallel(TP+PP)communicationGood performance speedup(4.4%increase for samples/sec&5.3%increase for TFLOPS),loss curves greatly improved2024 Virtual OFA Workshop18Network Based Computing LaboratoryZ

25、HybridLow-rate ZFP compression for Data Parallel gradient Allreduce+high-rate ZFP compression for Model Parallel(TP+PP)communicationEven better performance speedup(17.3%increase for samples/sec&12.7%increase for TFLOPS),loss curves still acceptable2024 Virtual OFA Workshop19Network Based Computing L

26、aboratoryComparing Zhybrid with NCCL:Up to 7.6%increase in samples/sec and 12.9%in TFLOPS per GPU on 192 V100 GPUsCompression-assisted MPI collectives capable of reducing message size and mitigate bandwidth pressure as we scale upHigher ZFP rates lead to loss closer to baseline than lower ZFP ratesF

27、or specific tradeoffs on accuracy and speedups,the users can select a proper ZFP rate.Discussion2024 Virtual OFA Workshop20Network Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA

28、Workshop21Network Based Computing LaboratoryAnalyzed different communication routines under 3D parallelism and ZeRO stage 1 for a typical LLM training scenarioProposed MZHybrid and ZHybrid,two hybrid compression schemes that adopts GPU-based Compression MPI collectives on LLM training.The two propos

29、ed schemes consider data sparsity within communication and utilizes different compression library(MPC&ZFP)for different parallelism to provide training speedups and baseline-level model performanceMZHybrid provides up to 4.4%increase in samples/sec and 5.3%increase in TFLOPS per GPU while maintainin

30、g baseline model accuracyZHybrid provides up to 20.4%increase in samples/sec and 20.6%increase in TFLOPS per GPUConclusion2024 Virtual OFA Workshop22Network Based Computing LaboratoryThank You!Network-Based Computing Laboratoryhttp:/nowlab.cse.ohio-state.edu/The High-Performance MPI/PGAS Projecthttp:/mvapich.cse.ohio-state.edu/The High-Performance Big Data Projecthttp:/hibd.cse.ohio-state.edu/The High-Performance Deep Learning Projecthttp:/hidl.cse.ohio-state.edu/

相關圖表

本文主要探討了大規模語言模型訓練中使用混合GPU壓縮技術的方法。作者指出，隨著模型參數規模的增長，如何有效利用計算資源、優化內存使用和提升通信效率成為訓練大規模語言模型的關鍵挑戰。文章提出了兩種混合壓縮方案：MZHybrid和ZHybrid，分別針對模型并行和數據并行通信。MZHybrid結合了損失壓縮的ZFP和保持準確度的MPC，實現了4.4%的樣本/秒和5.3%的TFLOPS每GPU的性能提升；ZHybrid采用不同壓縮率的ZFP，實現了更高的性能提升。實驗結果表明，這些混合壓縮策略在不犧牲太多準確度的前提下，顯著提高了訓練速度。

"如何優化大規模語言模型訓練？" "混合壓縮設計在模型訓練中的應用" "如何通過并行計算和壓縮技術提高語言模型訓練效率？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站