《會議5_使用 MVAPICH 中的混合 GPU 壓縮來擴展大型語言模型訓練.pdf》由會員分享,可在線閱讀,更多相關《會議5_使用 MVAPICH 中的混合 GPU 壓縮來擴展大型語言模型訓練.pdf(22頁珍藏版)》請在三個皮匠報告上搜索。
1、SCALING LARGE LANGUAGE MODEL TRAINING USING HYBRID GPU-BASED COMPRESSION IN MVAPICHAamir Shafi,Research ScientistLang Xu,Ph.D.StudentNetwork Based Computing LaboratoryThe Ohio State Universityhttp:/nowlab.cse.ohio-state.edu/2024 OFA Virtual WorkshopFollow us onhttps:/ 2024 Virtual OFA Workshop2Netwo
2、rk Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop3Network Based Computing LaboratoryLarge Language Models(LLaMA2,GPT4,Claude3)are powerful in various areas(dialogue syst
3、ems,knowledge base,)Model capability scales with number of parameters(100 Million BERT to 500 Billion Megatron-Turing NLG)Training Billion parameter models requires:Parallelism strategies(scaling up to thousands of GPUs)Memory optimization(fitting models within GPUs)Efficient communication(reducing
4、interconnect bandwidth pressure)Training Large Language Model2024 Virtual OFA Workshop4Network Based Computing LaboratoryParallelism StrategiesData Parallelism(DP):Maintains full model replica on each DP rank and takes mini-batch as inputData-intensive gradient synchronization using AllreducePipelin
5、e Parallelism(PP):Shards model layers across devices and executes in a pipeline orderPoint-to-point communication passing activations and gradientsTensor Parallelism(TP):Distributes Matrix Multiplication over different devicesFrequent Allreduce and Allgather communication ensuring correctness3D Para
6、llelism combines DP+PP+TP(Megatron-LM)2024 Virtual OFA Workshop5Network Based Computing LaboratoryMemory OptimizationDeepSpeed ZeRO Optimizer:A novel memory optimization technology for large-scale distributed deep learningEnables training models with billions of parameter among GPUEach GPU only upda
7、tes its portion of data(optimizer states,gradients,model parameters)Reduces memory footprintRequires Allgather and Reduce-Scatter to synchronize between processesZeRO-1:Partitions optimizer states(momentum&variances)across GPUsZeRO-2:Further partitions gradientsZeRO-3:Further partitions model parame
8、tersDeepspeed Zero:https:/arxiv.org/abs/1910.02054v32024 Virtual OFA Workshop6Network Based Computing LaboratoryLLM Training requires data-intensive collective communication using 3D parallelism+ZeRO-1Large communication overhead 1Saturates interconnect bandwidthDifferent sparsity across data struct
9、ure 2Gradients are generally sparse(mostly zeros)Activations and optimizer states are denseCo-designing MPI with GPU-based Compression has proved to greatly leverage bandwidth and throughput!34Profiling and Optimizing Communication1 Q.Anthony,et al.,MCR-DL:Mix-and-Match Communication Runtime for Dee
10、p Learning,in 2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS),St.Petersburg,FL,USA,2023 2 S.Bian et al“Does compression activations help model parallel training?”https:/arxiv.org/abs/2301.026543 Q.Zhou et al.,Designing High-Performance MPI Libraries with On-the-fly Compr
11、ession for Modern GPU Clusters,2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS),Portland,OR,USA,2021,pp.444-453,doi:10.1109/IPDPS49936.2021.00053.4 Q.Zhou et al.,Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communi
12、cation,2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS),St.Petersburg,FL,USA,2023,pp.134-144,doi:10.1109/IPDPS54959.2023.00023.2024 Virtual OFA Workshop7Network Based Computing LaboratoryMotivationUsing compression-assisted MPI collectives(Allgather,Reduce-scatter&point-t
13、o-point)to accelerate large language model training(in a 3D parallelism+ZeRO-1 setting)2024 Virtual OFA Workshop8Network Based Computing LaboratoryWhat are the major communication routines involved in a typical 3D parallelism+ZeRO-1 training scenario?Understanding different implementations on these
14、parallelism strategiesHow to efficiently utilize the different sparsity inherent in the messages without compromising accuracy?Determine message types being transferred in each parallelism degreeUtilize lossless and lossy compressionHow to avoid over-compression in certain parallelism degree?Differe
15、nt parallelism stage uses different compression ratioChallenges2024 Virtual OFA Workshop9Network Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop10Network Based Computing
16、LaboratoryUtilize lossless MPC compression for model parallelismMaintains activation accuracyApplies to inter-layer gradients to avoid over-compressionPreserving accuracyUtilize lossy ZFP compression for Data-Parallel data-intensive gradient AllreduceCompress sparse gradientsProviding speedupsMZHybr
17、id:MPC for MP&ZFP for DP2024 Virtual OFA Workshop11Network Based Computing LaboratoryZHybrid:high-rate ZFP for MP&low-rate ZFP for DPUtilize high-rate ZFP compression for model parallelismMaintains activation accuracyApplies to inter-layer gradients to avoid over-compressionPreserving accuracyUtiliz
18、e low-rate ZFP compression for Data-Parallel data-intensive gradient AllreduceCompress sparse gradientsProviding speedupsMore throughput oriented(no lossless components)2024 Virtual OFA Workshop12Network Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Desi
19、gn Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop13Network Based Computing LaboratoryExperiment SetupCPUIBM Power9 44 Cores/NodeMemory256GBGPUNVIDIA Tesla V100(32GB)InterconnectInfiniBand EDR 100GB/sModelGPT-NeoX-20BDatasetBooks3PP Degree6MP Degree4Grad Accumulation S
20、tep1Micro batch size per GPU4Lassen cluster configuration2024 Virtual OFA Workshop14Network Based Computing LaboratoryEnforce consistent ZFP compression across all parallelism and ZeRO-1ZFP-8 is more aggressive than ZFP-16 in compression(loses more info)ZFP-16:15.4%increase in throughput(samples/sec
21、)11.14%increase in TFLOPS per GPUZFP-8:23.6%increase in throughput(samples/sec)22.5%increase in TFLOPS per GPUStarting from Naive Compression(ZFP)Aggressive lossy compression across all collective communication results in model performance degradation!(higher final test loss)2024 Virtual OFA Worksho
22、p15Network Based Computing LaboratoryEnforce lossless MPC for all collectivesClose to baseline accuracy!However,we lose TFLOPS and throughputStarting from Naive Compression(MPC)2024 Virtual OFA Workshop16Network Based Computing LaboratoryHybrid CompressionNave ZFP or MPC solution poses different pro
23、s and consLossy ZFP provides speedups but degradation in accuracyLossless MPC maintains baseline accuracy but degradation in throughputDP Gradients are sparse,MP activations are densePossible Hybrid solution for according parallelism degree2024 Virtual OFA Workshop17Network Based Computing Laborator
24、yMZHybridlossy ZFP compression for Data Parallel gradient Allreduce+lossless MPC compression for Model Parallel(TP+PP)communicationGood performance speedup(4.4%increase for samples/sec&5.3%increase for TFLOPS),loss curves greatly improved2024 Virtual OFA Workshop18Network Based Computing LaboratoryZ
25、HybridLow-rate ZFP compression for Data Parallel gradient Allreduce+high-rate ZFP compression for Model Parallel(TP+PP)communicationEven better performance speedup(17.3%increase for samples/sec&12.7%increase for TFLOPS),loss curves still acceptable2024 Virtual OFA Workshop19Network Based Computing L
26、aboratoryComparing Zhybrid with NCCL:Up to 7.6%increase in samples/sec and 12.9%in TFLOPS per GPU on 192 V100 GPUsCompression-assisted MPI collectives capable of reducing message size and mitigate bandwidth pressure as we scale upHigher ZFP rates lead to loss closer to baseline than lower ZFP ratesF
27、or specific tradeoffs on accuracy and speedups,the users can select a proper ZFP rate.Discussion2024 Virtual OFA Workshop20Network Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA
28、Workshop21Network Based Computing LaboratoryAnalyzed different communication routines under 3D parallelism and ZeRO stage 1 for a typical LLM training scenarioProposed MZHybrid and ZHybrid,two hybrid compression schemes that adopts GPU-based Compression MPI collectives on LLM training.The two propos
29、ed schemes consider data sparsity within communication and utilizes different compression library(MPC&ZFP)for different parallelism to provide training speedups and baseline-level model performanceMZHybrid provides up to 4.4%increase in samples/sec and 5.3%increase in TFLOPS per GPU while maintainin
30、g baseline model accuracyZHybrid provides up to 20.4%increase in samples/sec and 20.6%increase in TFLOPS per GPUConclusion2024 Virtual OFA Workshop22Network Based Computing LaboratoryThank You!Network-Based Computing Laboratoryhttp:/nowlab.cse.ohio-state.edu/The High-Performance MPI/PGAS Projecthttp:/mvapich.cse.ohio-state.edu/The High-Performance Big Data Projecthttp:/hibd.cse.ohio-state.edu/The High-Performance Deep Learning Projecthttp:/hidl.cse.ohio-state.edu/