當前位置：首頁 > 報告詳情

大語言模型分布式訓練時的量化分析與最佳實踐以 GPT-175B 為例.pdf

上傳人：張** 編號：155323 2024-02-15 PDF PDF 39頁 14.99MB

該報告所屬合集： DataFunSummit2023：大模型與AIGC峰會嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/39

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《大語言模型分布式訓練時的量化分析與最佳實踐以 GPT-175B 為例.pdf》由會員分享，可在線閱讀，更多相關《大語言模型分布式訓練時的量化分析與最佳實踐以 GPT-175B 為例.pdf（39頁珍藏版）》請在三個皮匠報告上搜索。

1、IN-DEPTH ANALYSIS OF THE PERFORMANCE FOR GPT3顏子杰NVIDIALLM TRAINING TECHSLARGER MODEL IS THE TRENDCHALLENGES FOR TRAINING LARGE MODELCompute cost Lower bound of each iteration:(Refer to https:/arxiv.org/abs/2104.04473)B:batch size,S:sequence length,l:transformer layer numberh:hidden size,V:vocabulary

2、 size 2150 ZettaFLOPs(175B with 1.5T tokens)1 ZettaFLOP=1024 ExaFLOPs96!(1+6+16)Challenges 128 DGX A100,trained in 170 120 days.(about 50%computing efficiency)High Compute CostsCHALLENGES FOR TRAINING LARGE MODELMemory costs(Mixed Precision,Native implementation)Model States(Total:3.5TB)Parameter:35

3、0GB(175B*2Bytes)Gradient:350GB Optimizer:2800GB Activation:?Challenges Model can not fit in single GPU or even single GPU server.(35p A100 80G)Model parallelism is a MUST across multi nodesHigh Memory CostsCHALLENGES FOR TRAINING LARGE MODEL Model can not fit in single GPU or even single GPU server.

4、(35p A100-80G)Extremely huge computing power:about 16K A100*days computing.(Not considering efficiency)What we need:An efficient framework with model parallel Careful co-design of software and system7NeMo and Megatron-LM is NVIDIAs FW for efficiently training the worlds largest transformer-based mod

5、els.Train transformer models with billions of parametersAchieve high utilization and scaling to thousands of GPUs7NeMo and MEGATRONOVERVIEW OF LARGE TRANSFORMER TRAINING TECHNIQUES Parallelisms:Pipeline Parallelism Tensor Parallelism Sequence Parallelism Expert Parallelism Memory Optimizations:Distr

6、ibuted optimizer (DeepSpeed ZeRO-1)Checkpoint activations Selective activation checkpointing Others:FP16/BF16 training,optimized kernels,etc Communication overlapping for PP and TPBlue for Megatron v2 featuresGreen for Megatron v3 new featuresTHE DISTRIBUTED TRAINING OF GPT-3 MODELModel Parallelism

7、Tensor model parallelism Intra-layer Split individual layers across multiple devices Simple to implement Good performance under large matrix Fine-grained,high-frequency communication Pipeline model parallelism Inter-layer Split sets of layers across multiple devices Coarse-grained communication Gene

8、ralizable to almost all DNNs Require large batch size for high throughput Load imbalance across workersTHE DISTRIBUTED TRAINING OF GPT-3 MODEL Tensor model parallelism-MLPf and g are conjugate operators,f is identity operator and g is all-reduce operator in forward pass；While in backward pass,f is a

9、ll-reduce operator and g is identity operator.THE DISTRIBUTED TRAINING OF GPT-3 MODEL Tensor model parallelismSelf-attentionf and g are conjugate operators,f is identity operator and g is all-reduce operator in forward pass；While in backward pass,f is all-reduce operator and g is identity operator.P

10、ARALLELISIMNew in Megatron v3:Sequence ParallelismExpands upon Tensor Parallelism by spli4ng tensors across the sequence dimension.Par;oning along the sequence dimension reduces the memory required for the ac;va;ons.Introduces all-gather/reduce-scaAer between sequence parallel and tensor parallel.g

11、and g are conjugate.g is all-gather in forward pass and reduce-scaAer in backward pass.g is reduce-scaAer in forward pass and all-gather in backward pass.THE DISTRIBUTED TRAINING OF GPT-3 MODEL Pipeline model parallelism1F1BTimem:number of mini batchesp:number of pipeline stages!：forward step time：b

12、ackward step timeIdeal time=(!+)Bubble time=(p 1)(!+)Total time=(+1)(!+)Bubble time overhead=+,+-./01.02.3-/01.=4561THE DISTRIBUTED TRAINING OF GPT-3 MODEL Interleaved pipeline model parallelismTimeBubble time overhead=!#$v:number of interleaved stagesBubble time overhead=#%!#$THE DISTRIBUTED TRAINI

13、NG OF GPT-3 MODEL3D ParallelismREQUIREMENT OF STORAGEActivations memory optimizationsFull checkpointing Storing(or checkpointing)the input activations to a group of layers and recomputing other required activations using an extra forward pass during back-propagation.Significantly reduced the require

14、d memory for training,while with 36%computing overhead.Sequence Parallel+Selective checkpointing Only checkpoint and recompute parts of each layer that take up a considerable amount of memory but are not computationally expensive to recompute.Use sequence parallel to distributed other parts of activ

15、ation.Reducing the recompute overhead from 36%to 4%.ACTIVATION CHECKPOINTINGNew in Megatron v3:Selective CheckpointingOnly checkpoint and recompute parts of each layer that take up a considerable amount of memory but are not computa;onally expensive to recompute.(Called selec;ve ac;va;on recompu;ng)

16、Found that aAen;on opera;ons generally have large input sizes and thus large ac;va;ons,however,the number of floa;ng-point opera;ons(FLOPs)per input element is very low.Excep&ons&Limita&ons:Only works well in conjunc&on with other Parallelism TechniquesReducing the recompute overhead from 36%to 4%by

17、 using sequence parallel and selec&ve checkpoin&ng.DISTRIBUTED OPTIMIZERNew in Megatron v3Optimizer States(16M)Gradients(2M)Model Weights(2M)MemoryCostCommunicationCostData ParallelismReplicatedReplicatedReplicated20Mall-reduce(M)Distributed Optimizer(ZeRO Stage 1)PartitionedReplicatedReplicated(4+1

18、6/N)Mreduce-scatter(M)+all-gather(M)ZeRO Stage 2PartitionedPartitionedReplicated(2+18/N)Mreduce-scatter(M)*num_micro_batches+all-gather(M)ZeRO Stage 3PartitionedPartitionedPartitioned20M/N1.5*all-reduce(M)*num_micro_batchesM is the number of parameters,N is the number of devices.Megatron V3 implemen

19、ted distributed optimizer to shard optimizer states including momentum,variance,master weight and master grads,without compromising performance.Considering the communication overhead,ZeRO-2,3 is not currently implemented in Megatron.In fact,the training of GPT-3 175B and Turing-NLG 530B didnt use Ze

20、RO.MEMORY ANALYSISSTORAGE ANALYSISGPT3 storageModel memoryActivation memoryExtra memoryParameterGradientOptimizerTransformer layerEmbedding layerAll reduceSelf-attentionMLPPyTorch Memory ManagementREQUIREMENT OF STORAGEModel params Model memory per device#4+42+1.37 175(3)vocabulary sizesequence leng

21、thAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536v/t=6400h/t=1536REQUIREMENT OF STORAGEModel states Cost of model memory(AMP)vocabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor

22、 parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048a=96h=12288n=96d=8t=8p=16b=1B=1536Partitioned Model size#1.37BParameter()2#2.74GBGradient()2#2.74GBAdam optimizer()4+4+4+4#21.9GBNote:The first 4 in _ equation:the copy of parameter with fp32 typeThe second 4 in _ equation:the copy of

23、gradient with fp32 typeThe third 4 in _ equation:the momentum with fp32 typeThe fourth 4 in _ equation:the variance with fp32 typeREQUIREMENT OF STORAGEModel states w/ZeRO Cost of model memory(AMP)vocabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelM

24、icro batch sizeBatch sizev=51200s=2048a=96h=12288n=96d=8t=8p=16b=1B=1536Partitioned Model size#1.37BParameter()2#2.74GB-ZeRO3-0.34GBGradient()2#2.74GB-ZeRO2-0.34GBAdam optimizer()4+4+4+4#21.9GB-ZeRO1-2.7GBNote:The first 4 in _ equation:the copy of parameter with fp32 typeThe second 4 in _ equation:t

25、he copy of gradient with fp32 typeThe third 4 in _ equation:the momentum with fp32 typeThe fourth 4 in _ equation:the variance with fp32 typeREQUIREMENT OF STORAGEActivations memory(Per Layer)AttentionMLP2 LayerNorm=11+5+19+4=(34+5)REQUIREMENT OF STORAGEFull checkpointing Cost of activation memoryvo

26、cabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536Only store the input activation of transformer layerSublinear memory costNeed forward re-computationFull checkpointing signifi

27、cantly reduced the required memory for training,while with 36%computing overhead.=34+5=2DroppedMLPAttentionREQUIREMENT OF STORAGETensor Parallel=8+5+3+16+3+4=(24+5+10)2 LayerNormt is the TP sizeREQUIREMENT OF STORAGETensor Parallel+Sequence ParallelMLPAttention=8+5+3+16+3+4=(34+5)2 LayerNormREQUIREM

28、ENT OF STORAGETensor Parallel+Sequence Parallel+Selective checkpointingMLPAttention=8+5+3+16+3+4=(34)2 LayerNormDroppedDroppedDroppedREQUIREMENT OF STORAGEActivation-SummaryMethodActivationMemory PerLayerRecomputation overheadNone(34+5)2.86GB0%Full250MB36%TP+SP+Selective(34)106MB4%vocabulary sizeseq

29、uence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536COMMUNICATION COSTDATA PARALLELISMNameOperationMessage sizeGroupLoopsGradient accumulationAll reduceData parallel group1:model gradientBW：Bus

30、bandwidthd：data parallelism1#=2(1)0501001502002500200040006000800010000BANDWIDTH(GB/S)MESSAGE SIZE(MB)Bus Bandwidth vs.Message sizeTENSOR PARALLELISMNameOperationMessage sizeGroupSelf-attention All-reduce forTP All-gather andReduce-Scatterfor TP+SP_Tensor parallel groupMLPCommunication size for TP a

31、nd TP+SP are the same.TENSOR PARALLELISMEquation for Communication Cost2#=2(,)2(1)3+3 Message size per NCCL callCorrection factor for bus BW（all-reduce)Self attentionMLPBus Bandwidth:obtained by interpolationNumber of transformer layersNumber of mini batchesvocabulary sizesequence lengthAttn headsHi

32、dden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizevsahhndtpbBPIPELINE PARALLELISM Scatter-gather mechanismData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizedtpbBAll-gather via NVLINK.Can beignored.P2PP2PCOMPUTATION COSTTRAINING COMPUTATION ANA

33、LYSIS OF GPT-3 175BPer Batch FLOPs EstimationModel FLOPs(Per Batch)=Model FLOPS per GPU=Model FLOPs/(Batch Time*Num GPUs)A100 Training efficiency=Model FLOPS per GPU/312We can check if our training efficiency meets expectations by calculating the FLOPS achieved per GPU during training.For Large GPT(

34、with TP and PP),it can achieve 150 Model TFLOPS/A100.If this value is less than 120,there are usually some performance issues during training.CONCLUSION/38NVIDIA CONFIDENTIAL.DO NOT DISTRIBUTE.TAKEAWAY POINTSOOTBMixed precision trainingFlashAttentionBF16 is recommended for larger model training.(20B

35、)For large scale training(both weak-scaling and strong-scaling),my intuition:If memory is an issue:Selective activation checkpointingDistributed optimizerProgressively using tensor parallel(with sequence parallel enabled,and keep(hidden/tp)at least 1024,ideally 2048.)Progressively using pipeline parallelFull activation checkpointingIf memory is not an issue:Data parallelLarger batchsize感謝您的觀看 THANKS

相關圖表

本文主要探討了大規模語言模型GPT-3 175B的訓練方法及其所面臨的挑戰。作者提出，為了提高訓練效率，應采用混合精度訓練、Flash注意力機制以及BF16數值類型。文章詳細分析了模型參數、激活內存、優化器狀態、存儲和通信成本，并提出了相應的優化策略，如選擇性激活檢查點、分布式優化器、張量并行和流水線并行等。此外，作者還討論了在訓練過程中如何根據內存需求和帶寬限制調整模型結構和并行策略。最后，作者指出，對于大規模訓練，應根據內存和帶寬情況選擇合適的并行方法和模型配置。

"大模型訓練如何降低內存成本？" "如何優化大型模型的通信效率？" "大型模型訓練中，哪些策略能提高計算效率？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站