《大語言模型分布式訓練時的量化分析與最佳實踐以 GPT-175B 為例.pdf》由會員分享,可在線閱讀,更多相關《大語言模型分布式訓練時的量化分析與最佳實踐以 GPT-175B 為例.pdf(39頁珍藏版)》請在三個皮匠報告上搜索。
1、IN-DEPTH ANALYSIS OF THE PERFORMANCE FOR GPT3顏子杰NVIDIALLM TRAINING TECHSLARGER MODEL IS THE TRENDCHALLENGES FOR TRAINING LARGE MODELCompute cost Lower bound of each iteration:(Refer to https:/arxiv.org/abs/2104.04473)B:batch size,S:sequence length,l:transformer layer numberh:hidden size,V:vocabulary
2、 size 2150 ZettaFLOPs(175B with 1.5T tokens)1 ZettaFLOP=1024 ExaFLOPs96!(1+6+16)Challenges 128 DGX A100,trained in 170 120 days.(about 50%computing efficiency)High Compute CostsCHALLENGES FOR TRAINING LARGE MODELMemory costs(Mixed Precision,Native implementation)Model States(Total:3.5TB)Parameter:35
3、0GB(175B*2Bytes)Gradient:350GB Optimizer:2800GB Activation:?Challenges Model can not fit in single GPU or even single GPU server.(35p A100 80G)Model parallelism is a MUST across multi nodesHigh Memory CostsCHALLENGES FOR TRAINING LARGE MODEL Model can not fit in single GPU or even single GPU server.
4、(35p A100-80G)Extremely huge computing power:about 16K A100*days computing.(Not considering efficiency)What we need:An efficient framework with model parallel Careful co-design of software and system7NeMo and Megatron-LM is NVIDIAs FW for efficiently training the worlds largest transformer-based mod
5、els.Train transformer models with billions of parametersAchieve high utilization and scaling to thousands of GPUs7NeMo and MEGATRONOVERVIEW OF LARGE TRANSFORMER TRAINING TECHNIQUES Parallelisms:Pipeline Parallelism Tensor Parallelism Sequence Parallelism Expert Parallelism Memory Optimizations:Distr
6、ibuted optimizer (DeepSpeed ZeRO-1)Checkpoint activations Selective activation checkpointing Others:FP16/BF16 training,optimized kernels,etc Communication overlapping for PP and TPBlue for Megatron v2 featuresGreen for Megatron v3 new featuresTHE DISTRIBUTED TRAINING OF GPT-3 MODELModel Parallelism
7、Tensor model parallelism Intra-layer Split individual layers across multiple devices Simple to implement Good performance under large matrix Fine-grained,high-frequency communication Pipeline model parallelism Inter-layer Split sets of layers across multiple devices Coarse-grained communication Gene
8、ralizable to almost all DNNs Require large batch size for high throughput Load imbalance across workersTHE DISTRIBUTED TRAINING OF GPT-3 MODEL Tensor model parallelism-MLPf and g are conjugate operators,f is identity operator and g is all-reduce operator in forward pass;While in backward pass,f is a
9、ll-reduce operator and g is identity operator.THE DISTRIBUTED TRAINING OF GPT-3 MODEL Tensor model parallelismSelf-attentionf and g are conjugate operators,f is identity operator and g is all-reduce operator in forward pass;While in backward pass,f is all-reduce operator and g is identity operator.P
10、ARALLELISIMNew in Megatron v3:Sequence ParallelismExpands upon Tensor Parallelism by spli4ng tensors across the sequence dimension.Par;oning along the sequence dimension reduces the memory required for the ac;va;ons.Introduces all-gather/reduce-scaAer between sequence parallel and tensor parallel.g
11、and g are conjugate.g is all-gather in forward pass and reduce-scaAer in backward pass.g is reduce-scaAer in forward pass and all-gather in backward pass.THE DISTRIBUTED TRAINING OF GPT-3 MODEL Pipeline model parallelism1F1BTimem:number of mini batchesp:number of pipeline stages!:forward step time:b
12、ackward step timeIdeal time=(!+)Bubble time=(p 1)(!+)Total time=(+1)(!+)Bubble time overhead=+,+-./01.02.3-/01.=4561THE DISTRIBUTED TRAINING OF GPT-3 MODEL Interleaved pipeline model parallelismTimeBubble time overhead=!#$v:number of interleaved stagesBubble time overhead=#%!#$THE DISTRIBUTED TRAINI
13、NG OF GPT-3 MODEL3D ParallelismREQUIREMENT OF STORAGEActivations memory optimizationsFull checkpointing Storing(or checkpointing)the input activations to a group of layers and recomputing other required activations using an extra forward pass during back-propagation.Significantly reduced the require
14、d memory for training,while with 36%computing overhead.Sequence Parallel+Selective checkpointing Only checkpoint and recompute parts of each layer that take up a considerable amount of memory but are not computationally expensive to recompute.Use sequence parallel to distributed other parts of activ
15、ation.Reducing the recompute overhead from 36%to 4%.ACTIVATION CHECKPOINTINGNew in Megatron v3:Selective CheckpointingOnly checkpoint and recompute parts of each layer that take up a considerable amount of memory but are not computa;onally expensive to recompute.(Called selec;ve ac;va;on recompu;ng)
16、Found that aAen;on opera;ons generally have large input sizes and thus large ac;va;ons,however,the number of floa;ng-point opera;ons(FLOPs)per input element is very low.Excep&ons&Limita&ons:Only works well in conjunc&on with other Parallelism TechniquesReducing the recompute overhead from 36%to 4%by
17、 using sequence parallel and selec&ve checkpoin&ng.DISTRIBUTED OPTIMIZERNew in Megatron v3Optimizer States(16M)Gradients(2M)Model Weights(2M)MemoryCostCommunicationCostData ParallelismReplicatedReplicatedReplicated20Mall-reduce(M)Distributed Optimizer(ZeRO Stage 1)PartitionedReplicatedReplicated(4+1
18、6/N)Mreduce-scatter(M)+all-gather(M)ZeRO Stage 2PartitionedPartitionedReplicated(2+18/N)Mreduce-scatter(M)*num_micro_batches+all-gather(M)ZeRO Stage 3PartitionedPartitionedPartitioned20M/N1.5*all-reduce(M)*num_micro_batchesM is the number of parameters,N is the number of devices.Megatron V3 implemen
19、ted distributed optimizer to shard optimizer states including momentum,variance,master weight and master grads,without compromising performance.Considering the communication overhead,ZeRO-2,3 is not currently implemented in Megatron.In fact,the training of GPT-3 175B and Turing-NLG 530B didnt use Ze
20、RO.MEMORY ANALYSISSTORAGE ANALYSISGPT3 storageModel memoryActivation memoryExtra memoryParameterGradientOptimizerTransformer layerEmbedding layerAll reduceSelf-attentionMLPPyTorch Memory ManagementREQUIREMENT OF STORAGEModel params Model memory per device#4+42+1.37 175(3)vocabulary sizesequence leng
21、thAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536v/t=6400h/t=1536REQUIREMENT OF STORAGEModel states Cost of model memory(AMP)vocabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor
22、 parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048a=96h=12288n=96d=8t=8p=16b=1B=1536Partitioned Model size#1.37BParameter()2#2.74GBGradient()2#2.74GBAdam optimizer()4+4+4+4#21.9GBNote:The first 4 in _ equation:the copy of parameter with fp32 typeThe second 4 in _ equation:the copy of
23、gradient with fp32 typeThe third 4 in _ equation:the momentum with fp32 typeThe fourth 4 in _ equation:the variance with fp32 typeREQUIREMENT OF STORAGEModel states w/ZeRO Cost of model memory(AMP)vocabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelM
24、icro batch sizeBatch sizev=51200s=2048a=96h=12288n=96d=8t=8p=16b=1B=1536Partitioned Model size#1.37BParameter()2#2.74GB-ZeRO3-0.34GBGradient()2#2.74GB-ZeRO2-0.34GBAdam optimizer()4+4+4+4#21.9GB-ZeRO1-2.7GBNote:The first 4 in _ equation:the copy of parameter with fp32 typeThe second 4 in _ equation:t
25、he copy of gradient with fp32 typeThe third 4 in _ equation:the momentum with fp32 typeThe fourth 4 in _ equation:the variance with fp32 typeREQUIREMENT OF STORAGEActivations memory(Per Layer)AttentionMLP2 LayerNorm=11+5+19+4=(34+5)REQUIREMENT OF STORAGEFull checkpointing Cost of activation memoryvo
26、cabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536Only store the input activation of transformer layerSublinear memory costNeed forward re-computationFull checkpointing signifi
27、cantly reduced the required memory for training,while with 36%computing overhead.=34+5=2DroppedMLPAttentionREQUIREMENT OF STORAGETensor Parallel=8+5+3+16+3+4=(24+5+10)2 LayerNormt is the TP sizeREQUIREMENT OF STORAGETensor Parallel+Sequence ParallelMLPAttention=8+5+3+16+3+4=(34+5)2 LayerNormREQUIREM
28、ENT OF STORAGETensor Parallel+Sequence Parallel+Selective checkpointingMLPAttention=8+5+3+16+3+4=(34)2 LayerNormDroppedDroppedDroppedREQUIREMENT OF STORAGEActivation-SummaryMethodActivationMemory PerLayerRecomputation overheadNone(34+5)2.86GB0%Full250MB36%TP+SP+Selective(34)106MB4%vocabulary sizeseq
29、uence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536COMMUNICATION COSTDATA PARALLELISMNameOperationMessage sizeGroupLoopsGradient accumulationAll reduceData parallel group1:model gradientBW:Bus
30、bandwidthd:data parallelism1#=2(1)0501001502002500200040006000800010000BANDWIDTH(GB/S)MESSAGE SIZE(MB)Bus Bandwidth vs.Message sizeTENSOR PARALLELISMNameOperationMessage sizeGroupSelf-attention All-reduce forTP All-gather andReduce-Scatterfor TP+SP_Tensor parallel groupMLPCommunication size for TP a
31、nd TP+SP are the same.TENSOR PARALLELISMEquation for Communication Cost2#=2(,)2(1)3+3 Message size per NCCL callCorrection factor for bus BW(all-reduce)Self attentionMLPBus Bandwidth:obtained by interpolationNumber of transformer layersNumber of mini batchesvocabulary sizesequence lengthAttn headsHi
32、dden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizevsahhndtpbBPIPELINE PARALLELISM Scatter-gather mechanismData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizedtpbBAll-gather via NVLINK.Can beignored.P2PP2PCOMPUTATION COSTTRAINING COMPUTATION ANA
33、LYSIS OF GPT-3 175BPer Batch FLOPs EstimationModel FLOPs(Per Batch)=Model FLOPS per GPU=Model FLOPs/(Batch Time*Num GPUs)A100 Training efficiency=Model FLOPS per GPU/312We can check if our training efficiency meets expectations by calculating the FLOPS achieved per GPU during training.For Large GPT(
34、with TP and PP),it can achieve 150 Model TFLOPS/A100.If this value is less than 120,there are usually some performance issues during training.CONCLUSION/38NVIDIA CONFIDENTIAL.DO NOT DISTRIBUTE.TAKEAWAY POINTSOOTBMixed precision trainingFlashAttentionBF16 is recommended for larger model training.(20B
35、)For large scale training(both weak-scaling and strong-scaling),my intuition:If memory is an issue:Selective activation checkpointingDistributed optimizerProgressively using tensor parallel(with sequence parallel enabled,and keep(hidden/tp)at least 1024,ideally 2048.)Progressively using pipeline parallelFull activation checkpointingIf memory is not an issue:Data parallelLarger batchsize感謝您的觀看 THANKS