《2020年終大會-自然語言處理:6-5.pdf》由會員分享,可在線閱讀,更多相關《2020年終大會-自然語言處理:6-5.pdf(28頁珍藏版)》請在三個皮匠報告上搜索。
1、LightSeq:高性能NLP序列推理實踐王曉暉ByteDance AI LabLightfeqA#page#010302CUDA壓縮量化TVM#page#01CUDA#page#項目介紹基于CUDAToolkit,聚焦NLP序列模型推理支持Bert/GPT/Transformer及VAE支持Beam Search/Diverse Beam Search/sampling2019.7上線FP32,2019.09上線FP16,2019.12開源#page#項目背景模型應用廣泛,例如機器翻譯、智能寫作、自動問答等原生深度學習框架,如TensorFlow、PyTorch,秒級延退模型結構收斂,復用性
2、高,研發成本可控#page#優化方法算子多運算融合動態顯存復用層級式解碼計算#page#算子多運算融合TF和LightSeg中的Layer Normalization1mean=tf.reduce_mean(x,axis=-1)b:and rolumean)2variance=tf.reducemean(tf.square(xFFNmean)*tf.rsqrt(varianceepsilon)3norm=(x-Y=Y.Wscale+bias4result=normLNandntemplate ctypenTbias)Y=Y.W+binti=blockIdx.adIdx.xpeubrsu*bLo
3、ckDfloat vaatrixi;odwuso/step8.shared_float s_meanfloat reduce_ref(threadIdxx=0)5sofma_syncthreads(;S=QK/pshared_foat s_var;Custom kernroshap0.KVif(threadIdx.x=)O.KV=X.(WoWxW)+CUBLAS GEMMsvar=rsgrtf(reduce_res/float(blocki.e0sf1onsyncthreads;LNandresldual/step 2.layer norsmatrixi=X=X+XAS*1EA#page#動態
4、內存復用對動態維度限定最大值,如序列長度模型加載時分配所有張量顯存,對動態維度,按最大值分配計算圖優化,無依賴張量復用顯存模型服務期間無內存申請釋放,單T4GPU可部署8個Transformer Big#page#層級式解碼計算TensorFlow1#1.計算以每個token為結尾的序列的Logprobability2 og_token_prob=tf.nn.log_softmax(logit)#batch_size,beam_size,vocab_size3 log_seq_prob += log_token_prob #fbatch_size,beam_size,vocab_size4 L
5、og_seq_prob = tf.reshape(log_seq_prob,-l,beam_size vocab_size)5#2.為每個序列(batchelement)找出排名topk的token0-0S0S152sn6 topk_log_probs,topk_indices=tf.nn.top_k(log_seq_prob,k=K)S20-1S0S1sn7#3,根據beamid,刷新decoder中的seifattention模塊中的key和value的緩存1-0S1S2SnS08refresh_cache(cache,topk_indices)S1S21-1S0snlogitsBatch
6、size=2存在元余計算,難以并行化,占一次推理延退的30%+beam=2nisvocab sizeA#page#層級式解碼計算LightSeq粗選+精排。粗選過程,遍歷logit矩陣兩次:1.對每個beam,將其logit值隨機分成k組,每組求最大值,然后對k個最大值求最小值,作為一個粗略的topk值,記為Rtopk2.對每個beam,將每個大于Rtopk的logit值,寫入精排隊列24243512粗選-第一次遍歷1586372424351137415865粗選-第二次遍歷直接排序1112223344455678441578精排#page#性能分析1.矩陣乘法占比80%+,已成推理效率主要矛
7、盾(對比TF僅有25%)。2.緩存刷新在FP32和FP16中分別占比10%和6%,可嘗試降低decoder層數,降低緩存精度等,繼續提升。3.其他運算總計在FP32和FP16中分別占比8%和6%,計算融合收益明顯已逼近優化上界#page#優勢分析:功能豐富模型解碼方法功能支持DiverseVAEGPTBeamSearchTransformerSamplingBeamSearchXXFaster TransformerXTurbo TransformersLightSeqhttps:/ en-deModelBLEU Param/mATransformer參數分布(M)28.58245Transf
8、ormerbig120Depth-wise27.732200.8510090Convolutional28.37244-0.21Light-weight81seq2seq7560Dynamic28.48249-0.103027.97214-0.61Depth-wise0LSRA227Light-weight28.35-0.23EmbeddingAttentionFFNDynamic28.49230600-16#page#序列模型壓縮ModelBLEU Param/m258Transformerbig30.6941.2729.42ls+ed(128)+dp01文5030.48-0.215030.
9、50-0.19W=ww+b層參數共享+變換4029.91-0.78emb sharels+ed(256)27.4233emb(6k)-3.27詞向量分解+ap0129.5742-1.12fin(2048)12e3d30.3350-0.3612e6d+dm(768)30.0941-0.60+head(12)30ls+hierarchical emb+dp01252-51#page#量化流程預訓練Fine-tune CQAT )BLEUAModel1Transformer-big30.84-0.09fixedrangeclip30.7530.75-0.09percentageclipQAT30.7
10、2-0.12asymmetrical30.75-0.09symmetrical-0.41fixed range clip30.4330.41-0.43percentage clipPTQ30.25-0.59asymmetrical30.41-0.43symmetricalhttps:/arxiv.org/abs/1511.00363https:/arxiv.org/abs/1712.05877#page#量化:裁剪+對稱AsymmetricalSymmetrical0.08127Asymmetrical quantization:bound= maxIWeminlIwemaxlqq2=(i-z
11、)S.(i2-z2).22%-1=5.S2.(i.i2-iz2-i2.z+z.)Wnn=-bound.2n-1-1Symmetrical quantization:Wemaxboudqq2=(iS).(i22)=SS2.(i.i)9#page#高效局部量化Training量化所有weights僅量化計算密集型運算Inference#page#結論LightFastAccurateFLOPs1/4(1019Score-1%(BLEU)Modelsize1/20(MB)2.8120030.930.842.39842.39002.130.830.676001.430.70.63000.730.530
12、.50190470.030.4CompressionCompressionTransformer-big+Quantization21#page#03TVM22#page#背景提升計算密集型運算的性能具備多端部署能力https:/yyapache/incubator-#page#方法引入TVM代碼自動生成與參數搜索Dynamicshape量化序列模型定制圖優化24#page#案例:Bertbatch_size, seq_lenl - batch_token num J移除paddingtoken填充一定數量paddingtoken,離散化batch_token_num(例如至8整數倍)buildtime,對運行環境+model+量化后dynamicshape,搜索參數并生成執行代碼runtime,執行對應shape代碼近期開源,歡迎關注25#page#LightSeg應用案例火山翻譯http:/ NeurSThttps:/ paper: https:/arxiv.org/abs/2010.1388727#page#Q&AThank you28