1、010302CUDA壓縮量化TVM#page#01CUDA#page#項目介紹基于CUDAToolkit,聚焦NLP序列模型推理支持Bert/GPT/Transformer及VAE支持Beam Search/Diverse Beam Search/sampling2019.7上線FP32,2019.09上線FP16,2019.12開源#page#項目背景模型應用廣泛,例如機器翻譯、智能寫作、自動問答等原生深度學習框架,如TensorFlow、PyTorch,秒級延退模型結構收斂,復用性高,研發成本可控簡單易用的CUDA多線程范式、社區成熟ONVIDDIATeamofFour團隊的專業支持#pa
2、ge#優化方法算子多運算融合動態顯存復用層級式解碼計算#page#算子多運算融合TF和LightSeg中的Layer Normalization1mean=tf.reduce_mean(x,axis=-1)b:and rolumean)2variance=tf.reducemean(tf.square(xFFNmean)*tf.rsqrt(varianceepsilon)3norm=(x-Y=Y.Wscale+bias4result=normLNandntemplate ctypenTbias)Y=Y.W+binti=blockIdx.adIdx.xpeubrsu*bLockDfloat va
3、atrixi;odwuso/step8.shared_float s_meanfloat reduce_ref(threadIdxx=0)5sofma_syncthreads(;S=QK/pshared_foat s_var;Custom kernreshap0,KVif(threadIdx.x=)O.KV=X.(WoWxW)+CuBLAS GEMMsvar=rsgrtf(reduce_res/float(blocki.e0sf1onsyncthreads;LNandresldual/step 2.layer norsmatrixi=X=X+XAS*1EA#page#層級式解碼計算Tensor
4、Flow1#1.計算以每個token為結尾的序列的Logprobability2 og_token_prob=tf.nn.log_softmax(logit)#batch_size,beam_size,vocab_size3 log_seq_prob += log_token_prob #fbatch_size,beam_size,vocab_size4 Log_seq_prob = tf.reshape(log_seq_prob,-l,beam_size vocab_size)5#2.為每個序列(batchelement)找出排名topk的token0-0S0S152sn6topk_log_
5、probs,topk_indices=tf.nn.top_k(log_seq_prob,k=K)S20-1S0S1sn7#3.根據beamid,刷新decoder中的selfattention模塊中的key和value的緩存1-0S1S2SnS08refresh_cache(cache,topk_indices)S1S21-1S0snlogitsBatchsize=2存在元余計算,難以并行化,占一次推理延退的30%+beam=2nisvocab sizeA#page#層級式解碼計算LightSeq粗選+精排。粗選過程,遍歷logit矩陣兩次:1.對每個beam,將其logit值隨機分成k組,每
6、組求最大值,然后對k個最大值求最小值,作為一個粗略的topk值,記為Rtopk2.對每個beam,將每個大于Rtopk的logit值,寫入精排隊列24243512粗選-第一次遍歷1586372424351137415865粗選-第二次遍歷直接排序1112223344455678441578精排10#page#性能分析1.矩陣乘法占比80%+,已成推理效率主要矛盾(對比TF僅有25%)。2.緩存刷新在FP32和FP16中分別占比10%和6%,可嘗試降低decoder層數,降低緩存精度等,繼續提升。3.其他運算總計在FP32和FP16中分別占比8%和6%,計算融合收益明顯已逼近優化上界11#pag
7、e#優勢分析:性能高TensorflowTensorflowFasterTransformeFasterTransformerUghtseqLightSeq1412132,(Batchsizehttps:/ TransformerXTurbo TransformersLightSeqhttps: en-deModelBLEU Param/mATransformer參數分布(M)245FTransformer-big28.5812027.730.85Depth-wise22010006ConvolutionalLighweight28.37244-0.2181seq2seq560249-0.10
8、Dynamic28.483027.97-0.61Depth-wise2140LSRALigh-weight28.35227-0.23mEmbeddingAttentionFFNDynamic28.49230-0.0916#page#序列模型壓縮ModelBLEU Param/mA258ATransformerbig30.6929.42古1.27ls+ed(128)+dp01Y50-0.2130.485030.50-0.19W=wW+b層參數共享+變換29.9140-0.78embsharels+ed(256)33emb(6k)27.42-3.27詞向量分解+ap01fn(2048)29.57書
9、-1.1250-0.3612e3d30.3312e6d+dm(768)30.0941-0.60+head(12)302545ls+hierarchicalemb+dp0117https:/arxiv.org/abs/1909.11942#page#量化流程預訓練Fine-tune (QAT)BLEUAModel1Transformer-big30.84-0.09fixedrangeclip30.7530.75-0.09percentageclipQAT30.72-0.12asymmetrical30.75-0.09symmetrical-0.41fixed range clip30.4330.
10、41-0.43percentage clipPTQ30.25-0.59asymmetrical30.41-0.43symmetricalhttps:/arxiv.org/abs/1511.0036318https:f/arxiv.org/abs/1712.05877#page#量化:裁剪+對稱AsymmetricalSymmetrical0.08127Asymmetrical quantization:bound= maxIWeminlIwemaxlqq2=(i-z)S.(i2-z2).22%-1=5.S2.(i.i2-iz2-i2.z+z.)Wnn=-bound.2n-1-1Symmetri
11、cal quantization:Wemaxboudqq2=(iS).(i22)=SS2.(i.i)19#page#高效局部量化Training量化所有weights僅量化計算密集型運算Inference20#page#結論LightFastAccurateFLOPs1/4(1019Score-1%(BLEU)Modelsize1/20(MB)2.8120030.930.842.39842.39002.130.830.676001.430.70.63000.730.530.50190470.030.4CompressionCompressionTransformer-big+Quantizat
12、ion21#page#03TVM22#page#背景提升計算密集型運算的性能具備多端部署能力23https:/ seq_lenl - batch token numJ移除paddingtoken填充一定數量paddingtoken,離散化batch_token_num(例如至8整數倍)buildtime,對運行環境+model+量化后dynamicshape,搜索參數并生成執行代碼runtime,執行對應shape代碼預計2020年底開源,歡迎關注5#page#LightSeg應用案例火山翻譯http:/ paper:https:/arxiv.org/abs/2010.13887CSDN分享:https:/ you28