當前位置：首頁 > 報告詳情

GPU Tensor Core 上大型語言模型的高效任意精度加速.pdf

上傳人：蘆葦編號：651799 2025-05-01 PDF PDF 44頁 3.12MB

該報告所屬合集： 第三十屆亞洲及南太平洋設計自動化會議（ASP-DAC 2025）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/44

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《GPU Tensor Core 上大型語言模型的高效任意精度加速.pdf》由會員分享，可在線閱讀，更多相關《GPU Tensor Core 上大型語言模型的高效任意精度加速.pdf（44頁珍藏版）》請在三個皮匠報告上搜索。

1、Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor CoresShaobo Ma,Chao Fang,Haikuo Shao,Zhongfeng WangICAIS Lab,Nanjing University,ChinaJan 23,2025OutlinesBackground&Motivation01Our WorksExperimentsConclusion02030401Background&Motivation1.1.1 Background:Quantization o

2、f LLMsChallenges Brought by the Growth in Size of LLMsMore memory(storage)More computational power and time(inference)Growth in Size of Transformer Models4BERT(340M)GPT-1(117M)GPT-2(1.5B)GPT-3(175B)GPT-4(1000+B)PaLM(540B)Gopher(280B)020040060080010001200201720182019202020212022202320241.1.1 Backgrou

3、nd:Quantization of LLMsChallenges Brought by the Growth in Size of LLMsMore memory(storage)More computational power and time(inference)One Effective MethodModel quantizationStorage requirementComputational overhead Growth in Size of Transformer Models5BERT(340M)GPT-1(117M)GPT-2(1.5B)GPT-3(175B)GPT-4

4、(1000+B)PaLM(540B)Gopher(280B)02004006008001000120020172018201920202021202220232024Challenges Brought by the Growth in Size of LLMsMore memory(storage)More computational power and time(inference)One Effective MethodModel quantizationStorage requirementComputational overhead Quantization WorksGPTQ(3-

5、4bit)1TSLD(2bit)2OneBit(1bit)31.1.1 Background:Quantization of LLMsModelsModelsFP16(GB)FP16(GB)GPTQ 3bitGPTQ 3bit(GB)(GB)TSLDTSLD(GB)(GB)OneBitOneBit(GB)(GB)LLaMA-7B13.52.51.71.3LLaMA-13B26.04.93.32.2LLaMA-30B65.112.28.14.9LLaMA-65B130.624.516.39.2Storage Reduction Brought by Model QuantizationGrowt

6、h in Size of Transformer Models61 Frantar,Elias,et al.Gptq:Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323(2022).2 Kim,Minsoo,et al.Token-scaled logit distillation for ternary weight generative language models.Advances in Neural Information

7、 Processing Systems 36(2024).3 Xu,Yuzhuang,et al.OneBit:Towards Extremely Low-bit Large Language Models.arXiv preprint arXiv:2402.11295(2024).BERT(340M)GPT-1(117M)GPT-2(1.5B)GPT-3(175B)GPT-4(1000+B)PaLM(540B)Gopher(280B)02004006008001000120020172018201920202021202220232024GPU:Graphics Processing Uni

8、tHighly parrallel computing architectureMulti-level memory hierarchy1.1.2 Background:GPU and Tensor Core7Comparation Between CPU and GPU ArchitectureTensor Core Acceleration of Matrix MultiplicationTensor Core(TC):Specialized Processing UnitOptimized for matrix operationsLow-precision computing1.2.1

9、 Motivation:Limited Data Format8Problem:Limited Data Format Support in GPU and TCMismatch with the quantized data format(INT2 1,2/INT3 3,4)Modern NVIDIA GPU Precision Support1 Kim,Minsoo,et al.Token-scaled logit distillation for ternary weight generative language models.Advances in Neural Informatio

10、n Processing Systems 36(2024).2Chen,Mengzhao,et al.Efficientqat:Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062(2024).3 Frantar,Elias,et al.Gptq:Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323

11、(2022).4 Lin,Ji,et al.AWQ:Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.Proceedings of Machine Learning and Systems 6(2024):87-100.1.2.1 Motivation:Limited Data Format9Problem:Limited Data Format Support in GPU and TCMismatch with the quantized data format(INT2

12、1,2/INT3 3,4)GPU Computation with Limited Data Format SupportCurrent approach:computation by padding to higher-bit data formatExtra computation and memory overheadCharacteristics of Different Levels of StorageMore memory,slower speedSmaller range,faster speedDisadvantages of Direct Memory MangementI

13、nefficient memory transferSlow global memory accessThreads contend for shared memory1.2.2 Motivation:Inefficient Memory Management10Host Memory(CPU)Global Memory(GPU)Shared Memory(GPU)12.8 GB/s(1 TB)1.5 TB/s(24 GB)19 TB/s(48 MB)Comparison of Memory Bandwidth and CapacityBandwidth(Capacity)1.3 Our Co

14、ntributions11GLOBAL MEMSHMEMTCTCPreprocess ModuleXWX2.3 An Efficient Matrix Preprocessing Method2.1 Bipolar-INT:a Novel Data Format2.2 An Arbitrary Precision MatMuls Design2.4 A Memory Management Strategy02Our WorksInterpret“0”as “-1”in calculationExampleRange2.1 Bipolar-INT Data Format13Unsigned IN

15、T:50401=+1 0 10Bipolar-INT:5842 1=+0 1 01(-1)(-1)1 1 1 10 0 0 0842 115 =842 1 15+=2 2nn+All Odd NumbersCompared with Signed INTWithout sign bitEasy to parallelizeCompared with Unsigned INTSymmetric rangeRedundacy Reduction2.1.1 Comparison with Signed INT14Compared with Signed INTWithout sign bitEasy

16、 to parallelizeCompared with Unsigned INTSymmetric rangeRedundacy Reduction2.1.2 Comparison with Unsigned INT15Data DecompositionSplit input data bit by bitDivide into 1-bit matrices1-bit MatMulData Recovery2.2 Bit-Wise MatMul Reconstitution(1)16Data DecompositionSplit input data bit by bitDivide in

17、to 1-bit matrices1-bit MatMulPairwise combine inputOutput intermediate matricesData Recovery2.2 Bit-Wise MatMul Reconstitution(2)17Data DecompositionSplit input data bit by bitDivide into 1-bit matrices1-bit MatMulPairwise combine inputOutput intermediate matricesData RecoveryShift and add matricesO

18、utput final result2.2 Bit-Wise MatMul Reconstitution(3)18Applicable to Arbitrary Precision MatMulAny INT-like data format can be decomposed into 1-bit matricesGPU TC supports 1-bit MatMul computationApplicable to Both INT and Bipolar-INTINT:implement 1-bit MatMul using“AND”operationBipolar-INT:imple

19、ment 1-bit Matmul using“XOR”operation2.2 Bit-Wise MatMul Reconstitution(4)19wxy0000101001111-bit INT Multiplication is Implemented as AND Logicwxy0(-1)0(-1)10(-1)10(-1)10(-1)0(-1)1111-bit Bipolar-INT Multiplication is Implemented as XOR LogicThe Necessity of Input Data PreprocessingMemory redundancy

20、 due to unsupported data formatSubsequent computations require bitwise decomposition2.3 Matrix Decomposition and Reassembly20Matrix Decomposition and Reassembly,Taking 3-bit Data as an ExampleMatrix DecompositionBreak down each bit and regroup themEliminate the redundancy due to unsupported data for

21、mats2.3 Matrix Decomposition and Reassembly(1)21Matrix Decomposition and Reassembly,Taking 3-bit Data as an ExampleData ReassemblyReassemble data using 32-bit unsigned INTsAlign with the native support,thereby enhance transfer speed2.3 Matrix Decomposition and Reassembly(2)22Matrix Decomposition and

22、 Reassembly,Taking 3-bit Data as an ExampleMatrix ConcatenateConcatenate processed matrices into a single matrixReduce transmission instructions to further improve transfer speedFacilitate subsequent computations2.3 Matrix Decomposition and Reassembly(3)23Matrix Decomposition and Reassembly,Taking 3

23、-bit Data as an ExampleGPU Implementation of Bit-Wise MatMul(2.2):General ApproachEach SM handles one pair of 1-bit WX matricesShift and add in global memoryLow Efficiency ReasonsMatrix recovery in global memory Low utilization of shared memory Optimization GoalsReduce computation in global memoryMo

24、ve matrix recovery to shared memory2.4 Recovery-Oriented Memory Scheduling24Global MemoryShared MemoryFragment(TC)(0)W(1)W(0)X(1)X(0)W(0)X(0,0)Y(0,0)Y(1,0)Y(1,1)Y(0,1)YYComplete Matrix Recovery in the Shared Memory of a Single SM(Streaming Multiprocessor)Obtain all intermediate matrices for the outp

25、utShift and add in shared memory2.4.1 Matrix Recovery in Shared Memory25Global MemoryShared MemoryFragment(TC)(0,0)Y(1,0)Y(1,1)Y(0,1)YYYPairwise Combine of W and X with Different BitwidthIt can be achieved within one MatMul computationImplement 1-bit MatMul in Tensor Cores2.4.2 Compute All Intermedi

26、ate Matrices26Global MemoryShared MemoryFragment(TC)(0,0)Y(1,0)Y(1,1)Y(0,1)YYY(0,0)Y(1,0)Y(1,1)Y(0,1)Y(1)W(0)W(1)X(0)XRead Data from All 1-bit Matrices and ConcatenateIncludes all data required for output2.4.3 Matrix Concatenation in Shared Memory27Global MemoryShared MemoryFragment(TC)(0,0)Y(1,0)Y(

27、1,1)Y(0,1)YYY(0,0)Y(1,0)Y(1,1)Y(0,1)Y(1)0W(0)0W(1)0X(0)0X(1)0W(0)0W(1)0X(0)0X2.4.4 Overall Scheduling28Recovery-Oriented Memory SchedulingMatrix Concatenation in Shared MemoryCompute All Intermediate MatricesMatrix Recovery in Shared Memory03ExperimentsComputing Platforms:NVIDIA RTX 3090 GPU(Ampere

28、Architecture)Compilation Environment:CUDA-11.8 and CUTLASS-2.11Baselines:Pytorch FP32,Pytorch FP16,CUTLASS INT4,CUTLASS INT1,APNN-TC 1,BSTC 2,BTC 3LLM Models:LLaMA2-7B,OPT-6.7B,BLOOM-7BWorkloads:Square matrices MatMulsLLM-specific matrices MatMulsLLM models inference speed evaluation3.1 Experimental

29、 Setup301 Feng,Boyuan,et al.Apnn-tc:Accelerating arbitrary precision neural networks on ampere gpu tensor cores.Proceedings of the international conference for high performance computing,networking,storage and analysis.2021.2 Li,Ang,et al.BSTC:A novel binarized-soft-tensor-core design for accelerati

30、ng bit-based approximated neural nets.Proceedings of the international conference for high performance computing,networking,storage and analysis.2019.3 Li,Ang,and Simon Su.Accelerating binarized neural networks via bit-tensor-cores in turing gpus.IEEE Transactions on Parallel and Distributed Systems

31、 32.7(2020):1878-1891.3.2.1 Arbitrary Precision Square MatMuls31Comparison with FP32 and FP16 towards large square MatMuls193 speedup over FP32(4k/4k/4k,W1A2)66.7 speedup over FP16(4k/4k/4k,W1A2)M/N/K1k/1k/1k2k/2k/2k4k/4k/4kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP32121us1.00779us1.005690u

32、s1.00FP1644.2us2.73263us2.961960us2.90CUTLASS INT415.8us7.6166.5us11.7386us14.7CUTLASS INT19.3us13.036.9us21.1161us35.3W3A4(ours)12.4us9.7450.4us15.4184us31.0W2A2(ours)8.7us13.918.1us43.046.5us122W1A2(ours)9.0us13.411.7us66.429.5us193 66.7Comparison with CUTLASS INT1 and INT4 towards Large Square Ma

33、tMulsMore than 13 speedup over CUTLASS INT4(4k/4k/4k,W1A2)5.5 faster than CUTLASS INT1(4k/4k/4k,W1A2)3.5 faster than CUTLASS INT1(4k/4k/4k,W2A2)3.2.1 Arbitrary Precision Square MatMuls32M/N/K1k/1k/1k2k/2k/2k4k/4k/4kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP32121us1.00779us1.005690us1.00FP16

34、44.2us2.73263us2.961960us2.90CUTLASS INT415.8us7.6166.5us11.7386us14.7CUTLASS INT19.3us13.036.9us21.1161us35.3W3A4(ours)12.4us9.7450.4us15.4184us31.0W2A2(ours)8.7us13.918.1us43.046.5us122W1A2(ours)9.0us13.411.7us66.429.5us193 13.15.53.5Comparison of Throughput with Other Methods44 TOPS over APNN-TC(

35、W1A2)50 TOPS over APNN-TC(W2A2)3.2.1 Arbitrary precision Square MatMuls 3344Comparison of Throughput with Other Methods44 TOPS over APNN-TC(W1A2)50 TOPS over APNN-TC(W2A2)3.2.1 Arbitrary precision Square MatMuls 3450Extract the MatMul parameters from Llama2-7BMore than 90 speedup over FP32Significan

36、t speedup over FP163.2.2 Arbitrary Precision LLM-specific MatMuls35M/N/K1k/4k/4k1k/10.5k/4k1k/4k/10.5kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP323.12ms1.008.21ms1.008.36ms1.00FP161.07ms2.911.47ms5.581.58ms5.30CUTLASS INT40.238ms13.10.574ms14.30.548ms15.3CUTLASS INT10.097ms32.10.255ms32.20.

37、188ms44.6W3A4(ours)0.194ms16.10.523ms15.70.540ms15.5W2A2(ours)0.059ms53.20.143ms57.60.165ms50.7W1A2(ours)0.034ms91.2 0.084ms98.1 0.082ms102 31.317.619.2Extract the MatMul parameters from Llama2-7BAbout 7 speedup over CUTLASS INT4Over 2.2 speedup over CUTLASS INT13.2.2 Arbitrary Precision LLM-specifi

38、c MatMuls36M/N/K1k/4k/4k1k/10.5k/4k1k/4k/10.5kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP323.12ms1.008.21ms1.008.36ms1.00FP161.07ms2.911.47ms5.581.58ms5.30CUTLASS INT40.238ms13.10.574ms14.30.548ms15.3CUTLASS INT10.097ms32.10.255ms32.20.188ms44.6W3A4(ours)0.194ms16.10.523ms15.70.540ms15.5W2A2

39、(ours)0.059ms53.20.143ms57.60.165ms50.7W1A2(ours)0.034ms91.2 0.084ms98.1 0.082ms102 7.06.96.72.843.052.29Comparison of Throughput with Other MethodsAbout 10 TOPS over APNN-TC(W1A2)More than 10 TOPS over APNN-TC(W2A2)3.2.2 Arbitrary Precision LLM-specific MatMuls379.711.210.7Comparison of Throughput

40、with Other MethodsAbout 10 TOPS over APNN-TC(W1A2)More than 10 TOPS over APNN-TC(W2A2)3.2.2 Arbitrary Precision LLM-specific MatMuls3810.212.310.6Single Inference of Different LLMsUp to 6x speedup compared to FP16(W1A2)3.3 Arbitrary Precision LLM Evaluation39Compared with GPTQ INT2/3/4(Use CUTLASS I

41、NT4 Kernel)The inference speed of GPTQ is almost identical3.3 Arbitrary Precision LLM Evaluation40Compared with OneBit(Use CUTLASS INT1 Kernel)W1A2 and W2A2 configurations still achieve speedup3.3 Arbitrary Precision LLM Evaluation4104ConclusionConclusionEfficient Arbitrary Precision Acceleration fo

42、r Large Language Models on GPU Tensor CoresBipolar-INT Data FormatBit-Wise MatMul ReconstitutionMatrix Decomposition and ReassemblyRecovery-Oriented Memory SchedulingAchieves a 5.5 Speedup Compared to NVIDIA CUTLASSAchieves a 44 Speedup Compared to Existing SolutionsThe Model Inference Speed is 3.9-6.7 Faster Compared to FP16The Model Inference Speed is 1.2-2 Faster than Quantized Model with CUTLASS KernelThank you for listening!Welcome to contact us by email:ICAIS Lab,Nanjing University,China

相關圖表

本文主要介紹了在GPU Tensor Cores上為大型語言模型（LLM）實現高效任意精度加速的方法。主要內容包括： 1. 背景與動機：隨著LLM模型規模的增大，存儲和推理計算需求增加，模型量化是一種有效的解決方案。然而，GPU和Tensor Core對數據格式的支持有限，與量化后的數據格式不匹配。 2. 工作內容：提出了一種新的數據格式Bipolar-INT，通過位級矩陣乘法重構實現任意精度矩陣乘法，并設計了矩陣分解與重組策略以及面向恢復的內存調度方法。 3. 實驗結果：與NVIDIA CUTLASS相比，實現了5.5倍的加速；與現有解決方案相比，實現了44倍的加速；與FP16相比，模型推理速度提高了3.9-6.7倍；與量化模型和CUTLASS內核相比，推理速度提高了1.2-2倍。 4. 結論：提出的方法為LLM在GPU Tensor Cores上實現了高效任意精度加速，顯著提高了模型推理速度。

如何在GPU上實現任意精度矩陣乘法加速？如何通過數據格式轉換提高GPU矩陣乘法效率？如何優化GPU內存管理以加速大語言模型推理？

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站