《GPU Tensor Core 上大型語言模型的高效任意精度加速.pdf》由會員分享,可在線閱讀,更多相關《GPU Tensor Core 上大型語言模型的高效任意精度加速.pdf(44頁珍藏版)》請在三個皮匠報告上搜索。
1、Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor CoresShaobo Ma,Chao Fang,Haikuo Shao,Zhongfeng WangICAIS Lab,Nanjing University,ChinaJan 23,2025OutlinesBackground&Motivation01Our WorksExperimentsConclusion02030401Background&Motivation1.1.1 Background:Quantization o
2、f LLMsChallenges Brought by the Growth in Size of LLMsMore memory(storage)More computational power and time(inference)Growth in Size of Transformer Models4BERT(340M)GPT-1(117M)GPT-2(1.5B)GPT-3(175B)GPT-4(1000+B)PaLM(540B)Gopher(280B)020040060080010001200201720182019202020212022202320241.1.1 Backgrou
3、nd:Quantization of LLMsChallenges Brought by the Growth in Size of LLMsMore memory(storage)More computational power and time(inference)One Effective MethodModel quantizationStorage requirementComputational overhead Growth in Size of Transformer Models5BERT(340M)GPT-1(117M)GPT-2(1.5B)GPT-3(175B)GPT-4
4、(1000+B)PaLM(540B)Gopher(280B)02004006008001000120020172018201920202021202220232024Challenges Brought by the Growth in Size of LLMsMore memory(storage)More computational power and time(inference)One Effective MethodModel quantizationStorage requirementComputational overhead Quantization WorksGPTQ(3-
5、4bit)1TSLD(2bit)2OneBit(1bit)31.1.1 Background:Quantization of LLMsModelsModelsFP16(GB)FP16(GB)GPTQ 3bitGPTQ 3bit(GB)(GB)TSLDTSLD(GB)(GB)OneBitOneBit(GB)(GB)LLaMA-7B13.52.51.71.3LLaMA-13B26.04.93.32.2LLaMA-30B65.112.28.14.9LLaMA-65B130.624.516.39.2Storage Reduction Brought by Model QuantizationGrowt
6、h in Size of Transformer Models61 Frantar,Elias,et al.Gptq:Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323(2022).2 Kim,Minsoo,et al.Token-scaled logit distillation for ternary weight generative language models.Advances in Neural Information
7、 Processing Systems 36(2024).3 Xu,Yuzhuang,et al.OneBit:Towards Extremely Low-bit Large Language Models.arXiv preprint arXiv:2402.11295(2024).BERT(340M)GPT-1(117M)GPT-2(1.5B)GPT-3(175B)GPT-4(1000+B)PaLM(540B)Gopher(280B)02004006008001000120020172018201920202021202220232024GPU:Graphics Processing Uni
8、tHighly parrallel computing architectureMulti-level memory hierarchy1.1.2 Background:GPU and Tensor Core7Comparation Between CPU and GPU ArchitectureTensor Core Acceleration of Matrix MultiplicationTensor Core(TC):Specialized Processing UnitOptimized for matrix operationsLow-precision computing1.2.1
9、 Motivation:Limited Data Format8Problem:Limited Data Format Support in GPU and TCMismatch with the quantized data format(INT2 1,2/INT3 3,4)Modern NVIDIA GPU Precision Support1 Kim,Minsoo,et al.Token-scaled logit distillation for ternary weight generative language models.Advances in Neural Informatio
10、n Processing Systems 36(2024).2Chen,Mengzhao,et al.Efficientqat:Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062(2024).3 Frantar,Elias,et al.Gptq:Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323
11、(2022).4 Lin,Ji,et al.AWQ:Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.Proceedings of Machine Learning and Systems 6(2024):87-100.1.2.1 Motivation:Limited Data Format9Problem:Limited Data Format Support in GPU and TCMismatch with the quantized data format(INT2
12、1,2/INT3 3,4)GPU Computation with Limited Data Format SupportCurrent approach:computation by padding to higher-bit data formatExtra computation and memory overheadCharacteristics of Different Levels of StorageMore memory,slower speedSmaller range,faster speedDisadvantages of Direct Memory MangementI
13、nefficient memory transferSlow global memory accessThreads contend for shared memory1.2.2 Motivation:Inefficient Memory Management10Host Memory(CPU)Global Memory(GPU)Shared Memory(GPU)12.8 GB/s(1 TB)1.5 TB/s(24 GB)19 TB/s(48 MB)Comparison of Memory Bandwidth and CapacityBandwidth(Capacity)1.3 Our Co
14、ntributions11GLOBAL MEMSHMEMTCTCPreprocess ModuleXWX2.3 An Efficient Matrix Preprocessing Method2.1 Bipolar-INT:a Novel Data Format2.2 An Arbitrary Precision MatMuls Design2.4 A Memory Management Strategy02Our WorksInterpret“0”as “-1”in calculationExampleRange2.1 Bipolar-INT Data Format13Unsigned IN
15、T:50401=+1 0 10Bipolar-INT:5842 1=+0 1 01(-1)(-1)1 1 1 10 0 0 0842 115 =842 1 15+=2 2nn+All Odd NumbersCompared with Signed INTWithout sign bitEasy to parallelizeCompared with Unsigned INTSymmetric rangeRedundacy Reduction2.1.1 Comparison with Signed INT14Compared with Signed INTWithout sign bitEasy
16、 to parallelizeCompared with Unsigned INTSymmetric rangeRedundacy Reduction2.1.2 Comparison with Unsigned INT15Data DecompositionSplit input data bit by bitDivide into 1-bit matrices1-bit MatMulData Recovery2.2 Bit-Wise MatMul Reconstitution(1)16Data DecompositionSplit input data bit by bitDivide in
17、to 1-bit matrices1-bit MatMulPairwise combine inputOutput intermediate matricesData Recovery2.2 Bit-Wise MatMul Reconstitution(2)17Data DecompositionSplit input data bit by bitDivide into 1-bit matrices1-bit MatMulPairwise combine inputOutput intermediate matricesData RecoveryShift and add matricesO
18、utput final result2.2 Bit-Wise MatMul Reconstitution(3)18Applicable to Arbitrary Precision MatMulAny INT-like data format can be decomposed into 1-bit matricesGPU TC supports 1-bit MatMul computationApplicable to Both INT and Bipolar-INTINT:implement 1-bit MatMul using“AND”operationBipolar-INT:imple
19、ment 1-bit Matmul using“XOR”operation2.2 Bit-Wise MatMul Reconstitution(4)19wxy0000101001111-bit INT Multiplication is Implemented as AND Logicwxy0(-1)0(-1)10(-1)10(-1)10(-1)0(-1)1111-bit Bipolar-INT Multiplication is Implemented as XOR LogicThe Necessity of Input Data PreprocessingMemory redundancy
20、 due to unsupported data formatSubsequent computations require bitwise decomposition2.3 Matrix Decomposition and Reassembly20Matrix Decomposition and Reassembly,Taking 3-bit Data as an ExampleMatrix DecompositionBreak down each bit and regroup themEliminate the redundancy due to unsupported data for
21、mats2.3 Matrix Decomposition and Reassembly(1)21Matrix Decomposition and Reassembly,Taking 3-bit Data as an ExampleData ReassemblyReassemble data using 32-bit unsigned INTsAlign with the native support,thereby enhance transfer speed2.3 Matrix Decomposition and Reassembly(2)22Matrix Decomposition and
22、 Reassembly,Taking 3-bit Data as an ExampleMatrix ConcatenateConcatenate processed matrices into a single matrixReduce transmission instructions to further improve transfer speedFacilitate subsequent computations2.3 Matrix Decomposition and Reassembly(3)23Matrix Decomposition and Reassembly,Taking 3
23、-bit Data as an ExampleGPU Implementation of Bit-Wise MatMul(2.2):General ApproachEach SM handles one pair of 1-bit WX matricesShift and add in global memoryLow Efficiency ReasonsMatrix recovery in global memory Low utilization of shared memory Optimization GoalsReduce computation in global memoryMo
24、ve matrix recovery to shared memory2.4 Recovery-Oriented Memory Scheduling24Global MemoryShared MemoryFragment(TC)(0)W(1)W(0)X(1)X(0)W(0)X(0,0)Y(0,0)Y(1,0)Y(1,1)Y(0,1)YYComplete Matrix Recovery in the Shared Memory of a Single SM(Streaming Multiprocessor)Obtain all intermediate matrices for the outp
25、utShift and add in shared memory2.4.1 Matrix Recovery in Shared Memory25Global MemoryShared MemoryFragment(TC)(0,0)Y(1,0)Y(1,1)Y(0,1)YYYPairwise Combine of W and X with Different BitwidthIt can be achieved within one MatMul computationImplement 1-bit MatMul in Tensor Cores2.4.2 Compute All Intermedi
26、ate Matrices26Global MemoryShared MemoryFragment(TC)(0,0)Y(1,0)Y(1,1)Y(0,1)YYY(0,0)Y(1,0)Y(1,1)Y(0,1)Y(1)W(0)W(1)X(0)XRead Data from All 1-bit Matrices and ConcatenateIncludes all data required for output2.4.3 Matrix Concatenation in Shared Memory27Global MemoryShared MemoryFragment(TC)(0,0)Y(1,0)Y(
27、1,1)Y(0,1)YYY(0,0)Y(1,0)Y(1,1)Y(0,1)Y(1)0W(0)0W(1)0X(0)0X(1)0W(0)0W(1)0X(0)0X2.4.4 Overall Scheduling28Recovery-Oriented Memory SchedulingMatrix Concatenation in Shared MemoryCompute All Intermediate MatricesMatrix Recovery in Shared Memory03ExperimentsComputing Platforms:NVIDIA RTX 3090 GPU(Ampere
28、Architecture)Compilation Environment:CUDA-11.8 and CUTLASS-2.11Baselines:Pytorch FP32,Pytorch FP16,CUTLASS INT4,CUTLASS INT1,APNN-TC 1,BSTC 2,BTC 3LLM Models:LLaMA2-7B,OPT-6.7B,BLOOM-7BWorkloads:Square matrices MatMulsLLM-specific matrices MatMulsLLM models inference speed evaluation3.1 Experimental
29、 Setup301 Feng,Boyuan,et al.Apnn-tc:Accelerating arbitrary precision neural networks on ampere gpu tensor cores.Proceedings of the international conference for high performance computing,networking,storage and analysis.2021.2 Li,Ang,et al.BSTC:A novel binarized-soft-tensor-core design for accelerati
30、ng bit-based approximated neural nets.Proceedings of the international conference for high performance computing,networking,storage and analysis.2019.3 Li,Ang,and Simon Su.Accelerating binarized neural networks via bit-tensor-cores in turing gpus.IEEE Transactions on Parallel and Distributed Systems
31、 32.7(2020):1878-1891.3.2.1 Arbitrary Precision Square MatMuls31Comparison with FP32 and FP16 towards large square MatMuls193 speedup over FP32(4k/4k/4k,W1A2)66.7 speedup over FP16(4k/4k/4k,W1A2)M/N/K1k/1k/1k2k/2k/2k4k/4k/4kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP32121us1.00779us1.005690u
32、s1.00FP1644.2us2.73263us2.961960us2.90CUTLASS INT415.8us7.6166.5us11.7386us14.7CUTLASS INT19.3us13.036.9us21.1161us35.3W3A4(ours)12.4us9.7450.4us15.4184us31.0W2A2(ours)8.7us13.918.1us43.046.5us122W1A2(ours)9.0us13.411.7us66.429.5us193 66.7Comparison with CUTLASS INT1 and INT4 towards Large Square Ma
33、tMulsMore than 13 speedup over CUTLASS INT4(4k/4k/4k,W1A2)5.5 faster than CUTLASS INT1(4k/4k/4k,W1A2)3.5 faster than CUTLASS INT1(4k/4k/4k,W2A2)3.2.1 Arbitrary Precision Square MatMuls32M/N/K1k/1k/1k2k/2k/2k4k/4k/4kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP32121us1.00779us1.005690us1.00FP16
34、44.2us2.73263us2.961960us2.90CUTLASS INT415.8us7.6166.5us11.7386us14.7CUTLASS INT19.3us13.036.9us21.1161us35.3W3A4(ours)12.4us9.7450.4us15.4184us31.0W2A2(ours)8.7us13.918.1us43.046.5us122W1A2(ours)9.0us13.411.7us66.429.5us193 13.15.53.5Comparison of Throughput with Other Methods44 TOPS over APNN-TC(
35、W1A2)50 TOPS over APNN-TC(W2A2)3.2.1 Arbitrary precision Square MatMuls 3344Comparison of Throughput with Other Methods44 TOPS over APNN-TC(W1A2)50 TOPS over APNN-TC(W2A2)3.2.1 Arbitrary precision Square MatMuls 3450Extract the MatMul parameters from Llama2-7BMore than 90 speedup over FP32Significan
36、t speedup over FP163.2.2 Arbitrary Precision LLM-specific MatMuls35M/N/K1k/4k/4k1k/10.5k/4k1k/4k/10.5kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP323.12ms1.008.21ms1.008.36ms1.00FP161.07ms2.911.47ms5.581.58ms5.30CUTLASS INT40.238ms13.10.574ms14.30.548ms15.3CUTLASS INT10.097ms32.10.255ms32.20.
37、188ms44.6W3A4(ours)0.194ms16.10.523ms15.70.540ms15.5W2A2(ours)0.059ms53.20.143ms57.60.165ms50.7W1A2(ours)0.034ms91.2 0.084ms98.1 0.082ms102 31.317.619.2Extract the MatMul parameters from Llama2-7BAbout 7 speedup over CUTLASS INT4Over 2.2 speedup over CUTLASS INT13.2.2 Arbitrary Precision LLM-specifi
38、c MatMuls36M/N/K1k/4k/4k1k/10.5k/4k1k/4k/10.5kSchemesLatencySpeedupLatencySpeedupLatencySpeedupFP323.12ms1.008.21ms1.008.36ms1.00FP161.07ms2.911.47ms5.581.58ms5.30CUTLASS INT40.238ms13.10.574ms14.30.548ms15.3CUTLASS INT10.097ms32.10.255ms32.20.188ms44.6W3A4(ours)0.194ms16.10.523ms15.70.540ms15.5W2A2
39、(ours)0.059ms53.20.143ms57.60.165ms50.7W1A2(ours)0.034ms91.2 0.084ms98.1 0.082ms102 7.06.96.72.843.052.29Comparison of Throughput with Other MethodsAbout 10 TOPS over APNN-TC(W1A2)More than 10 TOPS over APNN-TC(W2A2)3.2.2 Arbitrary Precision LLM-specific MatMuls379.711.210.7Comparison of Throughput
40、with Other MethodsAbout 10 TOPS over APNN-TC(W1A2)More than 10 TOPS over APNN-TC(W2A2)3.2.2 Arbitrary Precision LLM-specific MatMuls3810.212.310.6Single Inference of Different LLMsUp to 6x speedup compared to FP16(W1A2)3.3 Arbitrary Precision LLM Evaluation39Compared with GPTQ INT2/3/4(Use CUTLASS I
41、NT4 Kernel)The inference speed of GPTQ is almost identical3.3 Arbitrary Precision LLM Evaluation40Compared with OneBit(Use CUTLASS INT1 Kernel)W1A2 and W2A2 configurations still achieve speedup3.3 Arbitrary Precision LLM Evaluation4104ConclusionConclusionEfficient Arbitrary Precision Acceleration fo
42、r Large Language Models on GPU Tensor CoresBipolar-INT Data FormatBit-Wise MatMul ReconstitutionMatrix Decomposition and ReassemblyRecovery-Oriented Memory SchedulingAchieves a 5.5 Speedup Compared to NVIDIA CUTLASSAchieves a 44 Speedup Compared to Existing SolutionsThe Model Inference Speed is 3.9-6.7 Faster Compared to FP16The Model Inference Speed is 1.2-2 Faster than Quantized Model with CUTLASS KernelThank you for listening!Welcome to contact us by email:ICAIS Lab,Nanjing University,China