《大模型時代 最大化CPU價值的優化策略-何普江.pdf》由會員分享,可在線閱讀,更多相關《大模型時代 最大化CPU價值的優化策略-何普江.pdf(18頁珍藏版)》請在三個皮匠報告上搜索。
1、何普江英特爾AI軟件架構師大語言模型時代:最大化CPU價值的優化策略QCON 2023 SHANGHAI目錄01 背景(為什么?)02 CPU上如何優化大語言模型?04 總結03 最大化CPU價值背景(為什么考慮最大化CPU價值?)QCON 2023 SHANGHAIComputing Needs in LLM4Probability of next tokenSoftMaxMatMulLayer NormBMMSoftMaxBMMMaskedMulti-HeadAttentionMatMulLayer NormMatMulSoftMaxMatMul1st tokenNext tokensQK
2、V MatMul in MHAA:2048x4096B:4096x12288A:1x4096B:4096x12288MHA(1st BMM)A:16x2048x256B:16x2048x256A:16x1x256B:16x2048x256Output MatMul in MHAA:2048x4096B:4096x4096A:1x4096B:4096x40961st MatMul in FFNA:2048x4096B:4096x16384A:1x4096B:4096x163842nd MatMul in FFNA:2048x16384B:4096x4096A:1x16384B:4096x4096
3、Input embeddingMatMul shapes in GPT-J(suppose prompt token size=2048,batch size=1,greedy search)x 28Compute BoundMemory Read Bandwidth BoundGPT-J Model StructureQKV MatMulQCON 2023 SHANGHAIGPT Series Model Analysis5Parameters visited during one time inference=?4?2+2 4?2+?Memory latency&Compute laten
4、cy?=?=2?Arithmetic intensity?=2?=2?FLOIPS/bytePeak AI for SPR-SP with BF16 with AMX?16=123.2?307.2/1000=401FLOIPS/byteCompute bound?16 B S 401Memory bound?16 1GB/core HBM memory capacity1TB/s memory BW up to112.5MB shared LLCDDR5 8 channels per CPU 4800MTS(1DPC)/16 DIMMs per socket64GB HBM2e QCON 20
5、23 SHANGHAICPU is NOT Fully Utilized!70.000010.000020.000030.000040.000050.000060.000070.000005001,0001,5002,0002,5003,000metric_CPU utilization%metric_CPU utilization%Vector DBText Emb2Context RetrieverLLMREST/gRPC APIPre-trained/finetuned LLM modelPre-processing and post processing in LLM inferenc
6、e is relatively simple and do not need too much CPU resource.Text Emb1Should I attend QCON?Yes.LLM Inference PipelineCPU Utilization in LLM Training(offload mode)Even for offload LLM training,CPU is still not fully utilized.CPU上如何優化大語言模型?Optimization Leverage the high-performance kernel(e.g.,oneDNN)
7、Avoid redundant computingContinuous batchingCausal maskingPrefix sharing Lower precision&Sparsity Graph fusion Minimize memory copy and reorder Reuse the memory Distributed inference&use efficient communication library oneCCL Runtime tunningOptimization for Distributed InferenceDistributed inference
8、 based on oneCCLImprove scalability by minimizing synchronization.One time synchronization per layer is enough for some modelsCompute moduleCommunication modulecopyMinimize memory copy with full stack ownership11Attention OptimizationQKSoftmax(Q*KT)VSoftmax(Q*KT)*V*QKSoftmax(Q*KT)VSoftmax(Q*KT)*V*Co
9、mputing Order:Intermediate score size:Same size with Advantage:less score buffer w/o redundant computing*SlimAttention(split score in 1 dimesion)FlashAttention(split score in 2 dimesion)Computing Order:Intermediate score size:Same size with Advantage:minimal intermediate buffer w/some redundant comp
10、uting12Do We Need Paged Attention on CPU?012345678910111213141501234567891011121314152nd sequential visit1st sequential visitInt8 Weight Only Quantizationmin(fp32/fp16)max(fp32/fp16)int8fp32/fp16-127 -126 0 126 127min 0.0 maxint8histogramint8histogram100%data99.99%datareal min(fp32/fp16)real max(fp3
11、2/fp16)int8fp32/fp16real min0.0real max127126.0-126-127Weight(fp32/fp16)QWeight(int8)scale(fp32)zero_point(fp32)100%data99.99%data1 0 00 1-127-126-125-124 2 1 0 1 2 125 126 127count500real minminreal maxmaxLLaMA2 7BAccuracyHF+AutoGPTQ _INT873.7046%xFT+INT8_Convert73.7629%ObservationWith histogram-ba
12、sed quantization,we could get very good accuracy in xFasterTransformer.最大化CPU價值QCON 2023 SHANGHAICPU vs.GPUA smaller number of larger coresA larger number(thousands)of smaller coresLow latencyHigh throughputPerforms fewer instructions per clockPerforms more instructions per clockDesigned and optimiz
13、ed for complex programs w/serial processingOptimized for parallel processing w/bulk repetitive calculationsAutomatic cache managementAllows for manualmemory managementLarge memory capacityLimited memory capacityCPUGPUMemory bandwidthComputingMemory capacityKey HW Factors in LLM InferenceKey Challeng
14、es in LLM InferenceAutoregressive generationAttention(square)Model is largeQCON 2023 SHANGHAIScenarios CPU has values Long tail models(many models,few requests)Offline mode(to maximize throughput)Occasional demand Very long prompt token size and no strict latency requirement Very large model and no
15、enough GPU Hybrid solution(e.g.,speculative sampling)CPUModel-1Model-2Model-3Model-N All models loaded in memory Not all models serving togetherQCON 2023 SHANGHAISpeculative DecodingImage from https:/ ModelTarget ModelProposedTokensTokenGenerationTokenVerification716.8480001000200030004000500060002S
16、 CPU(5th Gen Xeon)8 x A10Memory Bandwidth(GB/s)Memory Bandwidth MatchingIn Practice,draft models are about 15-20 x smaller than the target model.Draft ModelTarget ModelProposedTokensProposedTokensDraft2 Model6.7xQCON 2023 SHANGHAI總結Try our solution on XeonBuild your own solution or seek ultimate perfWhy Considering CPU?How to optimize on CPU?When to use CPU?