當前位置：首頁 > 報告詳情

大模型時代最大化CPU價值的優化策略-何普江.pdf

上傳人：張** 編號：153225 2024-01-15 PDF PDF 18頁 2.54MB

該報告所屬合集： 2023QCon全球軟件開發者大會-上海站嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/18

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《大模型時代最大化CPU價值的優化策略-何普江.pdf》由會員分享，可在線閱讀，更多相關《大模型時代最大化CPU價值的優化策略-何普江.pdf（18頁珍藏版）》請在三個皮匠報告上搜索。

1、何普江英特爾AI軟件架構師大語言模型時代：最大化CPU價值的優化策略QCON 2023 SHANGHAI目錄01 背景（為什么?）02 CPU上如何優化大語言模型？04 總結03 最大化CPU價值背景(為什么考慮最大化CPU價值?)QCON 2023 SHANGHAIComputing Needs in LLM4Probability of next tokenSoftMaxMatMulLayer NormBMMSoftMaxBMMMaskedMulti-HeadAttentionMatMulLayer NormMatMulSoftMaxMatMul1st tokenNext tokensQK

2、V MatMul in MHAA:2048x4096B:4096x12288A:1x4096B:4096x12288MHA(1st BMM)A:16x2048x256B:16x2048x256A:16x1x256B:16x2048x256Output MatMul in MHAA:2048x4096B:4096x4096A:1x4096B:4096x40961st MatMul in FFNA:2048x4096B:4096x16384A:1x4096B:4096x163842nd MatMul in FFNA:2048x16384B:4096x4096A:1x16384B:4096x4096

3、Input embeddingMatMul shapes in GPT-J(suppose prompt token size=2048,batch size=1,greedy search)x 28Compute BoundMemory Read Bandwidth BoundGPT-J Model StructureQKV MatMulQCON 2023 SHANGHAIGPT Series Model Analysis5Parameters visited during one time inference=?4?2+2 4?2+?Memory latency&Compute laten

4、cy?=?=2?Arithmetic intensity?=2?=2?FLOIPS/bytePeak AI for SPR-SP with BF16 with AMX?16=123.2?307.2/1000=401FLOIPS/byteCompute bound?16 B S 401Memory bound?16 1GB/core HBM memory capacity1TB/s memory BW up to112.5MB shared LLCDDR5 8 channels per CPU 4800MTS(1DPC)/16 DIMMs per socket64GB HBM2e QCON 20

5、23 SHANGHAICPU is NOT Fully Utilized!70.000010.000020.000030.000040.000050.000060.000070.000005001,0001,5002,0002,5003,000metric_CPU utilization%metric_CPU utilization%Vector DBText Emb2Context RetrieverLLMREST/gRPC APIPre-trained/finetuned LLM modelPre-processing and post processing in LLM inferenc

6、e is relatively simple and do not need too much CPU resource.Text Emb1Should I attend QCON?Yes.LLM Inference PipelineCPU Utilization in LLM Training(offload mode)Even for offload LLM training,CPU is still not fully utilized.CPU上如何優化大語言模型？Optimization Leverage the high-performance kernel(e.g.,oneDNN)

7、Avoid redundant computingContinuous batchingCausal maskingPrefix sharing Lower precision&Sparsity Graph fusion Minimize memory copy and reorder Reuse the memory Distributed inference&use efficient communication library oneCCL Runtime tunningOptimization for Distributed InferenceDistributed inference

8、 based on oneCCLImprove scalability by minimizing synchronization.One time synchronization per layer is enough for some modelsCompute moduleCommunication modulecopyMinimize memory copy with full stack ownership11Attention OptimizationQKSoftmax(Q*KT)VSoftmax(Q*KT)*V*QKSoftmax(Q*KT)VSoftmax(Q*KT)*V*Co

9、mputing Order:Intermediate score size:Same size with Advantage:less score buffer w/o redundant computing*SlimAttention(split score in 1 dimesion)FlashAttention(split score in 2 dimesion)Computing Order:Intermediate score size:Same size with Advantage:minimal intermediate buffer w/some redundant comp

10、uting12Do We Need Paged Attention on CPU?012345678910111213141501234567891011121314152nd sequential visit1st sequential visitInt8 Weight Only Quantizationmin(fp32/fp16)max(fp32/fp16)int8fp32/fp16-127 -126 0 126 127min 0.0 maxint8histogramint8histogram100%data99.99%datareal min(fp32/fp16)real max(fp3

11、2/fp16)int8fp32/fp16real min0.0real max127126.0-126-127Weight(fp32/fp16)QWeight(int8)scale(fp32)zero_point(fp32)100%data99.99%data1 0 00 1-127-126-125-124 2 1 0 1 2 125 126 127count500real minminreal maxmaxLLaMA2 7BAccuracyHF+AutoGPTQ _INT873.7046%xFT+INT8_Convert73.7629%ObservationWith histogram-ba

12、sed quantization,we could get very good accuracy in xFasterTransformer.最大化CPU價值QCON 2023 SHANGHAICPU vs.GPUA smaller number of larger coresA larger number(thousands)of smaller coresLow latencyHigh throughputPerforms fewer instructions per clockPerforms more instructions per clockDesigned and optimiz

13、ed for complex programs w/serial processingOptimized for parallel processing w/bulk repetitive calculationsAutomatic cache managementAllows for manualmemory managementLarge memory capacityLimited memory capacityCPUGPUMemory bandwidthComputingMemory capacityKey HW Factors in LLM InferenceKey Challeng

14、es in LLM InferenceAutoregressive generationAttention(square)Model is largeQCON 2023 SHANGHAIScenarios CPU has values Long tail models(many models,few requests)Offline mode(to maximize throughput)Occasional demand Very long prompt token size and no strict latency requirement Very large model and no

15、enough GPU Hybrid solution(e.g.,speculative sampling)CPUModel-1Model-2Model-3Model-N All models loaded in memory Not all models serving togetherQCON 2023 SHANGHAISpeculative DecodingImage from https:/ ModelTarget ModelProposedTokensTokenGenerationTokenVerification716.8480001000200030004000500060002S

16、 CPU(5th Gen Xeon)8 x A10Memory Bandwidth(GB/s)Memory Bandwidth MatchingIn Practice,draft models are about 15-20 x smaller than the target model.Draft ModelTarget ModelProposedTokensProposedTokensDraft2 Model6.7xQCON 2023 SHANGHAI總結Try our solution on XeonBuild your own solution or seek ultimate perfWhy Considering CPU?How to optimize on CPU?When to use CPU?

相關圖表

本文主要探討了在大型語言模型時代，如何最大化CPU的價值，以優化CPU上的大語言模型。文章首先分析了在LLM中，計算需求和概率的復雜性，以及模型結構對內存和計算帶寬的影響。然后，文章提出了多種優化策略，包括利用高性能內核、避免重復計算、使用低精度和小規模模型、圖融合、最小化內存復制和排序、內存重用、分布式推理和使用高效的通信庫等。文章還通過具體的實驗數據，展示了這些優化策略的效果。最后，文章討論了在何種場景下，CPU相較于GPU具有優勢，例如長尾模型、離線模式、偶爾的需求和非常大的提示詞長度等。

"如何最大化CPU在大語言模型中的價值？" "CPU優化大語言模型的策略有哪些？" "在什么情況下應該考慮使用CPU？"

相關報告

Lob：2023年實現利潤最大化的直銷函件營銷優化策略報告（英文版）（19頁）.pdf

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站