當前位置：首頁 > 報告詳情

6567 - LLM Inference Performance Projection.pdf

上傳人：蘆葦編號：651493 2025-05-01 PDF PDF 17頁 1.10MB

該報告所屬合集： 2025 OCP EMEA Summit嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/17

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《6567 - LLM Inference Performance Projection.pdf》由會員分享，可在線閱讀，更多相關《6567 - LLM Inference Performance Projection.pdf（17頁珍藏版）》請在三個皮匠報告上搜索。

1、Mohan J Kumar,Perpetual Intelligence Edmund Song,Intel LLM Inference Performance Projection AI Market Trends AI Refresher Training vs Inference Types of Parallelism MESA LLM Inference Performance Projection Overview and Examples Unique Attributes Summary and Call to ActionAgenda Generative AI revenu

2、e expected to be$1T by 20321Generative AI will be 12%of technology spend by 20321Global AI Market by 2030 will be$1.7T2 Global AI Inference Market expected to grow at 18.4%CAGR to$133B by 20343Inference is excepted to 90%of AI computing by 2030 AI Trends Sources:1 Bloomberg 2 Grand View Research 3 M

3、arket.us Training vs.Inference Large Dataset Car?Error Forward pass Backward pass Model trains on dataset Adjusts parameters to minimize error CarForward pass TrainingInference With new data input,output a prediction Types of Parallelism Multiple full copies of models on different GPU or AI clusters

4、 Increases overall throughput processing multiple request in parallelGPU0 GPU3 GPU2GPU1 Long Sequence is split across multiple GPUs or AI clusters More commonly used in inference Allows handling long sequences without running out of memory Sequence Parallelism Data Parallelism GPU0 GPU3 GPU2GPU1 Seq

5、1Seq2Seq 3Seq 4Input Sequence Types of Parallelism Model split across multiple GPUs or AI clusters E.g.split across 4 GPUs in above illustration Support large models that do not fit within a single unit(GPU or AI cluster)memory constraints Tensor Parallelism Model layers split across multiple GPUs o

6、r AI clusters E.g.4 layers split across 4 GPUs in above illustration Better utilization of AI hardware Pipeline Parallelism GPU0 GPU1GPU3GPU2Types of Parallelism Allows spreading experts across multiple GPUs or AI accelerators or AI clusters Activate only a subset of experts per input avoiding redun

7、dant computation Reduces total compute cost per inference Expert1Expert2Expert3Expert4GPU0 GPU3 GPU2GPU1 Expert Parallelism Transformer Layers RouterMESA MESA-Model Evaluation and Smart Analytics is a tool for AI inference perf projection!Inputs LLM Model,AI Hardware,Sequence Length,Batch Size Outpu

8、t-Latency/Token(inv Tokens/s),Total GPUs required,Parallelism MESA Tool UI Overview Configuration PanelGraph Panel Model TopologyAnalysis Result PanelModel Topology using MESA Layer wise breakdown:Input,attention,mlp,output Displays calculations for each block and interdependencies between blocks Ex

9、amine optimization strategies and tuning opportunities based on topology MESA How it brings Model and HW together Model parameter available as JSON from huggingface Model source is also available Model Arch Description language and AI HW Spec language developed as part of MESA Model ParametersModel

10、Data File/Source Code Model Arch DescriptionHugging Face GithubTBDAI HW Spec MESA Perf ProjectionTune Batch sizeParallelism Sequence Length hardware_name:H100,hardware_type:GPU,FP64_supported:true,TF64_supported:true,FP32_supported:true,TF32_supported:true,.FP64_bandwidth:34,TF64_bandwidth:67,FP32_b

11、andwidth:67,.memory_type:HBM2e,memory_capacity:80,memory_bandwidth:3.352,MESA Demo MESA Unique Attributes JSON to describe model architecture and model parameters Produce model diagrams to illustrate data flow paths and dependencies Calculate the latency,computational needs,memory requirement for ea

12、ch stage of the model Enable tuning batch size,sequence length,parallelism given hardware constraints JSON to describe various GPU and AI hardware in a generic manner Summary and Call to Action Global AI market is poised to reach$2T by 2030 AI inference is expected to 90%of AI market by 2030 MESA al

13、lows AI inference performance project across models and hardware platforms OCP as hardware/platform org can play a key role in maintaining hardware Contributing the hardware model JSON to OCP Utilize MESA and help improve and enhance it!Thank You!Inference DeepSeek Model:671B Parameters HW Platform:

14、8xH20 MLA 11B,MLP 11B Experts -44M/Expert/Layer MLA,MLP TP Shared Expert 1 per GPU 256 Experts Distributed 32 per GPU Batch Size limited GPU1 Shared ExpertRE1RE32RouterShared ExpertRE33RE64RouterShared ExpertRE65RE32RouterShared ExpertRE1RE32RouterShared ExpertRE1RE32RouterShared ExpertRE1RE32Router

15、Shared ExpertRE1RE32RouterShared ExpertRE1RE 512RouterGPU2GPU3GPU4GPU5GPU6GPU7GPU8MLAMLP Types of Inference Batch Process multiple inputs at onceOffline or scheduledOptimized for throughput,not latency E.g.Recommendation Systems Interactive(online)Process single or low batch input in real-time Interactive,often user-facing Optimized for relatively low latency E.g.chatbots Fast Specialized online inferenced Ultra-low low latency Machine to machine E.g.autonomous driving,high-frequency trading

相關圖表

本文主要探討了人工智能（AI）市場的趨勢、訓練與推理的區別、各類并行主義、以及MESA – 大型語言模型（LLM）推理性能預測工具。核心數據包括：1. 到2032年，生成式AI收入預計將達到1萬億美元，占技術支出的12%。2. 到2030年，全球AI市場規模預計將達到1.7萬億美元。3. 到2034年，全球AI推理市場預計將以18.4%的復合年增長率增長至1330億美元。關鍵點如下： - AI市場的趨勢：生成式AI將成為技術支出的一部分，AI市場將繼續增長。 - 訓練與推理：訓練是模型在數據集上學習以最小化誤差的過程，而推理是給定新數據輸入時，模型輸出預測的過程。 - 并行主義：包括多種類型的并行計算，如數據并行主義、序列并行主義、張量并行主義和管道并行主義，它們可以提高AI硬件的利用率。 - MESA：MESA是一個AI推理性能預測工具，它可以幫助用戶在不同模型和硬件平臺上進行性能預測和優化。MESA具有獨特的屬性，如JSON描述模型架構和參數，以及生成模型圖等。綜上所述，本文提供了關于AI市場、訓練與推理、并行主義以及MESA工具的詳細信息，以幫助讀者了解AI領域的最新趨勢和性能優化方法。

"AI市場未來趨勢如何？" 有何不同？" "MESA工具如何提升AI性能？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站