《6567 - LLM Inference Performance Projection.pdf》由會員分享,可在線閱讀,更多相關《6567 - LLM Inference Performance Projection.pdf(17頁珍藏版)》請在三個皮匠報告上搜索。
1、Mohan J Kumar,Perpetual Intelligence Edmund Song,Intel LLM Inference Performance Projection AI Market Trends AI Refresher Training vs Inference Types of Parallelism MESA LLM Inference Performance Projection Overview and Examples Unique Attributes Summary and Call to ActionAgenda Generative AI revenu
2、e expected to be$1T by 20321Generative AI will be 12%of technology spend by 20321Global AI Market by 2030 will be$1.7T2 Global AI Inference Market expected to grow at 18.4%CAGR to$133B by 20343Inference is excepted to 90%of AI computing by 2030 AI Trends Sources:1 Bloomberg 2 Grand View Research 3 M
3、arket.us Training vs.Inference Large Dataset Car?Error Forward pass Backward pass Model trains on dataset Adjusts parameters to minimize error CarForward pass TrainingInference With new data input,output a prediction Types of Parallelism Multiple full copies of models on different GPU or AI clusters
4、 Increases overall throughput processing multiple request in parallelGPU0 GPU3 GPU2GPU1 Long Sequence is split across multiple GPUs or AI clusters More commonly used in inference Allows handling long sequences without running out of memory Sequence Parallelism Data Parallelism GPU0 GPU3 GPU2GPU1 Seq
5、1Seq2Seq 3Seq 4Input Sequence Types of Parallelism Model split across multiple GPUs or AI clusters E.g.split across 4 GPUs in above illustration Support large models that do not fit within a single unit(GPU or AI cluster)memory constraints Tensor Parallelism Model layers split across multiple GPUs o
6、r AI clusters E.g.4 layers split across 4 GPUs in above illustration Better utilization of AI hardware Pipeline Parallelism GPU0 GPU1GPU3GPU2Types of Parallelism Allows spreading experts across multiple GPUs or AI accelerators or AI clusters Activate only a subset of experts per input avoiding redun
7、dant computation Reduces total compute cost per inference Expert1Expert2Expert3Expert4GPU0 GPU3 GPU2GPU1 Expert Parallelism Transformer Layers RouterMESA MESA-Model Evaluation and Smart Analytics is a tool for AI inference perf projection!Inputs LLM Model,AI Hardware,Sequence Length,Batch Size Outpu
8、t-Latency/Token(inv Tokens/s),Total GPUs required,Parallelism MESA Tool UI Overview Configuration PanelGraph Panel Model TopologyAnalysis Result PanelModel Topology using MESA Layer wise breakdown:Input,attention,mlp,output Displays calculations for each block and interdependencies between blocks Ex
9、amine optimization strategies and tuning opportunities based on topology MESA How it brings Model and HW together Model parameter available as JSON from huggingface Model source is also available Model Arch Description language and AI HW Spec language developed as part of MESA Model ParametersModel
10、Data File/Source Code Model Arch DescriptionHugging Face GithubTBDAI HW Spec MESA Perf ProjectionTune Batch sizeParallelism Sequence Length hardware_name:H100,hardware_type:GPU,FP64_supported:true,TF64_supported:true,FP32_supported:true,TF32_supported:true,.FP64_bandwidth:34,TF64_bandwidth:67,FP32_b
11、andwidth:67,.memory_type:HBM2e,memory_capacity:80,memory_bandwidth:3.352,MESA Demo MESA Unique Attributes JSON to describe model architecture and model parameters Produce model diagrams to illustrate data flow paths and dependencies Calculate the latency,computational needs,memory requirement for ea
12、ch stage of the model Enable tuning batch size,sequence length,parallelism given hardware constraints JSON to describe various GPU and AI hardware in a generic manner Summary and Call to Action Global AI market is poised to reach$2T by 2030 AI inference is expected to 90%of AI market by 2030 MESA al
13、lows AI inference performance project across models and hardware platforms OCP as hardware/platform org can play a key role in maintaining hardware Contributing the hardware model JSON to OCP Utilize MESA and help improve and enhance it!Thank You!Inference DeepSeek Model:671B Parameters HW Platform:
14、8xH20 MLA 11B,MLP 11B Experts -44M/Expert/Layer MLA,MLP TP Shared Expert 1 per GPU 256 Experts Distributed 32 per GPU Batch Size limited GPU1 Shared ExpertRE1RE32RouterShared ExpertRE33RE64RouterShared ExpertRE65RE32RouterShared ExpertRE1RE32RouterShared ExpertRE1RE32RouterShared ExpertRE1RE32Router
15、Shared ExpertRE1RE32RouterShared ExpertRE1RE 512RouterGPU2GPU3GPU4GPU5GPU6GPU7GPU8MLAMLP Types of Inference Batch Process multiple inputs at onceOffline or scheduledOptimized for throughput,not latency E.g.Recommendation Systems Interactive(online)Process single or low batch input in real-time Interactive,often user-facing Optimized for relatively low latency E.g.chatbots Fast Specialized online inferenced Ultra-low low latency Machine to machine E.g.autonomous driving,high-frequency trading