《AMD InstinctTM MI300X生成式人工智能加速器和平臺架構.pdf》由會員分享,可在線閱讀,更多相關《AMD InstinctTM MI300X生成式人工智能加速器和平臺架構.pdf(22頁珍藏版)》請在三個皮匠報告上搜索。
1、AMD Instinct MI300X Generative AI Accelerator and Platform ArchitectureAlan Smith,Sr.Fellow,Instinct Lead SoC ArchitectVamsi Alla,Fellow,Instinct Chief EngineerHot Chips 20242|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024Agenda AMD Instinct MI300X Accelerator Ov
2、erview AMD CDNA 3 Architecture Memory System Overview Spatial Partitioning 4th Gen Infinity Architecture System Architecture AMD Instinct MI300X Platform Application Performance3|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024Multiple generations of architecture f
3、ocused advancing HPC&AI computeMI100AMD CDNA ECOSYSTEM GROWTHFirst purpose-built GPU architecture to accelerate FP64 and FP32 HPC workloadsMI200AMD CDNA 2DRIVING HPC AND AITO A NEW FRONTIERDenser compute architecture with leading memory capacity/bandwidthMI300AMD CDNA3 20202023MI300A MI300XDATA CENT
4、ER APU&DISCRETE GPUFocused improvements on unified memory,AI data format performance,and in-node networkingThe AMD Instinct Accelerator Journey4|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024AMD Instinct MI300X Multi-chiplet Accelerator153 Billion Transistors in
5、TSMC 5nm|6nm FinFET4th Generation Infinity Fabric 896 GB/sInfinity Fabric AP6 TB/s bisectionInfinity Fabric Advanced Package(AP)4.8 TB/s bisectionINFINITY FABRIC LINKPCI EXPRESSTM GEN5 LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FA
6、BRIC LINKXCDSHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2XCDL2SHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2L2XCDSHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2XCDL2SHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2L2XCDSHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2XCDL2SHADERENGINES
7、HADERENGINESHADERENGINESHADERENGINEL2L2XCDSHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2XCDL2SHADERENGINESHADERENGINESHADERENGINESHADERENGINEL2L2Host PCIe 5.0128 GB/sCOMPUTE UNIT304 Compute Units1,216 Matrix CoresHBM3HBM3HBM3HBM3HBM3HBM3HBM3HBM3192 GB HBM35.2 TB/sXCD/IOD Interface2.1 TB/s5|AMD I
8、NSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024AMD CDNA 3 ArchitectureSchedulerLocal Data ShareL1 CacheVector Units+Matrix CoreVector RegistersComputationMI300X(FLOPS/clock/CU)MI250X(FLOPS/clock/CU)MI300X(Peak TFLOP/s)MI250X(Peak TFLOP/s)MI300X Peak SpeedupVector FP641
9、2812881.747.91.7xVector FP322256128163.447.93.4xMatrix FP64256256163.495.71.7xMatrix FP32256256163.495.71.7xMatrix TF32 1024N/A653.7N/A1N/A1Matrix FP163204810241307.43833.4xMatrix BF163204810241307.43833.4xMatrix FP834096N/A*2614.9N/A1N/A1Matrix INT83 409610242614.93836.8x Doubled low precision matr
10、ix ops/clk/cu 2:4 structured sparsity support for INT8,FP8,FP16,BF16 Additional 2x performance with sparsity enabled TF32 and FP8 numerical format support 2-bit mantissa and a 5-bit exponent for training(E5M2)3-bit mantissa with a 4-bit exponent for inference(E4M3)OCP FP8 compliant Co-issue FP16|FP3
11、2|INT32 with FP16|FP32|FP641.AMD InstinctTM MI200 Series accelerators dont support FP8,TF32 or exploit structured sparsity2.Refers to non packed vector instructions on AMD InstinctTM MI200 and MI300 Series accelerators3.Refers to dense compute on AMD InstinctTM MI300 accelerators 6|AMD INSTINCT MI30
12、0X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024Worlds First 8-Stack HBM3 Memory ArchitectureMax LLM size per system(FP16)Single Nvidia H100 HGXSingle AMD Instinct MI300X Platform640 GB HBM3|25.6 TB/s1.5 TB HBM3|42.4 TB/sTrainingInferenceTrainingInference30B290B70B680BSee endnote M
13、I300-427|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024MI300X Block DiagramCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCDCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUC
14、UCU4MB L2 CacheGlobal ResourcesXCDCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCDCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCDCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU
15、CUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCDCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCDInfinity FabricInfinity Cache2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB
16、2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MBHBM2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB2MB
17、2MB2MB2MB2MBHBM4x IF or PCIe4x IFCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCDCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCUCU4MB L2 CacheGlobal ResourcesXCD8|AMD INSTINCT MI300X GENERATIVE AI ACCELERAT
18、OR AND PLATFORM ARCHITECTURE|AUGUST 2024MI300X Cache and Memory Hierarchy MI300XCDNA3 Compute Unit Cache Optimizations L1 Data Cache 128B cache line,32 KiB(vs.16 KiB on MI200 Series)L1 Instruction cache 64 KiB,shared by two CU(vs.32 KiB on MI200 Series)XCD private L2 cache Write back,write allocate
19、cache Increase request coalescing and reduce spill Each instance delivers 2048 Bytes/Clk Agent scope coherentSIMD(1216)CU(304)SE(32)64 KiB L1 Instruction Cache 16 KiB L1 Constant Cache 64 KiB LDS3.2 KiB Scalar Registers4 MiB L2GL2(8)128 KiB Vector Registers192 GiB HBM3GPU(1)256 MiB Infinity Cache32
20、KiB L1 Data CacheLevelCapacityNumber per GPUVGPR128 KiB1216LDS64 KiB304L1 Data Cache32 KiB304L2 Cache4 MiB8Infinity Cache256 MiB1HBM192 GiB1AMD Infinity Cache Benefits 256 MB at 14.7 TB/s peak BW Bandwidth amplification Power and latency reduction Device/System scope coherent9|AMD INSTINCT MI300X GE
21、NERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024 All XCDs operate together to present the GPU as a single processor GPU can be spatially partitioned into as many partitions as XCDs XCDs can be grouped to appear as multiple GPUs Instinct MI300X supports Single Root IO Virtualization(SR-I
22、OV)Supports up to 64 VFs per platform and dynamic re-partitioning Provides isolation of Virtual Functions(VFs)and protect a VF from accessing information or state of the Physical Function(PF)or another VF Each XCD could operate on a separate stream of input queries for inference NUMA partitions per
23、socket or NPS,exploit data locality for partitionsSingle partition Two partitionsFour partitionsEight partitionsNPS1NPS4AMD Instinct MI300X GPU Spatial PartitioningFlexible Partitions for Bare Metal and Virtualization10|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2
24、024AMD Instinct Platform8x AMD Instinct MI300X AcceleratorIndustry-standarddesign4th Gen AMD Infinity Fabric TechnologyLeadership memory capacityIndustry leading generative AI platform11|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024MI100AMD CDNA Scale Up System
25、to support large Model MLMI200AMD CDNA 2OAM to power a new frontier beyond PCIe Form FactorMI300AMD CDNA3 AI Subsystem to serve LLMs 20202023The AMD Instinct System JourneyEnhancing system architecture to complement our silicon development 12|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFOR
26、M ARCHITECTURE|AUGUST 2024AMD Instinct MI300X Platform8 AMD Instinct MI300X10.4 PFBF16/FP161.5 TBHBM3896 GB/sInfinity Fabric BandwidthIndustry-Standard DesignIndustry-leading generative AI platform13|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024MI300X Infinity P
27、latform Direct connectivity for 8 OAMs via AMD Infinity Fabric Seven bi-directional links 128 GB/s PCIe Gen 5 x16 per OAM for server connectivity and I/O 192 GB HBM for RDMA 14|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024MI300X Infinity Platform UBB 2.0:Open st
28、andards,faster deployment,seamless datacenter integration Security:SPDM attestation,measurement RAS:Full-chip ECC memory,page retirement,page avoidance Telemetry:RedFish,logs,notifications Firmware management:RedFish PLDM bundles,redundancy15|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFOR
29、M ARCHITECTURE|AUGUST 2024Dell PowerEdge XE9680GIGABYTE G593-ZX1-AAZ1HPE CRAY SC XD675Lenovo ThinkSystem SR685a V3 Rack ServerSuper Micro AS-8125GS-TNMR2 ServerUBB 2.0 6UUBB 2.0 5UUBB 2.0 8UUBB 2.08UUBB 2.08UDual CPUs with up to 56 cores per processorDual AMD EPYC 9004 Series Processors(with AMD 3D
30、V-Cache Technology)2x AMD EPYC up to 400W 2x 4thGen AMD EPYC ProcessorsDual AMD EPYC 9004 Series Processors 8x AMD Instinct MI300X Accelerators 8x AMD Instinct MI300X Accelerators 8x AMD Instinct MI300X Accelerators 8x AMD Instinct MI300X Accelerators 8x AMD Instinct MI300X Accelerators MI300X Indus
31、try Standard OCP Server Designs16|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024Open software ecosystemLibrariesCompilers and ToolsRuntimeExpanded GenAI optimizationsAI FrameworksExpanded features and supportExpanded developer supportAMD GPUs17|AMD INSTINCT MI300
32、X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024Leadership performance for generative AINvidiaH100AMD InstinctMI300XMeta Llama-3 70B1.3x1.2xMistral-7B8X GPU1X GPUROCm 6 SoftwareMi300-53,MI300-54Token Generation Throughput18|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM
33、ARCHITECTURE|AUGUST 2024ThroughputTokens/secMPTModel size:30BModel Fine Tuning1xWorld Class Training PerformanceSingle Server 8x MI300XNvidiaH100 HGXAMD InstinctMI300X PlatformResults may vary.See endnotes:MI300-4019|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024
34、SummaryDatacenter APU and Accelerator Architecture Optimal Efficiency through unified memory,AI focused data formats 8 stack HBM3 Memory System 192GB per MI300X and 1.5 TB per platform,enabling 680B parameter LLM inference Instinct Platform Architecture Modular AI Subsystem enabling fast industry ad
35、optionIndustry Leading AI performance 21|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024EndnotesMI300-53:Testing completed on 05/28/2024 by AMD performance lab attempting text generated throughput measured using Mistral-7B model comparison.Tests were performed usi
36、ng batch size 1 and 2048 input tokens and 2048 output tokens for Mistral-7B Configurations:2P AMD EPYC 9534 64-Core Processor based production server with 8x AMD InstinctTM MI300X(192GB,750W)GPU,Ubuntu 22.04.1,and ROCm 6.1.1 Vs.2P Intel Xeon Platinum 8468 48-Core Processor based production server wi
37、th 8x NVIDIA Hopper H100(80GB,700W)GPU,Ubuntu 22.04.3,and CUDA 12.2 Only 1 GPU on each system was used in this test.Server manufacturers may vary configurations,yielding different results.Performance may vary based on use of latest drivers and optimizations.MI300-53MI300-54:Testing completed on 05/2
38、8/2024 by AMD performance lab attempting text generated Llama3-70B using batch size 1 and 2048 input tokens and 128 output tokens for each system.Configurations:2P AMD EPYC 9534 64-Core Processor based production server with 8x AMD InstinctTM MI300X(192GB,750W)GPU,Ubuntu 22.04.1,and ROCm 6.1.1 Vs.2P
39、 Intel Xeon Platinum 8468 48-Core Processor based production server with 8x NVIDIA Hopper H100(80GB,700W)GPU,Ubuntu 22.04.3,and CUDA 12.2 8 GPUs on each system was used in this test.Server manufacturers may vary configurations,yielding different results.Performance may vary based on use of latest dr
40、ivers and optimizations.MI300-54MI300-40:Testing completed 11/28/2023 by AMD performance lab using MosaicML vllm-foundry to fine tune the MPT-30b model for 2 epochs using the MosaicML instruct-v3 dataset and a max sequence length of 8192 tokens using custom docker container for each system.Configura
41、tions:2P Intel Xeon Platinum 8480C CPU server with 8x AMD Instinct MI300X(192GB,750W)GPUs,ROCm 6.0 pre-release,PyTorch 2.0.1,MosaicML llm-foundry pre-release,Ubuntu 22.04.2.Vs.An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors,8x Nvidia H100(80GB,700W)GPUs,CUDA 11.8,PyTorch 2.0.1.,Mosa
42、icML llm-foundry,Ubuntu 22.04.3.Server manufacturers may vary configurations,yielding different results.Performance may vary based on use of latest drivers and optimizations.MI300-34:Token generation throughput using DeepSpeed Inference with the Bloom-176b model with an input sequence length of 1948
43、 tokens,and output sequence length of 100 tokens,and a batch size tuned to yield the highest throughput on each system comparison based on AMD internal testing using custom docker container for each system as of 11/17/2023.Configurations:2P Intel Xeon Platinum 8480C CPU powered server with 8x AMD In
44、stinct MI300X 192GB 750W GPUs,pre-release build of ROCm 6.0,Ubuntu 22.04.2.Vs.An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors,8x Nvidia H100 80GB 700W GPUs,CUDA 12.0,Ubuntu 22.04.3.8 GPUs on each system were used in this test.Server manufacturers may vary configurations,yielding dif
45、ferent results.Performance may vary based on use of latest drivers and optimizations.MI300-39:Number of simultaneous text generating copies of the Llama2-70b chat model,using vLLM,comparison using custom docker container for each system based in AMD internal testing as of 11/26/2023.Configurations:2
46、P Intel Xeon Platinum 8480C CPU server with 8x AMD Instinct MI300X(192GB,750W)GPUs,ROCm 6.0 pre-release,PyTorch 2.2.0,vLLM for ROCm,Ubuntu 22.04.2.Vs.An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors,8x Nvidia H100(80GB,700W)GPUs,CUDA 12.1.,PyTorch 2.1.0.vLLM v.02.2.2(most recent),Ubu
47、ntu 22.04.3.Server manufacturers may vary configurations,yielding different results.Performance may vary based on use of latest drivers and optimizations.M300-42:Measurements by internal AMD Performance Labs as of December 1,2023 on current specifications and/or internal engineering calculations.Inf
48、erence and training Large Language Model(LLM)run comparisons with FP16 precision to determine the largest Large Language model size that is expected to run on the 8x AMD Instinct MI300X(192GB)accelerator platform and on the Nvidia 8x H100(80GB)GPUs DGX platform.Calculated estimates based on GPU-only
49、 memory size versus memory required by the model at defined parameters plus 10%overhead.Calculations rely on published and sometimes preliminary model memory sizes.Multiple LLMs and parameter sizes were analyzed.Max size determined by memory capacity of 8x platform.Configurations:8x AMD Instinct MI3
50、00X(192GB HBM3,OAM Module)750W accelerator at 2,100 MHz peak boost engine clock designed with 3rd Gen AMD CDNA 3 5nm FinFET process technology.Vs.8x Nvidia HGX H100(80GB HBM3,SXM5)platform Nvidia memory specification at https:/ for Inferencing:Largest parameter size for 8X H100:MI300X GPUs H100 GPUs
51、 Gopher Deepmind(290B)4 Calculated 8 Calculated Largest parameter size for 8x MI300X:MI300X GPUs H100 GPUsPaLM-1(680B)8 Calculated 19 Calculated Results for Training:Largest parameter size for 8X H100:MI300X GPUs H100 GPUs Mosiac MPT-30B parameter 4 Calculated 8 CalculatedLargest parameter size for
52、8x MI300X:MI300X GPUs H100 GPUsMosiac MPT-70B parameter 7 Calculated 16 Calculated Assumptions:FP16 Datatype Batchsize 1Memory needs for model=2Bytes per Parameter Memory size needs for activations and others=+10%Actual maximum LLM parameter size that can run on each platform may vary upon performan
53、ce testing with physical servers.22|AMD INSTINCT MI300X GENERATIVE AI ACCELERATOR AND PLATFORM ARCHITECTURE|AUGUST 2024Disclaimer and AttributionThe information contained herein is for informational purposes only and is subject to change without notice.While everyprecaution has been taken in the pre
54、paration of this document,it may contain technical inaccuracies,omissions and typographicalerrors,and AMD is under no obligation to update or otherwise correct this information.Advanced Micro Devices,Inc.makes norepresentations or warranties with respect to the accuracy or completeness of the conten
55、ts of this document,and assumes no liability of any kind,including the implied warranties of noninfringement,merchantability or fitness for particular purposes,with respect to the operation or use of AMD hardware,software or other products described herein.No license,including implied orarising by e
56、stoppel,to any intellectual property rights is granted by this document.Terms and limitations applicable to the purchaseor use of AMD products are as set forth in a signed agreement between the parties or in AMDs Standard Terms and Conditions ofSale.GD-18u.2024 Advanced Micro Devices,Inc.All rights
57、reserved.AMD,the AMD Arrow logo,AMD CDNA,AMD Instinct,AMD ROCm,EPYC,AMD Infiinity Cache,AMD Infinity Fabric and combinations thereof are trademarks of Advanced Micro Devices,PCIe is a registered trademark of PCI-SIG Corporation.Other product names used in this publication are for identification purp
58、oses only and may be trademarks of their respective owners.PyTorch,the PyTorch logo and any related marks are trademarks of The Linux Foundation.TensorFlow,the TensorFlow logo and any related marks are trademarks of Google Inc.Certain AMD technologies may require third-party enablement or activation.Supported features may vary by operating system.Please confirm with the system manufacturer for specific features.No technology or product can be completely secure.