《下一代MTIA元推薦推理加速器.pdf》由會員分享,可在線閱讀,更多相關《下一代MTIA元推薦推理加速器.pdf(27頁珍藏版)》請在三個皮匠報告上搜索。
1、Next Gen MTIA Next Gen MTIA-Recommendation Inference AcceleratorRecommendation Inference AcceleratorMahesh Maddury,Pankaj Kansal,Olivia WuAcknowledgementAcknowledgementPrahlad Venkatapuram,Linda Cheng,Willie Wei,Harikrishna Reddy,Nagesh Sreedhara,Dheepak Jayaraman,Jae Won Seo,Soheil Gharahi,Anurag J
2、ain,Vinod Bhat,Krishna Datla,Srinu Sunkara,Surendra Rajupalem,Inder Bhasin,Krishna Noru,Adam Hutchin,Rakesh Nattoji,Brian Ko,Kevin Quinn,Himanshu Sanghavi,Yogesh Agrawal,Ushma Bharucha,Harshitha Pilla,Adrian Lewis,Eric Qin,Duckhwan Kim,Giri Prasanna M,Bob Dreyer,Amin Firoozshahian,Gurdeepak Grewal,S
3、ameer Shripad,Krishna Thottempudi,Ramakoteswara Rao Chennamsetti,K Rajesh Jagannath,Xiao He,Tuo Wang,Keerthi Manthati,Eric Wang,Kalyan Chakravarthy Kurapati,Joyce Wu,Vaibhav Desai,Sai Ma,Tongji Wang,Aadesh Karki,Teja Kala,Bhargav Alluri,Hao Jin,Eda Sahin,Arun Karthik Sengottuvel,Satyanarayana Arvapa
4、lli,Kalyan Chakravarthy Kurapati,Kaushal Gandhi,Samantak Gangopadhyay,Andrew Xi,Siddharth Bhargav,Michael Wrighton,Hengsheng Chen,Bala Balasubramanian,Harish Aepala,Ram Gupta,Aarti Ramesh,Ejaz Mohammed,Sivanandh Ramadass,Anshul Bansal,Gautham Suryanarayana,Changru Liu,Michael Xiong,Pritam Chopda,Jos
5、eph Saiju Thirunilathu,Gaurav Sudheer Bhangle,Shayan Ebrahimi,Ferze Patawaran,Lei Gao,Sumanth Batchu,Brian Dodds,Pritesh Modi,Bhasker Jakka,Anmol Shridhar Poojary,Ashwin Narasimha,Mick Baikie,Bradley Hall,Greeshma Sasikumar,Lara Khamisy,Lawrence Jacob,Mrinalini Padmanabhan,Patrick Hanley,Shayan Rais
6、i,Erum Kazi,Athul Asokan Thulasi,Krish Skandakumaran,Pavan Shetty,Sanjay Desai,Indu Kalayraman,Karthik Narayanan,Roman Levenstein,Kaustubh Gondkar,Jack Montgomery,Sameer Abu Asal,Joe Shajrawi,Kunming Ho,Saritha Dwarakapuram,Mike Tsai,Hangchen Yu,Richard Li,Xiaolong Xie,Siyan Lin,Jordan Fix,Junhan Hu
7、,Ilia Cherniavskii,Aleksandr Malyshev,Mandeep Singh Grang,Garima Gupta,Giancarlo Colmenares,Wang Xu,Cyrus Daruwala,Site Cao,Jun Luo,Mitch Gamburg,Ryan Xia,Qiming Lu,Jay Banerjee,Raviteja Chinta,Joel Coburn,Cao Gao,Kimble Houck,Aravind Sukumaran-Rajam,Andrey Kareev,Satish Nadathur,JiyuanZhang,Nasibeh
8、 Nasiri,Shintaro Iwasaki,Amy Yang,Jiawei Zhang,Jongsoo Park,Summer Deng,Fangran Xu,Eric Ouyang,Gautam PatilMotivationMotivationMeta Inference Workload TrendsMeta Inference Workload TrendsDeep Learning Recommendation Models(DLRM)are increasing in model size(GB)and complexity(GFLOPS)Emergence of GenAI
9、 with LLMs and wide array of models across different use casesGFLOPS/SAMPLEGFLOPS/SAMPLEMODEL SIZE(EMBEDDINGS)MODEL SIZE(EMBEDDINGS)MODEL SIZE(TOTAL)MODEL SIZE(TOTAL)INFERENCE MODEL TRENDSINFERENCE MODEL TRENDS0.20.40.60.81.01.21.41.61.82020H22021H12021H22022 H12022 H22023 H12023 H22024 H12024H2MODE
10、L COMPLEXITY(GFLOPS/SAMPLE)MODEL SIZE-INFERENCE(GB)2004006008001000Models evolved beyond SparseNN for better accuracy and user experienceGPU Deployment ChallengesGPU Deployment ChallengesPeak performance not alwaysnot always equal to effective performanceLarge deployments can be resource resource in
11、tensiveintensiveCapacity constraintsCapacity constraintsdue to GenAI demandNext Gen MTIA Development GoalsNext Gen MTIA Development GoalsImprove perf/TCOperf/TCO and perf/Wperf/W compared to previous generationHandle models across multiplemultiple Meta services efficientlyDeveloper efficiencyDevelop
12、er efficiencyto quickly reach high volume deploymentsFeaturesFeaturesPYTORCH EAGER PYTORCH EAGER MODE SUPPORTMODE SUPPORTNew hardware job launch job launch time 1ustime 1usCompleted jobreplacement time 0.5usreplacement time 0.5usINTEGER DYNAMIC INTEGER DYNAMIC QUANTIZATIONQUANTIZATIONHardware based
13、tensor tensor quantizationquantizationDeliver accuracy comparable to FP32FP32GENGEN-OO-GEN GEN PERFORMANCEPERFORMANCEIncreased GEMM TOPs by 3.5x3.5x to177 TFLOPS BF16Sparse matrix support with 2x 2x TFLOPSANS weight decompression with 50%50%compression ratio and 20%20%better memory to compute tensor
14、 transfer performanceBalance compute,memory and data transfer to achieve over 80%80%utilizationutilizationTBE(TABLE BATCH TBE(TABLE BATCH EMBEDDING)EMBEDDING)OPTIMIZATIONOPTIMIZATIONHW optimization for download and prefetch of embedding indices 2 2-3x3x faster run time compare to prev genAccelerator
15、 ArchitectureAccelerator ArchitectureTECHNOLOGYTECHNOLOGYTSMC 5nmFREQUENCYFREQUENCY1.35 GHzGATE COUNTGATE COUNT2.35B gates,103M flopsDIMENSIONSDIMENSIONS25.6 x 16.4 mm(421 mm2)PACKAGEPACKAGE50mm x 40mmTDPTDP90 WattsGEMM TOPSGEMM TOPS354(INT8),177(FP16)2x with sparsityMEMORYMEMORY128GB LPDDR5 6400BW
16、204.8GB/sSpecificationSpecificationControl subsystem&host interface8x8 grid of processing elements connected via custom mesh network256MB of on-chip SRAM,distributed across 4 sides with 2.7 TB/s BW16 channels of LPDDR5 memory on 4 sides,up to 128GB capacity with 204.8GB/s BWArchitecture OverviewArch
17、itecture OverviewHost InterfaceHost InterfaceGen5 x8-32GB/s4MB PCIe Descriptor SRAM for fast descriptor fetch Control Core SubsystemControl Core SubsystemQuad Core Scalar RISC-V8MB L2 Cache.4MB Context SRAM for fast workload distributionHost Interface&Control CoreHost Interface&Control CoreData NoCD
18、ata NoCIncreased PE to Memory subsystem BW by 3.4x to 2.76 TB/sNon-blocking and QoS supportMulticast reads from PEConfig NoCConfig NoCNon-blocking and QoS supportSubsystem level broadcastSelective multicast for eager mode supportNetwork on Chip(Noc)Network on Chip(Noc)Dual RISC-V cores,one scalar,on
19、e with vector extensionCommand Processor(CP)to coordinate execution of functional blocks in PEFixed-function units to accelerate:Matrix multiplications with sparsity support(DPE)Non-Linear functions(SE)Data movement(MLU)Dynamic quantization(RE)Weight decompression(SDMA)Eager mode(WQE)Processing Elem
20、ents(PE)OverviewProcessing Elements(PE)OverviewDot Product Engine(DPE)delivers 2.77 TF/s(FP16)per PEAdded sparsity matrix support,providing 5.54 TF/s(FP16)in sparse modeWidened data paths in MLU,RE&SE to match DPE performancePE ComputePE Compute384KB PE local memory to support larger and more comple
21、x workloadsIndex aligned DMA support to speed up index prefetchIncreased PE local SRAM and Fabric Interface BW to balance compute and memory access and provide over 80%PE utilizationPE Memory SubsystemPE Memory SubsystemBuilt-in hardware that provide capability to offload the task of adjusting quant
22、ization parameters in real time Collect min/max per batch during run timeSupport rowwise quantizationAchieved over 99.95%accuracy comparing to baseline FP32 resultEnable channel-wise symmetric dynamic quantization for FC operatorsInteger Dynamic QuantizationInteger Dynamic QuantizationAdded multicas
23、t write groups to allow Control Core to broadcast Eager Mode Work Queue(WQ)descriptors to select PEsAdded Work Queue Engine(WQE)in PE to DMA WQ requests descriptors from Control CoreMore than 80%reduction in PE job launch timeEager Mode EnhancementsEager Mode EnhancementsTransfer time of large embed
24、ding indices over PCIe impact SLS performanceAdded Decompression Engine to alleviate PCIe and network congestionSupport for RFC1952(GUNZIP/GZIP)standardencapsulating RFC1951(Deflate CompressionFormat)Support for static and dynamic Huffmancoded blocks4 Decompression CoresDecompression rate up to 25 G
25、B/sHardware DecompressionHardware DecompressionLossless Asymmetric Numerical System(ANS)algorithmAchieved close to 50%compression ratioimprove the on-chip memory footprintreduce the PE to NoC read BW Data processed on a 32x32B granularity32B/cycle decompression ratePE Weight DecompressionPE Weight D
26、ecompressionTBECombines tables from separate embedding batch ops into one single tablePERFORMANCEImprove runtime by 2-3x over previous generationHARDWARE FEATURESIndex aligned DMA for faster index prefetchMultiple context unrolling with SW prefetch for indicesHW decompression engine to speed up larg
27、e embedding indices from host via PCIeTable Batch Embedding(TBE)Table Batch Embedding(TBE)System DesignSystem DesignUp to 256GB LPDDR5 Memory409.6 GB/s total memory BWPCIe CEM FHFL Form Factor2 MTIAs per ModuleBoard TDP of 220W64GB/s Gen5 PCIe Interface2 Gen5 x8Accelerator ModuleAccelerator Module72
28、 MTIA ASICs per rack12 modules per chassis3 chassis per rackDeployed in DC since H1 24System TopologySystem TopologyPerformancePerformanceModel OptimizationModel OptimizationContinuous improvement in model performanceLow complexity(LC)models have an out-of-the-box advantage with the large on-chip SRAMHigh complexity(HC)models require more optimization to effectively block data in SRAM and realize higher effective FLOPSMore than 2x performance improvement over 4-6 monthsModel PerformanceModel PerformanceQ&AQ&A