當前位置：首頁 > 報告詳情

下一代MTIA元推薦推理加速器.pdf

上傳人： c** 編號：465022 2025-01-12 PDF PDF 27頁 1.33MB

該報告所屬合集： 2024年高性能芯片研討會（HOT Chips 2024）嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/27

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《下一代MTIA元推薦推理加速器.pdf》由會員分享，可在線閱讀，更多相關《下一代MTIA元推薦推理加速器.pdf（27頁珍藏版）》請在三個皮匠報告上搜索。

1、Next Gen MTIA Next Gen MTIA-Recommendation Inference AcceleratorRecommendation Inference AcceleratorMahesh Maddury,Pankaj Kansal,Olivia WuAcknowledgementAcknowledgementPrahlad Venkatapuram,Linda Cheng,Willie Wei,Harikrishna Reddy,Nagesh Sreedhara,Dheepak Jayaraman,Jae Won Seo,Soheil Gharahi,Anurag J

2、ain,Vinod Bhat,Krishna Datla,Srinu Sunkara,Surendra Rajupalem,Inder Bhasin,Krishna Noru,Adam Hutchin,Rakesh Nattoji,Brian Ko,Kevin Quinn,Himanshu Sanghavi,Yogesh Agrawal,Ushma Bharucha,Harshitha Pilla,Adrian Lewis,Eric Qin,Duckhwan Kim,Giri Prasanna M,Bob Dreyer,Amin Firoozshahian,Gurdeepak Grewal,S

3、ameer Shripad,Krishna Thottempudi,Ramakoteswara Rao Chennamsetti,K Rajesh Jagannath,Xiao He,Tuo Wang,Keerthi Manthati,Eric Wang,Kalyan Chakravarthy Kurapati,Joyce Wu,Vaibhav Desai,Sai Ma,Tongji Wang,Aadesh Karki,Teja Kala,Bhargav Alluri,Hao Jin,Eda Sahin,Arun Karthik Sengottuvel,Satyanarayana Arvapa

4、lli,Kalyan Chakravarthy Kurapati,Kaushal Gandhi,Samantak Gangopadhyay,Andrew Xi,Siddharth Bhargav,Michael Wrighton,Hengsheng Chen,Bala Balasubramanian,Harish Aepala,Ram Gupta,Aarti Ramesh,Ejaz Mohammed,Sivanandh Ramadass,Anshul Bansal,Gautham Suryanarayana,Changru Liu,Michael Xiong,Pritam Chopda,Jos

5、eph Saiju Thirunilathu,Gaurav Sudheer Bhangle,Shayan Ebrahimi,Ferze Patawaran,Lei Gao,Sumanth Batchu,Brian Dodds,Pritesh Modi,Bhasker Jakka,Anmol Shridhar Poojary,Ashwin Narasimha,Mick Baikie,Bradley Hall,Greeshma Sasikumar,Lara Khamisy,Lawrence Jacob,Mrinalini Padmanabhan,Patrick Hanley,Shayan Rais

6、i,Erum Kazi,Athul Asokan Thulasi,Krish Skandakumaran,Pavan Shetty,Sanjay Desai,Indu Kalayraman,Karthik Narayanan,Roman Levenstein,Kaustubh Gondkar,Jack Montgomery,Sameer Abu Asal,Joe Shajrawi,Kunming Ho,Saritha Dwarakapuram,Mike Tsai,Hangchen Yu,Richard Li,Xiaolong Xie,Siyan Lin,Jordan Fix,Junhan Hu

7、,Ilia Cherniavskii,Aleksandr Malyshev,Mandeep Singh Grang,Garima Gupta,Giancarlo Colmenares,Wang Xu,Cyrus Daruwala,Site Cao,Jun Luo,Mitch Gamburg,Ryan Xia,Qiming Lu,Jay Banerjee,Raviteja Chinta,Joel Coburn,Cao Gao,Kimble Houck,Aravind Sukumaran-Rajam,Andrey Kareev,Satish Nadathur,JiyuanZhang,Nasibeh

8、 Nasiri,Shintaro Iwasaki,Amy Yang,Jiawei Zhang,Jongsoo Park,Summer Deng,Fangran Xu,Eric Ouyang,Gautam PatilMotivationMotivationMeta Inference Workload TrendsMeta Inference Workload TrendsDeep Learning Recommendation Models(DLRM)are increasing in model size(GB)and complexity(GFLOPS)Emergence of GenAI

9、 with LLMs and wide array of models across different use casesGFLOPS/SAMPLEGFLOPS/SAMPLEMODEL SIZE(EMBEDDINGS)MODEL SIZE(EMBEDDINGS)MODEL SIZE(TOTAL)MODEL SIZE(TOTAL)INFERENCE MODEL TRENDSINFERENCE MODEL TRENDS0.20.40.60.81.01.21.41.61.82020H22021H12021H22022 H12022 H22023 H12023 H22024 H12024H2MODE

10、L COMPLEXITY(GFLOPS/SAMPLE)MODEL SIZE-INFERENCE(GB)2004006008001000Models evolved beyond SparseNN for better accuracy and user experienceGPU Deployment ChallengesGPU Deployment ChallengesPeak performance not alwaysnot always equal to effective performanceLarge deployments can be resource resource in

11、tensiveintensiveCapacity constraintsCapacity constraintsdue to GenAI demandNext Gen MTIA Development GoalsNext Gen MTIA Development GoalsImprove perf/TCOperf/TCO and perf/Wperf/W compared to previous generationHandle models across multiplemultiple Meta services efficientlyDeveloper efficiencyDevelop

12、er efficiencyto quickly reach high volume deploymentsFeaturesFeaturesPYTORCH EAGER PYTORCH EAGER MODE SUPPORTMODE SUPPORTNew hardware job launch job launch time 1ustime 1usCompleted jobreplacement time 0.5usreplacement time 0.5usINTEGER DYNAMIC INTEGER DYNAMIC QUANTIZATIONQUANTIZATIONHardware based

13、tensor tensor quantizationquantizationDeliver accuracy comparable to FP32FP32GENGEN-OO-GEN GEN PERFORMANCEPERFORMANCEIncreased GEMM TOPs by 3.5x3.5x to177 TFLOPS BF16Sparse matrix support with 2x 2x TFLOPSANS weight decompression with 50%50%compression ratio and 20%20%better memory to compute tensor

14、 transfer performanceBalance compute,memory and data transfer to achieve over 80%80%utilizationutilizationTBE(TABLE BATCH TBE(TABLE BATCH EMBEDDING)EMBEDDING)OPTIMIZATIONOPTIMIZATIONHW optimization for download and prefetch of embedding indices 2 2-3x3x faster run time compare to prev genAccelerator

15、 ArchitectureAccelerator ArchitectureTECHNOLOGYTECHNOLOGYTSMC 5nmFREQUENCYFREQUENCY1.35 GHzGATE COUNTGATE COUNT2.35B gates,103M flopsDIMENSIONSDIMENSIONS25.6 x 16.4 mm(421 mm2)PACKAGEPACKAGE50mm x 40mmTDPTDP90 WattsGEMM TOPSGEMM TOPS354(INT8),177(FP16)2x with sparsityMEMORYMEMORY128GB LPDDR5 6400BW

16、204.8GB/sSpecificationSpecificationControl subsystem&host interface8x8 grid of processing elements connected via custom mesh network256MB of on-chip SRAM,distributed across 4 sides with 2.7 TB/s BW16 channels of LPDDR5 memory on 4 sides,up to 128GB capacity with 204.8GB/s BWArchitecture OverviewArch

17、itecture OverviewHost InterfaceHost InterfaceGen5 x8-32GB/s4MB PCIe Descriptor SRAM for fast descriptor fetch Control Core SubsystemControl Core SubsystemQuad Core Scalar RISC-V8MB L2 Cache.4MB Context SRAM for fast workload distributionHost Interface&Control CoreHost Interface&Control CoreData NoCD

18、ata NoCIncreased PE to Memory subsystem BW by 3.4x to 2.76 TB/sNon-blocking and QoS supportMulticast reads from PEConfig NoCConfig NoCNon-blocking and QoS supportSubsystem level broadcastSelective multicast for eager mode supportNetwork on Chip(Noc)Network on Chip(Noc)Dual RISC-V cores,one scalar,on

19、e with vector extensionCommand Processor(CP)to coordinate execution of functional blocks in PEFixed-function units to accelerate:Matrix multiplications with sparsity support(DPE)Non-Linear functions(SE)Data movement(MLU)Dynamic quantization(RE)Weight decompression(SDMA)Eager mode(WQE)Processing Elem

20、ents(PE)OverviewProcessing Elements(PE)OverviewDot Product Engine(DPE)delivers 2.77 TF/s(FP16)per PEAdded sparsity matrix support,providing 5.54 TF/s(FP16)in sparse modeWidened data paths in MLU,RE&SE to match DPE performancePE ComputePE Compute384KB PE local memory to support larger and more comple

21、x workloadsIndex aligned DMA support to speed up index prefetchIncreased PE local SRAM and Fabric Interface BW to balance compute and memory access and provide over 80%PE utilizationPE Memory SubsystemPE Memory SubsystemBuilt-in hardware that provide capability to offload the task of adjusting quant

22、ization parameters in real time Collect min/max per batch during run timeSupport rowwise quantizationAchieved over 99.95%accuracy comparing to baseline FP32 resultEnable channel-wise symmetric dynamic quantization for FC operatorsInteger Dynamic QuantizationInteger Dynamic QuantizationAdded multicas

23、t write groups to allow Control Core to broadcast Eager Mode Work Queue(WQ)descriptors to select PEsAdded Work Queue Engine(WQE)in PE to DMA WQ requests descriptors from Control CoreMore than 80%reduction in PE job launch timeEager Mode EnhancementsEager Mode EnhancementsTransfer time of large embed

24、ding indices over PCIe impact SLS performanceAdded Decompression Engine to alleviate PCIe and network congestionSupport for RFC1952(GUNZIP/GZIP)standardencapsulating RFC1951(Deflate CompressionFormat)Support for static and dynamic Huffmancoded blocks4 Decompression CoresDecompression rate up to 25 G

25、B/sHardware DecompressionHardware DecompressionLossless Asymmetric Numerical System(ANS)algorithmAchieved close to 50%compression ratioimprove the on-chip memory footprintreduce the PE to NoC read BW Data processed on a 32x32B granularity32B/cycle decompression ratePE Weight DecompressionPE Weight D

26、ecompressionTBECombines tables from separate embedding batch ops into one single tablePERFORMANCEImprove runtime by 2-3x over previous generationHARDWARE FEATURESIndex aligned DMA for faster index prefetchMultiple context unrolling with SW prefetch for indicesHW decompression engine to speed up larg

27、e embedding indices from host via PCIeTable Batch Embedding(TBE)Table Batch Embedding(TBE)System DesignSystem DesignUp to 256GB LPDDR5 Memory409.6 GB/s total memory BWPCIe CEM FHFL Form Factor2 MTIAs per ModuleBoard TDP of 220W64GB/s Gen5 PCIe Interface2 Gen5 x8Accelerator ModuleAccelerator Module72

28、 MTIA ASICs per rack12 modules per chassis3 chassis per rackDeployed in DC since H1 24System TopologySystem TopologyPerformancePerformanceModel OptimizationModel OptimizationContinuous improvement in model performanceLow complexity(LC)models have an out-of-the-box advantage with the large on-chip SRAMHigh complexity(HC)models require more optimization to effectively block data in SRAM and realize higher effective FLOPSMore than 2x performance improvement over 4-6 monthsModel PerformanceModel PerformanceQ&AQ&A

相關圖表

本文介紹了一種名為Next Gen MTIA的推薦推理加速器，由Mahesh Maddury等人研發。該加速器采用TSMC 5nm技術，擁有2.35B個門和103M個浮點運算能力。其特點包括支持PyTorch eager模式、基于硬件的整數動態量化、GEN-O-GEN性能提升、稀疏矩陣支持等。加速器具有354個GEMM TOPs（INT8），177個GEMM TOPs（FP16），并且在稀疏模式下，這一數字翻倍。其內存為128GB LPDDR5，帶寬為204.8GB/s。該加速器旨在提高性能與能效比，處理模型跨多個Meta服務的問題，以及提高開發者效率。其架構包括8x8網格的處理元素、控制子系統和主機接口、256MB的片上SRAM、16通道的LPDDR5內存等。Next Gen MTIA已經在數據中心部署，并取得了超過2倍的性能提升。

"Next Gen MTIA如何提升模型推理性能？" "深度學習推薦模型的發展趨勢是什么？" "如何通過硬件解壓縮引擎優化大型嵌入式索引處理？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站