《Session 16Highlighted Chip Releases_ Digital and Machine Learning Processors.pdf》由會員分享,可在線閱讀,更多相關《Session 16Highlighted Chip Releases_ Digital and Machine Learning Processors.pdf(14頁珍藏版)》請在三個皮匠報告上搜索。
1、Session 16 Overview:Highlighted Chip Releases:Digital and Machine Learning Processors INVITED INDUSTRY SESSION Thi s sessi on hi ghli ghts recent,producti zed advancements i n di gi tal processors and systems for AI accelerati on and hi gh-performance networki ng.Key i nnovati ons i nclude novel arc
2、hi tectures for AI processi ng,sophi sti cated memory archi tectures,and advanced swi tch chi p desi gns.The work presented i n thi s sessi on demonstrates si gni ficant i mprovements i n AI capabi li ti es,power effici ency,and scalabi li ty across vari ous appli cati ons,from mobi le devi ces to d
3、ata centers.The sessi on provi des an overvi ew of current state-of-the-art hardware soluti ons addressi ng the demands of modern computi ng workloads.Session Chair:Ali ci a Kli nefelter NVI DI A,Durham,NC Session Co-Chair:Vi vek De I ntel,Hi llsboro,OR 280 2025 I EEE I nternati onal Soli d-State Ci
4、 rcui ts ConferenceISSCC 2025/SESSION 16/HIGHLIGHTED CHIP RELEASES/OVERVIEW979-8-3315-4101-9/25/$31.00 2025 I EEE2:20 PM 16.3 An On-Device Generat ive AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC wit h Fan-Out Wafer-Level Package Jun-Seok Park,Samsung Electroni cs,Hwaseong,Korea I n
5、I nvi ted Paper 16.3,Samsung i ntroduces thei r Exynos 2400 SoC,used i n the Galaxy S24 smartphone.I t features a 4nm EUV process and supports on-devi ce AI wi th a heterogeneous NPU archi tecture consi sti ng of vector engi nes and two types of tensor engi nes.The 4nm SoC achi eves i mproved AI per
6、formance through better heat di ssi pati on usi ng fan-out wafer level packagi ng,wi th the NPU occupyi ng 12mm and operati ng at 0.55 to 0.83V and 533 to 1196MHz.2:45 PM 16.4 SambaNova SN40L:A 5nm 2.5D Dat aflow Accelerat or wit h Three Memory Tiers for Trillion Paramet er AI Raghu Prabhakar,SambaN
7、ova Systems,Palo Alto,CA I n I nvi ted Paper 16.4,SambaNova di scusses thei r SN40L,a Reconfigurable Dataflow Uni t for AI appli cati ons.I t features a three-ti er memory system and dual accelerator di es on a CoWoS-S i nterposer.Each socket deli vers 640 BF16 TFLOPs,and i ncludes 520MB on-chi p SR
8、AM.The SN40L s archi tecture enables i mproved energy effici ency and performance capable of hosti ng up to 5 tri lli on parameter models.1:55 PM 16.2 RNGD:A 5nm Tensor-Cont ract ion Processor for Power-Efficient Inference on Large Language Models Sang Mi n Lee,Furi osaAI,Seoul,Korea I n I nvi ted P
9、aper 16.2,Furi osaAl presents thei r 5nm i nference chi p that uti li zes paralleli sm and data locali ty i nherent i n tensor contracti on,featuri ng HBM3 memory,sli ce redundancy,and a hi gh-bandwi dth NoC.I t deli vers 256TFLOPS/512TOPS for BF16/I NT8 wi th 256MB of on-chi p SRAM,and achi eves 12
10、.3 queri es/s at 99%accuracy for GPT-J.ISSCC 2025/February 18,2025/1:30 PM281 DI GEST OF TECHNI CAL PAPERS 1:30 PM 16.1 Tomahawk5:51.2Tb/s 5nm Monolit hic Swit ch Chip for AI/ML Net working Asad Khami sy,Broadcom,San Jose,CA I n I nvi ted Paper 16.1,Broadcom presents Tomahawk5,a swi tch chi p for da
11、ta centers and AI envi ronments,offeri ng 51.2Tb/s swi tchi ng capaci ty.The chi p i ncludes 512 lanes of 106.4Gb/s PAM4 SerDes,supports multi ple I/O configurati ons,has a 168MB shared-buffer archi tecture,and i s desi gned as a monoli thi c di e wi th 9,352 pi ns and si x ARM processor cores,achi
12、evi ng 450W typi cal power consumpti on i n actual deployments.16282 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025/SESSION 16/HIGHLIGHTED CHIP RELEASES:DIGITAL AND MACHINE LEARNING PROCESSORS/16.1979-8-3315-4101-9/25/$31.00 2025 I EEE16.1 Tomahawk5:51.2Tb/s 5nm Monolit hic
13、Swit ch Chip for AI/ML Net working Asad Khami sy1,Mohan Kalkunte1,Peter Del Vecchi o1,Yokai Cheok1,Greg Barsky1,Karlhei nz Muth1,Reza Shari fi2 1Broadcom,San Jose,CA 2Broadcom,I rvi ne,CA Thi s paper provi des an overvi ew of Broadcom s Tomahawk5(BCM78900 seri es,aka TH5)chi p and the challenges of
14、i mplementi ng a 51.2Tb/s advanced Ethernet swi tch i n a monoli thi c di e.We wi ll descri be several technologi es that enabled TH5 reali zati on and i ts advanced capabi li ti es.TH5 i s a cutti ng-edge swi tch chi p desi gned for next-generati on data center and AI networks.Thi s chi p represent
15、s a si gni ficant advancement i n swi tch chi p technology,offeri ng 51.2Tb/s of swi tchi ng capaci ty,whi ch i s double the bandwi dth of i ts predecessors 1.I ntegrated i nto the Tomahawk5 swi tch i s Broadcom s Peregri ne SerDes technology,whi ch represents a si gni ficant advancement i n hi gh-s
16、peed data transmi ssi on for networki ng appli cati ons.The i ntegrated SerDes cores provi de nati ve support for numerous physi cal connecti vi ty opti ons,i ncludi ng di rect attach copper(DAC),front panel pluggable opti cs,li near pluggable opti cs(LPO),and co-packaged opti cs(CPO).Thi s versati
17、li ty allows data center operators to choose the most appropri ate and cost-effecti ve connecti vi ty soluti ons for thei r needs.The chi p features 512 lanes of 106.25Gb/s PAM4 SerDes and supports vari ous I/O configurati ons,i ncludi ng 100GbE,200GbE,400GbE,and 800GbE ports.I t features a shared-b
18、uffer archi tecture that i s dynami cally shared across ports and queues.I t i s desi gned as a monoli thi c di e wi th 9,352 pi ns and i ncludes si x ARM processor cores for on-chi p addi ti onal processi ng.The chi p achi eves very low power through power-savi ng techni ques wi th a typi cal power
19、 of 450W at actual deployments.One of the key challenges was i mplementi ng the chi p i nto a monoli thi c di e,whi ch provi des lower cost,power,and latency.Thi s requi red i nnovati ons i n mi croarchi tecture,memory,and SerDes desi gn.We wi ll elaborate on some of these i nnovati ons below.Fi gur
20、e 16.1.1 shows the key components of the TH5 chi p overlai d on the actual chi p floor plan.TH5 i mplements an output queued shared buffer archi tecture provi di ng excellent burst absorpti on and quali ty of servi ce characteri sti cs.The key factors to reali zi ng the chi p i n a monoli thi c di e
21、 are:1)shared memory traffic manager enabli ng the effici ent i mplementati on of control structures,2)semi-custom packet buffer enabli ng hi gher uti li zati on,3)area-opti mi zed DSP SerDes,4)symmetri c floor plan enabli ng focus on opti mi zi ng the reduced number of physi cal blocks,and 5)extens
22、i ve use of feed-throughs to reduce wi ri ng costs.We i mplemented many power reducti on techni ques i n TH5 to achi eve the lowest power possi ble and enable ai r-cooled systems.One of these techni ques uses multi-bi n Adapti ve Voltage Scali ng(AVS),whi ch equali zes power across di fferent proces
23、s corners.We i mplemented 8 AVS bi ns rangi ng from 0.7 to 0.7875V wi th 12.5mV granulari ty.The core voltage i s determi ned duri ng the wafer probe,and the correspondi ng voltage i s stored i n an OTP(One Ti me Programmable)array.Duri ng power-up,thi s value i s output by the chi p and can be conn
24、ected to an external power regulator.Fi gure 16.1.2 shows the typi cal power consumpti on of multi ple Tomahawk generati ons.We were able to achi eve 30%power reducti on from one generati on to the next generati on.Thi s i s better than the 15-to-20%process technology power reducti on we observe i n
25、 P&R vi a mappi ng the same desi gn,whi ch i s accompli shed vi a mi croarchi tecture i nnovati ons and Serdes technology i nnovati ons.TH5 measured 450W of power i n typi cal customer deployments.One of the key challenges i n desi gni ng the power deli very network(PDN)i s meeti ng the core voltage
26、 speci ficati ons at the package BGA ball.The PDN must ensure suffici ent current i s deli vered to the chi p to meet the i dle to full worst-case traffic condi ti ons whi le keepi ng the voltage ri pple wi thi n spec duri ng these load transi ents.For thi s devi ce,the core current transi ti on cou
27、ld be as hi gh as 430A wi thi n 550ns,whi ch equates to a 782A/s current slew rate.The PDN desi gn has to account for suffici ent copper PCB layers and output decoupli ng capaci tors to meet the electri cal and thermal requi rements of the system.Usi ng frequency domai n analysi s,we determi ned the
28、 core PCB PDN i mpedance mask requi red to achi eve the maxi mum allowable voltage droop on the core power supply,as shown i n Fi gure 16.1.3(top left).Measurement data from the TH5 Si li con Vali dati on Ki t(SVK)i n Fi gure 16.1.3(top ri ght)captures the voltage droop behavi or at the BGA ball whe
29、n the devi ce transi ti ons from i dle traffic state to max traffic condi ti on.The blue li ne shows the current transi ti on from i dle to acti ve,whi le the yellow li ne shows the voltage at the BGA ball.Note that the steady-state voltage drop from the power supply to TH5 BGA i s compensated for b
30、y usi ng a dedi cated sense li ne from the di e to the voltage regulator.A load li ne at the voltage regulator ensures that the voltage droop stays at 3%of the supply voltage.The TH5 SVK i s the pri mary vehi cle for vali dati ng the chi p functi onali ty and performance.Fi gure 16.1.3(bottom)shows
31、the SVK board and i llustrates the si mpli ci ty of the system desi gn compared to systems wi th flyover cables and hi gh power consumpti on.We can see that TH5 SVK supports 64 ports of 800G usi ng stacked,belly-to-belly OSFP connectors usi ng PCB routed si gnals.TH5 i mplements 100G DSP-based SerDe
32、s,whi ch supports over 45dB of i nserti on loss at 10-6 or better pre-FEC BER.We can support 4-meter DAC cables connected between TH5 ports.TH5 SVK i s an ai r-cooled system wi th a li dless package for better thermal performance.The TH5 organi c BGA package had several challenges that requi red a c
33、ombi nati on of desi gn i nnovati on and custom package technology development.Hi gh-speed li nks have stri ngent si gnal i ntegri ty requi rements i n terms of i solati on(FEXT/NEXT),i nserti on loss,and return loss.One of the cri ti cal factors i mpacti ng i solati on i s the SerDes BGA pattern,an
34、d i n the case of TH5,the use of tradi ti onal patterns and ball pi tch(1.0mm)would dri ve the package si ze to over 100mm wi th longer li nks and hi gher i nserti on loss.Broadcom i nstead used a custom patented compressed hex pattern wi th 0.9mm pi tch that helped reduce package si ze si gni fican
35、tly and yi elded better SI performance metri cs,as shown i n Fi g.16.1.4.Advanced node CMOS technology i s i nherently sensi ti ve to stresses i nduced by chi p packagi ng i nteracti on due to the presence of fragi le low-k di electri cs and other BEOL structures,such as hi gh-densi ty metal-i nsula
36、tor-metal capaci tors.Broadcom used a lookahead dai sy chai n test vehi cle wi th Broadcom I P(i ncludi ng custom bump cell,custom desi gn,custom substrate&assembly BOM/process)to fully quali fy and characteri ze TH5 chi p-to-package i nteracti ons ahead of tapeout.Despi te bei ng a large monoli thi
37、 c di e i n a large package at the ti me of development,TH5 packagi ng results were:passed JEDEC JESD47K Component Level Qual on the first try,met JEDEC Room Temp Coplanari ty(200um)and Hi gh Temp Warpage(-140m/+230m)specs wi th margi n,and passed I PC9702/3 Shock&Bend Tests.Si gnal processi ng and
38、CMOS technology advancements have enabled the i ntegrati on of powerful DSP SerDes ci rcui ts i nto swi tch ASI Cs.Therefore,the recei ver archi tecture has become very si mi lar between electri cal and opti cal i nterconnects.Secondly,the progress i n si li con photoni cs and packagi ng technologi
39、es has opened a path for hi ghly i ntegrated opti cal i nterconnects of large ASI Cs li ke swi tch chi ps.As a first i mplementati on,a“di rect dri ve”archi tecture was proposed by co-packagi ng(CPO)an Ethernet swi tch ASI C wi th li near opti cal engi nes for a si mple li near electri cal-to-opti c
40、al and opti cal-to-electri cal conversi on 2 and also demonstrated 3.The same archi tecture has been i mplementedwi th a TH5 co-packaged FR4 opti cs product named TH5-Bai lly(Fi g.16.1.7(left).Fi gure 16.1.7(ri ght)shows typi cal performance data of the system.The green li ghts i n the bottom secti
41、on of the figure represent one 400G FR4 port each and show error-free performance(after FEC).The top left graph shows the number of code word errors,whi ch i s zero i n thi s case.The center-left graph shows the average opti cal transmi t and recei ve power.The graphs on the ri ght center and top sh
42、ow the power consumpti on of the system and the opti cs.The total system power i s 820W and the opti cs consume 274W whi ch results i n an effici ency of about 5.5pJ/b.The mai n benefit of the“di rect dri ve”archi tecture i s that the core ASI C does not need a change i n the SerDEs to work wi th co
43、pper or opti cal i nterfaces.Li near pluggable opti cs(LPO)or li near recei ve opti cs(LRO,li near recei ver,reti med transmi tter)also benefit from the same topology.Thi s way,we can develop one swi tch chi p that works across all medi a types.The mai n advantages of the CPO archi tecture are a si
44、gni ficant power reducti on of the opti cal i nterface(see Fi g.16.1.5)and a maj or system densi ty i mprovement compared to standard front panel transcei vers.Because of the consoli dati on and i ntegrati on,the overall cost can also be reduced.LPO or LRO systems profit from the same archi tectural
45、 benefits;however,there i s no densi ty i mprovement si nce these are sti ll front-panel pluggable modules.The i mprovements i n power consumpti on and cost reducti on are also not as hi gh compared to a CPO archi tecture.Another advantage of CPO and LPO archi tectures i s the reducti on of si gnal
46、latency by approxi mately 100ns per li nk.The BER performance of these li nks i s comparable to“standard”opti cal li nks,where the si gnal i s regenerated wi th DSPs two ti mes along the li nk before i t reaches the recei ver(Fi gures 16.1.6(top left and top ri ght).I t has been demonstrated that th
47、e performance and error stati sti cs of LPO li nks can be better than standard reti med li nks 4.To enable LPO/LRO module vendors to test thei r opti cal modules,Broadcom provi des a module test platform(MTP),as shown i n Fi gure 16.1.6(bottom).I t provi des ports for di fferent opti cal module form
48、 factors and a range of electri cal channels for testi ng and quali ficati on.Figure 16.1.1:Tomahawk5 floor plan.Figure 16.1.2:Tomahawk family power consumpt ion.Figure 16.1.3:Top left:Tomahawk5 PDN impedance mask(wit h and wit hout Loadline);t op right:Tomahawk5 SVK volt age droop,39.9mV wit h IDLE
49、 t o MAX t raffic wit hout loadline enabled;bot t om:Tomahawk5 SVK syst em.Figure 16.1.4:Top:Tomahawk family package size evolut ion;bot t om:Tomahawk5 ball map.Figure 16.1.5:Typical power efficiency of t he different opt ical engines.Figure 16.1.6:Top left:Pre FEC BER wit h PRBS31;t op right:Pre FE
50、C BER wit h t raffic;bot t om:Tomahawk5 Module Test Plat form(MTP)for LRO/LPO.ISSCC 2025/February 18,2025/1:30 PM283 DI GEST OF TECHNI CAL PAPERS 16 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure
51、16.1.7:Left:Tomahawk5 co-packaged wit h 8 opt ical engines of 6.4T each;right:Tomahawk5 copackaged opt ics link performance IEEE 400GBase-FR4 St andard.References:1 Broadcom,“Tomahawk4/BCM56990 Seri es,”Accessed on Dec.2,2024,.2 K.Muth,“Multi-Chi p vs Monoli thi c I ntegrati on:A System Perspecti ve
52、,”Work s hop OFC,March 2019.3 K.Muth,“Co-packaged Opti cal I/O wi th Tomahawk Swi tch,”Work s hop ECOC,Sept.2021.4 H.Chaouch,“800G LPO”I PEC/Li ghtCounti ng Li near Dri ve Webi nar,”Sept.2023.284 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025/SESSION 16/HIGHLIGHTED CHIP RELE
53、ASES:DIGITAL AND MACHINE LEARNING PROCESSORS/16.2979-8-3315-4101-9/25/$31.00 2025 I EEE16.2 RNGD:A 5nm Tensor-Cont ract ion Processor for Power-Efficient Inference on Large Language Models Sang Mi n Lee1,Hanj oon Ki m1,Jeseung Yeon1,Mi nho Ki m1,Changj ae Park1,Byeongwook Bae1,Yoj ung Cha1,Wooyoung
54、Choe1,Jonguk Choi1,Younggeun Choi1,Ki Ji n Han2,Seokha Hwang1,Ki seok Jang1,Jaewoo Jeon1,Hyunmi n Jeong1,Yeonsu Jung1,Hyewon Ki m1,Sewon Ki m1,Suhyung Ki m1,Won Ki m1,Yongseung Ki m1,Youngsi k Ki m1,Hyukdong Kwon1,Jeong Ki Lee1,Juyun Lee1,Kyungj ae Lee1,Seokho Lee1,Mi nwoo Noh1,Junyoung Park1,Ji mi
55、n Seo1,June Pai k1 1Furi osaAI,Seoul,Korea 2Dongguk Uni versi ty,Seoul,Korea There i s a need for an AI accelerator opti mi zed for large language models(LLMs)that combi nes hi gh memory bandwi dth and dense compute power whi le mi ni mi zi ng power consumpti on.Tradi ti onal archi tectures 1-4 typi
56、 cally map tensor contracti ons,whi ch i s the core computati onal task i n machi ne learni ng models,onto matri x multi pli cati on uni ts.However,thi s approach often falls short i n fully leveragi ng the paralleli sm and data locali ty i nherent i n tensor contracti ons.I n thi s work,tensor cont
57、racti on i s used as a pri mi ti ve i nstead of matri x multi pli cati on,enabli ng massi ve paralleli sm and ti me-axi s pi peli ni ng si mi lar to vector processors.Large coarse-grai ned PEs can be spli t i nto smaller compute uni ts called s l i ces,as i llustrated i n Fi g.16.2.1.Dependi ng on t
58、he setup of the fetch network connecti ng the sli ces,these sli ces can functi on ei ther as one large processi ng element or as small and i ndependent compute uni ts.I nput data are conti nuously fetched i n a pi peli ned manner through the fetch network,allowi ng hi gh throughput and effici ent da
59、ta reuse.Si nce the operati on uni ts compute determi ni sti cally as configured,accurate cost models for performance and energy can be developed for opti mi zati on.The chi p speci ficati ons are also shown i n Fi g.16.2.1.The RNGD consi sts of two PE clusters,each wi th four PEs and routers as i n
60、 Fi g.16.2.2.Each cluster NoC operates at 1GHz,provi di ng 1TB/s of memory bandwi dth to mi ni mi ze bottlenecks duri ng si multaneous HBM access by multi ple PEs.A si ngle PE,i mplemented as a common physi cal block,can functi on i ndependently or be fused wi th up to three others to form a si ngle
61、,larger PE.The four PEs are abutted to provi de mi ni mal latency i n fused mode(Fi g.16.2.7).The SoC supports si ngle root I/O vi rtuali zati on,enabli ng usage of PEs by multi ple vi rtual machi nes wi th separate address spaces.Each PE can transfer data to HBM or another PE on the same chi p at u
62、p to 256GB/s.The RNGD features two HBM3 5 stacks,deli veri ng up to 1.5TB/s of memory bandwi dth.The memory NoC for two HBM stacks can effici ently di stri bute traffic from the PE to the 32 HBM channels by i nterleavi ng wi th address hashi ng.Fi gure 16.2.1 shows the block di agram of a PE,whi ch
63、i ncludes a CPU core,a tensor uni t(TU)wi th 32MB of SRAM,and a tensor DMA(TDMA)engi ne for tensor transfers.The TU,capable of 64TOPS,conti nuously fetches i nput tensors from SRAM,processes them,and commi ts the results back to SRAM.Each PE operates i ndependently wi th i ts own address space,manag
64、ed by an address translati on uni t that restri cts unauthori zed access determi ned by the secure firmware.I n case of large models,the chi p supports PCI e peer-to-peer(P2P)for i nter-chi p communi cati on wi th measured throughput of 32GB/s and 52GB/s wi thout and wi th a PCI e swi tch,respecti v
65、ely.PEs communi cate wi th other PEs and the host usi ng TDMA for data and I PC messages.Synchroni zati on of tensors and I PC message transfers enables tensor parallel operati ons across multi ple PEs.The CPU core operates at up to 2GHz wi th a nomi nal voltage of 0.75V,handli ng scalar processi ng
66、,control flow,si mple vector processi ng,and managi ng the TU and the TDMA engi ne.I t i ncludes 64KB of L1 I/D caches,256KB of L2 cache,and a 3.5MB scratch pad memory(SPM),whi ch are all separate from 32MB of SRAM menti oned above.The large SPM allows all code to run from i t,si mpli fyi ng the pro
67、grammi ng model.Memory transfers,often the bottleneck i n neural networks,especi ally LLMs,are handled asynchronously through the TDMA engi ne and can overlap wi th TU operati ons.The TDMA can i ndex and transfer tensors i n any di mensi on order,enabli ng opti mal data loweri ng from HBM to multi p
68、le sli ces and supporti ng tensor mani pulati ons li ke reshape and transpose.The PE NoC provi des QoS control and ti meout management.Each PE contai ns 65 sli ces,wi th one reserved as a spare to i mprove chi p yi elds.Whi le functi onally i denti cal,the sli ces have 8 di fferent physi cal shapes
69、due to I O posi ti on vari ati on based on thei r locati on as i n Fi g.16.2.2.The sli ces are i nterconnected through the TU NoC,consi sti ng of 65 router nodes i n a bi-di recti onal ri ng topology.Data fetched from SRAM of each sli ce i s sequenti ally deli vered vi a the fetch network,whi ch sup
70、ports multi casti ng to multi ple sli ces,to the operati on uni ts,whi ch i nclude contracti on(CE),vector(VE),and transpose(TE)engi nes.The CE contai ns 8 dot-product engi nes(DPE)that perform element-wi se multi pli cati ons and produce vari able number of outputs wi th depth-configurable reducti
71、on trees.The VE handles non-li near functi ons,element-wi se operati ons,reducti ons,and type conversi ons.The TE transposes the last axi s of a tensor wi th other axes,for example,b l e i nto b e l.The archi tecture supports data reuse configurati ons,i ncludi ng wei ght stati onary vi a the regi s
72、ter file,i nput stati onary usi ng the CE s i nput buffer,and output stati onary usi ng accumulator regi sters 6.A TU controller(TUC)manages and moni tors the sli ces by accessi ng TU control/status regi sters.The CPU core can configure tensor operati ons ei ther by wri ti ng di rectly to these regi
73、 sters or by sendi ng commands vi a the TUC s command queue so that the TUC sets the regi sters on i ts behalf.A si mple command processor i n the TU enables asynchronous tensor operati ons and movements.The TU has control regi sters that descri be tensor shapes and the order of computi ng tensor ax
74、es,and can be set si multaneously across all sli ces vi a broadcast.An enable bi t tri ggers executi on once all control regi sters are configured.The CPU core queues the next command to TU whi le the previ ous operati on i s sti ll i n progress,allowi ng the CPU core to conti nue other tasks i n pa
75、rallel.The CPU core handles dynami c control flow,checki ng the TUC s command queue status vi a polli ng or i nterrupt.I n the TU,tensor operati on can run i n parallel wi th vector and memory operati ons thanks to two executi on contexts per sli ce(Fi g.16.2.1):the mai n context for tensor operati
76、ons and the sub-context for other tasks.To ensure determi ni sti c communi cati on cycles,the 65 sli ces were i mplemented synchronously wi th a si ngle verti cal clock trunk net and two verti cal nets per PE as i n Fi g.16.2.2,whi ch mi ni mi zes clock skew for 32 sli ces.A strong dri vi ng cell wa
77、s used as a trunk clock cell and top two metal layers were used for routi ng,reduci ng clock tree latency between the root to the furthest parti ti on block and clock tree j i tter by at least 50%compared to a typi cal clock tree structure,whi le also mi ti gati ng on-chi p-vari ati on effect.3D ful
78、l-wave electromagneti c modeli ng of the si li con i nterposer was used for HBM si gnal i ntegri ty(SI)and power i ntegri ty(PI)as shown i n Fi g.16.2.3.Vari ous HBM toggli ng patterns were consi dered to exci te on-/off-chi p resonances,so that PDN i mpedance i s suppressed across data spectrums.Ex
79、tensi ve system power transi ent si mulati ons and PDN opti mi zati on achi eved Vpp of 3.65%at 0.4V VDDQL and Vpp of 7.48%at 0.75V VDD.The hi gh-speed i nterconnects between HBM PHY and DRAM on the si li con i nterposer are carefully desi gned to mi ni mi ze crosstalk and capaci ti ve loadi ng.I nt
80、erconnects are analyzed i n S-parameters as well as transi ent si mulati ons,whi ch show the eye di agram i s better than 0.76UI at RX I/O pads.Dynami c voltage and frequency scali ng(DVFS)effici ently balances temperature and peak performance for the PE and NoC.I n Fi g.16.2.4,DVFS controls the tot
81、al board power under 150W.Temperature sensors as i n Fi g.16.2.4 are moni tored peri odi cally duri ng operati on.To manage hi gh peak currents,the desi gn i ncludes vari ous decoupli ng capaci tors,such as on-di e,metal-i nsulator-metal(Mi M),and i nterposer deep-trench capaci tors(DTC).The amount
82、of DTC for each cluster i s 24.5F.Fully di gi tal wi de-bandwi dth voltage droop detectors moni tor locali zed fast supply voltage drops,tri ggeri ng i nterrupts i f the voltage falls below a pre-defined threshold.Ti mi ng margi ns are conti nuously moni tored across di fferent chi p locati ons and
83、also for long-term agi ng,cruci al for data center reli abi li ty.The chi p supports encrypti on for secure booti ng,ensuri ng system i ntegri ty.The control logi c for data memory and SPM,along wi th the HBM controller,supports si ngle-error correcti on and double-error detecti on.Si ngle-bi t erro
84、rs are counted for stati sti cs and a double-bi t error tri ggers an i nterrupt.To i mprove low chi p yi elds associ ated at the advanced 5nm process node and a large 653mm2 chi p area,a bi nni ng process allows faulty sli ces to be replaced wi th a spare.Duri ng wafer testi ng,all sli ces are teste
85、d,and statuses are recorded i n on-chi p fuses.Duri ng booti ng,these values are read,and a si ngle faulty sli ce wi thi n a PE can be replaced wi th a spare.However,i f two or more sli ces fai l i n a PE,the PE i s deemed a fai led PE.I f any of the four PEs i n a cluster fai l,the enti re PE clust
86、er i s consi dered fai led.A custom heat si nk for ai r cooli ng,i ncludi ng assembly,was desi gned and vali dated through extensi ve thermal si mulati ons and measurements as shown usi ng the chi p thermal model i n Fi g.16.2.4.We evaluated the LLaMA-2 7B model on the RNGD.Fi gure 16.2.5 i llustrat
87、es how tensor contracti on maxi mi zes paralleli sm and data reuse i n the attenti on layer.I t shows the multi pli cati on of query(i n0=Q0:2560:128)and key(i n1=K0:2560:128)matri ces for one head,and the di stri buti on of matri x Q across 16 sli ces.Fi gure 16.2.5 shows the TU and TDMA acti vi ty
88、 trace for an encoder block duri ng the prefill phase.I t i s shown that how the computati on ti me i s hi dden behi nd wei ght transmi ssi on for short sequence lengths and how wei ght transmi ssi on ti me i s hi dden behi nd computati on for long sequence length where computati on domi nates.Fi gu
89、re 16.2.6 shows that RNGD has 1.7 hi gher peak memory bandwi dth and i s 1.76 faster i n throughput than L40s wi th a 57%lower TDP.Although H100 i s 1.72 faster wi th 2.2 hi gher memory bandwi dth,i t has a 4.7 hi gher TDP than RNGD.Thi s compari son underscores RNGD s superi or throughput effici en
90、cy relati ve to i ts power consumpti on,thanks to i ts favorable memory bandwi dth-to-TDP rati o.The measured performance per power of GPT-J 6B wi th 99%accuracy i s 53%hi gher than that of L40s.Ack nowl edgement:We thank all the members of Furi osaAI who have contri buted to the development of the
91、chi p.We also thank Global Uni chi p Corp.and proteanTecs for thei r contri buti on on thi s work.Figure 16.2.1:A processing element of t ensor cont ract ion processor(TCP)wit h four slices,mult i-cont ext examples,and t he chip summary.Figure 16.2.2:Block diagram,processing element floor plan,and c
92、lock rout ing st rat egy.Figure 16.2.3:HBM int egrat ion and signal/power int egrit y.Figure 16.2.4:A chip t hermal model wit h on-chip monit ors,a t iming margin map,a DVFS graph,and t he heat sink.Figure 16.2.5:Tensor cont ract ion maximizes parallelism and dat a reuse while t races show how compu
93、t at ion or t ransmission t ime is hidden.Figure 16.2.6:Comparison on t he lat ency and a large language model(LLM)performance.ISSCC 2025/February 18,2025/1:55 PM285 DI GEST OF TECHNI CAL PAPERS 16 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERE
94、NCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 16.2.7:Die micrograph.References:1 E.Medi na and E.Dagan,“Habana labs purpose-bui lt AI i nference and trai ni ng processor archi tectures:Scali ng AI trai ni ng systems usi ng standard ethernet wi th Gaudi processor,”IEEE Mi cro,vol.40,no.2,pp.17-24,
95、2020.2 S.Knowles,“Graphcore,”Hot Chi ps 33 Sympos i um(HCS),2021.3 K.Chatha,“Qualcomm Cloud Al 100:12TOPS/W scalable,hi gh performance and low latency deep learni ng i nference accelerator,”Hot Chi ps 33 Sympos i um(HCS),2021.4 N.Jouppi,et al.,“TPU v4:an opti cally reconfigurable supercomputer for m
96、achi ne learni ng wi th hardware support for embeddi ngs,”ISCA,2023.5 M.-J.Park,et al.,“A 192-Gb 12-hi gh 896-GB/s HBM3 DRAM wi th a TSV auto-cali brati on scheme and machi ne-learni ng-based layout opti mi zati on,”IEEE JSSC,vol.58,no.1,pp.256-269,Jan.2023.6 V.Sze,et al.,“Effici ent processi ng of
97、deep neural networks:a tutori al and survey,”Proc.IEEE,vol.105,no.12,pp.2295-2329,Dec.2017.7 NVI DI A,“NVI DI A AI i nference performance,”Onli ne.Avai lable:https:/developer.nvi di ng-performance-trai ni ng-i nference/ai-i nference.8 MLCommons,“v3.1 Results,”Onli ne.Avai lable:https:/mlcommons.org/
98、benchmarks/i nference-datacenter/.9 MLCommons,“v4.0 Results,”Onli ne.Avai lable:https:/mlcommons.org/benchmarks/i nference-datacenter/.286 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025/SESSION 16/HIGHLIGHTED CHIP RELEASES:DIGITAL AND MACHINE LEARNING PROCESSORS/16.3979-8-33
99、15-4101-9/25/$31.00 2025 I EEE16.3 An On-Device Generat ive AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC wit h Fan-Out Wafer-Level Package Jun-Seok Park,Taehee Lee,Heonsoo Lee,Changsoo Park,Youngsang Cho,Mookyung Kang,Heeseok Lee,Ji nwon Kang,Taeho Jeon,Dongwoo Lee,Yesung Kang,Kyungm
100、ok Kum,Geunwon Lee,Hongki Lee,Mi nkyu Ki m,Suknam Kwon,Sung-beom Park,Dongkeun Ki m,Chulmi n Jo,HyukJun Chung,I lryoung Ki m,Jongyoul Lee Samsung Electroni cs,Hwaseong,Korea A notable trend observed i n on-devi ce AI i s natural progressi on from camera appli cati on-centri c CNN-based neural networ
101、ks to transformer-based generati ve AI (Gen AI)1.For i nstance,large language models(LLM)such as LLaMA 2 can support natural language understandi ng and human-li ke text generati on whi le large vi sual models(LVM)such as Stable Di ffusi on 3 can generate i mages or 3D models based on user context.H
102、owever,gen AI models exhi bi t di fferent operati onal characteri sti cs from tradi ti onal neural network(NN)models.LLMs requi re readi ng a several GB of wei ght data from DRAM every ti me a si ngle token i s generated duri ng decodi ng,resulti ng i n memory-i ntensi ve behavi or.LVMs,on the other
103、 hand,are more compute-i ntensi ve than LLMs but have di sti nct operati onal characteri sti cs,wi th softmax and layernorm accounti ng for 40%of total computati on ti me 4,whereas CNNs typi cally consi st of convoluti ons that account for the maj ori ty(90 to 99%)of operati ons i n these networks.W
104、e report on neural processi ng uni t(NPU)i n 4nm Samsung Exynos 2400 that employs heterogeneous archi tecture consi sti ng of vector engi nes and two types of tensor engi nes.NPU i ntegrates a memory hi erarchy and ti li ng techni ques to support a wi de range of neural networks.AI performance i s b
105、oosted by enhanci ng heat di ssi pati on through fan-out wafer level packagi ng(FOWLP).I n Fi g.16.3.1,the NPU contai ns two types of tensor engi nes(TEs)called general TE(GTE)and shallow TE(STE)havi ng 8K MACs and 512 MACs,respecti vely.Each type of TEs has a si mpli fied memory hi erarchy that max
106、i mi zes data reuse by leveragi ng a pre-determi ned data flow.The four vector engi nes(VEs)i ntegrati ng a si ngle i nstructi on multi ple data(SI MD)datapath handle non-li near operati ons such as softmax,complex acti vati on functi ons,and normali zati on,etc.,wi th each havi ng a 32-way executi
107、on uni t.Both TEs and VEs collaborate to accelerate a gi ven network,and they are connected to a 6MB of a shared scratchpad memory called NPUMEM,whi ch holds i nput and output feature maps(FM),i ntermedi ate data,and wei ght values.Each TE has L0/L1 queui ng caches(Q-cache)whi ch are types of si mpl
108、i fied cache memory whi le reduci ng mi ss penalti es,based on the fact that the operati ons of all modules i n the TEs proceed i n a predetermi ned order accordi ng to the nested loop structure as shown i n Fi g.16.3.2.The Q-cache can have a response sequence that i s temporally decoupled whi le ma
109、i ntai ni ng the same order as the request sequence.Therefore,Q-cache manages these data sequences as a queue to get an i ntui ti ve understandi ng of the temporal and spati al locali ty of upcomi ng data,allowi ng for more preci se evi cti on deci si on.Consequently,most subsequent accesses hi t i
110、n the L0/L1 Q-caches,thereby reduci ng the average latency due to the locali ty after the i ni ti al cold mi ss occurs.The prefetchi ng di rectly loads data i nto the L1 Q-cache to mi ni mi ze the i ni ti al cold mi ss to reduce the latency furthermore.I n the case of CNNs,the pre-fetch uni t can si
111、 gni ficantly i ncrease the hi t rati o i n L1 Q-cache when i t operates j ust a few cycles ahead of the fetch uni t.As a result,i t allows for latency hi di ng i n NPU wi thout complex scheduli ng or task management li ke si ngle-i nstructi on multi ple threads(SI MT)5.Fi gure 16.3.3 shows that the
112、 matri x si ze and shape affect the amount of data reuse.NPU shows hi gher data reusabi li ty when the larger matri ces are stored i n the memory,i n gi ven si ze of memory,FM matri x and wei ght matri x are more si mi lar si zes,and the matri ces have smaller i nput channel lengths.I n thi s contex
113、t,we define skewness as the rati o between a larger matri x and a smaller matri x i n si zes,and mi ni mum reuse factor as the mi ni mum amount of data reuse of the i nput data to fill the bandwi dth(BW)gap between i nput and output ports i n a memory hi erarchy.I f the reuse factor of i nput data s
114、tored i n the gi ven memory hi erarchy i s smaller than the mi ni mum reuse factor,NPU s MAC may be i n the i dle state due to i nsuffici ent data,resulti ng i n reduced HW uti li zati on.To prevent thi s si tuati on,we need to take advantage of data si ze,skewness,and i nput channel length to i ncr
115、ease the data reuse factor.The skewness curve i n Fi g.16.3.3 summari zes the relati onshi p between the skewness and i nput channel length i n a gi ven memory si ze(data si ze),and the li ne i n the curve represents the skewness for each i nput channel length requi red to sati sfy the mi ni mum dat
116、a reuse for a gi ven memory si ze.I f a ti le wi th FM and wei ght matri ces falls wi thi n blue regi ons on the skewness curve,i t becomes compute-i ntensi ve,otherwi se,i t remai ns memory-i ntensi ve.Furthermore,we can extend thi s property i nto three di mensi ons accordi ng to vari ous memory s
117、i zes.Usi ng the three-di mensi onal skewness curve,we can determi ne the si ze of each memory hi erarchy duri ng HW desi gn or perform the matri x/tensor ti li ng to fit the ti led matri x/tensor i nto a gi ven memory si ze whi le maxi mi zi ng the data reuse.As an example,Fi g.16.3.3.shows how a h
118、euri sti c method usi ng the skewness curve performs the ti li ng wi thi n a gi ven memory si ze.Suppose we i ni ti ally recei ve a ti le larger than the memory budget,and begi n the ti li ng process usi ng the skewness curve correspondi ng to the ti le si ze.At each step,we select a ti li ng di rec
119、ti on from the avai lable candi dates(e.g.,wi dth,hei ght,and i nput channel)i n greedy algori thmi c manner by referri ng to the skewness curve correspondi ng the ti le si ze.Afterward,we update the skewness curve to correspond to the new ti le si ze.As thi s ti li ng process i s repeated,the ti le
120、 si ze gradually decreases,approachi ng the memory budget.Thi s process gui des ti li ng i n a way that maxi mi zes data reuse,enabli ng us to qui ckly achi eve a ti li ng result that fits wi thi n the gi ven memory budget.Si nce the transformer-based networks alternate between li near operati ons a
121、nd non-li near operati ons(e.g.,softmax),wi th the latter accounti ng for a si gni ficant porti on of the overall computati ons 4,i t i s cruci al to perform TE and VE i n parallel to reduce computati onal latency.Even i n cases where li near operati ons are performed sequenti ally,as i n tradi ti o
122、nal CNNs,i t i s also necessary to uti li ze all TEs to qui ckly process the li near operati ons i n parallel.I n the NPU,the enti re network i s parti ti oned i nto large L2 ti les,whi ch are si zed to fit wi thi n NPUMEM at once,consi deri ng the parallel processi ng of TEs and VEs.Once an L2 ti l
123、e i s allocated on NPUMEM,the L2 ti les are further parti ti oned i nto smaller L1 ti les,taki ng i nto account the L1 Q-cache of TE.A TE processes one L1 ti le at a ti me i n back-to-back manner unti l all computati ons related to the L2 ti le are completed.TE and VE can perform ti le-level pi peli
124、 ni ng on an L1 ti le basi s as exchangi ng the i nput and output ti les wi th NPUMEM as shown i n Fi g.16.3.4.Si mi larly,i t i s also possi ble that all TEs are executed i n parallel exploi ti ng the ti le-level paralleli sm to accelerate the computati on of an L2 ti le.Whi le the GTE i s hi ghly
125、opti mi zed on convoluti ons and matri x-matri x multi pli cati on wi th hi gh data reuse,i t could result relati vely low HW uti li zati on on operati ons li ke memory i ntensi ve matri x-vector multi pli cati on or depthwi se convoluti on.However,STE,whi ch has si gni ficantly fewer MACs than the
126、GTE,compensates wi th relati vely hi gher memory BW per MAC,allowi ng for hi gher HW uti li zati on i n such operati ons.Consequently,the heterogeneous NPU archi tecture consi sti ng of two types of TEs shows functi onal flexi bi li ty to cover vari ous types of li near layers from compute i ntensi
127、ve CNN layers to memory i ntensi ve matri x-vector operati ons i n hi gh effici ency.The number of MAC uni ts,physi cally packed together wi thi n a small area of si li con,results i n unreli able j uncti on temperature.The dynami c thermal management(DTM)soluti ons control thermal throttli ng wi th
128、 on-chi p temperature sensors to prevent excessi ve j uncti on temperature.The performance of NPU i s thermally li mi ted by droppi ng clock frequency as show i n Fi g.16.3.5.To overcome thermal-i nduced performance degradati on,enhanced 3rd-generati on 4nm process and FOWLP soluti on are selected.A
129、 3rd-generati on 4nm process provi des 11%of RO AC performance gai n compari ng wi th the 1st-generati on 4nm process,si nce reducti on i n Ceff and Reff i s achi eved wi th source and drai n engi neeri ng,mi ddle-of-li ne(MOL)resi stance reducti on,and replacement metal gate(RMG)opti mi zati on.The
130、 FOWLP has excellent thermal characteri sti cs sui table for mobi le products 6.I nterposer PoP(I-PoP)was used i n the Exynos 2200,however Exynos 2400 adopts FOWLP that i mproves thermal resi stance as shown i n Fi g.16.3.5.The verti cal stack-up of AP and DRAM packages generally prevent the heat pr
131、ovi ded by the hot spot of AP from escapi ng at the top upper surface of AP.However,a FOWLP can provi de better thermal resi stance than I-PoP due to larger di e thi ckness.Thi n redi stri buti on layer(RDL)i n FOWLP enhances heat spreadi ng through large di e thi ckness.Thermal resi stance i s decr
132、eased by 16%from 16.52C/W to 13.83C/W by changi ng from I-PoP to FOWLP.The performance gai n i n the 4nm process and thermal resi stance reducti on due to package change i mprove NPU maxi mum clock frequency at the same power by 30%.Fi gure 16.3.6 shows the measurement results for the proposed NPU.T
133、he NPU occupi es 12 mm2,and i t operates at 0.55-to-0.83V supply voltage and 533-to-1196MHz clock frequency.Performance and power of NPU were measured i n si li con on several networks i n MLPerf(mobi le)7 and U-net network i n Stable Di ffusi on.Whi le overall i nference throughput at 1196MHz for m
134、obi leNetEdgeTPU 8,Mobi leDet 9,and Mosai c 10 are i mproved by 1.81,2.37,2.65,respecti vely compared to 11,performance for EDSR 12 and LVM U-net were newly measured at 140.3 and 8.3 i nference/s,respecti vely.Whi le i ncreasi ng the i nternal buffer si ze from 2 to 6MB,an area effici ency of 3.48 T
135、OPS/mm2 i s achi eved due to shari ng wei ght buffer across the MAAs i n spati al di recti on and the opti mi zed MAC desi gn.Fi gure 16.3.7 shows the chi p mi crograph wi th NPU.Figure 16.3.1:Het erogeneous archit ect ure of t he neural processing unit.Figure 16.3.2:L0/L1 queuing caches operat ion.
136、Figure 16.3.3:Tiling select ion met hod using skewness curve.Figure 16.3.4:Neural net work operat ions using het erogeneous engines.Figure 16.3.5:NPU performance improvement achieved by RO AC performance gain and t hermal resist ance reduct ion.Figure 16.3.6:Measurement result s and performance comp
137、arison t able.ISSCC 2025/February 18,2025/2:20 PM287 DI GEST OF TECHNI CAL PAPERS 16 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 16.3.7:Die micrograph.References:1 A.Vaswani,et al.,“Attenti on
138、I s All You Need”,NeurIPS,2017.2 A.Dubey,et al.,“The Llama3 Herd of Models”,ArXi v,2024.3 R.Rombach,et al.,“Hi gh-resoluti on i mage synthesi s wi th latent di ffusi on models”,ArXi v,2021.4 J.R.Stevens,et al.,“Softermax:Hardware/Software Co-Desi gn of an Effici ent Softmax for Transformers”,DAC,202
139、1.5 B.Klenk,et al.,“Relaxati ons for Hi gh-Performance Message Passi ng on Massi vely Parallel SI MT Processors”,Int.Paral l el and Di s t ri but ed Proces s i ng,2017.6 T.Yoo,et al.,“Advanced Chi p Last Process I ntegrati on for Fan Out WLP”,IEEE ECTC,2022.7 V.J.Reddi,et al.,“MLPerf I nference benc
140、hmark”,ISCA,2020.8 M.Tan,et al.,“Effici entNet:Rethi nki ng Model Scali ng for Convoluti onal Neural Networks”,ICML,2019.9 Y.Xi ong,et al.,“Mobi leDets:Searchi ng for Obj ect Detecti on Archi tectures for Mobi le Accelerators”,CVPR,2021.10 W.Wang,et al.,“MOSAI C:Mobi le Segmentati on vi a decodi ng
141、Aggregated I nformati on and encoded Context”,ArXi v,2021.11 J.-S.Park,et al.,“A Multi-Mode 8k-MAC HW-Uti li zati on Aware Neural Processi ng Uni t wi th a Uni fied Multi-Preci si on Datapath i n 4nm Flagshi p Mobi le SoC”,ISSCC,pp.246-247,Feb.2022.12 B.Li m,et al,.“Enhanced Deep Resi dual Networks
142、for Si ngle I mage Super-Resoluti on”,CVPR,2017.13 J.-S.Park,et al.,“A 6K-MAC feature-map-sparsi ty-aware neural processi ng uni t i n 5nm flagshi p mobi le SoC”,ISSCC,pp.152-153,2021.14 A.Agrawal,et al.,“7nm 4-Core AI Chi p wi th 25.6TFLOPS Hybri d FP8 Trai ni ng,102.4TOPS I NT4 I nference and Work
143、load-Aware Throttli ng”,ISSCC,pp.144-145,2020.15 C.-H.Li n,et al.,“A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learni ng Accelerator for Versati le AI Appli cati ons i n 7nm 5G Smartphone SoC”,ISSCC,pp.134-135,2020.288 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025/SESSION 16
144、/HIGHLIGHTED CHIP RELEASES:DIGITAL AND MACHINE LEARNING PROCESSORS/16.4979-8-3315-4101-9/25/$31.00 2025 I EEE16.4 SambaNova SN40L:A 5nm 2.5D Dat aflow Accelerat or wit h Three Memory Tiers for Trillion Paramet er AI Raghu Prabhakar,Junwei Zhou,Darshan Gandhi,Youngmoon Choi,Mahmood Khayatzadeh,Kyungl
145、ok Ki m,Uma Durai raj an,Jeongha Park,Satyaj i t Sarkar,Ji nuk Luke Shi n SambaNova Systems,Palo Alto,CA The SN40L i s the latest-generati on Reconfigurable Dataflow Uni t(RDU)from SambaNova Systems bui lt for modern AI trai ni ng and i nference appli cati ons 1,2.The SN40L combi nes on-chi p datafl
146、ow flexi bi li ty wi th a three-ti er memory system:on-chi p SRAM,on-package Hi gh Bandwi dth Memory(HBM),and off-package DRAM(Fi g.16.4.7).The SN40L employs 3 CoWoS-S i nterposer technology featuri ng two i denti cal accelerator di es and HBM memory modules.To enable both hi gh-bandwi dth and hi gh
147、-capaci ty memory accesses from a si ngle chi p,along wi th peer-to-peer fabri c and di e-to-di e connecti vi ti es,a dual-di e approach i s adopted,where the si ze of each di e i s 600mm2(Fi g.16.4.1).Furthermore,a sub-reti cle si ze dual-di e desi gn opti mi zes performance,thermal effici ency,and
148、 cost.The SN40L accelerator di es are fabri cated usi ng TSMC s 5nm process.Each socket contai ns a total of 102 bi lli on transi stors,deli veri ng 640 BF16 TFLOPs and 520MB on-chi p SRAMs usi ng hi gh-densi ty memory cells.Each SN40L di e consi sts of four types of programmable uni ts arranged i n
149、 a 2D mesh i nterconnect:520 Pattern Compute Uni ts(PCUs),520 Pattern Memory Uni ts(PMUs),Swi tches,and Address Generati on and Coalesci ng Uni ts(AGCUs)(Fi g.16.4.2).PCUs consi st of a pi peli ne of vectori zed ALUs that deli ver all the compute teraflops(Fi g.16.4.3).ALUs i n the PCU can be progra
150、mmed to operate as ei ther a systoli c array for matri x multi pli cati ons,or as a stati c dataflow pi peli ne for streami ng operati ons.Each PCU also has a tai l ALU stage that nati vely supports complex operati ons li ke exponent,si gmoi d,and tanh.PMUs each contai n 512KB of programmer-managed
151、on-chi p SRAM wi th multi ple banks to allow concurrent reads and wri tes wi th arbi trary access patterns at hi gh throughput(Fi g.16.4.3).PMUs also contai n speci ali zed ALUs to accelerate address computati on,and data ali gnment uni ts to accelerate operati ons li ke transpose.Swi tches support
152、packet-swi tched scalar and vector data i nterconnects,and a stati cally programmed bi t-level control i nterconnect.Packet-swi tched i nterconnects support poi nt-to-poi nt and multi cast paths.Packets are routed wi th hardware di mensi on-ordered routi ng(DOR),wi th a provi si on for software over
153、ri de.AGCUs enable hi gh throughput off-chi p address generati on as well as vi rtuali zati on for off-chi p memory management.A graph control i nterface enables RDUs to load and control graph orchestrati on wi thout host i nvolvement.A hardware-managed Top-Level Network(TLN)connects AGCUs to the I
154、O subsystem components li ke HBM,DDR,and PCI e li nks to other RDUs and the host CPU.Data can be di rectly streamed from one RDU to another RDU usi ng a token-based Peer-to-Peer(P2P)protocol.Compute graphs are executed on RDUs i n a pi peli ned manner where compute i s overlapped wi th communi cati
155、on by default.Subtensors,or ti les,of i nput Tensors are loaded by AGCUs from HBM or DDR i nto one or more PMUs(Fi g.16.4.2).Data arri ves i n packets i nto PMUs,where ALUs are programmed wi th speci fic read and wri te access patterns to access data i n the SRAMs.Once a data ti le i s wri tten i nt
156、o a PMU,a control token starts the read operati on from the PMU.Data from the PMU i s routed to i ts on-chi p desti nati on uni ts PCUs,PMUs,or AGCUs vi a swi tches along ei ther di mensi on-ordered or software-determi ned paths(Fi g.16.4.3).Each uni t i s programmed wi th i nput data and control de
157、pendenci es.Uni ts begi n processi ng i ts i nputs and produci ng outputs when i ts dependenci es are met.For i nstance,a PCU programmed to add two Tensors begi ns executi on when i t has recei ved at least one data packet for each i nput.Loadi ng data ti les from AGCUs,reads and wri tes i n PMUs,an
158、d compute i n PCUs are all fully decoupled and overlapped wi th each other.DDR and HBM memori es are exposed to software di rectly as two separate memory spaces.Software allocates Tensors requi ri ng hi gh bandwi dth access li ke wei ghts and KV caches i n generati ve i nference i n HBM.DDR i s used
159、 to store Tensors requi ri ng hi gh capaci ty and low bandwi dth access.For i nstance,trai ni ng appli cati ons can use DDR to store acti vati on checkpoi nts generated duri ng the forward pass that i s requi red duri ng backpropagati on.I n generati ve i nference and agenti c AI appli cati ons,DDR
160、can be used to store parameters of hundreds of models along wi th persi stent KV cache values 5.The most frequently used models are then moved to HBM dynami cally by software.Wi th advances i n agenti c AI and test ti me i nference scali ng 3,4,modern AI appli cati ons perform i nference on several
161、models to collecti vely accompli sh a complex task.Wi th DDR and HBM,a si ngle SN40L node wi th 8 sockets(SN40L-8)lowers both model hosti ng and swi tchi ng costs 2,and i s able to host and serve tri lli ons of parameters effici ently i n a cost-effecti ve manner,enabli ng much hi gher Performance/T
162、otal Cost of Ownershi p(TCO)(Fi g.16.4.5).The i ncreasi ng demand for computati onal capabi li ty i n LLMs has posed a si gni ficant challenge i n meeti ng both thermal and electri cal power demands i n many conventi onal data centers.Whi le both hardware and software power cappi ng soluti ons can b
163、e employed to li mi t the power at the cost of reduced performance,the fast-movi ng power vari ati on behavi or of LLM workloads makes i t di fficult to opti mi ze performance for every type of workload.Thi s often results i n stati c changes to lower-performi ng states to stay wi thi n the constrai
164、 nt.For example,a typi cal i nference workload consi sts of two mai n phases:the prompt phase,whi ch i s a compute i ntensi ve phase that processes a large i nput matri x,requi ri ng si gni ficant computati onal resources,and the generati on phase,whi ch sequenti ally generates tokens usi ng relati
165、vely li ghter and more stable computati on,and consumes less power.The compute i nterval of the prompt phase can span from mi lli seconds to seconds,dependi ng on the i nput context length and batch si ze.Therefore,the system needs to be dynami cally adapti ve to control peak power consumpti on to s
166、tay wi thi n the electri cal power li mi t of a rack i n the data center duri ng the prompt phase whi le runni ng at full frequency duri ng the generati on phase.To control both peak and average power under dynami cally changi ng workloads i n i nference or trai ni ng models,the SN40L has i mplement
167、ed an on-chi p Power Esti mati on Uni t(PEU).The PEU i s capable of esti mati ng power consumpti on i n the sub-mi crosecond range and automati cally adj usti ng operati ng condi ti ons to achi eve maxi mum performance on any workload wi thi n electri cal power li mi ts,all wi thout software assi st
168、ance.The PEU moni tors the acti vi ti es of all ti le components,totali ng approxi mately 3000 PCUs,PMUs,and swi tches,and aggregates the power consumpti on of each component.Wi thi n each uni t,a power calculator uni t tracks opcode,stage acti vi ty,data toggle rate,and zero data detecti on to ensu
169、re accurate power esti mati on.Addi ti onally,for PMU and swi tch uni ts,i t moni tors SRAM operati on and network port acti vi ty to esti mate power.The power consumpti on of each uni t i s then collected i n a hi erarchi cal dai sy chai n manner to mi ni mi ze hardware costs.Each uni t contai ns w
170、ei ghts and bi ases for power calculati on,enabli ng post-Si li con adj ustments.The aggregated power i s then converted wi th effecti ve frequency and voltage duri ng i ts calculati on to obtai n actual power values.The power management controller recei ves these power values to calculate the appro
171、pri ate operati ng condi ti on and i ni ti ates fast-reacti ng clock ski ppi ng,followed by voltage and frequency scali ng.As the PEU-controlled power cappi ng operates wi thi n mi crosecond ranges,i t ensures the chi p never exceeds electri cal power budget,stayi ng wi thi n the rack li mi t.Fi gur
172、e 16.4.4 i llustrates the fast transi ti onal peak power stayi ng close to average power(TDP)when the power i s capped by 50%.The performance i mprovement achi eved wi th PEU-based power cappi ng shows up to 39%better token generati on performance i n i nference workloads as i t throttles only when
173、power exceeds the cappi ng li mi t,compared to stati cally setti ng the operati ng frequency and voltage to stay wi thi n the power li mi t.Notably,even compute-i ntensi ve trai ni ng workloads show 11%performance i mprovement,as fast power management wi th PEU operates at hi gher frequenci es duri
174、ng communi cati on-heavy or compute-li ght phases.The SN40L s dataflow archi tecture and 3-ti er memory system are desi gned to achi eve energy effici ency i n computi ng large language models(LLMs)compared to GPUs 6.The streami ng dataflow archi tecture enables two key opti mi zati ons.Fi rst,i t a
175、llows massi ve kernel fusi on,i ntegrati ng the enti re decoder layer i nto a si ngle kernel call 1,2.Second,i t reduces kernel launch overhead by converti ng repeated i nvocati ons of thi s fused kernel i nto a si ngle call.Thi s mi ni mi zes off-chi p memory accesses,resulti ng i n power savi ngs
176、on both the accelerator and HBM di es.Addi ti onally,the dataflow archi tecture mi ni mi zes the power overhead associ ated wi th core-based I SA accelerators.These features enable SN40L to deli ver about 3 better energy effici ency compared to the Nvi di a H100 GPU across Llama3.1 8B,70B,and Vi T m
177、odel i nference(Fi g.16.4.5).Dataflow also enables better uti li zati on of on-chi p compute teraflops duri ng trai ni ng,achi evi ng a 2 speedup over H100 i n trai ni ng vi si on transformers(Vi T)(Fi g.16.4.5).A typi cal 8-socket SN40L system can host models wi th up to 5 tri lli on parameter mode
178、ls,compared to 0.27 tri lli on parameter models on GPUs wi th the same socket count 7,all wi thi n mi lli second model swi tchi ng latenci es.Thi s uni que capabi li ty of di rect DDR access from the SN40L provi des power,space,and cost benefits for runni ng a large number of expert models by breaki
179、 ng the memory wall of GPUs.Fi nally,reduced power densi ty allows the SN40L to maxi mi ze i ts compute resources i n ai r-cooled systems(Fi g.16.4.6).Thi s eli mi nates the need for costly and complex cooli ng soluti ons and enables deployi ng SN40L i n datacenters worldwi de.Figure 16.4.1:SN40L so
180、cket component s and memory t iers.Figure 16.4.2:SN40L die wit h PCUs,PMUs,Swit ches,and AGCUs in a 2D Mesh Int erconnect.Figure 16.4.3:PCU(t op)and PMU(bot t om)archit ect ure in SN40L.Figure 16.4.4:Unconst rained vs.const rained power profiling during Llama2 7B t raining.Figure 16.4.5:SN40L perfor
181、mance and energy vs.DGX H100.Figure 16.4.6:Temperat ure and ut ilizat ion charact erist ics of GPT-13B t raining on SN40L.ISSCC 2025/February 18,2025/2:45 PM289 DI GEST OF TECHNI CAL PAPERS 16 2025 I EEE I nternati onal Soli d-State Ci rcui ts ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES9
182、79-8-3315-4101-9/25/$31.00 2025 I EEEFigure 16.4.7:SN40L die and board phot o showing chip and t hree t iers of memory.References:1 R.Prabhakar,“SambaNova SN40L RDU:Breaki ng the Barri er of Tri lli on+Parameter Scale Gen AI Computi ng”,IEEE Hot Chi ps 36 Symp.,Aug.2024.2 R.Prabhakar,et al.,“SambaNo
183、va SN40L:Scali ng the AI Memory Wall wi th Dataflow and Composi ti on of Experts”,IEEE/ACM Int.Symp.Mi croarchi t ect ure,2024.3 OpenAI,“Learni ng to Reason wi th LLMs”.Accessed on Dec.5,2024,.4 C.Snell,et.al.,“Scali ng LLM Test-Ti me Compute Opti mally can be More Effecti ve than Scali ng Model Parameters”.Accessed on Dec.5,2024,.5 OpenAI,“Prompt Cachi ng”.Accessed on Dec.5,2024,.6 NVI DI A TensorRT-LLM.Accessed on Dec.5,2024,.7 NVI DI A DGX H100 Datasheet.Accessed on Dec.5,2024,.