《Session 2Processors.pdf》由會員分享,可在線閱讀,更多相關《Session 2Processors.pdf(32頁珍藏版)》請在三個皮匠報告上搜索。
1、Session Chair:Jie Gu Northwestern University,Evanston,IL Session 2 Overview:Processors DIGITAL ARCHITECTURES AND SYSTEMS SUBCOMMITTEEProcessors are the driving workhorses of modern high-perf ormance computing.This session presents the next-generation processors f or both general-purpose computing an
2、d specialized computing.The first three papers deliver the next-generation CPU processors and networking processors,f ollowed by a 20-chiplet heterogenous 2.5D system in the f ourth paper.The remaining papers present the state-of-the-art super-resolution image and video processors,as well as advance
3、d 3D rendering processors.Session Co-Chair:Nathaniel Pinckney Nvidia,Austin,TX 42 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/OVERVIEW979-8-3315-4101-9/25/$31.00 2025 IEEE1:30 PM 2.1 “Zen 5”:The AMD High-Performance 4nm x86-64 Microprocessor Core Spence Ol
4、iver,AMD,Austin,TX In Paper 2.1,AMD presents Zen 5,their next-generation x86-64 microprocessor core,targeting both desktop and laptop applications.Built in TSMC s 4nm FinFET process,the 55mm2 core complex integrates 8.6B transistors across 8 CPU cores.Each core delivers a generational 16%IPC improve
5、ment over Zen4,while maintaining a maximum f requency of 5.7GHz.1:55 PM 2.2 IBM Telum II:Next Generat ion 5.5GHz Microprocessor wit h On-Die Dat a Processing Unit and Improved AI Accelerat or Gerald Strevig,IBM Systems,Austin,TX In Paper 2.2,IBM showcases their next-generation 5.5GHz Telum II microp
6、rocessor with an enhanced AI accelerator and a new Data Processing Unit f or I/O of fload.The 600mm2 die with 43B transistors in 5nm benefits f rom+40%L2 cache,on-chip voltage control,+36%X-BUS bandwidth,and reliability enhancements,while staying within 5%power of the prior generation.2:20 PM 2.3 Gr
7、anit e Rapids-D:Int el Xeon 6 SoC for vRAN,Edge,Net working,and St orage Raj R.Varada,Intel,Santa Clara,CA In Paper 2.3,Intel describes the integration of their Xeon 6 SoC,Granite Rapids-D.In exploring the I/O die built in Intel 4 technology,details of the 200Gb/s Enterprise-Grade Ethernet,32 lanes
8、of 32Gb/s PCIe5,and 16 lanes of PCIe4 are provided.Hardware accelerators demonstrate 1.8 improvement in system-level benchmarks over prior generations.4:00 PM 2.7 IRIS:A 8.55mJ/frame Spat ial Comput ing SoC for Int eract able Rendering and Surface-Aware Modeling wit h 3D Gaussian Splat t ing Seokcha
9、n Song,KAIST,Daejeon,Korea In Paper 2.7,KAIST university presents a 28nm 20.25mm2 3D Gaussian splatting SoC f or real-time interactive rendering.The spatial computing SoC achieves 72f ps real-time rendering with 34 f aster perf ormance and 300 lower energy than an edge GPU.4:25 PM 2.8 A 210fps Image
10、 Signal Processor for 4K Ult ra HD True Video Super Resolut ion Ying-Sheng Lin,National Taiwan University,Taipei,Taiwan In Paper 2.8,National Taiwan University and Google present an image signal processor f or true video super-resolution(VSR)by utilizing inter-f rame dependencies.The 40nm 6.8mm2 chi
11、p delivers 210f ps f or 4K UHD with 3.5 higher f rame rate and 7.3 better energy ef ficiency than previous works.4:50 PM 2.9 STEP:An 8K-60fps Space-Time Resolut ion-Enhancement Neural-Net work Processor for Next-Generat ion Display and St reaming Kai-Ping Lin,National Tsing Hua University,Hsinchu,Ta
12、iwan In Paper 2.9,National Tsing Hua University and TSMC present a 16nm 8mm2 space-time resolution enhancement neural-network processor f or next-generation display and streaming,achieving 8K-UHD 60f ps inf erence with 10.2TOPS and 1425mW at 400MHz.5:15 PM 2.10 A 0.52mJ/Frame 107fps Super-Resolut io
13、n Processor Exploit ing Pseudo-FP6 Sparsit y for Mobile Applicat ions Xuyang Duan,Fudan University,Shanghai,China In Paper 2.10,Fudan University presents a 22nm 6mm2 low-power(0.52mJ/f rame),high-throughput(1K107f ps)super-resolution processor.It uses a channel-number-adaptive caching strategy to re
14、duce on-chip memory by 90%,a workload-balance engine cutting execution cycles by 64%,and a hybrid data flow improving hardware utilization by 75%.3:35 PM 2.6 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Archit ect ure Using Earlier Comput at ion Skipping and Gaussian Cache Schedul
15、er Xiaoyu Feng,Tsinghua University,Beijing,China In Paper 2.6,Tsinghua University presents a 28nm 2.43mm2 3D Gaussian splatting processor f or multi-scenario rendering on edge devices.The work exploits dynamic reconfiguration through hybrid rasterizing and interpolating to achieve 1.96 higher peak t
16、hroughput,while delivering 7.9 lower energy per f rame over state-of-the-art 3D rendering accelerators.ISSCC 2025/February 17,2025/1:30 PM43 DIGEST OF TECHNICAL PAPERS 3:00 PM 2.5 A 16nm 5.7TOPS CNN Processor Support ing Bi-Direct ional FPN for Small-Object Det ect ion on High-Resolut ion Videos Yu-
17、Chun Ding,National Tsing Hua University,Hsinchu,Taiwan In Paper 2.5,National Tsing Hua University and Taiwan Semiconductor Research Institute present a 16nm CNN processor supporting bidirectional FPN f or small-object detection on high-resolution videos,achieving 5.7TOPS f or 26.4%COCO mAPs on 89689
18、6 inputs at 26.6f ps with 1373mW at 1.09V.2:45 PM 2.4 A 300MB SRAM,20Tb/s Bandwidt h Scalable Het erogenous 2.5D Syst em Inferencing Simult aneous St reams Across 20 Chiplet s wit h Workload-Dependent Configurat ions Srivatsa Rangachar Srinivasa,Intel,Hillsboro,OR In Paper 2.4,Intel presents a 2.5D
19、package integration approach with workload-dependent configurations.The system f eatures 20 chiplets f rom two dif f erent f oundries assembled over a passive silicon base(22mm19mm,1650000bumps,120 power domains)with a power of 20.52W at 300MHz and running deep learning inf erence at a peak of 72TOP
20、S.244 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.1979-8-3315-4101-9/25/$31.00 2025 IEEE2.1 “Zen 5”:The AMD High-Performance 4nm x86-64 Microprocessor Core Teja Singh1,Spence Oliver1,Sundar Rangarajan1,Shane Southard1,Carson Henrion2,Alex Schaef er1,Bret
21、t Johnson2,Sarah Bartaszewicz Tower1,Kathy Hoover1,Deepesh John1,Ted Antoniadis1,Shravan Lakshman1,Vibhor Mittal1,Brian Kasprzyk1,Ross McCoy1,Kurt Mohlman1,Anitha Mohan1,Hon-Hin Wong3,Daryl Lieu3,Russell Schreiber1,Sahilpreet Singh4,Nick Lance2,Darryl Prudich2,Justin Coppin2,Tim Jackson2,Anita Kareg
22、ar2,Ryan Miller2,Sabeesh Balagangadharan5,James Pistole3,Wilson Li3,Michael McCabe3 1AMD,Austin,TX 2AMD,Fort Collins,CO 3AMD,Santa Clara,CA 4AMD,Markham,Canada 5AMD,Bangalore,India Codenamed“Zen 5,”AMD s next-generation,energy-ef ficient high-perf ormance x86 core targets a wide array of client,serv
23、er,and embedded markets.Fabricated in TSMC s 4nm FinFET process,the 55mm2 core complex(CCX),shown in Fig.2.1.1.,contains 8.6B transistors across eight cores,each with a 1MB private L2 cache and a shared 32MB L3 cache.The“Zen 5”implementation supports configurable FP256 and FP512 data paths.The“Zen 5
24、”f amily includes a“Zen 5c”variant with increased density and power ef ficiency f or key markets.The main design priorities f or“Zen 5”are to improve per-core perf ormance and energy ef ficiency,while maintaining similar area f ootprint as the prior generation 1.The“Zen 5”core delivers a 16%generati
25、onal IPC increase in desktop PC applications 2 while supporting f requencies up to 5.7GHz 3.The“Zen 5”micro-architecture f ocuses on balanced throughput enhancements across the pipeline to achieve the perf ormance uplif t while maintaining power and area ef ficiency.Figure 2.1.2 presents key f eatur
26、e comparisons f rom the previous“Zen 4”generation.Improved branch prediction is combined with f ront-end parallelism,in the f orm of dual-instruction f etch and micro-op cache pipes,to increase the dispatch width f rom six to eight micro-ops per cycle.To improve integer-based workloads,the ALU count
27、 increases to six units f rom the f our f ound on all previous“Zen”generations.To make use of the additional execution hardware,a wider instruction window is implemented with a larger and higher-throughput scheduler,register file,and retire queue.A f ourth address generation unit is also inserted to
28、 support additional load and store pipes.The L1 data cache capacity is increased 50%to 48KB and 12 ways,while maintaining 4-cycle load-to-use perf ormance.L2/L1 cache fill and victim bandwidth doubles to 64B per cycle.The design moved the majority of 8T SRAMs to 6T to increase density.The“Zen 5”floa
29、ting point(FP)data path increases to 512b f rom 256b on“Zen 4,”doubling the bandwidth f or advanced vector extension(AVX512)instructions.Other notable FP changes include a doubling of dedicated FP registers,enhanced out-of-order scheduling capability,a reduction of the minimum FPADD latency to 2-cyc
30、les,and a doubling of the bandwidth f or INT8,INT16,and BFLOAT16 AI f ormats.Through the use of improved on-die sensors,AC capacitance(CAC)monitors,and di/dt-based adaptive clocking,“Zen 5”can achieve f ull AVX512 perf ormance at the peak core f requency.The FP implementation allows f or the upper 2
31、56b data path to be power gated at boot time.This reverts the FP to the double-pumped AVX512 implementation used on“Zen 4,”achieving similar power ef ficiency by also using an independent mesh clock to gate the upper 256b data path 4.This configurable implementation also makes it simple to create di
32、screte 256b data path FP variants f or product flexibility.“Zen 5”fits 30%more transistors into the same CCX area as“Zen 4”via numerous advancements across core implementation,4nm process technology,and cache design.The bulk of the area improvements are in cache design and stacking technology.“Zen 4
33、”significantly improved the stacking area ef ficiency relative to“Zen 3”,and“Zen 5”continues this trend with a solution that requires minimal core complex die(CCD)area.“Zen 5”L2 is a f ast,private cache with 1MB per core as in“Zen 4,”however,the bandwidth between the core and L2 is doubled f rom 32B
34、/cycle to 64B/cycle.Additionally,the L2 associativity doubles f rom 8-ways to 16-ways.Logic device area is reducing f aster than 6T SRAM in modern process technologies,so cache floorplans must be increasingly flexible to maintain density.Despite the increased perf ormance of the“Zen 5”L2,the density
35、 is improved by 11%(process technology neutral)relative to“Zen 4”via floorplan changes shown in Fig.2.1.3.The“Zen 3”and“Zen 4”L3“ring”f abric topology 5 is replaced by a mesh topology which reduces latency and increases bandwidth,especially f or configurations with higher core counts.New L3 macros a
36、re created to achieve higher density.L3 data macro capacity is doubled,and its density improved by 10%.Circuit techniques are used to mitigate the f requency loss that typically occurs f rom doubling the capacity of SRAM macros.The capacity of each tag macro is doubled which leads to an 11%improveme
37、nt in density.Another primary“Zen 5”objective is to drive an energy-ef ficient,high-IPC machine that scales across a variety of workloads.Driving high IPC and new f eatures,such as FP512,typically comes with more transistors and is challenging f or f requency/power attainment.Unlocking improvements
38、in transistor density to reduce switching and leakage power are key f or delivering an ef ficient,high-IPC,5.7GHz design.In addition to density improvements enabled by the geometric process shrink f rom a 5nm process to a 4nm process,PD methodology innovations such as array ef ficiency,SRAM optimiza
39、tions,standard cell optimizations,design technology co-optimization,and utilization improvements results in higher transistor density.Significant emphasis is placed on switching power,CAC,and leakage with detailed tracking of power f or each RTL f eature.Use of numerous RTL and PD power reduction me
40、thods enables“Zen 5”to reduce CAC/IPC cost.The physical design f ocuses on timing and power co-optimization in the synthesis,place,and route(SAPR)tools to reduce dynamic and leakage power.The“Zen 5”CAC breakdown is shown in Fig.2.1.4.Macro power contribution is higher relative to“Zen 4”due to the la
41、rger array structures which enable higher perf ormance.Contributions f rom flop/gater/clock are lower.High emphasis is put on multi-bit flop banking optimizations as well as attention to the flop cell palette and design usage.Clock gater CAC is optimized with improved clock-gating cells and clock-tr
42、ee synthesis(CTS)methodology.VT swap algorithms incorporated optimizations to better balance leakage,power,and f requency attainment.As with past“Zen”products 1,AMD utilizes the AM5 package f or“Zen 5”desktop comprised of the CCX,a system management unit(SMU),test/debug logic,and dual Infinity Fabri
43、c On-Package(IFOP)SerDes links.One or two CCDs are once again combined with a 6nm IO die(IOD)to deliver a broad range of “Zen 5”-based client desktop products codenamed“Granite Ridge”ranging f rom 6-core to 16-core.An over-arching goal f or“Zen 5”is to create a f amily strategy to extend the dynamic
44、 range and leverage the“Zen 5”core in a wider set of use-cases 3.Setting this strategy early-on allows the team to push the“Zen 5”core into the client notebook product of f ering as well as desktop.Using versions of the“Zen 5”core with a 256b floating point data path width,multiple CCX variants are
45、developed f or the“Strix”product f amily including a 4“Zen 5”core plus 16MB L3 variant and an 8“Zen 5c”plus 8MB L3 variant 6 as shown in Fig.2.1.5.These“Zen 5”f amily options are instrumental in allowing AMD to deliver“Zen 5”core perf ormance into laptop products with initial product of f erings ava
46、ilable at the same time as our desktop platf orms,thus delivering optimal perf ormance,power,and cost targets f or multiple market segments.“Zen 5”perf ormance per watt is improved versus“Zen 4”by delivering more perf ormance at fixed power.The“Zen 5”architecture is wider resulting in more IPC and t
47、otal CAC.Figure 2.1.6 7 shows that 1T Cinebench R23(CBR23)perf ormance per watt improved at high and medium power envelopes.As power is reduced,the perf ormance of the current and previous generations approach a cross-over point at lower power envelopes.Client programs leverage“Zen 5c”cores which ha
48、ve lower CAC,area,and f requency to deliver better energy-ef ficiency at the lower power envelopes and maximize nT perf ormance.Figure 2.1.7 8,9 shows the perf ormance uplif t of the 12-core 9900X vs.the 24-core 14900K and the 8-core 9700X vs.the 20-core 14700K across a variety of workloads.The upli
49、f t varies f rom 1%f or of fice productivity and up to 40%f or Handbrake,which makes use of AVX512 instructions.Despite having half or even less of the total core count,the 9900X and 9700X have a 13%and 17%geomean uplif t over the 14900K and 14700K,respectively.Usage of high-ef ficiency cores and ag
50、gressive power-saving physical design techniques enables a 17%improvement in CBR23 nT IPC.“Zen 5”delivers on a product strategy which improves perf ormance and ef ficiency,while providing configurable solutions which can scale f rom embedded to servers.“Zen 5”delivers a generational 16%increase in d
51、esktop perf ormance,supports f requencies up to 5.7GHz Fmax,and enables a f ull 512b FP data path.Ac k nowl e dge me nt:The authors would like to acknowledge and thank our talented AMD design teams around the world who contributed to“Zen 5.”Figure 2.1.1:“Zen 5”Die Phot o.Figure 2.1.2:“Zen 5”micro-ar
52、chit ect ural feat ures vs.previous-generat ion“Zen 4”.Figure 2.1.3:L2 and L3 Implement at ion.Figure 2.1.4:“Zen 5”CAC Breakdown.Figure 2.1.5:“Zen 5”CCX Variant s.Figure 2.1.6:“Zen 5”Single t hread performance.ISSCC 2025/February 17,2025/1:30 PM45 DIGEST OF TECHNICAL PAPERS 2 2025 IEEE International
53、 Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 2.1.7:Compet it ive analysis comparison.Re f e re nc e s:1 B.Munger et al.,“Zen 4:The AMD 5nm 5.7GHz x86-64 Microprocessor Core,”I SSCC,pp.38-39,2023.2 GNR-03:Testing as of May 2
54、024 by AMD Perf ormance labs.“Zen 5”system configured with:Ryzen 9 9950X GIGABYTE X670E AORUS MASTER motherboard,Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=ON,SAM=ON,KRACKENX63 vs.“Zen 4”system configured with:Ryzen 7 7700X,ASUS ROG Crosshair X670E motherboard,Balanced,DDR5-6000,Radeon RX 7900 XTX,VB
55、S=ON,SAM=ON,KRAKENX62 FixedFrequency=4.0 GHz.Applications tested include:Handbrake,League of Legends,FarCry 6,Puget Adobe Premiere Pro,3DMark Physics,Kraken,Blender,Cinebench(n-thread),Geekbench,Octane,Speedometer,and WebXPRT.System manuf acturers may vary configurations,yielding dif f erent results
56、.GNR-03.3 B.Cohen,M.Subramony,M.Clark,“Next Generation“Zen 5”Core,”Hot Chi ps,2024.4 T.Singh et al.,“Zen 2:The AMD 7nm Energy-Ef ficient High-Perf ormance x86-64 Microprocessor Core,”I SSCC,pp.42-43,2020.5 T.Burd,et al.,“Zen 3:The AMD 2nd-Generation 7nm x86-64 Microprocessor Core,”I SSCC,pp.54-55,20
57、22.6 T.Burd et al.,“Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core,”I SSCC,pp.38-40,2024.7 GNR-19:Testing as of Sept.2024 by AMD Perf ormance Labs on test systems configured as f ollows:AMD Ryzen 9 7950X,Ryzen 9 9950X,AMD“Splinter”ref erence motherboard.2x8GB dual channel DDR5-5600;Sam
58、sung 970 Pro SSD;Radeon RX 6800S graphics(461.4 driver),VBS&SAM of f,Windows 10 prof essional(x64)20H2 on the f ollowing application:Cinebench R23 1T.Results may vary and are based on several f actors,including system configuration,application,and sof tware.GNR-19.8 GNR-06:Testing as of June 2024 by
59、 AMD Perf ormance Labs on test systems configured as f ollows:AMD Ryzen 9 9900X CPU system:GIGABYTE X670E AORUS MASTER,Balanced,DDR5-6000,Radeon RX 7900 XTX GPU,VBS=On,SAM=On,KRACKENX63 vs.similarly configured Intel Core i9-14900K system:MSI MEG Z790 ACE MAX(MS-7D86),Balanced,DDR5-6000,Radeon RX 790
60、0 XTX,VBS=On,SAM=On,KRAKENX63,Profile=Intel Def ault on the f ollowing applications/games:Procyon Of fice,Geekbench 6.2.1 Single Core,Puget Photoshop(general),Blender(classroom TTC),Handbrake TTC,Borderlands3,DOTA2,Hitman3,Cyberpunk2077,F12023,and Horizon Zero Dawn.All games tested at 1080p resoluti
61、on.System manuf acturers may vary configurations,yielding dif f erent results.GNR-06.9 GNR-07:Testing as of June 2024 by AMD Perf ormance Labs on test systems configured as f ollows:AMD Ryzen 7 9700X CPU system:GIGABYTE X670E AORUS MASTER,Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=On,SAM=On,KRACKENX6
62、3 vs.similarly configured Intel Core i7-14700K system:MSI MEG Z790 ACE MAX(MS-7D86),Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=On,SAM=On,KRAKENX63,Profile=Intel Def ault on the f ollowing applications/games:7-Zip,Procyon Of fice Productivity,GeekBench 5.4.6 Single Core,Puget Adobe Photoshop,Handbrake
63、,Borderlands 3,DOTA2,Hitman3,Cyberpunk2077,F12023,and Horizon Zero Dawn.All games tested at 1080p resolution.System manuf acturers may vary configurations,yielding dif f erent results.GNR-07.46 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.2979-8-3315-4101
64、-9/25/$31.00 2025 IEEE2.2 IBM Telum II:Next Generat ion 5.5GHz Microprocessor wit h On-Die Dat a Processing Unit and Improved AI Accelerat or Gerald Strevig1,Chris Berry2,Rahul Rao3,Noam Jungmann4,Michael Sperling2,Michael Becht2,Eduard Herkel5,Matthias Pflanz5,Pat Meaney2,Michael Romain2,Mark Cicha
65、nowski1,Amanda Venton1,David Wolpert2,Elazar Kachir4,Luke Hopkins2,Tim Bubb2,Andreas Arp5,Daniel Kiss5,Simon Bchsenstein5,Michael Wood2,Michael Spear1,Robert Sonnelitter2,Rajiv Joshi6 1IBM Systems,Austin,TX 2IBM Systems,Poughkeepsie,NY 3IBM Systems,Bangalore,India 4IBM Systems,Tel Aviv,Israel 5IBM S
66、ystems,Bblingen,Germany 6IBM Research,Yorktown Heights,NY The IBM Z microprocessor,Telum II,has been redesigned f or the zNext system to improve perf ormance and system capacity over the previous z16 system 1.The system topology consists of f our Dual-Chip Modules(DCM),each composed of two central-p
67、rocessor(CP)chips per drawer.The system can be configured with up to f our drawers and a total of 32 CP chips in a f ully coherent shared memory system.The CP(Fig.2.2.7)is a 600mm2 die containing 43B transistors and is designed in Samsung 5nm bulk technology 2.It contains over 24 miles of wire and 1
68、65B vias spread across 18 layers of metal:8 narrow-width layers f or local interconnect,8 medium width-high perf ormance layers,and two ultra-thick layers f or of f-chip signal routing and power/clock distribution.The combination of die size increase and library improvements allowed us to keep 8 cor
69、es per CP,while adding a new Data Processing Unit(DPU)onto the chip.Each CP operates at 5.5GHz.Each core has a 128KB L1 instruction cache,and a 128KB data cache.Each CP also has 2 PCIe Gen5 16 interf aces,an M-BUS interf ace to the other CP on the DCM,and an X-BUS interf ace to every CP in the other
70、 3 DCMs per drawer.There is 1 A-BUS interf ace to connect 6 out of the 8 CPs in each drawer to the other drawers in the system.The clock network is designed with a 1:1 resonant mesh covering most of the chip and three small asynchronous non-resonant meshes f or the memory and PCIe interf aces.In add
71、ition,Telum II adds an on-chip voltage control loop f or improved dynamic voltage management,which maintains perf ormance without requiring higher voltages across all workloads 3.Key system capacity and perf ormance improvements came f rom enhancements to the core and the increased cache size.Telum
72、II increases the number of L2 cache instances f rom 8 to 10 and uses Samsung s high-density SRAM cell to grow the L2 cache by 40%f rom 32MB on Telum to 36MB per L2 cache instance.Each processor core and the DPU have a private 36MB L2 cache along with an extra floating L2,which are f ully connected b
73、y a 352GB/s ring.The on-chip shared virtual L3 cache increases f rom 256MB to 360MB.A f ully populated drawer now contains 2.88GB of virtual L4 cache,up f rom 2GB on z16.In addition to the high-density SRAM cell,the cache growth was enabled by a 20%core area shrink f rom Telum.The core shrink was ac
74、hieved through microarchitecture enhancements and technology.In addition to shrinking,the core added to the overall system perf ormance through enhancements to branch prediction,I-cache pref etching,additional rename registers,and TLB optimization.The core physical design is constructed of 7 large f
75、 ully abutted floorplanned blocks where the logic boundaries are removed,and the design is restructured.This methodology removes 2 levels of physical hierarchy leading to ef ficient area and metal usage in the core 4,5.Increasing the on-chip embedded memory is a key to improving perf ormance.The L2
76、cache SRAM design(Fig.2.2.1)includes a hierarchical bitline readout scheme along with a negative bitline(NBL)write assist scheme which is based of f metal coupling f or a negative boost 6.The metal capacitor running over the SRAM region allows f or minimal overhead and maximum density.To mitigate po
77、tential concerns with metal density and def ects,a unique circuit and method is introduced which enables metal capacitor short detection and repair.The short-detection circuit works by charging the NBL_MIRROR side of the NBL capacitor to VDD,discharging the NBL_BOOST side,then floating the NBL_BOOST
78、 side,and finally sampling it periodically.Figure 2.2.2 shows three scenarios:a low-resistance short,a high-resistance short and no short.In the case of a short,the NBL_SHORT output of the detection circuit goes high but remains low when no short is present.The NBL assist can be applied per SRAM cel
79、l column to support dif f erent application needs and allow maximum flexibility and optimized power-perf ormance.The SRAM cells and periphery logic are driven by a separate power supply.To reduce system and integration complexity,an on-chip voltage regulator is introduced which generates the SRAM st
80、and-alone supply.The regulator is realized as a dual-loop analog design similar to 7,with a high precision slow loop to set the DC value and distributed micro-regulators to respond to f ast switching transients.Voltage noise is f urther suppressed via a predictive activation scheme.This allows the S
81、RAM supply to be set to an optimal level which optimizes yield,power,and perf ormance.Telum II has increased on drawer data speed f rom 20.8Gb/s to 29.3Gb/s by increasing the PHY f requency through a 3:1 clocking design(Fig.2.2.3).The 3:1 clock is derived f rom the main grid by a counting circuit th
82、at suppresses two out of every 3 clock pulses through a local clock gate circuit.The timing is verified through a methodology that handles multiple clock speeds in one run.This increases the raw link bandwidth on the X-BUS interf ace by 36%f rom 47GB/s to 66GB/s.The system main memory now supports D
83、DR5,as well as DDR4 with a maximum configuration of 16TB per drawer and a total of 64TB system memory.The M-BUS is a synchronous crossing connecting the high-speed ring f rom each CP on a DCM.This low-latency crossing contains 2 sets of independent interf aces with 347 signals in each direction and
84、allows f or 119GB/s bandwidth per interf ace.Telum II builds on the on-chip AI accelerator introduced on Telum by adding support f or low precision INT8 data types 8.This improves compute perf ormance by 4 to 24TOPs per CP and up to 768TOPs per system.Telum II also adds the ability to utilize the ac
85、celerator of other chips in the system when the AI accelerator on the local CP is busy.The total AI compute capacity per drawer increases to 192TOPs accessible to a single thread across the drawer or about 30 improvement per thread over Telum.Telum II introduces a new coherently attached on-chip DPU
86、 that s initial usage is f or an IO Engine.The IO Engine supports thousands of virtual machine instances that allow applications to share IO ports and perf orm layered recovery,delivering both high perf ormance of fload and industry leading reliability.The new IO Engine is the first generation where
87、 the adaption and virtualization f unctions f or storage and networking protocols are integrated into the CP.The DPU itself is responsible f or the management and execution of the storage and networking protocol adapters across two IO drawer domains.The DPU consists of f our key units:f our programm
88、able element(PE)clusters,hardware-based queue manager and dispatcher,two interf ace f abrics,and a set of data mover engines.These PE clusters are the main workhorses f or generating the high IO start rates and run the channel code that manages and executes the IO protocols.The DPU builds on the des
89、ign point f rom z14 to not only deliver improved perf ormance and responsiveness over the previous channel engine design but also reduced latency while increasing start rate.The queue manager(QM)and dispatcher provide support f or scheduling of work amongst the PE clusters.One interf ace f abric pro
90、vides Memory-Mapped IO(MMIO)capability to both send MMIOs to internal locations within Telum II,as well as MMIO commands to adapters in the IO drawer.The 2nd interf ace f abric connects the PE clusters into the Telum II coherency f abric.The data mover engines allow f or ef ficient and high-speed mo
91、vement of large data blocks within Telum II and to the IO drawer.The DPU floorplan(Fig.2.2.4)consists of the f our PE clusters on the outer edges with the central area dedicated to the QM,dispatcher,interf ace f abrics,and the data-mover engines.Much like the core,the DPU logic boundaries have been
92、removed,but the restructuring methodology was enhanced to create 5 unique floorplanned blocks,where one block(PE cluster)is reused 4 times.The reliable 5nm technology of Telum II was developed in close cooperation with Samsung and is adapted to IBM s high-perf ormance requirements.The introduction o
93、f perf ormant,energy-ef ficient,SER-robust latches enables an ef ficient design f or Telum II(despite latch count growth of 40%f rom z16).Key reliability f eatures of Telum II include comprehensive cache RAID protection and fine-grained self healing to support large,robust caches in addition to core
94、 recovery and sparing.Possible technology degradations are excluded by extensive manuf acturing and stress tests(f,T,V).Comprehensive f ault detection and monitoring guarantee uninterrupted availability and service in the event of a f ault(99.999999%system uptime or continuous operation per year)9,1
95、0.The Telum II power profile is within 5%of the prior generation,while increasing both content and f requency.Significant content and f eatures drive latch counts higher(1.4)which increases local clock buf f er(LCB)loading on the high f requency clock mesh.Multiple low-power innovations contributed
96、to absorbing the growth:Innovative low-power latch designs,LCB consolidation with a micro-gated version to consolidate multiple enables onto a single LCB,and f urther clock mesh load reduction by removing test circuits(scan)connected directly to the mesh with new LCB designs contributed 8%to an over
97、all chip power savings of 14%11.These system power savings were achieved through the dedicated power-reduction initiatives listed prior and f acilitated by the reduced Vmin s required to run at zNext s target f requency.This concept is demonstrated in Fig.2.2.5;zNext exhibits significantly reduced v
98、oltages relative to z16 across a comparably normalized technology delay range when running a similar coverage-stressing workload f or a comparable duration.The reduction in Vmin is realized by a combination of the new technology(3%)as well as a notably improved design ef ficiency(3.5%).This improved
99、 design ef ficiency is also evident in the perf ormance per Watt data shown in Fig.2.2.6 across multiple generations.In this figure we can see that Telum II provides at least a 1.3 improvement in Instructions-Per-Cycle(IPC)per Watt relative to z16 s Telum chip.Ac k nowl e dge me nt:The authors would
100、 like to thank the IBM Z,EDA,Product Engineering,and Research teams f or their contributions and Samsung f or waf er f abrication.Figure 2.2.1:L2 cache SRAM cell inst ance physical design.Figure 2.2.2:Met al cap short det ect ion circuit and waveform.Figure 2.2.3:X-BUS 3:1 Clocking floorplan wit h o
101、ff-chip signal rout ing.Figure 2.2.4:Physical implement at ion of t he Dat a Processing Unit (DPU).Figure 2.2.5:Chip Vmin versus process delay.Figure 2.2.6:Power-performance improvement by generat ion.ISSCC 2025/February 17,2025/1:55 PM47 DIGEST OF TECHNICAL PAPERS 2 2025 IEEE International Solid-St
102、ate Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 2.2.7:Telum II die phot o.Re f e re nc e s:1 O.Geva et al.,“IBM Telum:a 16-Core 5+GHz DCM,”I SSCC,pp.46-48,2022.2 J.Jeong et al.,“Perf ormance-Power Management Aware State-of-the-Art 5nm
103、FinFET Design(5LPE)with Dual CPP f rom Mobile to HPC Application,”I EEE I EDM,pp.20.1.1-20.1.4,2020.3 T.Webel et al.,“Dynamic Guard Band Features of the IBM zNext system,”I SSCC,2025.4 J.Surprise et al.,“S2L:Small to Large Cutting of Physical Design Hierarchy,”I EEE/ACM DAC,2020.5 R.M.Rao et al.,“PO
104、WER10:A 16-Core SMT8 Server Processor With 2TB/s Of f-Chip Bandwidth in 7nm Technology,”I SSCC,pp.48-50,2022.6 N.Jungmann et al.,“A 1.9GHz 0.57V Vmin 576Kb embedded product-ready L2 cache in 5nm FinFET technology,”I EEE Sy mp.VLSI Ci rc ui ts,2023.7 M.E.Perez et al.,“Distributed Network of LDO Micro
105、regulators Providing Submicrosecond DVFS and IR Drop Compensation f or a 24-Core Microprocessor in 14-nm SOI CMOS,”I EEE J SSC,vol.55,no.3,pp.731-743,2020.8 C.Berry,“IBM Telum II Processor and IBM Spyre Accelerator Chip f or AI,”Hot Chi ps,2024,pp.1-29.9 W.J.Clarke et al.,“Advancing Reliability,Avai
106、lability,and Serviceability with the IBM z13,”IBM J.of Research and Development,vol.59,no.4/5,pp.9:1-9:11,2015.10 C.Berry et al.,“IBM z14:Processor Characterization and Power Management f or High-Reliability Mainf rame Systems,”I EEE J SSC,vol.54,no.1,pp.121-132,2019.11 D.Wolpert et al.,“IBM Telum I
107、I Processor Design-Technology Co-Optimizations f or Power,Perf ormance,Area,And Reliability,”I SSCC,2025.48 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.3979-8-3315-4101-9/25/$31.00 2025 IEEE2.3 Granit e Rapids-D:Int el Xeon 6 SoC for vRAN,Edge,Net workin
108、g,and St orage Raj R.Varada1,Rohini Krishnan2,Ajith Subramonia2,Rathish Chandran2,Kalyana Chakravarthy2,Uttpal D.Desai2,Sumedha Limaye2,Puneesh Puri2,David R.Mulvihill3,Mike Bichan4,Martin Koolhaas4,Vijayalakshmi Ramachandran5,Srinivasu Kondle5 1Intel,Santa Clara,CA 2Intel,Bengaluru,India 3Intel,For
109、t Collins,CO 4Intel,Toronto,Canada 5Intel,Chandler,AZ The Intel Xeon 6 SoC(codenamed Granite Rapids-D,GNR-D)is targeted towards the virtualized Radio Access Network(vRAN),edge server,networking,storage,and security segments.It targets two classes of compute requirements:one optimized towards compute
110、 and is aimed at scalar and data parallel workloads,which require low-latency,high-bandwidth memory,PCIe5 capabilities,and server-grade robustness,and the other optimized towards Edge SKUs,which require rugged operating conditions f rom 40C to 85C Tcase with f ull dynamic temperature range support.I
111、t of f ers confidential AI-enabled security and scales across dif f erent systems,supporting multiple enterprise-grade ethernet and accelerators 1.The SoC is a disaggregated design comprising one or more compute dies(C-die)with embedded memory and an IO die.The C-die(s)and IO die are integrated insi
112、de a BGA package.It supports a new ISA Accelerator(AiA)and AVX-512 with 5G.It can support diverse,dense,small f orm f actors with 4-channel DDR5 Gen3 with 2 DIMMs per channel in the HCC version or 8 channels in the XCC version of the SoC with support f or RDIMM,Memory Down,and SODIMM.The SoC IO die
113、contains a dense integration of hardware accelerators on a single die.It f eatures 200G look-aside crypto,80Gb/s compression,160Gb/s decompression,integrated vRAN Boost accelerator/f orward-error correction,media accelerator,Dynamic Load Balancer(DLB),and a data-streaming accelerator.It supports sel
114、f-boot,with no Platf orm Controller Hub(PCH)inside the package.The SoC of f ers over 3 increase in core count and memory bandwidth,up to 2.5 increase in IO perf ormance,and up to 2 increase in Intel Enterprise-Grade Ethernet throughput over the prior generation.The integrated AI accelerator and AMX
115、combine to of f er 8 speedup in Resnet-50 and 6 speedup in vision transf ormer vs.the prior generation chip.The SoC has 1.8 gen-to-gen improvement on SIR benchmarks f or aggregate perf ormance,and the accelerators of f er 2 perf ormance,better power,and throughput over the prior generation.Figure 2.
116、3.1 shows a high-level diagram.The IO die contains a network acceleration complex housing Intel Enterprise-Grade Ethernet PHY,and 316 PCIE lanes.The IO die connects to a complex Programmable Logic Device(CPLD)via the Serial Peripheral Interf ace(SPI)and to the C-die(s)via the 2.5D chiplet integratio
117、n technology described in 2.The SoC of f ers 32 PCIE 5.0 lanes with CXL 2.0 on 16 of those lanes and 16 PCIE 4.0 lanes.It supports up to 8 Intel Enterprise-Grade Ethernet lanes and max throughput of 200Gb/s as 2100,450,or 825 ports based on the Ethernet MAC configuration.It supports PXE boot and wak
118、e f rom the S5 state through wake on LAN.It also supports the 1588 time sync protocol and Synchronous Ethernet(SyncE)and can be configured as a 1588 time/SyncE master or 1588 time/SyncE f ollower.It supports packet processing,classification,traf fic-steering technologies,integrated access control li
119、sts(ACL),receiver-side scaling(RSS),and a flow detector.An advanced scheduling module allows a multiple-layer hierarchical scheduler with dynamic updates,dual-rate shaping,strict priority,and Weighted Fair Queuing(WFQ).The media accelerator supports the f ull of fload of transcoding f or live OTT,VO
120、D,and broadcast.Encode/decode operations support a choice of AVC/HEVC/AVI at 1080p 30f ps.The decode and scale/crop of fload is ideal f or video-inf erencing workloads.The Intel vRAN Boost f eature enables cost-ef f ective 4G and 5G next-generation virtualized Radio Access Network(vRAN)solutions.The
121、 enhanced vRAN Boost f eature in GNR-D supports 4G and 5G next-generation and includes a 5G low-density parity check and 4G Turbo encoder/decoder and rate match/de-match.The test and debug solutions provide a way to qualif y each die individually or assembled by reusing test patterns and debug hooks
122、.Muxes and IO signals allow the test and debug f abrics to operate within a die or utilize the chiplet interconnect to enable a Die-to-Die(D2D)datapath on assembled parts.A parallel test interf ace(PTI)is accessible via DDR channels,and a single die provides scan data via packetized test data to acc
123、ess IPs on an assembled part.The PTI is on general-purpose IO(GPIO)to allow access to debug data f rom any of the dies.JTAG is implemented as a single controller f or system view and two individual controllers f or high-volume manuf acturing.A significant reduction in tester time reduction was achie
124、ved by enabling completely independent die testing in parallel with no D2D dependencies of any clocks or reset signals.To optimize power management,the SoC combines integrated voltage regulators(IVR)similar to 3 and motherboard voltage regulators(MBVR).The VR choice was based on current requirements
125、,the physical spread of the domain,f ast switching events,and platf orm cost.To isolate large switching currents impacting IOs like the PCIE,DDR,and SerDes,the noisy IVRs and sensitive analog IO IVRs were sourced f rom separate MBVRs.The IVRs deliver two quiet analog power rails with a tight spread
126、of 3%f or DC+AC noise f or 56Gb/s Ethernet and PCIE5 data rates across an extended temperature range of -40C to 105C.This helped in saving the platf orm cost f or exclusive high-fidelity MBVR.To combat the challenge of a shrinking f ootprint,coaxial magnetic inductors are used f or IO domain IVR and
127、 air core inductors f or DDR IVR,resulting in a good tradeof f between high-Q inductors with a smaller f ootprint and increased IVR ef ficiency vs.noise.Low-noise crystal oscillator and PLL designs allow the Ethernet and PCIE ref erence clock to f ully integrate into the SoC,with a commodity 25MHz q
128、uartz crystal as the only external component.Design optimizations based on 4 in AC impedance,wire engineering,and power delivery noise were employed to achieve the jitter spec with a phase noise mask of -130dBc/Hz at 10KHz and-146dBc/Hz f rom 10MHz to 1GHz.The SoC uses the Intel PCIE 5.0/CXL,UPI,and
129、 DMI 5.0 PHY,which supports up to 32Gb/s,chip-to-chip,onboard,and backplane applications with 36dB loss in 8 and 4 modes.The SoC uses 48 lanes and has a maximum bandwidth of 189GB/s.The PCIE transceiver,in Fig.2.3.2,has a 3-tap voltage-mode transmitter,an adaptive Continuous-Time Linear Equalizer(CT
130、LE),and amulti-tap Decision Feedback Equalizer(DFE).The PHY supports SRIS clocking,platf orm needs f or AC and DC coupled links,and low-swing support f or lower power and shorter reach.The PHY uses microcontrollers f or flexible adaptative algorithms and superior debug capabilities.The IP complies w
131、ith all PCIE5 specifications and has best-in-class KPI metrics f or perf ormance,power,and datapath latencies(f or CXL perf ormance),all proven in silicon.It has aggressive area&shoreline costs achieved with innovative architecture,design,and process capabilities.The PCIE5 compliance transmit eye di
132、agram is shown in Fig.2.3.3.The SerDes is a highly flexible PHY capable of NRZ and PAM4 modulation,supporting 53.125Gb/s Ethernet and up to 58Gb/s.It uses a 16-tap analog DFE architecture to achieve an optimized power and perf ormance tradeof f in a compact silicon area of 0.27mm2 per lane.The IP is
133、 available in a 1-lane variant f or maximum lane independence and a 4-lane variant with a shared 16GHz clock f or minimum area and power.Two PLLs covering a 2:1 range f rom 8-16.22GHz provide the transmit clocking,enabling a max NRZ data rate of 32.44Gb/s f or JESD support.The receive clocking is pr
134、ovided by a single pseudo-dif f erential 4-stage ring oscillator,shown in Fig.2.3.4,with sophisticated dual-path bias control to achieve low jitter across the f ull 8-to-16.22GHz f requency range.The receiver f ront-end can support DC or AC coupling with channel loss f rom VSR to LR using an adjusta
135、ble 0-6dB attenuator,mid-band transf er f unction control,and a 3-stage CTLE.Extensive ref erence clock distribution options are supported to enable any desired SoC-level configuration of SerDes lanes with up to six distinct ref erence clocks.The IP shows excellent post-silicon health and meets all
136、specifications.The PAM4 eye diagram is shown in Fig.2.3.5.Each chiplet instantiated a D2D IO introduced in 5.This is an ultra-high bandwidth,low-latency,and low-power D2D interconnect.The SoC employs a power-perf ormance optimized instance of this IO with 0.33-0.48pJ/b scalable with clock f requency
137、.The SoC IO die is designed and manuf actured using 4nm-class technology 6 and measures 31.90mm12.06mm.It is f ully f unctional and perf orms to its specifications.Figure 2.3.7 shows the IO die photograph.SoC integration methodology used bottom-up and top-down approaches to optimize power,perf orman
138、ce,and area.Based on the SKU,the SoC is built using one IO and up to two C-dies connected by 3-to-6 D2D bridges,as described in 2.The disaggregation helps with overall yield and allows the use of multiple process nodes on the same SoC.Each D2D bridge is embedded into the package substrate,consuming
139、two package layers.All power delivery to the shadow of the bridge is routed via package layers and pass through.Instead of the traditional approach of having one common superset pinout that caters to both package configurations,the SoC has been designed on a single platf orm with both scalable BGA S
140、oCs.These are built in a modular f ashion with the smaller configuration(77.550mm2)pin compatible with the bigger configuration pin out(77.556.5mm2)shown in Fig.2.3.6.This involved trade-of f s and technological innovations around assembly/test keep-out zones,package thermals,optimized Integrated He
141、at Spreader(IHS)designs f or top-side fit and in parallel pin pattern/pitch(0.94-1.0mm hex),cavity,break-out,signal integrity,power-delivery optimizations f or bottom-side fit to accommodate 15%pin delta between two designs.Extensive pin optimizations with the least number of general Purpose IO usin
142、g novel concepts of package-pin overloading without af f ecting the SoCs f unctionality are another salient f eature that enabled the modular pin map.Ac k nowl e dge me nt:The authors thank all the individuals and teams that have worked relentlessly f or the success of this Intel Xeon 6 SoC product.
143、Figure 2.3.1:High-level diagram of t he Int el Xeon 6 SoC for vRAN,edge server,net working,st orage and securit y.FigureFigure 2.3.2:PCIE5 t ransceiver diagram.Figure 2.3.3:PCIE5 compliance Tx eye scope shot.Figure 2.3.4:56Gb/s SerDes receive-side CDR oscillat or.Figure 2.3.5:56Gb/s SerDes PAM4 Et h
144、ernet eye scope shot.Figure 2.3.6:Int el Xeon 6 SoC vRAN,edge server die package combinat ions showing die st it ching.ISSCC 2025/February 17,2025/2:20 PM49 DIGEST OF TECHNICAL PAPERS 2 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-41
145、01-9/25/$31.00 2025 IEEEFigure 2.3.7:Int el Xeon 6 SoC IO die phot o.Re f e re nc e s:1 P.Mosur et al.,“Built f or the Edge:The Intel Xeon 6 SoC”,Hot Chi ps,2024.2 R.Mahajan et al.,“Embedded Multi-die Interconnect Bridge(EMIB)A Localized,High-Density Multichip Packaging Interconnect,”I EEE Tra ns.on
146、 Compone nts,Pa c k a gi ng a nd Ma nuf a c turi ng Te c h.,vol.9,no.10,pp.1952-1962,2019.3 K.Bharath et al.,“Integrated Voltage Regulator Ef ficiency Improvement using Coaxial Magnetic Composite Core Inductors,”I EEE El e c.Compone nts a nd Te c h.Conf.,pp.1286-1292,2021.4 R.Krishnan et al.,“Design
147、 Techniques f or Ref erence Clock Jitter Optimization f or High Speed PHYs,”I EEE Conf e re nc e on De si gn,Te st,Te c hnol ogy of I nte gra te d sy ste ms,2023.5 N.Nassif et al.,“Sapphire Rapids:The Next-Generation Intel Xeon Scalable Processor,”I SSCC,pp.42-46,2022.6 B.Sell et al.,“Intel 4 CMOS T
148、echnology Featuring Advanced FinFET Transistors Optimized f or High Density and High-Perf ormance Computing,”I EEE Sy mp.VLSI Ci rc ui ts,pp.282-283,2022.50 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.4979-8-3315-4101-9/25/$31.00 2025 IEEE2.4 A 300MB SRA
149、M,20Tb/s Bandwidt h Scalable Het erogenous 2.5D Syst em Inferencing Simult aneous St reams Across 20 Chiplet s wit h Workload-Dependent Configurat ions Srivatsa Rangachar Srinivasa1,Dileep Kurian1,Paolo Aseron1,Prerna Budhkar1,Arun Radhakrishnan1,Alejandro Cardenas Lopez1,Jainaveen Sundaram1,Vinayak
150、 Honkote1,Leonid Azarenkov1,Daniel Lake1,Jaykant Timbadiya2,Mikhail Moiseev1,Brando Perez Esparza3,Ronald Kalim1,Erika Ramirez Lozano3,Mukesh Bhartiya2,Sriram Kabisthalam Muthukumar2,Satish Yada1,Suresh Kadavakollu2,Saransh Chhabra2,Kamakhya Prasad Sahu2,James Greensky1,Xavier Brun4,Esdras Juarez He
151、rnandez3,Robert Munoz4,Thomas Thomas1,Shawna Lif f4,Vivek De1,Anuradha Srinivasan1,Tanay Karnik1 1Intel,Hillsboro,OR 2Intel,Bangalore,India 3Intel,Guadalajara,Mexico 4Intel,Chandler,AZ Disaggregating large systems has shown multif old advantages especially with current application trends prompting a
152、 shif t towards chiplet-based architectures 1-3.To meet increasing computing demands,2.5D systems should allow greater interoperability across advanced technology nodes f rom multiple f oundries,higher system memory capacity,higher IO counts and scalable interconnect pitches.To f urther address the
153、escalating memory capacity demands and mitigate the bandwidth bottleneck prevalent in AI applications,chiplet systems should be capable of workload-tailored configurations at assembly time.This adaptability enables optimal resource allocation and f acilitates processing of voluminous datasets and co
154、mplex AI computations.We present a unique 2.5D packaging approach to integrate multi-f oundry active chiplets over a large passive silicon base.The system prototype f eatures a Intel 4nm memory chiplet(MD),a TSMC 16nm compute chiplet(CD),assembled on a UMC130nm passive silicon base(22mm19mm)having 2
155、0 identical and modular chiplet slots(CS)with an entire system power of 20.52W at 300MHz(Fig.2.4.7).Along with respective designs,both chiplets encompass five coherent components to enable integration over any available CS,thereby supporting heterogeneity.Multiple assembly time configurations allow
156、flexible and adaptable solutions f or a wide range of AI applications.Each CS in passive silicon is identical to the interconnect template(Fig.2.4.1)of the active chiplet with bumps f or chiplet attach.Additionally,various dimensions of multiple CS can be carved out on the passive silicon f or simil
157、ar-sized active chiplets to be integrated.We have designed two flavors of active chiplets,MD(3945.4m4049m)and CD(3709m4000.32m).CD is equipped with PCIe gen4 at the high-speed interf ace region and hence can communicate with an external host system.Figure 2.4.2 shows passive silicon with dual-side c
158、arrier,CD and MD processed at three dif f erent f oundries and technology nodes.165000 bump bonds(24304 signals)and 18000 C4 bonds f rom passive silicon to the package comprises 120 distinct power domains along with PCIe routing to both nearest the chiplet and f arthest f rom the package pins.Figure
159、 2.4.3 illustrates two CDs with PCIe link to the pins at the edge of the package.Due to variations in C4,TSV,bumps,routing signals and IO logic,we observe 22%increased insertion loss in the f arthest link and 16%TX return loss(31%RX)compared to the nearest link.A custom QPS circuit(Fig.2.4.3),with 1
160、4 modular TX/RX clusters,each with 9 data lines(transmitting at 2Gb/s)accompanied by strobe and Data-Valid(DV)extends out of a chiplet through a 128b AXI f abric.Each TX cluster generates quadrature I/Q clocks f rom the incoming 2GHz PLL clock and serializes 36b of AXI data to 9 streams of high-spee
161、d data.The Q-clock(strobe)along with data and DV is sent through a trimmable analog driver across a 1.2mm passive channel.At the RX cluster,programmable delay lines convert data back to 36b chunks at the respective clock domain.The CD(Fig.2.4.4)integrates:(i)a Tensilica LX7 processor(ii)an H.264 med
162、ia decoder(ii)a PCIe 4 PHY and controller f or external communication with host processor(iii)an AI accelerator with 2 INT8 TOPS(iv)a custom debug logic engine(v)a 3MB SRAM subsystem and Register Files(RF)f or chiplet and system configuration(vi)test logic and GPIOs.An AXI-based reconfigurable route
163、r achieves a data rate of 168Gb/s between the CDs through 392 QPS signals translating to physical metal routes via bumps to passive silicon routing.CD operates over a voltage(f requency)range of 0.650.9V(100-500MHz).The Intel 4nm MD(Fig.2.4.4)integrates:(i)Vision P6(VP6)f or DSP tasks(ii)interrupt c
164、ontrollers(iii)a DMA engine f or bulk data transf ers(iv)3 Near Memory Compute(NMC)accelerators interf acing to the 16MB shared memory and(v)custom debug logic.GPIO count is common across both the chiplets and adheres to the template.MD has a dedicated region of clustered bumps f or landing probes f
165、 or testing(Fig.2.4.7),while CD has depopulated bumps f or design consistency.Debug logic is also common between both CD and MD.Designed using open-drain IOs with multi-leader capability,a 1:1 transaction and/or 1:many transactions can be initiated across the entire 20-chiplet system f acilitating d
166、ebug without the need f or scan chaining an entire system,thus minimizing separate debug package IOs.An AI engine within the MD accelerates deep-learning inf erencing tasks with 8 INT8 TOPS.MD can access the shared memory space of all 20 chiplets(300MB on the 18MD-2CD configuration)f or storing weig
167、hts,partial outputs,other inf erence results without needing to invoke PCIe transactions to an external memory system.Figure 2.4.4 summarizes all architectural components of the active chiplets.The measurement setup f or the prototype system comprises a host CPU connected to PCIe through package con
168、nections(Fig.2.4.5)along with a display device connected through UART.We developed sof tware to program the entire system through the host to of fload computations.For measurement of AI inf erence tasks,the power domains of three MDs(7,17 and 19 f rom Fig.2.4.5)are combined,while external power is s
169、plit at the package and f ed to 17 other chiplets.QPS f unctionality is vital f or the entire system and Fig.2.4.5 shows the f unctional region between 400MHz through to 2000MHz,with minimum operating voltage of 800mV across CD and MD.We successf ully distributed a single ResNet50 model running inf
170、erence on the ImageNet 8 dataset across three MDs.Optimal mapping of 1 stream of ResNet50 across AI engines of 3 MDs results in 72 compute layers.Each layer has a 50MB memory requirement to store weights,bias,input and outputs(FP16).The layer breakdown in Fig.2.4.5 requires 956ms to execute at 300MH
171、z chiplet f requency.Measured current spikes show the execution time and AI engine idle time during data transf er.The 18MD-2CD configuration runs six simultaneous streams of inf erence at the rate of 6f ps.Throughputs achieved f or the three configurations are shown in Fig.2.4.7.A dif f erent AI-ce
172、ntric chiplet when integrated with the presented methodology would result in even higher AI perf ormance.With an increase in MD clock f requency,AI engine power crosses 450mW with throughput 25TOPS/W and a marginal increase in MD sense point voltage f rom the package source.Measured results are capt
173、ured in Fig.2.4.5.We demonstrate the ef f ectiveness of compute distribution of inf erence across three MDs by mapping a total of 54 layers(CNN and GEMM)to AI engines of 3MDs and the remaining 18 layers(including sof tmax)to the VP6 DSP.Top 5 classification results and accuracies generated by the sy
174、stem are shown in Fig.2.4.6.64%of the inf erence time is spent in reordering,moving the data across three chiplets(2-cycle latency across MD7-MD17,MD17-MD19 and 3-cycle latency across MD7-MD19).The AI engine is active f or 28.3%of the time running at 24576Ops/cycle or 12 INT8 TOPS.Sof tmax is the la
175、st stage of inf erence requiring 5W)edge GPU 12 requires 20minutes f or SAM and 400ms f or IPR due to its massive external memory access(EMA)and huge computational requirements.In general,there are three hardware challenges to realize a mobile 3DGS accelerator.First,the parameters of 3DGS(50MB f or
176、over 500K Gaussians)f ar exceed the L2 cache size of an edge GPU(4MB on the Jetson Orin Nano 12),while being stored in an unordered manner incurs f requent irregular DRAM accesses 14,19.Its EMA occupies 81.3%of the system energy consumption,even af ter parameter pruning schemes are applied 16.Thus,t
177、he Single-Embedding-Multi-MLP(SEMM)-Compressed 3DGS based on a state-of-the-art MLP-based compression approach 15 is adopted,but EMA still constitutes 52.8%of system energy.Second,spatial computing requires to accelerate IPR within a 15ms latency to provide real-time user interaction 23.However,the
178、three sequential stages of IPR def orm,reflect,and render require 2.33 more processing time compared to non-interactive rendering on previous 3DGS hardware 13.Third,backpropagation(BP)consumes 55.8%of the computations in SAM due to FP16 high-precision calculation and limited reusability.In this pape
179、r,IRIS,Interactable Real-Virtual Interf ace SoC,is proposed to enable real-time ISC on mobile devices by employing 3 solutions:1)Gaussians are mapped on the specific surf aces to speed up IPR and reduce energy consumption caused by EMA.Coplanar-Gaussian Cluster(CPGC)and Spatial-Temporal Gaussian Ord
180、er Reuse(STOR)eliminate redundant Gaussian f etching and time-consuming Gaussian sorting.2)A Reconfigurable ALP-based Multiply-Accumulator Array(RAMA)realizes floating-point(FP)multiply-accumulation(MAC)with high energy ef ficiency and throughput.3)An Error Direction Cache(EDC)eliminates redundant B
181、P computations.By storing the polar-decomposed error,EDC secures a high hit ratio with minimal resources.Figure 2.7.2 shows the overall chip architecture of IRIS,comprising a Surf ace Perception Unit(SPU),4 SEMM-Compressed 3D Gaussian Splatting(SCGS)Clusters,512KB global memory with near-memory SIMD
182、,and a top RISC-V controller.The surf ace extractor and def ormer handle surf aces bef ore/af ter interaction,respectively.The surf ace rasterization core selects the Gaussians to be loaded corresponding to each pixel tile.Gaussians loaded f rom external DRAM are sorted in the Gaussian Sorting Unit(
183、GSU).The 4 RAMA cores within the SCGS cluster perf orm harmonic coordinate(HC)calculation f or def ormation and matrix multiplications f or 3D Gaussian f eature generation.EDC reuses the direction of errors generated during the modeling process,and otherwise f unctions as a regular scratchpad.The pe
184、ripheral circuits in global memory update weight parameters with ADAM.Figure 2.7.3 describes the detailed architecture of the SPU which mitigates the EMA bottleneck problem of 3DGS.At regular f rame intervals during SAM,the SPU rearranges updated 3D Gaussians to enable regular DRAM access as f ollow
185、s.First,the MarchCube engine generates surf aces at locations where the number of Gaussians exceeds a threshold.Then,f or each Gaussian,the CPGC unit calculates L2 distances to all surf aces and each Gaussian is stored in the surf ace buf f er Sk corresponding to the closest surf ace.Clustered Gauss
186、ians in Sk are denoted as Sk-CPGC,and the GSU sorts and stores them in DRAM along surf ace normal direction.Surf ace memory stores a pointer to external DRAM f or every Sk-CPGC,enabling the rasterization core to access DRAM in a regular manner during every f rame s rendering and modeling process.As
187、a result,a 45.7%reduction in external memory latency is achieved by clustering Gaussians into a contiguous memory space.In addition,STOR reduces the 3DGS latency and EMA power consumption by exploiting redundancy in Gaussian sorting results.First,the incident angle()between the surf ace norm and cam
188、era view ray is calculated to determine spatial reusability.If is smaller than a threshold,Sk-CPGC loaded f rom DRAM is spatially reused without any additional GSU operation.If is close to 90,angle changes across neighboring f rames are evaluated to determine whether temporal reuse can be applied or
189、 not.If is below a threshold,cached sorting results f rom f rameT are read f rom the Gaussian Order Buf f er(GOB)instead of perf orming the GSU operation f or f rameT+1.Otherwise,GSU reorders CPGC and updates the GOB.Leveraging these two reusability f actors,STOR skips an average of 81%of the sortin
190、g tasks.Overall,CPGC and STOR decreases the end-to-end preprocessing latency by 61%and EMA power consumption by 59%.Figure 2.7.4 shows two f unctional units of RAMA,consisting of f used-MAC architecture:a Reconfigurable FP Multiplier(RFM)and a Hybrid Aligned-Mantissa Adder Tree(HAMAT).RFM handles en
191、tire operations in IPR using a linear-plane approximation f or mantissa multiplication 22.RFM maintains high ef ficiency even with FP16 high precision and high accuracy with approximate-aware training 24.The RFM is composed of a comparator,a shif ter,and(k+1)f ull adders(FAs)f or a k-bit mantissa.It
192、 achieves 4.5 higher energy ef ficiency and 2.6 higher area ef ficiency compared to a conventional FP multiplier with(k+1)2 FAs.An RFM unit supports either two ALP-FP8 operations or an ALP-FP16 operation.Also,it can be reconfigured as an accurate FP8 multiplier with compensation.On the other hand,HA
193、MAT accumulates the 12b partial products(PPs)f rom the RFMs.HAMAT is primarily composed of 32-way adder tree(AT)and OR-approximated AT(OAT)f or the MSB and LSB parts(6b each),respectively.The MSB parts show repetitive columns(RC)due to the sign-extended bits,so a shif ter-based RC accumulator(SRA)is
194、 proposed.The SRA replaces numerous shif t-and-adders with just two shif ters and a subtractor.Bits not included in the RC are accumulated through 6 ATs,and power consumption is minimized by suppressing bit-flips through RC-gating.The LSB parts exhibit error robustness.Theref ore,an OR approximation
195、 is used f or the first stage of AT.The LSB parts are encoded to boost up bit sparsity when it is below 50%,reducing both error and power consumption.Consequently,the RAMA achieves 2.36 higher area ef ficiency and 2.78 higher energy ef ficiency compared to accurate FP MAC without rendering quality d
196、egradation.Figure 2.7.5 shows the SCGS cluster consisting of RAMA and EDC to accelerate both IPR and SAM.Two major IPR workloads show dif f erent operation characteristics:1)HC computation requires iterative f eature aggregation with high input reusability.2)Feature generation involves large batch m
197、atrix multiplications with high weight reusability.Since they occur in a sequential manner,the heterogeneous architecture suf f ers f rom huge(39%)utilization drop.Theref ore,RAMA can be configured into two modes:1)For HC computation,node f eatures are temporally reused through the systolic datapath
198、;2)For matrix multiplication,each RFM is configured as a weight-stationary MAC.As a result,core utilization enhanced about 29%p by the reconfigurable homogenous architecture.Meanwhile,to accelerate modeling,EDC reduces BP computations by reusing the error direction.The similarity of local Gaussian f
199、 eatures inherently suggests that the error in the f eature map(FMap)during the modeling process will also exhibit similarity.However,reusing the error vector leads to significant storage space requirements and accuracy loss due to the high precision and accuracy sensitivity.Theref ore,EDC polar dec
200、omposes the error and stores only the directional component f or reuse.The key f or cache access minimizes accuracy loss through dual verification,by quantizing both the output magnitude and the sign pattern of the FMap.Consequently,a cache size of only 4KB per layer achieves a 79%hit ratio with les
201、s than 0.1dB of accuracy loss.As a result,EDC reduced the overall computation energy consumption by 33%in modeling through a 73%reduction in BP computation.Figure 2.7.6 shows measurement results of the IRIS and a comparison table.It is evaluated with 3D object datasets 18,20.IRIS enables 34.3 f aste
202、r IPR speed,on average,with only 1.6GB/s external memory BW compared to an edge GPU 12 with 34GB/s external memory BW.As well as high rendering throughput,it achieves 300 smaller energy consumption per f rame including external memory access compared to the edge GPU,dissipating 616mW power consumpti
203、on.IRIS also achieves 142.1f ps and 554mW power consumption on static rendering,reflecting 2.55 higher energy ef ficiency with better rendering quality compared to 8.For modeling,IRIS achieves 67.1 iterations/s,dissipating 584mW.IRIS is f abricated in 28nm CMOS technology,and occupies 20.25mm2,as sh
204、own in Fig.2.7.7.In conclusion,we propose a graphics SoC,IRIS,that enables real-time interactive spatial computing on mobile devices through interactive photorealistic rendering and surf ace-aware modeling.Figure 2.7.1:Overview of proposed int eract ive spat ial comput ing syst em and design challen
205、ges.Figure 2.7.2:Overall archit ect ure.Figure 2.7.3:Surface Percept ion Unit wit h Coplanar-Gaussian clust er t o speed up int eract ive rendering and spat ial-t emporal Gaussian order reuse t o eliminat e redundant sort ing.Figure 2.7.4:Proposed fused MAC unit consist of reconfigurable FP mult ipl
206、ier and hybrid aligned-mant issa adder t ree for high area and energy efficiency.Figure 2.7.5:Proposed homogeneous archit ect ure for high core ut ilizat ion and error-direct ion-Cache wit h polar-decomposed key for back-propagat ion accelerat ion.Figure 2.7.6:Measurement result s and performance co
207、mparison t able.ISSCC 2025/February 17,2025/4:00 PM57 DIGEST OF TECHNICAL PAPERS 2 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 2.7.7:Chip Micrograph and performance summary.Re f e re nc e s:1 M.Walke
208、r et al.,“Virtual,Augmented,and Mixed Reality f or Human-Robot Interaction:A Survey and Virtual Design Element Taxonomy,”ACM Tra ns.on Huma n-Robot I nte ra c ti on,vol.12,no.4,2023.2 Z.Zhong,Y.Wei,L.C.Go and J.Gu,“A Sub-1 J/class Headset-Integrated Mind Imagery and Control SoC f or VR/MR Applicatio
209、ns with Teacher-Student CNN and General-Purpose Instruction Set Architecture,”I SSCC,pp.544-545,2024.3 J.Chen et al.,“A 3D Mixed Reality Interf ace f or Human-Robot Teaming,”I EEE I CRA,2024.4 D.Dai et al.,“Interactive Mixed Reality Rendering in Holographic Pyramid,”I EEE Conf.on Vi rtua l Re a l i
210、ty a nd 3D Use r I nte rf a c e s,pp.483-492,2022.5 M.Nikita,“Create Stunning Renders Using V-Ray in 3ds Max:Guiding the Next Generation of 3D Renderers,”CRC Press,2021.6 O.Sorkine,M.Alexa,“As-Rigid-as-Possible Surf ace Modeling,”Sy mposi um on Ge ome try Proc e ssi ng,pp.109-116,2007.7 D.Han,J.Ryu,
211、S.Kim,S.Kim and H.-J.Yoo,“MetaVRain:A 133mW Real-Time Hyper-Realistic 3D-NeRF Processor with 1D-2D Hybrid-Neural Engines f or Metaverse on Mobile Devices,”I SSCC,pp.50-51,2023.8 J.Ryu et al.,“NeuGPU:A 18.5mJ/Iter Neural-Graphics Processing Unit f or Instant-Modeling and Real-Time Rendering with Segm
212、ented-Hashing Architecture,”I SSCC,pp.372-373,2024.9 K.Gao et al.,“Nerf:Neural Radiance Field in 3D Vision,A Comprehensive Review,”arXiv preprint arXiv:2210.00379,2022.10 B.Kerbl et al.,“3D Gaussian Splatting f or Real-Time Radiance Field Rendering.”ACM Tra ns.Gra ph.,vol.42 no.4,2023.11 J.Huang,H.Y
213、u,“GSDef ormer:Direct Cage-based Def ormation f or 3D Gaussian Splatting,”arXiv preprint arXiv:2405.15491,2024.12 NVIDIA Inc.,NVIDIA Jetson Orin Nano,https:/ J.Lee et al.,“GSCore:Ef ficient Radiance Field Rendering via Architectural Support f or 3D Gaussian Splatting,”ACM ASPLOS,pp.497-511,2024.14 S
214、.Kim,J.Lee,D.Im and H.-J.Yoo,“PNNPU:A 11.9 TOPS/W High-speed 3D Point Cloud-based Neural Network Processor with Block-based Point Processing f or Regular DRAM Access,”I EEE Sy mp.VLSI Ci rc ui ts,2021.15 Yihang et al.,“HAC:Hash-grid Assisted Context f or 3D Gaussian Splatting Compression,”ECCV,2024.
215、16 P.Panagiotis et al.,“Reducing the Memory Footprint of 3D Gaussian Splatting,”ACM Comput.Gra ph.I nte ra c t.Te c h.vol.7,no.1,2024.17 Micron DDR4 SDRAM Power Calculation,https:/ H Caesar et al.,“nuScenes:A Multimodal Dataset f or Autonomous Driving,”CVPR,2020.19 N.Chatterjee et al.,“Managing DRAM
216、 Latency Divergence in Irregular GPGPU Applications,”I EEE I ntl Conf.f or Hi gh Pe rf orma nc e Computi ng,Ne twork i ng,Stora ge a nd Ana l y si s,pp.pp.128-139,2014.20 Ben et al.,“NeRF:Representing Scenes as Neural Radiance Fields f or View Synthesis,”Commun.of the ACM,vol.65,no.1,pp.99-106,2020.
217、21 Lingjie et al.,“Neural Sparse Voxel Fields,”Adv a nc e s i n Ne ura l I nf orma ti on Proc e ssi ng Sy ste ms,vol.33,pp.15651-15663,2020.22 M.Imani et al.,“ApproxLP:Approximate Multiplication with Linearization and Iterative Error Control,”ACM/I EEE DAC,2019.23 M.S.Elbamby et al.,“Toward Low-Late
218、ncy and Ultra-Reliable Virtual Reality,”I EEE Ne twork,vol.32,no.2,pp.78-84,2018.24 Zijing Niu et al.,“A Logarithmic Floating-Point Multiplier f or the Ef ficient Training of Neural Networks,”I EEE GLSVLSI,pp.65-70,2021.25 S.Guo,S.Sapatnekar and J.Gu,“A 28nm Physical-Based Ray-Tracing Rendering Proc
219、essor f or Photorealistic Augmented Reality with Inverse Rendering and Background Clustering f or Mobile Devices,”I SSCC,pp.44-45,2024.58 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.8979-8-3315-4101-9/25/$31.00 2025 IEEE2.8 A 210fps Image Signal Processo
220、r for 4K Ult ra HD True Video Super Resolut ion Ying-Sheng Lin1,Jun Nishimura2,Chia-Hsiang Yang1 1National Taiwan University,Taipei,Taiwan 2Google,Mountain View,CA Video super-resolution(VSR)aims to convert low-resolution(LR)videos to high-resolution(HR)videos with high image quality 1.It can be use
221、d f or various video applications,such as streaming and conf erencing,given low-bandwidth connectivity.Figure 2.8.1 shows a neural network(NN)-based VSR f ramework that comprises alignment and refinement stages 2.Optical flow,which indicates the motion of the pixels,is used to align the objects betw
222、een two f rames and to generate a coarse HR f rame.The coarse HR f rame is then refined to produce the final HR f rame.Videos are usually encoded in a compact f rame-based f ormat f or ef ficient storage and transmission.Frames are categorized into several types(I,P,and B):I f rame is encoded withou
223、t ref erencing other f rames,P f rame is encoded by ref erencing previous f rames,and B f rame is encoded by ref erencing both previous and f uture f rames.In each f rame,blocks are also encoded by block prediction to ref erence similar blocks in other f rames.The location of the ref erence block is
224、 indicated by a motion vector and the pixel dif f erence is encoded as a residual.Dedicated processors have been proposed to deliver high throughput f or VSR 3-5.However,prior works treat video as a collection of independent images 3-4 or raw video 5,without considering the inter-dependency of f ram
225、es.In the multi-image workflow shown in 5,extracted f eatures of adjacent f rames are reused,but temporal and spatial redundancies embedded in video f rames are neglected f or VSR acceleration.This work demonstrates an image signal processor f or true video super-resolution with high f rame rate and
226、 energy ef ficiency.Figure 2.8.2 shows the proposed VSR workflow and algorithm-architecture optimizations.By adopting f rame-based VSR,motion vectors and residual blocks can be leveraged to reuse previous HR f rames.Block prediction is utilized to generate the coarse HR f rame without calculating th
227、e optical flow in the alignment stage.Additionally,an adaptive computation strategy is proposed to support various f rame types.Full VSR NN inf erence is adopted f or I f rames,block prediction is used f or alignment f or P f rames,and only block prediction is perf ormed f or B f rames.This strategy
228、 reduces the computational complexity by 93%.Content-aware processing is also adopted to leverage video redundancy.Blocks with temporal/spatial redundancy,which can be refined more easily,are identified first and their locations are marked by masks.To identif y temporal redundancy,the magnitude of r
229、esidual blocks is used to identif y the static background.A smaller residual magnitude indicates a higher similarity between the blocks generated by VSR NN inf erence and those predicted by block prediction.For regions with a high similarity,previous HR f rames can be reused without the need f or bl
230、ock refinement.To identif y spatial redundancy,this work employs the Laplacian operator 6 on the input f rames and a flatter region(with a smaller variation in magnitude)exhibits a smaller Laplacian magnitude.Refinement on flatter regions can be skipped and simple interpolation can be used to achiev
231、e comparable VSR quality.By leveraging the temporal and spatial redundancies,the computational complexity is reduced by 46%.Figure 2.8.3 shows the system architecture of the proposed image signal processor f or VSR,consisting of a reconstruction engine,a redundancy detector,a convolution engine,and
232、a tensor engine.The reconstruction engine perf orms block prediction using previous HR f rames based on the motion vectors and residual blocks.A preloading scheme is proposed to preload a set of adjacent blocks,called a macroblock,and to reduce the latency f or external memory access(EMA).The ref er
233、ence blocks are f etched by pixel samplers and processed by prediction unit(PU)arrays.The redundancy detector identifies redundancies in video f rames and calculates the masks based on the Laplacian and residual magnitudes.The convolution engine includes a content-aware processing element(CAPE)array
234、 that employs hardware parallelism to accelerate multiply-accumulate(MAC)operations f or VSR NN inf erence.For each CAPE,the masks obtained by the Laplacian magnitude(f rom the redundancy detector)are utilized to minimize redundant computations.The multi-dimensional data are treated as tensors and a
235、re divided into several tiles.Since a fixed tile size may cause higher EMA(due to the correlation between data redundancy and data reuse),dynamic tiling is adopted to adjust the tile size,thereby minimizing EMA.The tensor engine perf orms non-convolution operations f or VSR,such as pooling,concatena
236、tion,and upsampling.Figure 2.8.4 shows the detailed architecture of the proposed CAPE.For VSR NN inf erence,the f eature maps need to be upsampled,usually by pixel shuf fling,and the data in the channel dimension are then rearranged along the planar dimension in an interleaved manner.A flatter input
237、 block,which exhibits smaller variance,tends to have a smaller Laplacian magnitude.Theref ore,it is f easible to skip the associated convolution computations along the channel dimension by applying simple interpolation.Each CAPE consists of a channel-wise interpolation(CI)unit and 8 accumulators(ACC
238、),processing 16 input channels and 8 output channels.Each ACC includes 16 multipliers and an adder tree to perf orm MAC operations involved in convolution.Each CI unit,equipped with 16 interpolation units(IU),perf orms either 4 or 16 channel-wise interpolation to generate f eature maps:blocks with s
239、maller Laplacian magnitudes are skipped or obtained by 16 interpolation along the channel dimension,while blocks with larger Laplacian magnitude are obtained by 4 interpolation to achieve better VSR quality.A 44 block size is selected to match the dimension of the CAPE array,thereby achieving 100%ha
240、rdware utilization.Channel accumulation can be perf ormed adaptively depending on the block type to minimize the latency.By exploiting spatial redundancy in the channel direction,the latency is reduced by 37%compared to the baseline design.Figure 2.8.5 shows the optimizations on macroblock preloadin
241、g and dynamic tiling.For block prediction,massive EMA is required since the locations of the ref erence blocks pointed to by the motion vectors are not fixed.In the baseline design,the ref erence blocks need to be loaded f rom external memory individually,resulting in a high EMA latency.The correlat
242、ion between the ref erence blocks of adjacent blocks can be employed.A macroblock with multiple blocks can be preloaded to reduce the number of memory accesses.Preloading more macroblocks can achieve a lower latency,but this may increase EMA since more irrelevant ref erence blocks are f etched.Accor
243、ding to our experiments,a maximum 65%reduction in latency f or EMA can be achieved by preloading 4 macroblocks.By leveraging the video redundancy,some of the blocks can be skipped entirely.Given a smaller tile size,the redundant blocks can be skipped entirely but the data reuse rate is low f or f ea
244、ture maps with low redundancy.A larger tile size can be selected f or f eature maps with low redundancy but some redundant blocks,which need to be skipped,are loaded as well.In this work,a tiling unit is designed to support various tile sizes(with 44 and 1616 pixels),given a 44 block size.The redund
245、ancy mask is utilized to calculate the indexes of the blocks that need to be processed.These indexes allow the convolution engine to skip redundant blocks,enabling support of a larger tile size.The tile size can be determined dynamically based on the redundancy characteristics to minimize the EMA.EM
246、A can be reduced by 56%by applying such a dynamic tiling strategy.Fabricated in 40-nm CMOS,the chip integrates 4.8M logic gates in 4.4mm2 core area with 346KB on-chip SRAM.The maximum operating f requency is 200MHz with power dissipation of 344mW f rom a 0.98V supply.Figure 2.8.6 shows clips evaluat
247、ing video quality.This work supports 2 and 4 VSR upsampling rates.The processing latency is reduced by 97.6%with a 0.2dB peak SNR(PSNR)degradation compared with the baseline design.The proposed processor achieves up to 34TOPS peak perf ormance at a clock f requency of 200MHz,delivering a throughput
248、of 210f ps(with a 4 upsampling rate)f or 4K ultra-high definition(Ultra HD)resolution.The chip achieves a maximum energy ef ficiency of 163.2TOPS/W and a maximum area ef ficiency of 7.73TOPS/mm2.Compared to prior art 3-5,this work supports true VSR and leverages the inter-f rame dependency of encode
249、d video,achieving a 3.5 higher f rame rate.The chip achieves 7.3 higher energy ef ficiency and 6.9 higher area ef ficiency.This work provides an energy-ef ficient solution f or high-speed,high resolution VSR on mobile devices.Figure 2.8.7 shows the chip micrograph and summary.Ac k nowl e dge me nt:T
250、his work is supported by Google Silicon Research Program,NSTC of Taiwan,and Intelligent&Sustainable Medical Electronics Research Fund in NTU.The authors also thank TSRI f or technical support on chip implementation and Elsa Lin f or helpf ul discussions.Figure 2.8.1:Video super-resolut ion(VSR)and i
251、llust rat ion for t rue VSR.Figure 2.8.2:Proposed VSR workflow and algorit hm-archit ect ure co-opt imizat ions.Figure 2.8.3:Syst em archit ect ure of t he proposed image signal processor for t rue VSR.Figure 2.8.4:Proposed CAPE archit ect ure wit h adapt ive channel accumulat ion.Figure 2.8.5:Desig
252、n det ails of t he macroblock preloading scheme and dynamic t iling.Figure 2.8.6:Performance evaluat ion and comparison wit h t he st at e-of-t he-art designs.ISSCC 2025/February 17,2025/4:25 PM59 DIGEST OF TECHNICAL PAPERS 2 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER C
253、ONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 2.8.7:Chip micrograph and summary.Re f e re nc e s:1 H.Liu et al.,“Video Super-Resolution Based on Deep Learning:A Comprehensive Survey,”Arti fic i a l I nte l l i ge nc e Re v i e w,pp.5981-6035,2022.2 S.M.S.Mehdi et al.,“Frame-
254、Recurrent Video Super-Resolution,”I EEE CVPR,pp.6626-6634,2018.3 J.Lee et al.,“A Full HD 60 f ps CNN Super Resolution Processor with Selective Caching based Layer Fusion f or Mobile Devices,”I EEE Sy mp.VLSI Ci rc ui ts,pp.C302-C303,2019.4 K.Goetschalckx et al.,“DepFiN:A 12nm,3.8TOPs depth-first CNN
255、 processor f or high res.image processing,”I EEE Sy mp.VLSI Ci rc ui ts,2021.5 K.-P.Lin et al.,“VISTA:A 704mW 4K-UHD CNN Processor f or Video and Image Spatial/Temporal Interpolation Acceleration,”I SSCC,pp.48-49,2023.6 X.Wang,“Laplacian Operator-Based Edge Detectors,”I EEE Tra ns.on Pa tte rn Ana l
256、 y si s a nd Ma c hi ne I nte l l i ge nc e,vol.29,no.5,pp.886-890,2007.60 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.9979-8-3315-4101-9/25/$31.00 2025 IEEE2.9 STEP:An 8K-60fps Space-Time Resolut ion-Enhancement Neural-Net work Processor for Next-Genera
257、t ion Display and St reaming Kai-Ping Lin1,Tong Wu1,Chang-Pao Lin1,Po-Wei Chen1,Zhi-Jun Zhang1,Win-San Khwa2,Meng-Fan Chang2,Chao-Tsung Huang1 1National Tsing Hua University,Hsinchu,Taiwan 2TSMC Corporate Research,Hsinchu,Taiwan Next-generation display technology is driving ultra-high-definition(UHD
258、)TVs and screens,of f ering users an immersive experience.However,the scarcity of 8K-UHD streams and the high cost of transmission bandwidth necessitate the use of ISP techniques on terminal displays to enhance video quality.Deep-learning algorithms,in particular,can be employed to render stable and
259、 vivid videos.The one-stage space-time video super-resolution(STVSR)algorithm 1,depicted in Fig.2.9.1,is able to simultaneously generate high-resolution and high-f rame-rate videos f rom low-resolution and low-f rame-rate input.But rendering 8K-UHD 60f ps videos on edge devices with limited computat
260、ional resources still poses three main challenges.First,although deeper models typically yield better video quality,resource constraints necessitate the use of shallower CNN models,leading to a compromise in image quality.Second,the def ormable convolution with modulation(DCM)2 ef f ectively aligns
261、images across dif f erent time points,and multiple DCM layers f urther improve image quality.However,they require additional on-chip memory to store f eature maps(FM)within the layer-f usion(LF)workflow.Third,a large number of PE arrays(e.g.,10K MACs)are required to achieve high-throughput computati
262、on,significantly increasing power consumption.We propose three key f eatures and f unctional architectures,as shown in Fig.2.9.2.1)One-dimensional 31 asymmetric convolution(ASC)enhances horizontal receptive fields to improve the ef f ectiveness of FM in image SR applications 3.Models that adopt hybr
263、id ASC and 33 symmetric convolution(SC)can improve the f rame rate by 1.22 while maintaining similar quality,where a three-layer ASC replaces a single-layer SC under similar computation complexity.A reconfigurable convolution shape unit(RCSU)equipped with a global ring bus and three ASC cores(ASCCs)
264、is devised to flexibly support both ASC and SC.2)In-kernel-sampling DCM(IKS-DCM)is proposed to jointly consider both kernel location(pk)and of f set value(pk),rather than only pk 4-5.It ensures the kernel receptive field(KRF)remains consistent with plain convolution;theref ore,it can adaptively comp
265、ute each FM to enhance image quality by 0.4dB of PSNR,while consuming the same SRAM size,even with multiple DCM layers.An interpolation-modulation unit(IMU)f urther incorporates locational bounding and uses a hard sigmoid to perf orm pixel interpolation and mask modulation.3)Weight-operand isolation
266、(WI)can identif y weight values that are concentrated in a 4b bit-slice representation(4-BSR)to selectively zero-out the switching activity f rom the redundant sign-extension bits.Particularly,the coarse-grained WI(CWI)groups three weights together,instead of processing each weight individually,f or
267、 area and energy ef ficiencies.It includes a coarse-grained 4-BSR detector that identifies weight values and records the isolated kernel group in a sign-extension table(SET).A shif t-and-recover operation is then perf ormed to share the sign bit of the multiplier,achieving WI f or lossless computati
268、on.Figure 2.9.3 illustrates the RCSU operating in two modes during the LF workflow:block-wise LF(BLF)and tile-in-block LF(TLF).In BLF mode,an image block is processed tile-by-tile through a single SC layer,achieving a high MAC utilization of 96.2%.However,this mode results in a low MAC utilization o
269、f 32%when computing ASC layers.Conversely,in TLF mode,a tile within the image block is processed through three ASC layers in a single cycle,achieving a MAC utilization of 95.8%.Nevertheless,this mode leads to a low MAC utilization of 33.3%f or SC layer computations.Theref ore,to ef f ectively suppor
270、t both SC and ASC,the RCSU switches between BLF and TLF modes.In BLF mode,a 43-tile input is divided into three 41-tiles,each of which is f ed into an ASCC in parallel,simultaneously calculating a 21-tile partial sum.These partial sums are then accumulated in the SC accumulator to produce the final
271、21-tile SC result.On the other hand,in TLF mode,a 41-tile input is sequentially passed through three cascaded ASCCs.The 21-tile ASC result f or the first layer is produced by the first ASCC and stored in the ASC buf f er.Then,it is combined with the previous-cycle 21-tile ASC result to f orm a 41-ti
272、le input f or the second ASCC to compute the second ASC layer.This process is repeated f or the third ASC layer,ensuring the overall computation is f ully pipelined.Implementation results show that incorporating support f or both BLF and TLF modes requires additional accumulation and pipeline buf f
273、er resources,leading to only a 1.2%increase in computational area.Notably,since the ASC operation does not require reusing pixels in the vertical direction,the combined use of SC and ASC saves external memory access by 6.1%under the same computational complexity.Figure 2.9.4 shows the details of IMU
274、 architecture with IKS-DCM algorithm.First,the of f sets(pk)and masks(mk)are generated f rom FMs.The memory overhead caused by multi-DCM is determined by the pk and pk,collectively ref erred to as the sampled location(Sk=pk+pk).The distribution of Sk shows that these values are concentrated near the
275、 KRF.Theref ore,to limit memory overhead while retaining the coverage of Sk,a locational bounding box(BBox)is applied in the IMU.For instance,a BBox of 1,1 achieves a coverage of 98.19%,thereby maintaining nearly the same quality as that without bounding.Specially,the KRF of the DCM with BBox 1,1 is
276、 the same as that of plain convolution.Moreover,the original mk is generated using the Sigmoid f unction that involves complex operations such as the exponential f unction and division.Instead,we adopt a Hard Sigmoid approach,relying solely on simple operations using adders and shif ters.Perf ormanc
277、e results indicate that this approach causes negligible image degradation compared to the original Sigmoid.Second,the IKS-DCM perf orms pixel interpolation and modulation(PIM)according to the pk and mk.The IMU contains 432 PIM operations,and each one is simplified using associative and distributive
278、laws and combined with the Hadamard product that involves f our multiplications and seven additions.The computational complexity of the IMU is determined by the bit precision of pk and mk.Evaluations show that using a 3b pk and a 6b mk results in only a 0.1dB PSNR loss.Third,the 43-tile input is tra
279、nsf ormed into a 63-tile af ter interpolation and modulation,f ollowed by a stride-3 SC to produce a 21-tile output.To accommodate both stride-1 SC/ASC and stride-3 SC,an on-the-fly pixel-to-tile mapping is employed to unif ormly map all 43-tile inputs to 63-tiles.Implementation results show that IK
280、S-DCM reduces SRAM usage by 33%under three DCM layers compared to the DCM with|pk|1 confinement.Moreover,the IMU logic is reduced by 51.7%by adopting the 3b pk and the 6b mk.Figure 2.9.5 presents the architecture of the CWI.The cumulative distribution of BSR reveals that 78.24%of weights can be repr
281、esented as 4-BSR,where the 4b MSBs consist entirely of sign extensions,either all 0s or all 1s.This creates an opportunity to isolate the redundant partial products caused by these 4b sign extensions,such as a technique known as fine-grained WI(FWI).In practice,FWI first identifies whether each weig
282、ht f alls within the 4-BSR range and records the results in a SET.Each identified weight is lef t-shif ted by f our bits bef ore being multiplied with the input activation.Then the product is recovered or shif ted back bef ore being accumulated with other results.Incorporating FWI into the RCSU can
283、reduce power consumption by 11%;however,it induces a 27%area overhead due to the shif t and recover operations.To mitigate this overhead,CWI takes every three weights as a group.It identifies whether all three weights in a group f all within the 4-BSR range,and the CWI MAC module then shares and per
284、f orms the shif t and recover operations f or the entire group.Specifically,each module contains three MSB-weight multipliers with 8b activation and 4b-MSB weight operands,and three LSB-weight multipliers with 8b activation and 4b-LSB weight operands with 1b positive sign extension.When a group is i
285、dentified as a 4-BSR group,the LSB-weight multipliers are isolated to suppress the switching activities.Implementation results show that the RCSU with CWI saves area overhead by 30%compared to FWI and reduces MAC power consumption by 18%compared to the nave design without WI.Figure 2.9.6 shows the m
286、easurement results of the STEP and a comparison table with prior high-resolution imaging chips 5-9.STEP achieves a peak energy ef ficiency of 16.1TOPS/W at 100MHz and a peak computing throughput with an energy ef ficiency of 7.2TOPS/W at 400MHz.Moreover,under similar computational conditions,STEP de
287、livers detailed visual results and provides exceptional perf ormance in both spatial(PSNR,SSIM)and temporal(MOVIE 10)assessment metrics with proposed key f eatures.Additionally,STEP supports multiple DCM layers,whereas prior work 5 supports a single DC layer.In terms of normalized energy consumption
288、 f or enhancing both spatial and temporal resolution,STEP consumes 0.19nJ/pixel f or the one-stage STVSR model,compared to 0.59 and 1.12nJ/pixel f or the two-stage video super-resolution and f rame-interpolation models in 5,representing an 89%reduction in energy demand.Figure 2.9.7 shows the 16nm Fi
289、nFET 8mm STEP processor,including three ASCCs and IMU components that collectively provide peak throughputs of 10.2TOPS at 400MHz.Moreover,STEP provides rendering throughput of 8K-UHD 60f ps output at 400MHz and 1.1V with 1425mW of power consumption.In conclusion,the STEP processor enables advanced
290、spatial and temporal resolution rendering,f acilitating next-generation display and streaming technologies.Ac k nowl e dge me nt:The authors would like to thank TSMC and NSTC f or manuf acturing and financial support,and TSRI f or EDA tool support.Figure 2.9.1:Overview of space-t ime video super-res
291、olut ion(STVSR)and design challenges.Figure 2.9.2:Overall chip archit ect ure and key feat ures.Figure 2.9.3:Reconfigurable convolut ion shape unit (RCSU)wit h block-wise(BLF)and t ile-in-block(TLF)layer-fusion workflows.Figure 2.9.4:Det ails of t he in-kernel-sampling DCM(IKS-DCM)operat ion and int
292、 erpolat ion-modulat ion unit (IMU).Figure 2.9.5:Coarse-grained weight-operand isolat ion(CWI)wit h 4b bit-slice represent at ion(4-BSR)det ect or and group shift and recover operat ors.Figure 2.9.6:Qualit at ive and quant it at ive result s,measurement result s and performance comparison t able.ISS
293、CC 2025/February 17,2025/4:50 PM61 DIGEST OF TECHNICAL PAPERS 2 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 2.9.7:Chip phot ograph and performance summary.Re f e re nc e s:1 X.Xiaoyu et al.,“Zooming
294、Slow-Mo:Fast and Accurate One-Stage Space-Time Video Super-Resolution,”CVPR,2020.2 X.Zhu et al.,“Def ormable ConvNets V2:More Def ormable,Better Results,”CVPR,2019.3 C.Tian et al.,“Asymmetric CNN f or Image Superresolution,”I EEE Tra ns.on Sy ste ms,Ma n,a nd Cy be rne ti c s:Sy ste ms,vol.52,no.6,p
295、p.3718-3730,2022.4 Q.Huang et al.,“CoDeNet:Ef ficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs,”ACM F PGA,pp.206-216,2021.5 K.-P.Lin et al.,“VISTA:A 704mW 4K-UHD CNN Processor f or Video and Image Spatial/Temporal Interpolation Acceleration,”I SSCC,pp.48-49,2023.6 J.Lee et al.
296、,“A Full HD 60 f ps CNN Super Resolution Processor with Selective Caching based Layer Fusion f or Mobile Devices,”I EEE Sy mp.VLSI Ci rc ui ts,2019.7 K.Goetschalckx et al.,“DepFiN:A 12nm 3.8TOPs Depth-First CNN Processor f or High Res.Image Processing,”I EEE Sy mp.VLSI Ci rc ui ts,2021.8 Y.-C.Ding e
297、t al.,“A 4.6-8.3 TOPS/W 1.2-4.9 TOPS CNN-based Computational Imaging Processor with Overlapped Stripe Inf erence Achieving 4K Ultra-HD 30f ps,”I EEE ESSCI RC,pp.81-84,2022.9 M.-E.Shih et al.,“NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine f or High-Resolution Visual-Quality Enhancement on
298、Smart Devices,”I SSCC,pp.360-361,2024.10 K.Seshadrinathan et al.,“Motion Tuned Spatio-Temporal Quality Assessment of Natural Videos,”I EEE Tra ns.on I ma ge Proc e ssi ng,vol.19,no.2,pp.335-350,2010.62 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 2/PROCESSORS/2.10979-8-
299、3315-4101-9/25/$31.00 2025 IEEE2.10 A 0.52mJ/Frame 107fps Super-Resolut ion Processor Exploit ing Pseudo-FP6 Sparsit y for Mobile Applicat ions Xuyang Duan,Xinhua Shi,Zikang Zhou,Zhiyi Shu,Yitong Rong,Yuf an Chen,Zhen Yang,Menghan Li,Jun Han Fudan University,Shanghai,China Super-resolution(SR)is ess
300、ential f or enhancing digital image quality and is increasingly employed across various domains.The ability to recover fine details is especially critical in mobile applications such as gaming,video,and photography 1.However,mobile devices are usually sensitive to latency and power consumption,so SR
301、 processors that drain a lot of battery are not practical.Designing energy-ef ficient hardware f or convolutional neural network(CNN)-based SR(SRCNN)2 to meet these constraints poses three key challenges,as shown in Fig.2.10.1.Firstly,prior works 3,4 have introduced substantial on-chip memory f or m
302、ulti-dimentional(multi-dim)overlapping regions(OR)in layer-f usion dataflow to achieve low external bandwidth.However,to f urther improve energy ef ficiency,multi-dim OR and external memory access should both be reduced.Secondly,novel low-bit-width but high-dynamic-range computation is required,sinc
303、e it is dif ficult f or a SRCNN to obtain high image quality under traditional low-bitwidth quantization methods,which are widely used f or energy reduction 1.Thirdly,deconvolution accounts f or a large proportion of the workload in a SRCNN.Reusing the convolution datapath f or deconvolution operati
304、ons can improve area ef ficiency but leads to many redundant calculations.Alternatively,additional datapaths or registers are required to support deconvolution.It is important to remove additional both hardware overhead and redundant calculations to obtain high energy ef ficiency and improve hardwar
305、e utilization.To solve these challenges,this paper proposes a high-energy-ef ficiency SR processor with three f eatures:1)A channel-number-adaptive low-cost caching strategy(CLCS),a redundant-OR-elimination(ROE)workflow,and a two-stage layer f usion dataflow with compressed intermediate layer(CIL)f
306、or low external bandwidth are introduced,reducing on-chip memory by 90%compared to prior work 3 without significant external bandwidth overhead.2)Pseudo-FP6(PFP6)sparsity exploitation(PSE)is presented,which includes a bit-level adjustment(BLA)approach,a zero-PFP6-skipping pixel processing unit(PPU),
307、and an input channel rearrangement controller(ICRC)that balances workload by generating the reordered address sequence,reducing inf erence latency by 64%.3)A hybrid dataflow f or deconvolution(HDD)is proposed,reusing the datapath of convolution to improve utilization of the PPU by 75%compared to the
308、 approach of prior work 4.Figure 2.10.2 illustrates the overall architecture of the SR processor,which consists of f our pixel-processing cores and a central controller.Each pixel-processing core includes a PPU array,an out-of-order scheduler,an ICRC,and a memory access module including on-chip buf
309、f ers and load/store units.The PPU array comprises 120 PPUs,each of which contains a data selector and 8 multiply-accumulate(MAC)units(8b activation,6b weight).The data storer can write the output f eature map(Of map)or the output partial summation(Opsum)calculated by the PPU array to the data buf f
310、 er according to specific requirements of CLCS.Based on the input channel(IC)rearrangement scheme,the ICRC can generate the reordered address sequence to load the input f eature map(If map)and weights f or workload-balance computation.The weight loader only sends non-zero weights to the PPU array,by
311、passing unnecessary calculations.The PPU array can be reconfigured to perf orm convolution and deconvolution,achieving high hardware utilization.Figure 2.10.3 illustrates the approach to achieving small on-chip memory and low external bandwidth that includes three key techniques:CLCS,redundant-overl
312、apping regions(OR)-elimination(ROE)workflow,and two-stage layer f usion dataflow with CIL f or low external bandwidth.Under the layer-f usion dataflow 5,prior works 3,4,6,7 adopt either the If map caching strategy(ICS)that caches OR of an adjacent tile(i.e.,the horizontal and vertical boundaries)of
313、If map,or the Opsum caching strategy(OCS)that caches ORs of Opsum.ICS leads to larger on-chip memory overhead than OCS if the number of IC is larger than output channels(OC),and vice versa.The CLCS is proposed to avoid redundant on-chip memory requirements resulting f rom using a fixed cache strateg
314、y f or the SRCNN,whose layers have varied f eature map channel numbers.CLCS can adaptively apply ICS or OCS according to the size relationship between the number of IC and OC.Compared to using either ICS or OCS alone,CLCS reduces on-chip memory by 43%.In order to f urther reduce on-chip memory by mi
315、nimizing the OR,ROE is proposed f or our SRCNN model(PFPSR,SR model with PFP6 precision),which employs horizontal asymmetric convolution(AC)with 1D horizontal kernels and vertical AC with 1D vertical kernels to process dif f erent layers in our SRCNN,eliminating the ORs perpendicular to the sliding
316、direction of kernels.Symmetric convolution(SC)is used f or the first and last layers,as channels of the input and output images of these two layers are small(i.e.the memory overhead of ORs is insignificant).For an L-layer SRCNN,horizontal AC is applied to 2nd to(N+1)th layers,and vertical AC is used
317、 f or layers f rom(N+2)to(L-1).ROE reduces on-chip memory by 80%with minimal loss in algorithm perf ormance.Since ROE introduces two dif f erent convolution directions,two-stage layer f usion is used.Moreover,CIL is proposed to reduce external memory access brought by access-order switching between
318、the last layer of horizontal AC and the first layer of vertical AC.This approach uses a convolution layer with small OC to encode the f eature map bef ore it is stored of f-chip and reduces the external bandwidth requirement of the two layers.CIL achieves a 53%reduction in external bandwidth while m
319、aintaining the algorithm perf ormance(PSNR 36.94dB).Figure 2.10.4 demonstrates pseudo-FP6(PFP6)sparsity exploitation(PSE)and workload balance enhancement(WBE).PSE adjusts weights at the bit level without significant algorithm perf ormance degradation.Specifically,PSE ref ormats the two s complement
320、of negative weights with many high-order 1s as an unsigned binary with f ewer 1s and a sign bit,and then f urther reduces the number of non-zero bits of the unsigned binary through bit-level adjustment(BLA).Overall,PSE reduces the number of non-zero bits by 70%with only a 0.08dB drop in PSNR.Additio
321、nally,weights are encoded in PFP6(M2E3,2b mantissa and 3b exponent)f ormat,as this minimizes the data volume of encoded weights while achieving 75%sparsity.However,high sparsity exacerbates workload imbalance between MAC units in the PPU.Specifically,assuming that the If map tile of a single IC is b
322、roadcast to all MACs in the PPU in each cycle,MACs responsible f or the weights with f ewer non-zero PFP6 data must wait f or the busiest MAC,which deals with more non-zero PFP6 data.The ICRC broadcasts the If map tiles of multiple ICs and generates the reordered address sequence f or reordered ICs
323、to load If map and weights.This allows dif f erent ICs to be calculated on the MAC units of a PPU in an interleaved manner,while ensuring that as f ew IC tiles as possible are broadcast simultaneously.When using a buf f er size of 2 ICs,a 10%reduction in execution cycles is achieved.Overall,PSE and
324、WBE reduce execution cycles by 64%.Figure 2.10.5 presents the hybrid dataflow f or deconvolution(HDD),which significantly improves the utilization of MAC units.In convolution operations,multiple MACs within a PPU execute dif f erent OCs in parallel.However,f or deconvolution,the number of OCs is typ
325、ically small,leading to many MACs remaining idle.Deconvolution also requires insertion of zeros into the If map to reuse the convolution datapath,introducing redundant computations that f urther degrade MAC utilization.Prior work 4 addresses this by splitting the deconvolution kernel into several sm
326、aller convolution kernels,converting the deconvolution into a convolution without adding redundant computations.However,this approach is not suitable f or a hardware architecture exploiting parallelism among OCs,causing low utilization of MAC units when the number of OCs in the deconvolution layer i
327、n SRCNN is small.Instead,HDD exploits the parallelism of the width dimension of the deconvolution kernel in a SRCNN that is of large spatial size.In HDD,each column of MAC units is assigned the computation tasks corresponding to a single column of weights in the deconvolution kernel.This kind of com
328、putation task either requires additional datapath between the MAC units or zero-insertion in the If map if using traditional methods 9.HDD addresses this by splitting each column of deconvolution weights instead of a whole kernel 4 into multiple convolution weights,reusing the convolution computatio
329、n path to calculate deconvolution without additional hardware overhead or redundant calculations.HDD achieves a 75%improvement in MAC utilization compared to the previous approach 4.Figure 2.10.6 shows the measurement results f or our SR processor f abricated in 22nm.The chip works at 80-220MHz with
330、 0.6-0.9V supply.The experiments are conducted using our SR model(PFPSR),supporting 27-71f ps f or 4K resolution.The best SR energy ef ficiency is 0.52mJ/Frame f or 1K resolution,achieving 1.9 and 1.8 improvement over prior work 8,1.The highest throughput is 71f ps f or 4K resolution,which is 1.4 an
331、d 1.2 better than prior work 3,4.Additionally,under this test configuration,the SR energy ef ficiency is improved by 2.8 and 3.6,while the on-chip memory requirements are reduced by 10.1 and 10.9.Figure 10.2.7 shows the die photo and summary.Ac k nowl e dge me nt:This work was supported by the Natio
332、nal Natural Science Foundation of China(61934002).The corresponding author of this paper is Jun Han(junhanf ).Figure 2.10.1:Challenges of designing a SR processor for mobile applicat ions.Figure 2.10.2:The overall archit ect ure of t he proposed SR processor.Approach to reduce on-chip memory Approac
333、h to resolve the conflict between power consumption and image qualityApproach to Improve hardware utilization of DeConvIC Rearrangement Controller(ICRC)01234567Original Order of ICsRearranged Order of ICsIC Rearrangement47361205Data BufferAddressWeight BufferAddressICIdxOut-Of-Order SchedulerIns QueueMap TableIssue QueueExecution UnitsDataloaderData Buffer AddressData BufferIfmapOpsumOfmap2-ICIfma