《SESSION 2 Processors.pdf》由會員分享,可在線閱讀,更多相關《SESSION 2 Processors.pdf(312頁珍藏版)》請在三個皮匠報告上搜索。
1、ISSCC 2025SESSION 2Processors2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference1 of 31Zen 5:The AMD High-Performance4nm x86-64 Microprocessor Core Teja Singh1,Spence Oliver1,Sundar Rangarajan1,Brett Johnson2,Carson Henrion2,Sarah
2、 Tower1,Vibhor Mittal1,Kathy Hoover1,Shane Southard1,Russ Schreiber1,Shravan Lakshman1,Deepesh John1,Alex Schaefer1,Brian Kasprzyk1,Ted Antoniadis1,Wilson Li3,Michael McCabe3,James Pistole3,Sabeesh Balagangadharan4,Ross McCoy1,Anitha Mohan1,Kurt Mohlman1,Sahil Singh5,Nick Lance2,Darryl Prudich2,Just
3、in Coppin2,Tim Jackson2,Anita Karegar2,Ryan Miller2,Hon-Hin Wong3,Daryl Lieu31AMD Austin,TX 2 AMD Fort Collins,CO 3 AMD Santa Clara,CA 4 AMD Bangalore,India 5 AMD Markham,ON,Canada2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conferenc
4、e2 of 31Outline Market Segments and Design Goals Core Architecture Core Floorplan Physical Design Optimization Core Complex Die Product Configurations Performance Comparisons Conclusion2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conf
5、erence3 of 31“Zen 5”DeliversPower-efficient and cost-optimized x86 core across various segmentsConfigurable FP512/FP256 data paths and L3 Cache Supports 1T Fmax of 5.7GHz16%generational IPC increase in desktop applications 117%generational IPC increase in cloud/enterprise workloads 2“Granite Ridge”5
6、th Gen AMD RyzenDesktop Processors3rd Gen AMD RyzenAIMobile Processors“Strix Point”5th Gen AMD EPYCServer Processors“Turin”2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference4 of 31“Zen 5”FamilyZen 5 Family Members“Zen 5”Max Singl
7、e Core PerformanceZen 5c Optimized for Perf/Watt and Perf/AreaZen 5Family Members BringBetter optimization and targeted solutions to end use casesSimultaneous Multi-Threading(SMT)per CoreSoftware consistency and ease of useIPCAreaPower EfficiencyFrequency“Zen 5”“Zen 5c”Zen 5Family Enables Platform B
8、readthZen 5 and Zen 5c provide scalability and performance optionsFloating Point Data Path SizeL3 Cache CapacityAMD 3D V-Cache2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference5 of 31“Zen 5”for the Desktop Market Highest 1T Perfo
9、rmance Variant Two 8-core CCDs(4nm)+IOD(6nm)Fully-populated FP512 data path32 MB L3 Cache/CCD,64 MB total Supports 1T Fmax of 5.7GHz 16%generational IPC increase in desktop applications2 Support for AMD 3D V-Cache 64MB Extended L3 Die(L3D)Fully-compatible with prior-gen AM5 package 6nm IO Die(IOD)wi
10、th dual Infinity FabricOn-Package(IFOP)SerDes links“Granite Ridge”5th Gen AMD RyzenDesktop Processors2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference6 of 31Monolithic SoC with CCX,GPU,and NPUat 15W TDP(minimum),4nm4“Zen 5”Cores
11、 8 Threads8“Zen 5c”Cores16 ThreadsPeak PerformanceEfficient ThroughputUp to 5.1 GHz Boost2 GHz BaseUp to 3.3 GHz Boost 2 GHz BaseSame IPC,ISA,&SMT SupportFP256 Data Path with AVX512 Support1 MB L2/Core0.87x Desktop Core Area0.64x Desktop Core Area16 MB L38 MB L3“Strix Point”3rd Gen AMD RyzenAI Mobil
12、e Processors“Zen 5”for the Mobile Market2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference7 of 31Two Server CCD OptionsChiplets enable a broad stack of core count optionsNext-Gen 6nm Server IODAVX512 with full FP512 data path+32M
13、B L3/CCD“Scale Up”w/8-Core“Zen 5”CCDs(4nm)Max L3 cache/core and max frequency(5 GHz)Up to 16 CCDs(128 cores/256 threads)Up to 512MB Total L3 Cache“Scale Out”w/16-Core“Zen 5c”CCDs(3nm)For higher core counts and energy efficiencyUp to 12 CCDs(192 cores/384 threads)Up to 384MB Total L3 Cache“Zen 5”for
14、the Server Market“Turin”Platform Options5th Gen AMD EPYCServer Processors2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference8 of 31“Zen 5”Core Architecture1T&2T Performance UpliftFront End ParallelismNext-Gen Branch Predictor2x In
15、struction Fetch/Decode PipesArea-Optimized Op Cache with Increased Associativity and Throughput6 8-wide DispatchWider Execution4 6-wide Integer ALU Issue3 4-wide FP IssueConfigurable FP256/FP512 Data Path3 4 Load/Store PipesLarger Caches and Queues32B 64B L2 Fill&Victim InterfacesCourtesy:Cohen et a
16、l.,“Next Generation Zen 5 Core,”Hot Chips,2024 3L1D Cache48KB 12-way4 read,2 write64B fill,64B victimLoad/Store Queues2x512b load data/cycle4x64b load data/cycleL2 Cache1MB 16-way64B/cycle64B/cycle to L1IUOPQDispatch,8-wide L1I Cache32KB 8-way,32B fetch x2Op Cache6K inst/fused inst,6-wide x2Decode4-
17、wide x2Branch Prediction2-taken,TAGEVector Rename,6-wideSch,38-entryFMULFMAFADDFMULFMAFADDStDIntDVector Registers,512b,384-entryStDIntDNSQ,96-entry,8-wide inSch,38-entrySch,38-entry1x512b store data/cycleALUMulALUMulALUBrALUBrAGUAGUSchedulerALUMulALUBrSchedulerAGUGeneral-Purpose Registers,64b,240-en
18、tryInteger Rename,8-wideAGU4 addresses/cycle2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference9 of 31Floating Point ExecutionAVX512 with a native 512-bit data pathBIOS Support for“Double-pumped”AVX512 on 256-bit data pathISA-Comp
19、atible for native FP256 Core variantsPower Gating support for Upper FP256 Data PathClock Mesh Gating Support forUpper FP256Lower FP256More Bandwidth,Less Latency4 Execution and 2 Ld/St/Integer Pipelines2 loads and 1 store per cycle,each 512b2-cycle FADD2X Bandwidth for AI formats(INT8/16,BF16)Deeper
20、 QueuesNSQ growth with support for 8-wide instruction dispatch3 Larger 38-entry Schedulers(1.15x“Zen 4”)384-entry Physical Register File(2x“Zen 4”)Retire Queue(ROB)Growth128bData Path128bData PathUpperFP256FP512FPControl128bData Path128bData PathLowerFP2562.1:Zen 5:The AMD High-Performance 4nm x86-6
21、4 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference10 of 31Attribute“Zen 4”“Zen 5”L1/L2 BTB1.5K/7K16K/8KReturn Address Stack3252ITLB L1/L264/51264/2048Fetched/Decoded Instruction Bytes/Cycle3264Op Cache Associativity12-way16-wayOp Cache Bandwidth9 macro-ops12 inst or fused
22、instDispatch Bandwidth(Macro-Ops/Cycle)68ALU/AGU4/36/4Integer PRF(Reg/Flag)224/126240/192Vector PRF192384FP Pre-Scheduler Queue6496FP Scheduler2x323x38FP Pipes34Vector Width256b256b/512bROB/Retire Queue320448LS Mem Pipes Supporting Loads/Stores3/14/2DTLB L1/L272/307296/4096L1 Data Cache32KB/8-way48K
23、B/12-wayL2 per Core1MB/8-way1MB/16-wayL2 Bandwidth32B/clk64B/clkThese Arch enhancements have a pervasive impact throughout the designKey Design EnhancementsDE/OC/DIIC/BPLS/DC/L2EX/SC“Zen 5”Uplift BreakdownSpecInt2017AOCC2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE Inte
24、rnational Solid-State Circuits Conference11 of 31Density Trends in High Performance IPs16nm/14nm7nm/6nm5nm/4nmSignal RoutingSRAMLogicRelative Density by Function2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference12 of 31Design Cha
25、llenges and SolutionsLarger Architectural Structures Increased AreaWider Architecture Increased Route DemandEnhanced Metal Stack&Power GridWire EngineeringDesign/Tech Co-opt(DTCO)Improved Via Ladder InsertionLeveraged Machine Learning8T-to-6T SRAM ConversionLatch Array Area EfficiencyOptimized Stand
26、ard CellsDense Placement Algorithms4nm Scaling BenefitsLarger+Wider More Gates/CycleHigh-Fanout LoadsCustom Logic OptimizationEDA CollaborationUtilization of Optimized CellsUseful Clock SkewImproved Buffering Algorithms+Rd/Wr Ports+Longer Routes32-Byte64-ByteWider Buses2.1:Zen 5:The AMD High-Perform
27、ance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference13 of 314nm ProcessZen 3Zen 4“Zen 5”TechnologyTSMC 7nmTSMC 5nmTSMC 4nmCores/CCX888Area/CCX68 mm255 mm255 mm2L2/Core512 kB1 MB1 MBL3/CCX32 MB32 MB32 MBTransistors/CCX4.1B6.5B7.9BStandard Cell Track Library6 tra
28、ck6 track6 trackCu Metal Layers1315172.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference14 of 31L3 Cache Area Efficiency Over 35%of“Zen 4”CCD area was the 32MB L3 Cache L3 SRAMs density improved 10%Co-optimized L3 and Core aspect
29、ratio Minimized AMD 3D V-Cache area requirements 30%improvement in cache mm2/MB“Zen 4”32MB L3“Zen 5”32MB L32.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference15 of 31“Zen 5”AMD 3D V-CacheThermal benefits when L3D moved below the C
30、CD+500MHz Base and+200MHz Boost Frequency4vs.“Zen 4”3D V-Cache L3DCCD“Zen 4”Stacking Concept“Zen 5”Stacking ConceptCCDL3DTo Package Substrate*For illustration purposes onlyPath to Heat SinkTop Routing LayerTransistorsTop Routing LayerTransistors2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microproc
31、essor Core 2025 IEEE International Solid-State Circuits Conference16 of 31“Zen 5”Core Power Significant design emphasis is placed on switching power(CAC)and leakage“Zen 5”improved CAC-per-IPC metrics Higher generational SRAM power is a CAC-efficient method to add performance Contributions from seque
32、ntial overhead(flop/gater/clock)are lower Improved low-power cell options and design usage Improved multi-bit flop banking optimizations Improved clock mesh and clock tree synthesis(CTS)methodology VT swap algorithms incorporated optimizations to better balance leakage power and frequency attainment
33、“Zen 5”CACBreakdown2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference17 of 31“Zen 5”Core Complex Die8“Zen 5”Cores+32MB L3 CacheSystem Management Unit(SMU)Microcontroller,Power Management,Clocks,Reset,Fuses,Thermal Monitor,and Con
34、trolDual Infinity FabricOn-Package(IFOP)Over Two GMI3 Serdes LinksEach link is comprised of16 TX lanes20 RX lanes2 Clock and 2 Control lanesSpeeds up to 36 GbpsGMI3 wide and narrow modes supported8.6B Transistors,66 mm22.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE Inter
35、national Solid-State Circuits Conference18 of 31Example Product Configurations2-16 8-Core“Zen 5”CCDs(4nm)6nm IO Die+=EPYC9005 Series12.3 Billion Transistors412 mm2Server 8-128“Zen 5”Cores(4nm)Up to 192“Zen 5c”Cores(3nm)125-500W TDP Boost up to 5.0 GHz at 400W(64-core)Up to 149.9 B transistors per so
36、cketUp to 16 x 8.6 Billion Transistors66 mm2 eachChiplet-Based PackageUp to 2 x 8.6 Billion Transistors66 mm2 each1 or 2 8-Core“Zen 5”CCDs(4nm)6nm IO Die+=Desktop3.4 Billion Transistors111 mm2Chiplet-Based PackageRyzen9000 Series 6-16 Cores 65-170W TDP Boost up to 5.7 GHz4“Zen 5”Cores w/16MB L34nm M
37、onolithic Die+=RyzenAI HX 375 ProcessorMobile8“Zen 5c”Cores w/8MB L32.8 Billion Transistors3.5 Billion Transistors 12 Heterogeneous Cores 15-54W TDP Boost up to 5.1 GHz2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference19 of 31Pri
38、or-Generation Desktop ComparisonPerformance vs.Power2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference20 of 31Desktop Competitive Performance2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE Internationa
39、l Solid-State Circuits Conference21 of 31Desktop Generational Gaming Performance2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference22 of 31Desktop Competitive Gaming Performance2.1:Zen 5:The AMD High-Performance 4nm x86-64 Micropr
40、ocessor Core 2025 IEEE International Solid-State Circuits Conference23 of 31“Zen 5”Generational Server Performance2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference24 of 31“Zen 5”Generational HPC Performance2.1:Zen 5:The AMD High
41、-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference25 of 31Competitive Server Performance2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference26 of 31Conclusion Family of configurable“
42、Zen 5”cores designed for differentiated mobile,desktop,and server markets 16%generational desktop performance uplift with improved Perf/Watt 1T Frequencies up to 5.7GHz“Zen 5”delivers industry-leading gaming and server performance2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025
43、 IEEE International Solid-State Circuits Conference27 of 31Endnotes 1 GNR-03:Testing as of May 2024 by AMD Performance labs.Zen 5 system configured with:Ryzen 9 9950X GIGABYTE X670E AORUS MASTER motherboard,Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=ON,SAM=ON,KRACKENX63 vs.Zen 4 system configured wit
44、h:Ryzen 7 7700X,ASUS ROG Crosshair X670E motherboard,Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=ON,SAM=ON,KRAKENX62 FixedFrequency=4.0 GHz.Applications tested include:Handbrake,League of Legends,FarCry 6,Puget Adobe Premiere Pro,3DMark Physics,Kraken,Blender,Cinebench(n-thread),Geekbench,Octane,Speed
45、ometer,and WebXPRT.System manufacturers may vary configurations,yielding different results.GNR-03.2 EPYC-029C:Comparison based on thread density,performance,features,process technology and built-in security features of currently shipping servers as of 10/10/2024.EPYC 9005 series CPUs offer the highe
46、st thread density EPYC-025B,leads the industry with 500+performance world records EPYC-023F with performance world record enterprise leadership Java ops/sec performance EPYCWR-20241010-260,top HPC leadership with floating-point throughput performance EPYCWR-2024-1010-381,AI end-to-end performance wi
47、th TPCx-AI performance EPYCWR-2024-1010-525 and highest energy efficiency scores EPYCWR-20241010-326.The 5th Gen EPYC series also has 50%more DDR5 memory channels EPYC-033C with 70%more memory bandwidth EPYC-032C and supports 70%more PCIe Gen5 lanes for I/O throughput EPYC-035C,has up to 5x the L3 c
48、ache/core EPYC-043C for faster data access,uses advanced 3-4nm technology,and offers Secure Memory Encryption+Secure Encrypted Virtualization(SEV)+SEV Encrypted State+SEV-Secure Nested Paging security features.See the AMD EPYC Architecture White Paper(https:/ more information.3 B.Cohen,M.Subramony,M
49、.Clark,“Next Generation“Zen 5”Core,”Hot Chips,2024.4 Comparing 8-core Ryzen 7 9800X3D versus 8-core Ryzen 7 7800X3D,based on published specifications at .AMD Ryzen 7 9800X3D Specification:https:/ Ryzen 7 7800X3D Specification:https:/ GNR-06:Testing as of June 2024 by AMD Performance Labs on test sys
50、tems configured as follows:AMD Ryzen 9 9900X CPU system:GIGABYTE X670E AORUS MASTER,Balanced,DDR5-6000,RadeonRX 7900 XTX GPU,VBS=On,SAM=On,KRACKENX63 vs.similarly configured Intel Core i9-14900K system:MSI MEG Z790 ACE MAX(MS-7D86),Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=On,SAM=On,KRAKENX63,Profil
51、e=Intel Default on the following applications/games:Procyon Office,Geekbench 6.2.1 Single Core,Puget Photoshop(general),Blender(classroom TTC),Handbrake TTC,Borderlands3,DOTA2,Hitman3,Cyberpunk2077,F12023,and Horizon Zero Dawn.All games tested at 1080p resolution.System manufacturers may vary config
52、urations,yielding different results.GNR-06.6 GNR-07:Testing as of June 2024 by AMD Performance Labs on test systems configured as follows:AMD Ryzen 7 9700X CPU system:GIGABYTE X670E AORUS MASTER,Balanced,DDR5-6000,RadeonRX 7900 XTX,VBS=On,SAM=On,KRACKENX63 vs.similarly configured Intel Core i7-14700
53、K system:MSI MEG Z790 ACE MAX(MS-7D86),Balanced,DDR5-6000,Radeon RX 7900 XTX,VBS=On,SAM=On,KRAKENX63,Profile=Intel Default on the following applications/games:7-Zip,Procyon Office Productivity,GeekBench 5.4.6 Single Core,Puget Adobe Photoshop,Handbrake,Borderlands 3,DOTA2,Hitman3,Cyberpunk2077,F1202
54、3,and Horizon Zero Dawn.All games tested at 1080p resolution.System manufacturers may vary configurations,yielding different results.GNR-07.7 GNR-21:Testing as of October 2024 by AMD Performance Labs on test systems configured as follows:AMD Ryzen 7 7800X3D&9800X3D system:GIGABYTE X670E AORUS MASTER
55、,Balanced,2x16GB DDR5-6000,Radeon RX 7900 XTX,VBS=On,SAM=On,KRACKENX63(September 27,2024);on the following games:Ashes Of The Singularity:Escalation,Assassins Creed Mirage,Assassins Creed Valhalla,Avatar:Frontiers Of Pandora,Baldurs Gate 3,Black Myth:Wukong,Borderlands 3,Counter-Strike 2,CyberPunk 2
56、077,Deus Ex:Mankind Divided,Dirt 5,DOTA 2,F1 2023,F1 2024,Far Cry 6,Final Fantasy 14 Dawntrail,Forza Horizon 5,Ghost Recon Breakpoint,Guardians Of The Galaxy,Hitman 3,Hogwarts Legacy,Horizon Zero Dawn,League of Legends,Metro Exodus,Metro Exodus Enhanced Edition,Middle Earth Shadow of War,Rainbow 6 S
57、iege,Riftbreaker,Shadow Of The Tomb Raider,Spider Man Remastered,Starfield,Strange Brigade,The Callisto Protocol,Tiny Tinas Wonderlands,Total War Warhammer 3,Warhammer Dawn Of War 3,Watch Dogs Legion,World of Tanks encore,Wolfenstein Youngblood.System manufacturers may vary configurations,yielding d
58、ifferent results.GNR-212.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference28 of 31Endnotes 8 GNR-25:Testing as of October 2024 by AMD Performance Labs on test systems configured as follows:AMD Ryzen 7 9800X3D system:GIGABYTE X870E
59、 AORUS MASTER,Balanced,2x16GB DDR5-6000,GeForce RTX 4090,VBS=On,SAM=On,KRACKENX63(September 27,2024);Intel Core Ultra 9 285K system:ASUS ROG STRIX Z890-E GAMING WIFI,Balanced,2x16GB DDR5-6400,GeForce RTX 4090,VBS=On,SAM=On.KRAKENX63(September 11,2024)Profile=Intel Default on the following games:Blac
60、k Myth:Wukong,Avatar:Frontiers of Pandora,Ashes of the Singularity:Escalation,Hogwarts Legacy,Counter-Strike 2,Starfield,HITMAN 3,Final Fantasy 14,Dawntrail,Warhammer 40,000:Space Marine 2,Call of Duty:Black Ops 6,Far Cry 6,Watch Dogs:Legion,Cyberpunk 2077.System manufacturers may vary configuration
61、s,yielding different results.GNR-259 SPECrate2017_int_base comparison based on published results as of 10/10/2024.Configurations:2P AMD EPYC 7601(275 SPECrate2017_int_base,64 Total Cores,https:/www.spec.org/cpu2017/results/res2017q4/cpu2017-20171211-01594.html);2P AMD EPYC 7742(701 SPECrate2017_int_
62、base,128 Total Cores,https:/www.spec.org/cpu2017/results/res2019q4/cpu2017-20191125-20001.html);2P AMD EPYC 7763(861 SPECrate2017_int_base,128 Total Cores,https:/www.spec.org/cpu2017/results/res2021q4/cpu2017-20211121-30148.html);2P AMD EPYC 9654(1790 SPECrate2017_int_base,192 Total Cores,https:/www
63、.spec.org/cpu2017/results/res2022q4/cpu2017-20221024-32607.html);2P AMD EPYC 9754(1950 SPECrate2017_int_base,256 Total Cores,http:/www.spec.org/cpu2017/results/res2023q2/cpu2017-20230522-36617.html);2P AMD EPYC 9755(2720 SPECrate2017_int_base,256 Total Cores,https:/www.spec.org/cpu2017/results/res20
64、24q4/cpu2017-20240923-44824.html);2P AMD EPYC 9965(3100 SPECrate2017_int_base,384 Total Cores,https:/spec.org/cpu2017/results/res2024q4/cpu2017-20241004-44979.html);SPEC and SPECrate are registered trademarks of the Standard Performance Evaluation Corporation.See www.spec.org for more information.10
65、 Based on AMD internal testing on 2P reference servers running HPL as of 10/10/2024.Results may vary due to factors including system configurations,software versions and BIOS settings.2P AMD EPYC7601,1.17 TFLOPs,64 total cores,128 total threads,(32C/64T 180W,SMT_on,GCC 9,Performance and Power Optimi
66、zation Guide 56045_1_03.pdf);2P AMD EPYC 7742,3.76 TFLOPs,128 total cores,128 total threads,(64C/64T 225W,SMT_off,GCC 9,Performance and Power Optimization Guide 56334_1.08.pdf);2P AMD EPYC 7763,4.01 TFLOPs,128 total cores,128 total threads,(64C/64T 280W,SMT_off,GCC 9,Performance and Power Optimizati
67、on Guide 56882_1.13.pdf);2P AMD EPYC 9654,6.69 TFLOPs,192 total cores,384 total threads,(96C/192T 360W,SMT_on,AOCL 4.3.1,Performance and Power Optimization Guide 57863_1.50.pdf);2P AMD EPYC 9754,10.38 TFLOPs,256 total cores,512 total threads,(128C/256T 360W,SMT_on,AOCL 4.3.1,Performance and Power Op
68、timization Guide 57863_1.50.pdf);2P AMD EPYC 9755,17.31 TFLOPs,256 total cores,256 total threads,(128C/128T 500W,SMT_off,AOCL 4.5,Performance and Power Optimization Guide 58412_0.84.pdf);2P AMD EPYC9965,20.55 TFLOPs,384 total cores,384 total threads,(192C/192T 500W,SMT_off,AOCL 4.5,Performance and P
69、ower Optimization Guide 58412_0.84.pdf).11 SPECrate2017_int_base comparison based on published results as of 10/10/2024.Configurations:1P Intel Xeon Platinum 8592+(563 SPECrate2017_int_base,64 Total Cores,https:/www.spec.org/cpu2017/results/res2024q3/cpu2017-20240729-44257.html);1P Intel Xeon 6980P(
70、1230 SPECrate2017_int_base,128 Total Cores,https:/www.spec.org/cpu2017/results/res2024q4/cpu2017-20241104-45161.html);1P AMD EPYC 9755(1260 SPECrate2017_int_base,128 Total Cores,https:/www.spec.org/cpu2017/results/res2024q4/cpu2017-20240924-44944.html);1P AMD EPYC 9965(1510 SPECrate2017_int_base,192
71、Total Cores,https:/www.spec.org/cpu2017/results/res2024q4/cpu2017-20241118-45545.html);SPEC and SPECrate are registered trademarks of the Standard Performance Evaluation Corporation.See www.spec.org for more information.2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE Inte
72、rnational Solid-State Circuits Conference29 of 31AcknowledgmentsWe would like to thank our talented AMD design team across Austin,Bangalore,Boston,Fort Collins,Hudson Valley,Hyderabad,Markham,Santa Clara,and all AMDers who contributed to“Zen 5”Come check out“Zen 5”up-close and de-lidded at the AMD e
73、xhibition booth2.1:Zen 5:The AMD High-Performance 4nm x86-64 Microprocessor Core 2025 IEEE International Solid-State Circuits Conference30 of 31AMD Exhibition/DemoPlease join us for our exhibition/demo showing our new“Zen 5”-based Granite Ridge CPUs!2.1:Zen 5:The AMD High-Performance 4nm x86-64 Micr
74、oprocessor Core 2025 IEEE International Solid-State Circuits Conference31 of 31DisclaimerThe information presented in this document is for informational purposes only and may contain technical inaccuracies,omissions,and typographical errors.The information contained herein is subject to change and m
75、ay be rendered inaccurate for many reasons,including but not limited to product and roadmap changes,component and motherboard version changes,new model and/or product releases,product differences between differing manufacturers,software changes,BIOS flashes,firmware upgrades,or thelike.Any computer
76、system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information.However,AMD reserves the right to revise this information and to make changes from time to time to the content hereof with
77、out obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED AS IS.”AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,ERRORS,OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.AMD SP
78、ECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY,OR FITNESS FOR ANY PARTICULAR PURPOSE.IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE,DIRECT,INDIRECT,SPECIAL,OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,EVEN IF
79、AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.2025 Advanced Micro Devices,Inc.All rights reserved.AMD,the AMD Arrow logo,EPYC,Infinity Fabric,Ryzen,and combinations thereof are trademarks of Advanced Micro Devices,Inc.PCIe is a registered trademark of PCI-SIG.Other product names used i
80、n this publication are for identification purposes only and may be trademarks of their respective companies.2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference1 of 24IBM Telum II:Next
81、Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI AcceleratorGerald Strevig1,Chris Berry2,Rahul Rao3,Noam Jungmann4,Michael Sperling2,Michael Becht2,Eduard Herkel5,Matthias Pflanz5,Pat Meaney2,Michael Romain2,Mark Cichanowski1,Amanda Venton1,David Wolpert2,Elazar Kach
82、ir4,Luke Hopkins2,Tim Bubb2,Andreas Arp5,Daniel Kiss5,Simon Bchsenstein5,Michael Wood2,Michael Spear1,Robert Sonnelitter2,Rajiv Joshi61IBM Systems,Austin,TX2IBM Systems,Poughkeepsie,NY3IBM Systems,Bangalore,India4IBM Systems,Tel Aviv,Israel5IBM Systems,Bblingen,Germany6IBM Research,Yorktown Heights,
83、NY2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference2 of 24Outline Introduction Overview Technology System Topology Enhancements and Innovations Results Summary2.2:Telum II:Next Gener
84、ation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference3 of 24IntroductionTelum IIis the central processor chip for the next generation IBM Z and LinuxONE systems8 cores running at 5.5GHzIncreased performanceEnh
85、anced securityOn-die Data Processing Unit(DPU)Improved AI Accelerator2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference4 of 24Telum IIChip OverviewSamsung 5nm technology600 mm243 bill
86、ion transistors24 miles of wire165 billion vias1.2 trillion shapes24.9 mm24.1 mmDie photo courtesy of Samsung2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference5 of 24Technology Overvi
87、ewSamsung 5nm technologyFEOL:Mixed diffusion breaks4 VT options R,L,S,UBEOL:18 Metal Layer StackHigh density wiring levels4 36 to 44nm metal pitches4 80nm-pitchMedium width high performance levels4 128nm-pitch 2 256nm-pitch2 360nm-pitchUltra thick layers for off-chip signals and power/clock distribu
88、tions2 2.4um-pitch layers2.4um layers(1 shown)360nm layers2 256nm layers4 128nm layers4 80nm layers4 36-44nm layers2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference6 of 24zNext Syste
89、m TopologyCP-0D3D2D1CP-4CP-5D2CP-6CP-7CP-1CP-2CP-3D1D3Same system topology as z162 Telum IIprocessors(CP)per DCM connected by M-BUS4 DCMs per drawer fully interconnected by X-BUS4 drawers per system connected by high-speed A-BUS cables32 Telum II chips in a fully populated systemISSCC 2022,Ofer Geva
90、2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference7 of 24Telum III/OsMemory16 TB per drawerDDR4 and DDR5 supportPCIe Gen5 x16192 PCIe cards per systemA-BUSDrawer to drawer connectionX
91、-BUSOn drawer connectionM-BUSSynchronous connection to the high-speed ring on each chip in DCM238 GB/s bandwidth2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference8 of 24A-BUS Encrypti
92、on Drawer-to drawer encryption Public key exchange and key pair authentication using IBM signed certificate ECDSA classical Crystals-Dilithium quantum Asymmetric key exchange ECIES classical Crystals-Kyber quantum Symmetric key encryption AES-256 CTR Benefits Keys setup during manufacturing process
93、Can redo key exchange after drawer repair Protect from man in the middleStep 1:Step 2:Step 3:DataKeystreamDrawer 0CP-3Drawer 3CP-3Drawer 3CP-3Drawer 0CP-3Drawer 3CP-3Drawer 0CP-32.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE
94、International Solid-State Circuits Conference9 of 24X-BUS Performance 20.8 Gbps to 29.3333 Gbps 36%link bandwidth increase 47 GB/s to 66 GB/s 3:1 clocking Simplified crossing2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE Inte
95、rnational Solid-State Circuits Conference10 of 24Telum IIL2 Cache36 MB3.6 ns access timeDesigned with Samsungs high-density SRAM bitcellControl and DirectoryCacheRing on/off RampRing on/off Ramp2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accele
96、rator 2025 IEEE International Solid-State Circuits Conference11 of 24L2 Cache Dense SRAM DesignNegative bit line(NBL)assist scheme based off metal couplingMetal capacitor short detection and repairNBL boost circuitColumn of 256 SRAM cells on bit line NBL column select2.2:Telum II:Next Generation 5.5
97、GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference12 of 24L2 Cache SRAM Short Detection2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE Intern
98、ational Solid-State Circuits Conference13 of 24L2 Cache SRAM Voltage Regulator On-chip voltage regulator for SRAM stand-alone supply Dual loop analog design High precision slow loop Distributed micro-regulators Predictive activation scheme SRAM supply is tuned independent of the rest of the chip to
99、optimize yield,power,and performanceHigh precision regulatorMicro-regulator2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference14 of 24Cache Hierarchy and Ring40%increaseEach core and D
100、PU has a private L210 L2 instances form 360MB virtual L3Fully connected by 352 GB/s ringFully populated drawer contains 2.88GB virtual L4Cross BarCross BarCross BarCross BarCross BarCross BarRingRing2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI A
101、ccelerator 2025 IEEE International Solid-State Circuits Conference15 of 24AcceleratorsOn-chip compression and AI accelerationAI accelerator enhancementsAdded INT8 data type support4x increase in TOPS to 24 TOPS per chip192 TOPS per drawer using the ability to access AI accelerators from other chips
102、when the local AI accelerator is busy2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference16 of 24Telum IICore 20%area shrink compared to Telum Constructed in 7 fully abutted blocks Syst
103、em performance enhancements Register growth Store writeback Branch prediction2.8 mm3.4 mm2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference17 of 24Telum IIDPU Initial usage for I/O Ac
104、celerator 70%reduced power for IO management 1stgeneration where storage and networking protocols are integrated on the central processor3.1 mm5.6 mm2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Cir
105、cuits Conference18 of 24Core ConstructionLogicChangesTiming contractsPort locationsIterate on floorplan with fast timing modelChange physical hierarchy into 7 unique fully abutted blocks2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 20
106、25 IEEE International Solid-State Circuits Conference19 of 24DPU ConstructionLogicChangesTiming contractsPort locationsIterate on floorplan with fast timing modelChange physical hierarchy into 5 unique fully abutted blocks where PE cluster is reused 4 times2.2:Telum II:Next Generation 5.5GHz Micropr
107、ocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference20 of 24Chip Vminvs Process DelayVminreduction in zNext compared to z163%from 5nm technology3.5%from improved design efficiencySee ISSCC 25 paper 8.1,by Webel et al.“Dynamic Gu
108、ard-Band Features of the IBM zNext System”2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference21 of 24Core Power Performance1.3x generational improvementAcross comparable proprietary wo
109、rkloadsSee ISSCC 25 paper 37.1,by Wolpert et al.“IBM Telum IIprocessor design-technology co-optimizations for power,performance,area,and reliability”2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Cir
110、cuits Conference22 of 24Telum IIReliability Soft error rate robust latches Cache RAID protection Fine-grained self-healing Core recovery and sparing 99.999999%system uptimeSee ISSCC 25 paper 37.1,by Wolpert et al.“IBM Telum IIprocessor design-technology co-optimizations for power,performance,area,an
111、d reliability”2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE International Solid-State Circuits Conference23 of 24Telum IISummaryAll cores,all the time,5.5 GHzImproved performance1.3x Core IPC/watt36%more on drawer bandwidth4
112、0%larger cacheEnhanced securityOn-die data processing unit70%reduced IO management system powerImproved AI accelerator with 4x TOPS and up to 6x throughput per chip24.9 mm24.1 mm2.2:Telum II:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and Improved AI Accelerator 2025 IEEE
113、International Solid-State Circuits Conference24 of 24AcknowledgementThank you to the entire IBM Z,EDA,Product Engineering,and Research teams for their contributions to Telum II and thank you to Samsung for wafer fabrication.2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2
114、025 IEEE International Solid-State Circuits Conference1 of 28Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage Raj R Varada,Rohini Krishnan,Ajith Subramonia,Rathish Chandran,Kalyana Chakravarthy,Uttpal D Desai,Sumedha Limaye,Puneesh Puri,David R Mulvihill,Mike Bichan,Martin Kool
115、haas,Vijayalakshmi Ramachandran,Srinivasu Kondle2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference2 of 28OutlineMotivation and OverviewModular Product DesignIntel Xeon 6 SoC Generational ScalingIO Die Features and Constr
116、uctionIO Die Circuit Technology2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference3 of 28Motivation Edge Edge is the Next Frontier in Digital Transformation!Secure,connected,managed networksCompute density for AI&scalar w
117、orkloadsOptimized forrugged environments 2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference4 of 28Motivation vRAN Cloud NativeOperational FlexibilityRapid Introduction of New ServicesResource&Energy EfficiencyImproved TC
118、O&end-user ExperienceFaster Integration of New Innovations like AI2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference5 of 28Intel Xeon 6 SoCIntegrated SoC with processor cores,accelerators,and high bandwidth IO targeted t
119、oward virtualized Radio Access Network(vRAN),Edge Server,Networking,Storage,and Security Applications2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference6 of 28Overview:Processor built for Edge and vRANComputeScalar&data p
120、arallel workloadsMemoryLow latency,high bandwidthIOHigh bandwidth PCIe Gen 5IntegrationEthernet and acceleratorsRASServer grade robustnessScalabilityMultiple edge systemsbased on one architectureSecurityConfidential AI enabledForm FactorOptimized for space andpower constrained environments Compute O
121、ptimizedEdge OptimizedConsistent Architecture from Edge to Cloud2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference7 of 28Modular Product DesignIO DieAccelerator TilesNetwork Accelerator ComplexPCIe CXLEthernet SubsystemC
122、ompute and Memory DieEmbedded Multi-die Interconnect BridgeDDRDDREthernetPCIe 5.0/4.0,CXL 2.0Disaggregated Design Compute Chiplet and IO Chiplet integrated by EMIB2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference8 of 28
123、IntegratedAcceleratorsModular Product DesignIntegratedEthernet8 ports2x 100G4x 50G8x 25G,10G,1G,100MPCI Express&Compute Express Link32 LanePCIe 5.016 LanePCIe 4.016 Lane CXL 2.0Memory8Ch/4ch DDR5 Supports RDIMM,Memory Down,SODIMMIntel Data Streaming AcceleratorIntel Dynamic Load BalancerIntel Quick
124、Assist TechnologyIntel vRAN BoostMedia AcceleratorP-CoreIntel Xeon 6Compute Die Redwood CoveEMIBCompute Die IO Die Performance/Watt Optimized Cores,Integrated AI Acceleration.Unified Cache and Memory2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International So
125、lid-State Circuits Conference9 of 28Modular Product DesignIntel Xeon 6 SoC HCC ConfigurationOne Compute Chiplet and One IO ChipletEmbedded Multi-die Interconnect Bridge(EMIB)packaging enables high bandwidth,low latency cache&memory Intel Xeon 6 SoC XCC ConfigurationTwo Compute Chiplets and One IO Ch
126、ipletHigher core count,8CH DDR52.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference10 of 28Modular Product DesignHCC77.5mm50mmdummyXCC77.5mm56.5mmPackage design supports HCC and XCC configuration on the same platform with
127、BGAModular pin design for compatibility between HCC(77.5mm x 50mm)and XCC(77.5mm x 56.5mm)pin outOptimized package,thermal,and integrated heat spreader design with 0.94mm 1.0mm Hex pin patternPinmap(footprint)Modularity:Smaller Package(HCC)footprint drop-in compatible(subset)of bigger Package(XCC)fo
128、otprint on the same Board 2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference11 of 28Intel Xeon 6 SoC Generational ScalingConfiguration ScalingMetricIntel Xeon 6 SoCIntel SoCPrior GenCoresMax CountXCC:72HCC:42HCC:20LCC:10
129、Memory(DDR)MTsup to 6400DDR5up to 3200DDR4 Memory(DDR)ChannelXCC:8 CHHCC:4 CHHCC:4 CHLCC:3 CHCXLIOCXL 2.0NAPCIeLanes2x16 PCIe 5.01x16 PCIe 4.02x16 PCIe 4.0EthernetLanes8 ports200G Max8 ports100G MaxSignificant feature and capability over prior generation processor Refer to 1 for details.2.3:Granite
130、Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference12 of 28Intel Xeon 6 SoC Generational ScalingRefer to 1 for details.BenchmarkMetricIntel Xeon 6 SoCIntel SoCPrior GenResent50-v1-5Score4198511Resent50-v1-5Perf/Watt22.34.5IO Multi-Tu
131、nnel IP SecGbps18055Vision Transformer BaseScore794119IO Next Generation FirewallGbps9635Significant Performance and Performance Per Watt over prior generation processor 2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference
132、13 of 28IO Die Features and ConstructionIO Die Size:31.90mm x 12.06mmFabricated in Intel 4 Process TechnologyProduction Readiness in 20252.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference14 of 28IO Die Features and Const
133、ructionIntel Xeon 6 SOC IO Die Die PhotoDie-2-Die FabricIntegrated AcceleratorsNetwork Accelerators and 200G Ethernet SSEthernet SerDesPCIe 5.0/4.0|CXL 2.02.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference15 of 28IO Die
134、Features and ConstructionIntel Data Streaming Accelerator(Intel DSA)for infrastructure processing&storageIntel Quick Assist Technology(Intel QAT)for storage,CDN,securityIntel vRAN Boostfor telco radio access networksIntel Dynamic Load Balancer(Intel DLB)for packet processingMedia Acceleratorfor visu
135、al inference,media processing2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference16 of 28IO Die Circuit Technology:Power DeliveryFine grain voltage and power control via on-die power managementPower delivered via a combina
136、tion of Motherboard voltage regulators and on-die Fully Integrated Voltage Regulators(FIVR)Isolated power delivery for sensitive analog high-speed IOs to meet tight jitter budgetsAir Core Inductors for DDR IVR and Co-Ax Mil Inductors for other high-speed IOs to balance noise and powerVR13 1ph:VCCD1_
137、HV(1.1V)VR13 1ph:VCCFA_EHV(1.8V)VR13 2ph:VCCFA_EHV_FIVRA(1.8V)VR14 3+4Ph VCCIN_EHV(1.8V)VR14 2ph VCCINFAONVR13 3+1ph:VNN_NACVR13 1ph:VCCD0_HV(1.1V)Fixed 1V VR-VCCVNNPlatform supplied Rail CPLD 3.3V2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Soli
138、d-State Circuits Conference17 of 28IO Die Circuit Technology:On-die IO4 or 8 Channel DDR5(on C-die)8 ports integrated Ethernet2*100G,4x50G,8*25G,10G,1G,100M32 Lanes of PCIe 5.016 Lanes of PCIe 4.016 Lanes of CXL 2.0Advanced Integrated clock IP:RCLKDie to Die IO:MDFDDR5DDR5x16 PCIe5.0 x16 PCIe5.0PCIe
139、GPIOSMBusUARTeSPISPIEthernet2x100/4x508x25/20/2.5/1200G ThroughputSATA,M.2,BMC,PCIe Dev(optional)RDIMM/MCRUDIMM,SODIMMMemory DownDDR5DDR5PCIePCIePCIePCIeX16,x8,x4,x216 root ports16 PCIe Lanes,up to G4.x8/4/2,8 root portsUSBIntel Xeon 6 SoC supports a rich set of high BW,power-efficient IO2.3:Granite
140、 Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference18 of 28IO Die Circuit Technology:PCIe 5.0/4.0Analog Test PortSlow PLLckref_p Fast PLLDiff RecvrRcompckref_botckref_nDPMA CMNDeserialzerSerializerAPMA TxRXCTLE/DFE/CDRRXTXtx_p_lxtx_
141、n_lxrx_p_lxrx_n_lxCMNSynth SlowSynth FastDPMA TxDPMA RxDPMA registryPCS CMNPCS TxPCS RxAPMA RxDPMA Tx DPMA Rx APMA CMNPCSPIPE SERDES interfaceaprobe*PMA_CMNPMA_LANEAPB,JTAG,boundary scan,SIIP interfacePIPE SERDES interface Power Efficient,Flexible,High-Perf PCIe 5.0 Interface Configurations:Full bif
142、urcation on 4 lanes(4px4)configurable as 4x1,2x2,1x2+2x1,1x4 One port on 8 lanes(1px8),configurable as 2x4 or 1x8 Two ports on 16 lanes(2px16).Power Management:Supports L1*low power states Per-lane HSIO PHY and controller dynamic power gating Advanced Error Reporting 2.3:Granite Rapids-D:Intel Xeon
143、6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference19 of 28IO Die Circuit Technology:PCIe RxL1R-termLFEQCTLEiDACAux-SlicerData-SlicerData-SlicerAux-SlicerEdge-SlicerEdge-SlicerSIPO+Phase Detector+RetimerTimingRecoveryCalibrationRx adaptationPowerManagem
144、entDFXDigitalDCOPI+evenI/Q clocksGain/PeakingPadSquelch,BSCAN+oddiDAC+even+oddSeparate Reference Clock with Independent Spread Spectrum Clocking(SRIS)fully supportedEnables higher tolerance of 5600 ppm vs.PCIe Base tolerance of 600 ppmReduces EMI and eliminates the need for expensive shielding in ca
145、bling solutions2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference20 of 28IO Die Circuit Technology:PCIe 5.0Fully PCIe 5.0 compliant,supporting 2.5,5.0,8.0,16.0 and 32.0 GT/sPHY is PIPE 5.2 compliant ensuring compatibilit
146、y and interoperability with other components adhering to the same standard.Sinusoidal Jitter(SJ)Amplitude vs frequency graph shows the jitter spec is in compliance2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference21 of 2
147、8IO Die Circuit Technology:56G SerDes1-32 Gb/s NRZ,2-58 Gb/s PAM4 UX Multiprotocol SerDesDual PLLs provide full range of transmit clocking supporting 32.44 Gb/s JESDWide range support for channel loss from VSR to LRConfigurations:1-lane&4-lane with shared 16.22GHz clock for minimum area and powerBac
148、kground calibration and adaptation supports beyond spec PVT functionality 16-tap analog DFE architecture with advanced power management to provide optimal power/performance2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conferen
149、ce22 of 28IO Die Circuit Technology:56G SerDes PAM4Clean&symmetrical PAM4 eyeSupports PCIe,Ethernet,CPRI,JESD,SDI,DP,HDMI,OIF,USB and moreDual-path bias control to achieve low jitter across full frequency range2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE Inter
150、national Solid-State Circuits Conference23 of 28IO Die Circuit Technology:RCLKFully integrated on-die Low noise crystal oscillator and LCPLL designsCommodity 25 MHz quartz crystal is the only external componentSupports fully integrated reference clocks for 56G SerDes and PCIe 5.0RCLK delivers 25%bet
151、ter jitter than“pristine”external reference clocks in validation platforms2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference24 of 28Beam Management Challenge at the Cell EdgeVideo Quality DropCell Edge2.3:Granite Rapids-
152、D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference25 of 28AI-assisted Beam Management Powered by Intel Xeon 6 SoCKeeps UHD quality Cell Edge2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International
153、Solid-State Circuits Conference26 of 28We acknowledge the global Intel talent who contributed to the success of Intel Xeon 6 SoCThank You!2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference27 of 28References1 P.Mosur et a
154、l.,“Built for the Edge:The Intel Xeon 6 SoC”,Hot Chips Symp.,2024.2 R.Mahajan et al.,“Embedded Multi-die Interconnect Bridge(EMIB)-A Localized,High-Density Multichip Packaging Interconnect,”IEEE Trans.on Components,Packaging and Manufacturing Tech.,vol.9,no.10,pp.1952-1962,2019.3 K.Bharath et al.,“I
155、ntegrated Voltage Regulator Efficiency Improvement using Coaxial Magnetic Composite Core Inductors,”IEEE Elec.Components and Tech.Conf.,pp.1286-1292,2021.4 R.Krishnan et al.,“Design Techniques for Reference Clock Jitter Optimization for High Speed PHYs,”IEEE Conference on Design,Test,Technology of I
156、ntegrated systems,2023.5 N.Nassif et al.,“Sapphire Rapids:The Next-Generation Intel Xeon Scalable Processor,”ISSCC,pp.42-46,2022.6 B.Sell et al.,“Intel 4 CMOS Technology Featuring Advanced FinFET Transistors Optimized for High Density and High-Performance Computing,”IEEE Symp.VLSI Circuits,pp.282-28
157、3,2022.2.3:Granite Rapids-D:Intel Xeon 6 SoC for vRAN,Edge,Networking,and Storage 2025 IEEE International Solid-State Circuits Conference28 of 28Notices and DisclaimersPerformance varies by use,configuration and other factors.Learn more at www.I results are based on testing as of dates shown in conf
158、igurations and may not reflect all publicly available updates.See backup for configuration details.No product or component can be absolutely secure.Availability of accelerators varies depending on SKU.Visit https:/ for additional product details.Your costs and results may vary.Intel technologies may
159、 require enabled hardware,software or service activation.Intel Corporation.Intel,the Intel logo,and other Intel marks are trademarks of Intel Corporation or its subsidiaries.Other names and brands may be claimed as the property of others.2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D S
160、ystem Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference1 of 14A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chipletswith Workload-Dependent Co
161、nfigurationsS.Rangachar Srinivasa,D.Kurian,P.Aseron,P.Budhkar,A.Radhakrishnan,A.Cardenas Lopez,J.Sundaram,V.Honkote,L.Azarenkov,D.Lake,J.Timbadiya,M.Moiseev,B.Perez Esparza,R.Kalim,E.Ramirez Lozano,M.Bhartiya,S.Kabisthalam Muthukumar,S.Yada,S.Kadavakollu,S.Chhabra,K.Prasad Sahu,J.Greensky,X.Brun,E.J
162、uarez Hernandez,R.Munoz,T.Thomas,S.Liff,V.De,A.Srinivasan,T.KarnikIntel Corporation2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference2 o
163、f 14Chiplets are Becoming Increasingly PrevalentMeteor Lake 1Agilex FPGA 2Ponte Vecchio 31 https:/ https:/ https:/ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=97316732.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workl
164、oad-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference3 of 148C CPU Chiplet 4nm,Foundry 1 IO Chiplet7nm128C Graphics Chiplet 7nm,Foundry 1 SoC,4nm Foundry 1SRAM96MB4nmFoundry 26C CPU Chiplet 7nm,Foundry 3 IO Chiplet7nm512C Graphics Chiplet 3nm,Foundry 1 Novel Memory1GBF
165、oundry 2Enable seamless alternative configs,Scale Reticle Size Integrate multi foundry multi process chipletsHost hundreds of chiplets,integrate heterogenous IPs Example Config 1Example Config 2We Propose to“Standardize”Host interface Inter chiplet network Co-design of base die and top dies Package
166、Assembly Testing SoftwarePassive Base Chiplet(Foundry 4)Passive Base Chiplet(Foundry 4)2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conferenc
167、e4 of 14Standardization of Chiplet Based Designs Illustrative Config-1Standardization Enables:Reduced cost Accelerated time to market Distributed computing with heterogeneity Increased flexibility and scalability Our focus:D2D reconfigurable network Workload adaptive configurations Coherent integrab
168、le componentsBase ChipletBump LayerActive Chiplet LayerCMCMCMCMCMCMCMCMCMCMIllustrative Config-2MMMMMMMMCCCCCCAAAAAAA=AcceleratorsM=MemoryC=ComputeAnchored Positionand routing2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with
169、Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference5 of 14This work:Road to Template Based DesignTemplate adhering multi foundry chiplets 24k Signal Bumps 140k Power Bumps 15k C4 Bumps 45k TSVs Power rails=1202.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogeno
170、us 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference6 of 14Prototype:Multi Foundry Multi Chiplet System Compute ChipletMemory ChipletBaseSiliconTechnologyTSMC 16Intel 4UMC 130Chiplet Area(mm2
171、)3.70 x 4.00 3.94 X 4.0422.38 x19SRAM Capacity3MB16MB-PCIe presentYesNo-Xtensa VersionLX7VP6-Base config:4 TSMC and 16 Intel chiplets over a UMC 130nm passive base die Assembled 4 different configurations for AI and media acceleration workloads2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous
172、2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference7 of 14Active Chiplets Architecture and FeaturesFeaturesCompute chipletMemory chipletRouter128b AXI128b AXISERDESQuad Pump(2GHz)Quad Pump(2GHz
173、)Processor/DSPTensilica Xtensa LX7(32-bit RISC)Tensilica vision p6(AI workloads)On chip SRAM1.5MB,4 Banks16 MB,4 Data BanksHigh Speed interfacePCIe 4 compatible(controller+Phywithin chiplet)Non PCIe compatible,but I/O,Bump patterns matching compute chipletNMC acceleratorsN/A3,Near Memory AI and netw
174、orking acceleratorsDebug supportBoth Inter and intra chiplet debug facilityBoth Inter and intra chiplet debug facility2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International
175、 Solid-State Circuits Conference8 of 14Prototype:Scale-up and Scale-out Architecture AXI based Die-to-Die router network Optimal path from source to destination is When all chiplets are busy When nonfunctional chiplets in the system2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System
176、Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference9 of 14Memory Architecture Repurpose any chiplets as memory Tiles enabling contiguous addressing Every SRAM block of the system is accessible through sing
177、le addressing scheme2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference10 of 14Measurement Approach:Software and Platform Graph complier
178、and custom software determines the optimal mapping across multi chiplet system2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference11 of 14
179、Application Enabled:ResNet50Image Dimension:224 x 224Top 3 Classificationand accuraciesPersian Cat(44%)Tiger cat(15.3%)Tabby(8.6%)Beagle(92.8%)Basset(4.8%)English foxhound(1.4%)Brown Bear(99.12%)Sloth BearAmerican Black Bear2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferenc
180、ing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference12 of 14Application Analysis:ResNet50Reordering DataCNN and GEMMADDPooling and Softmax 4TOPS/Chiplet hardware measured inference rate 6FPS,6 Parallel threads Scal
181、ed 20TOPS/Chiplet hardware Projected inference rate 30FPS,2.5W/Chiplet2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 2025 IEEE International Solid-State Circuits Conference13 of 14Comparis
182、on Table2 JSSC 20203 JSSC 20204 ISSCC 20225 CICC 20216 2023 ECTC7 ISSCC 2023Our workTechnology16nm7nm16nm16nm-4nmIntel4nm,TSMC 16nmHeterogenous?NoNonoNoYesNoYeschiplets362424220Chiplet connectionsOrganic substrateCoWoS2 layerSi-IF180nmPassiveRedistribution LayerPassive Silicon130nmPassive SiliconCha
183、nnel Reach(m)-500m3502000-30001200Bump Pitch-401055-5045System/applicationDeep learningCPUDSPFPGACPU,GPUsChiplet Chiplet linkDistributed AI2.4:A 300MB SRAM,20Tb/s Bandwidth Scalable Heterogenous 2.5D System Inferencing Simultaneous Streams Across 20 Chiplets with Workload-Dependent Configurations 20
184、25 IEEE International Solid-State Circuits Conference14 of 14Thank YouQuestions?2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference1 of 23A 16nm 5.7TOPS CNN Processor Supporting Bi
185、-Directional FPN for Small-Object Detection on High-Resolution VideosYu-Chun Ding1,Chia-Yu Chang1,Chun-Yeh Lin1,Hui-Yun Tsai1,Hao-Jiun Tu1,Kai-Feng Chang1,Yu-Ching Su1,Tsung-Han Hsieh1,Yu-Kuan Jian1,Wen-Ching Chen2,Nian-Shyang Chang2,Chun-Pin Lin2,Chi-Shi Chen2,Chao-Tsung Huang11National Tsing Hua U
186、niversity,Hsinchu,Taiwan2Taiwan Semiconductor Research Institute,Hsinchu,Taiwan2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference2 of 23High-Precision Small-Object Detection for A
187、DASCar SpeedSFDFOVPedestrianVehicle120km/h100m2596 pixels322 pixels60km/h50m60304 pixels1013 pixels40km/h33m105338 pixels1126 pixels(SFD:three-second rules)Field-Of-View(FOV)Safe Following Distance(SFD)5M-Pixel Camera Tesla Vision Hardware 4.0Around or less than 1K pixels2.5:A 16nm 5.7TOPS CNN Proce
188、ssor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference3 of 23Image Example and Detection PerformanceYOLOv2(4.5mAPs)-No FPN-416x416 Input-19-Layer BackboneEffDet-D3(28.7mAPs)-Multi-Layer Bi-FPN-896x896 Input-77-L
189、ayer Backbone(under 628-1270 pixels 2M-pixel image)YOLOv3(13.8mAPs)-Top-Down Plain FPN-416x416 Input-53-Layer BackboneFPN:Feature Pyramid NetworkEven can detect objects difficult for the human eye(mAPs:mean Average Precision on objects area 322)(EffDet:EfficientDet)FPN2.5:A 16nm 5.7TOPS CNN Processo
190、r Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference4 of 23Bi-FPN:Bi-Directional Feature Pyramid NetworkPrediction Head Bounding Box&Class Label Advanced Bi-Directional Feature Pyramid NetworkBackboneBetter use o
191、f high-resolution FMs.be able to detect small objects.Model Input(Padded&Resized)(Connections&Layer )2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference5 of 23Major Challenge:Enorm
192、ous Memory CostFM EMA(MB)Supported in 3 3440.66 Resolution&Depth 90.619-Layer YOLOv2416x416 53-layerYOLOv3416x416 68-layerEffNet-B1768x768*77-layerEffNet-B3896x896*Backbone only 522EffNet:EfficientNetOne-Layer Plain FPN(YOLOv3)Multi-Layer Bi-FPN(EffDet-D3)101.55.1 FM EMA(MB)(FPN Only)Depth&Connectio
193、n FM:Feature Map.EMA:External Memory Access.EMA86%Compute14%System Energy-Compute:post-sim with PTPX 500MHz,TT-EMA:1.3nJ per 64-bit access vis DDR4 13(on EffDet-D3)EffDet:EfficientDet2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2
194、025 IEEE International Solid-State Circuits Conference6 of 23This Work Flow-Model Co-Optimized Orientation-Interleaved Causally-Processed Bi-FPN Reduce FM EMA by 83%and Buffer Size by 70%EMA-Optimized Scheduling with Re-Accessing and Re-Computing Reduce FM EMA by 83%Low-Toggled Reconfigurable Tensor
195、 Engine Improve Compute Utilization by 29%Reduce MAC Power by 16%:Bi-FPN;:backbone.For Memory CostFor Compute Energy2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference7 of 23Issue-
196、1:Enormous Memory Cost for Bi-FPNBaseline:Node-by-Node(Frame-Wise,Spatial-First)Compute Nodexx+ACTCONVFM Fusion xUPx2DNx2Buffer:One NodeChip FM Buffer (0.24MB)FM EMA (101.5MB)Frame-wise2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos
197、 2025 IEEE International Solid-State Circuits Conference8 of 23Reduce EMA:Node-First DataflowBaseline:Node-by-Node(Frame-Wise,Spatial-First)Compute Nodexx+ACTCONVFM Fusion xUPx2DNx2Proposed:Node-First(Patch-Wise,Depth-First)Buffer:Two Bi-FPN LayersChip FM EMA (16.9MB)FM Buffer (1.79MB)Buffer:One Nod
198、ePatch-wiseFrame-wiseFM Buffer (0.24MB)FM EMA (101.5MB)2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference9 of 23Root Cause:Fusion Across PatchesScanning OrientationOriginal:Non-Ca
199、usal CONV InferenceDNx2CONVCONVUPx2&Fusion(1-D View)pixel+Need to be buffered until next patch 2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference10 of 23Root Cause:Fusion Across P
200、atches472321118522Buffered PixelsScanning OrientationOriginal:Non-Causal CONV InferenceDNx2CONVCONVUPx2&FusionFusion across different patches(Buffer size )(1-D View)pixel+Lots of connections in Bi-FPN!-Buffered pixels increase if across more nodes-Buffered pixels enlarged with(DN+UP)1.26MB Buffer Us
201、age for Fusion Connection due to Across-Patch Buffering!2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference11 of 23Reduce Buffer:Causally-Processed Convolution Original:Non-Causal
202、CONV InferenceDNx2CONVCONVUPx2&FusionFusion across different patches(Buffer size )(1-D View)pixel+DNx2CONVCONVUPx2&FusionComplete in the same patch(Buffer size )Proposed:Offset-Free Causal CONV +Scanning OrientationLegend Patch nPatch n+1 Patch n+2Patch n+3Non-Causal:Causal:(Output)(Not depend on fu
203、ture data)Scanning Orientation2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference12 of 23Drawback:Precision Degradation Original:Non-Causal CONV InferenceDNx2CONVCONVUPx2&FusionFus
204、ion across different patches(Buffer size )Consider pixels from both sides Balanced receptive field (1-D View)pixel+DNx2CONVCONVUPx2&FusionOnly consider past&present pixels Skewed receptive field Precision Proposed:Offset-Free Causal CONV +Scanning OrientationComplete in the same patch(Buffer size )2
205、.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference13 of 23Save Precision:Orientation-Interleaved SchemeDifferent Scanning Orientation for Each Sub-ModelTwo-Layers Sub-Model(Node-Fi
206、rst)Whole Six-Layer Bi-FPN model:Balanced receptive field Precision 2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference14 of 23Feature-1 Summary:Orientation-Interleaved Causal-Proc
207、essed Bi-FPN ImplementationNode-by-Node Node-First +CP +CP+OI101.5 16.9 16.9 16.9 0.24 1.79 0.53 0.53 Buffer(MB)EMA(MB)FM Memory Cost-83%-70%Node-First+CP+OI+CP Detection Precision(-0.3)mAPs mAP46.6 28.7 46.4 28.4 (-0.7)(-0.6)(+0.4)(+0.4)(-0.2)(on EfficientDet-D3 with 6-layer Bi-FPN)CP:Causal-Proces
208、sedOI:Orientation-InterleavedmAP:mean Average Precision mAPs:mAP on object area less than 32x32(Evaluated on COCO validation dataset)2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Confer
209、ence15 of 23Depth-First Layer Fusion(Stop CA)Issue-2:Memory Cost for Backbone with High-Resolution InputEfficientNet Building Block21Keep Whole Frame of FMs in External Memory(Due to Global Pooling in CA)Layer Fusion SchedulingPWC:Point-wise Conv,DWC:Depth-wise Conv,CA:Channel Attention.ChannelHW(Si
210、ze of Feature Map)EMA Occur CA Input with Wider ChannelDemand Huge EMA even with Layer Fusiontime(Wide)3Load from External MemoryPWCDWCCAPWC+2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuit
211、s Conference16 of 23Feature-2:EMA-Optimized Scheduling with Re-Accessing and Re-Computing1Collect Global Information Before CA(Depth-First)Discard the whole FMs Avoid EMA wide channelProposed RARC Schedulingtime3Re-Computing:To Generate Wide Channel-60%90 227 522 Layer-by-Layerw/Layer Fusionw/RARC(P
212、roposed)FM EMA(MB/frame)-83%(Slim)(Wide)2Re-Accessing:Load FromSlim ChannelPWCDWCCAPWC+EfficientNet Building Block2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference17 of 23Feature
213、-3 for Compute Energy Reduction:Low-Toggled Reconfigurable Tensor EngineX12 X24Signed Magnitude Twos Complement SMTC PE 00SMTC PE 47 01Tile WindowShape-AdaptiveLine FIFOPre-Process UnitTensor Multiplication UnitActivationDynamic QuantizeScale&Pool 0101Post-Process Unit4x2-tile Inference Core Shape-A
214、daptiveAccumulatorDatapathSMTC PE 00SMTC PE 47 01Tile WindowShape-AdaptiveLine FIFOPre-Process UnitTensor Multiplication UnitActivationDynamic QuantizeScale&Pool 0101Post-Process Unit4x2-tile Inference Core Shape-AdaptiveAccumulatorDatapathSMTC PE 00SMTC PE 47 01Tile WindowShape-AdaptiveLine FIFOPre
215、-Process UnitTensor Multiplication UnitActivationDynamic QuantizeScale&Pool 0101Post-Process Unit4x2-tile Inference Core Shape-AdaptiveAccumulatorDatapathOptimization-2:Low-Toggled SMTC Processing ElementOptimization-1:Utilization-Aware Shape-Adaptive Datapathfor depth-wise convolution2.5:A 16nm 5.7
216、TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference18 of 23Sign Extension:Redundant ToggleLow-Toggled SMTC Processing ElementXW7b*7b magnitudesign-msb 01SMTC PE-magnitudesign-msb0 0 0 0 1 0 0 0
217、Cycle T+1,Input=-12 Cycle T,Input=81 1 1 1 0 1 0 0TC 1 0 0 0 1 1 0 0SM 4x2-Tile FM Switching SMTCDistribution of Toggle Bits Per Switching Activity0b-2b3b-5b6b-8b49%64%20%36%31%0%Toggle Bits=6Toggle Bits=2(1 for MUL)SM:Signed Magnitude TC:Twos Complement 2.5:A 16nm 5.7TOPS CNN Processor Supporting B
218、i-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference19 of 23Chip ImplementationProcessTSMC 16nm FinFETSupply Voltage0.66-1.09VChip AreaDie2.83mm x 2.83mmCore2.32mm x 2.32mmSRAM Size2.0MBCore Frequency100-480MHzPeak Performance
219、5.7TOPSCore Power177mW 133MHz,0.69V1373mW 480MHz,1.09VEnergy Efficiency8.9TOPS/W 133MHz,0.69V4.1TOPS/W 480MHz,1.09VDetection Precision(mAPs,mAP,fps)EfficientDet-OICP-INT8-D3(26.4,44.8,26.6)EfficientDet-OICP-INT8-D2(23.6,41.7,50.3)mAP:mean Average Precision mAPs:mAP on object area less than 32x32(Eva
220、luated on COCO validation dataset)2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference20 of 23Architectural ComparisonYOLOv2/v3416x416/416x416 in FPGA Accelerator 3 EffDet-OICP-INT8
221、-D2/D3768x768/896x896 in This Work:use precision of FP32 for YOLOv2/v3 because precision of INT8 is not provided in 3.0.6690.6 6.6 EMA(MB/frame)SRAM Size(MB)125.5 64.4 2.0Precision vs Memory Cost BetterBetter10 20 30mAPs5.1 Diameter522101.59016.9125.5682.9Baseline ProposedEMA of EffDet-D382%Back-bon
222、eBi-FPNHeadMB/frameEffDet:EfficientDet2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference21 of 23Performance ComparisonEffDet:EfficientDetTCSVT198ISSCC23 1ISSCC24 2This WorkProcess
223、55nm40nm14nm16nmFrequency(MHz)188225996480SRAM Size450KB2MB3MB2.0MBPeak Performance(TOPS)0.1552.28.163130.5545.73Energy Efficiency(TOPS/W)2.224.0814.18223.9-28.247.8-10.254.1-8.93ApplicationsDetectionDetection,TrackingDetection,SegmentationDetection(Small Objects)Feature PyramidNetwork SupportN/ASin
224、gle-LayerTop-Down Plain FPNN/AOICP-Optimized Multi-Layer Stacked Bi-Directional FPNModel NameYOLOv2 tinyCustomYOLOv2EffDet-OICP-D27EffDet-OICP-D37Model Complexity(#of OPs/#of Params/#of Layers)1.32G/16.9M/94.7G/12.7M/2029G/50.9M/1922G/8.1M/6850G/12M/77Frame Rate(fps)627.8163.131N/A50.326.6Energy per
225、 frame(mJ)63.350.57N/A22.944.12.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference22 of 23Performance ComparisonTCSVT198ISSCC23 1ISSCC24 2This WorkProcess55nm40nm14nm16nmFrequency(M
226、Hz)188225996480SRAM Size450KB2MB3MB2.0MBPeak Performance(TOPS)0.1552.28.163130.5545.73Energy Efficiency(TOPS/W)2.224.0814.18223.9-28.247.8-10.254.1-8.93ApplicationsDetectionDetection,TrackingDetection,SegmentationDetection(Small Objects)Feature PyramidNetwork SupportN/ASingle-LayerTop-Down Plain FPN
227、N/AOICP-Optimized Multi-Layer Stacked Bi-Directional FPNModel NameYOLOv2 tinyCustomYOLOv2EffDet-OICP-D27EffDet-OICP-D37Model Complexity(#of OPs/#of Params/#of Layers)1.32G/16.9M/94.7G/12.7M/2029G/50.9M/1922G/8.1M/6850G/12M/77Frame Rate(fps)627.8163.131N/A50.326.6Energy per frame(mJ)63.350.57N/A22.94
228、4.1+19.1/21.9%mAPs(4.5 23.6/26.4)mAPsevaluated on COCO val.:Floating point;:Fixed INT8.2.5:A 16nm 5.7TOPS CNN Processor Supporting Bi-Directional FPN for Small-Object Detection on High-Resolution Videos 2025 IEEE International Solid-State Circuits Conference23 of 23ConclusionA memory-efficient CNN p
229、rocessor for small-object detection Provide 26.6-50.3fps for 23.6-26.4%COCO mAPsinferenceMemory Cost ReductionCo-Optimized Orientation-Interleaved Causal-Processed Bi-FPNEMA-Optimized Scheduling with Re-Accessing and Re-ComputingCompute Energy ReductionLow-Toggled Reconfigurable Tensor EngineWelcome
230、 to Our Demonstration Feb 17th5-7pm!:Bi-FPN;:backbone.2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference1 of 411.78mJ/Frame 373fps 3D GS Processor B
231、ased on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache SchedulerXiaoyu Feng*,Hedi Wang*,Chen Tang,Tongda Wu,Huazhong Yang,Yongpan Liu Tsinghua University,Beijing,China*Equally Credited Authors2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid
232、 Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference2 of 41Outline Introduction Challenges of 3D GS Processor Key Features of 3D GS ProcessorShape-aware reconfigurable computation architectureOnline Shape-aware Early Sk
233、ipping ControllerSpatio-Temporal Mixed Gaussian Management Measurement Results Conclusion2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference3 of 413D
234、 RepresentationConventional RepresentationLearning-based 3D RepresentationNeural Radiance Field(NeRF)3D Gaussian Splatting(3D GS)Mesh,Volume,Point Cloud,etcVolumePoint CloudMeshBunny High Storage Space Low Generalization Sparse Representation ISSCC24(NeuGPU)ECCV20(NeRF)High Resolution&Details Low St
235、orage Space High Computation Overhead High Resolution&Details Moderate Storage Space Low Computation Overhead SIGGRAPH23(3D GS)ForwardBackwardCameraProje-ctionDensity Control3D Gaussians2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipp
236、ing and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference4 of 413D GS Based Rendering 3D GS on Xavier NX:th thThe varying Gaussian shapes result in many invalid computations.2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlie
237、r Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference8 of 41Challenge 3:Frequent&Irregular Off-chip AccessD=0D=1D=2D=382.8%Gaussian Parameter36.5MB Spherical HarmonicsHigh Off-chipFootprint!Repeat Irregular Off-chip Access7.2 Times for Each Gauss
238、ian PointRepeat Gaussian AccessT1T2T3Order12345SHOpacityMeanCovarianceThe Gaussian parameters cause high off-chip movement overhead.2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE Internation
239、al Solid-State Circuits Conference9 of 41Outline Introduction Challenges of 3D GS Processor Key Features of 3D GS ProcessorShape-aware reconfigurable computation architectureOnline Shape-aware Early Skipping ControllerSpatio-Temporal Mixed Gaussian Management Measurement Results Conclusion2.6:A 1.78
240、mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference10 of 41Overall Architecture24b Write Ports24b Read PortsCamera PoseT-th FrameOutput SRAM(5.1KB)ColorOpacity
241、PreprocessingCoo.ProjectCov.Project3D 2D (x,y,z)(u,v)SHRenderChip-level Data InterfacePLLRef CLKPLL CfgIEREREIEREREIEIEREREIEREIEIERERERasterizing ElementUni-directional Interpolating ElementREBi-directional Interpolating ElementBase Coo.RegEnd State RegBG Color RegReconfigurable Computation Archite
242、cture1Top ControllerSpatio-Temporal Reusing Cache3IE10b Tag#010b Tag10b Tag#110b Tag10b Tag10b Tag#210b Tag#102410b Tag144KB FM MemS&TDataSchedulerOpacity THr THd TH SkipFIFOEarly Skipping Controller2Hybrid Rasterizing&Interpolating ArrayshscolorIEIE8-Way Cache Tag2.6:A 1.78mJ/Frame 373fps 3D GS Pro
243、cessor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference11 of 41Overall Architecture24b Write Ports24b Read PortsCamera PoseT-th FrameOutput SRAM(5.1KB)ColorOpacityPreprocessingCoo.ProjectC
244、ov.Project3D 2D (x,y,z)(u,v)SHRenderChip-level Data InterfacePLLRef CLKPLL CfgIEREREIEREREIEIEREREIEREIEIERERERasterizing ElementUni-directional Interpolating ElementREBi-directional Interpolating ElementBase Coo.RegEnd State RegBG Color RegReconfigurable Computation Architecture1Top ControllerSpati
245、o-Temporal Reusing Cache3IE10b Tag#010b Tag10b Tag#110b Tag10b Tag10b Tag#210b Tag#102410b Tag144KB FM MemS&TDataSchedulerOpacity THr THd TH SkipFIFOEarly Skipping Controller2Hybrid Rasterizing&Interpolating ArrayshscolorIEIE8-Way Cache Tag2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Awa
246、re Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference12 of 41Overall Architecture24b Write Ports24b Read PortsCamera PoseT-th FrameOutput SRAM(5.1KB)ColorOpacityPreprocessingCoo.ProjectCov.Project3D 2D (x,y,z)(u
247、,v)SHRenderChip-level Data InterfacePLLRef CLKPLL CfgIEREREIEREREIEIEREREIEREIEIERERERasterizing ElementUni-directional Interpolating ElementREBi-directional Interpolating ElementBase Coo.RegEnd State RegBG Color RegReconfigurable Computation Architecture1Top ControllerSpatio-Temporal Reusing Cache3
248、IE10b Tag#010b Tag10b Tag#110b Tag10b Tag10b Tag#210b Tag#102410b Tag144KB FM MemS&TDataSchedulerOpacity THr THd TH SkipFIFOEarly Skipping Controller2Hybrid Rasterizing&Interpolating ArrayshscolorIEIE8-Way Cache Tag2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Us
249、ing Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference13 of 41Overall Architecture24b Write Ports24b Read PortsCamera PoseT-th FrameOutput SRAM(5.1KB)ColorOpacityPreprocessingCoo.ProjectCov.Project3D 2D (x,y,z)(u,v)SHRenderChip-level Dat
250、a InterfacePLLRef CLKPLL CfgIEREREIEREREIEIEREREIEREIEIERERERasterizing ElementUni-directional Interpolating ElementREBi-directional Interpolating ElementBase Coo.RegEnd State RegBG Color RegReconfigurable Computation Architecture1Top ControllerSpatio-Temporal Reusing Cache3IE10b Tag#010b Tag10b Tag
251、#110b Tag10b Tag10b Tag#210b Tag#102410b Tag144KB FM MemS&TDataSchedulerOpacity THr THd20123INTComp.SkipIf 04201INTComp.SkipIf 022.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International
252、Solid-State Circuits Conference17 of 41Hybrid Rasterizing&Interpolating ArrayRendering pipeline can be reconfigured based on the Gaussians shape.FP16EXPINTFP16EXPINTFP16EXPINTINTINTINTRE(0,0)IE(0,1)IE(1,0)IE(1,1)Pixel(0,0)Pixel(0,1)Pixel(1,0)Pixel(1,1)RE(0,0)Pixel(0,0)Gau0Gau1Gau2REIEHigh Performanc
253、e PipelineHigh Precision PipelineSmooth Gau.Spiky Gau.ReduceFP16/EXPtRE(0,0)Pixel(0,1)RE(0,0)Pixel(1,0)RE(0,0)Pixel(1,1)Cycle 1Cycle 1Cycle 2Cycle 3Cycle 4CxCzCyCyFP16EXPINTFP16EXPINTFP16EXPINTREIEtClock-GatedGau0Gau0Gau0Shape Analysis2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hy
254、brid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference18 of 41Evaluation2.08x Area Efficiency Improvement21.87dB21.38dBPSNR*AreaEfficencyEnergyEfficency2.08x2.76x-0.49dB*Measured with T&T Train DatasetPerformance(fps)
255、PSNR(dB)*Measured without SESCHybrid R&I ArrayRE-only Array222034363840801201602000Synthetic-HotdogSynthetic-LegoT&T-TrainNoInter.FullInter.RE-only Array vs Hybrid ArrayReconfigurable Prec.&Perf.2.76x Energy Efficiency Improvement2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid
256、Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference19 of 41Outline Introduction Challenges of 3D GS Processor Key Features of 3D GS ProcessorShape-aware reconfigurable computation architectureOnline Shape-aware Early Sk
257、ipping Controller(SESC)Spatio-Temporal Mixed Gaussian Management Measurement Results Conclusion2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference20
258、of 41Gaussian-dependent Ineffective ComputationPosition,shape and opacity determine whether a computation is effective.Spiky GaussianSmooth Gaussian12213344yx NearFar235566 xyBoundrayNear56EffectiveIneffectiveGaussian2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture
259、Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference21 of 41Skip Ineffective ComputationPosition,shape and opacity determine whether a computation is effective.Skip-I:Far TileSkip-II:Boundary Tile for Large Gau.Skip-III:Gau.with Low
260、Opacity2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference22 of 41Hardware-Friendly Gaussian Shape Analysisxy1.Gaussian 2.6:A 1.78mJ/Frame 373fps 3D
261、GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference23 of 41Hardware-Friendly Gaussian Shape Analysisxy1.Gaussian 2.Radiusxy2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shap
262、e-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference24 of 41Hardware-Friendly Gaussian Shape Analysisxy1.Gaussian 2.Radiusxy3.EigenvalueAB2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybr
263、id Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference25 of 41Hardware-Friendly Gaussian Shape Analysisxy1.Gaussian 2.Radiusxy3.EigenvalueAB4.Constraints of Radius1/R121/R22MIN(Rmin)R1R2MAX(Rmin)AB2.6:A 1.78mJ/Frame 373
264、fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE International Solid-State Circuits Conference26 of 41Hardware-Friendly Gaussian Shape AnalysisEstimate the shape of a Gaussian with just two adders and two multiplie
265、rs.xy1.Gaussian 2.Radiusxy3.EigenvalueAB4.Constraints of Radius1/R121/R22MIN(Rmin)R1R2MAX(Rmin)AB5.Shape Analysis Algo.max(Rmin)min(Rmin)Rmin2.6:A 1.78mJ/Frame 373fps 3D GS Processor Based on Shape-Aware Hybrid Architecture Using Earlier Computation Skipping and Gaussian Cache Scheduler 2025 IEEE In
266、ternational Solid-State Circuits Conference27 of 41SESC Implementation DetailsSESC estimates the shape of the Gaussians and decides which to skip.Gau.0Gau.1Gau.2Gau.3Gau.1Gau.3Gau.0Gau.2a.CornerUnitCacheFIFOSESCd.Opacity Unitb.Position Unitc.ShapeUnitORMCOpac.MCAreaPower2.2%2.6%SESCRest of GS chipOv
267、erhead of SESCa.Corner Unitb.Position Unitc.Shape Unit d.Opacity UnitThrCxCzDEDEDEDEThThThThxThy&MxPxMyPy100 MB2)Irregular Load&Store:Irregular Access by unstructured 3D GaussiansP1)Excessive External DRAM AccessII.Motivation*EMA:External Memory Access 2025 IEEE International Solid-State Circuits Co
268、nference11 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Large&Irregular Off-chip Memory Access-State-of-the-art:MLP,codebook-based model compression-EMA energy still accounts for 52.7%P1)Excessive Exter
269、nal DRAM AccessII.Motivation*EMA:External Memory Access 2025 IEEE International Solid-State Circuits Conference12 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting 3-Stage Sequential Processes in Interactive
270、 Rendering 1)Scene Deformation 2)Reflection 3)Rendering P2)Long Interaction Latency II.Motivation 2025 IEEE International Solid-State Circuits Conference13 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting I
271、ntensive Computation for Mesh Management1)Intensive BP computation modeling for mesh extraction2)Complex Gaussian transformation calculatingP3)Compute-Intense Mesh IntegrationII.Motivation 2025 IEEE International Solid-State Circuits Conference14 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC f
272、or Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingAccelerating 3DGS by Surface&MLP 3 Key Features with SW/HW Co-OptimizationIII.IRIS ArchitectureLoad&SortGaussiansMLP Inference&TrainingFree-form Deformation&MLP TrainingSpatial-TemporalGaussian Order Reuse(STOR)
273、Reconfigurable ALP*-based MAC Array(RAMA)Homogeneous Core Cluster Architecture&Error Direction Cache(EDC)EMA Latency Power Throughput Efficiency Throughput CoPlanar-GaussianCluster(CPGC)*ALP:Approximate Multiplication based on Linear Plane 2025 IEEE International Solid-State Circuits Conference15 of
274、 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting1.Surface Percetion Unit(SPU)2.Reconfigurable ALP-based MAC Array(RAMA)3.Homogeneous Architecture&Error-Direction Cache(EDC)Overall Architecture of IRISIII.Arch
275、itecture of IRIS 2025 IEEE International Solid-State Circuits Conference16 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingOverall Architecture of IRISIII.Architecture of IRIS1.Surface Perception Unit(SPU)S
276、urface-centric 3D Gaussian managing Skip redundant Gaussian sorting 2025 IEEE International Solid-State Circuits Conference17 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting2.Reconfigurable ALP*-based MAC
277、Array(RAMA)Accuracy-Precision reconfigurable floating-point multiplier Bit position-wise optimized mantissa adder treeOverall Architecture of IRISIII.Architecture of IRISALP*:Approximate Multiplication based on Linear Plane 2025 IEEE International Solid-State Circuits Conference18 of 442.7:IRIS:A 8.
278、55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting3.Spatially-Computable Gaussian Splatting(SCGS)Cluster Homogenous architecture for Free-form Deform&MLP Inference Caching polarized error for fast and efficient Back-propagatio
279、nOverall Architecture of IRISIII.Architecture of IRIS 2025 IEEE International Solid-State Circuits Conference19 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingFeature 1:Preprocess Optimization 3 Key Featur
280、es with SW/HW Co-DesignLoad&SortGaussiansMLP Inference&TrainingFree DeformMLP TrainingSpatial-TemporalGaussian Order Reuse(STOR)Reconfigurable ALP1-based MAC Array(RAMA)Homogeneous Core Cluster&Error Direction Cache(EDC)EMA Latency Power Throughput Efficiency Throughput CoPlanar-GaussianCluster(CPGC
281、)IV.Key Feature 1)Surface Perception Unit(SPU)2025 IEEE International Solid-State Circuits Conference20 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Periodic Mesh Extraction in 3DGS Modeling 3D space pe
282、rception for Scene Deformation Pre-processing acceleration by utilizing geometric informationSpatial Surface RecognitionIV.Key Feature 1)Surface Perception Unit(SPU)2025 IEEE International Solid-State Circuits Conference21 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactabl
283、e-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Accelerating 3DGS using Surface information1)Surface-based efficient handling of clustered Gaussians2)Minimizing Gaussian sorting latency based on incident angleSurface Perception Unit(SPU)IV.Key Feature 1)Surface Perception Unit(SPU)
284、2025 IEEE International Solid-State Circuits Conference22 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting CoPlanar Gaussian Cluster(CPGC)Mapping spatial adjacency to address adjacency Fewer and continual D
285、RAM memory access on RenderingCluster-wise Management of 3D Gaussian IV.Key Feature 1)Surface Perception Unit(SPU)2025 IEEE International Solid-State Circuits Conference23 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaus
286、sian Splatting Spatial-Temporal Gaussian Order Reuse(STOR)Pre-sorting under the assumption of normal incidence Reusing sorting results between adjacent framesReuse of Gaussian Sorting ResultsIV.Key Feature 1)Surface Perception Unit(SPU)2025 IEEE International Solid-State Circuits Conference24 of 442
287、.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting 61%Gaussian Pre-processing Latency Reduction 59%EMA Energy Consumption ReductionPerformance Results in SPUIV.Key Feature 1)Surface Perception Unit(SPU)Measured Syn
288、thetic-NeRF,NuScenes Dataset 2025 IEEE International Solid-State Circuits Conference25 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingFeature 2:Low Power MAC Core 3 Key Features with SW/HW Co-DesignLoad&So
289、rtGaussiansMLP Inference&TrainingFree DeformMLP TrainingSpatial-TemporalGaussian Order Reuse(STOR)Reconfigurable ALP-based MAC Array(RAMA)Homogeneous Core Cluster&Error Direction Cache(EDC)EMA Latency Power Throughput Efficiency Throughput CoPlanar-GaussianCluster(CPGC)IV.Key Feature 2)Reconfigurabl
290、e ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference26 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Replacing HW-expensive Mantissa Multiplier Area&energy efficiency at the cost
291、 of multiplication error Existing limitations:constrained application due to computation errorApproximate FP Multiplier IV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference27 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Inte
292、ractable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Comparator-based Linear Approximation method ALP-optimized model training High efficiency w/o Accuracy LossALP:Stepwise Linear Plane Approx.IV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State
293、 Circuits Conference28 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Reconfigurable FP Multiplier(RFM)ALP2Unit Processor w/High Area&Energy Efficiency Unit Processor Combination-based precision-accuracy
294、reconfigurabilityALP-based Multiplier DesignIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference29 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Rec
295、onfigurable FP Multiplier(RFM)ALP2Unit Processor w/High Area&Energy Efficiency Unit Processor Combination-based precision-accuracy reconfigurabilityALP-based Multiplier DesignIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference30 of 442.7:IRIS:A
296、 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting 77.7%Power Consumption Reduction 60.8%Multiplier Area Reduction Performance Results of RFMIV.Key Feature 2)Reconfigurable ALP-based MAC ArrayArea m2 FP1647.5FP8Power W 10.3
297、FP16534FP8Accurate FP MultiplierRFM30.7%7.510620960.8%77.7%50.9%15.7153 2025 IEEE International Solid-State Circuits Conference31 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Optimization based on Each
298、Characteristics of MSB and LSB MSB:Sign-bits from scale alignment process LSB :Comparatively robust to errorsAligned-Mantissa Adder TreeIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference32 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC fo
299、r Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Shifter-based Column-Sum Reuse Sign bit duplication Same results across different columns Shifter-based sign bit accumulationMSB Accumulation of MantissaIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 I
300、EEE International Solid-State Circuits Conference33 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting OR-based Reduction Tree Encoding logic&OR-Gate Area 16.8%,Power 26.2%27.7 dB(reduction tree based)28.3 dB
301、(after encoding)LSB Accumulation of MantissaIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference34 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting 61.
302、2%Adder Tree Power Reduction Reducing power consumption by maximizing bit sparsityPeformance of HAMATIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference35 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering
303、and Surface-aware-Modeling with 3D Gaussian Splatting Power&Area Comparsion(3232 MAC Core)9.8 TOPS/W,0.9 TOPS/mm2 FP16Performance of Proposed MAC CoreIV.Key Feature 2)Reconfigurable ALP-based MAC Array 2025 IEEE International Solid-State Circuits Conference36 of 442.7:IRIS:A 8.55 mJ/frame Spatial Co
304、mputing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingFeature 3:Cluster Architecture 3 Key Features with SW/HW Co-DesignLoad&SortGaussiansMLP Inference&TrainingSpatial-TemporalGaussian Order Reuse(STOR)Reconfigurable ALP1-based MAC Array(RAMA)EMA Laten
305、cy Power Throughput CoPlanar-GaussianCluster(CPGC)IV.Key Feature 3)SCGS Cluster Architecture Free-form Deformation&MLP TrainingHomogeneous Core Cluster&Error Direction Cache(EDC)Efficiency Throughput 2025 IEEE International Solid-State Circuits Conference37 of 442.7:IRIS:A 8.55 mJ/frame Spatial Comp
306、uting SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Two Sequential Processes w/High Computational Intensity Reconfigurable MAC Core Architecture Throughput,Utilization Heterogenous&Interdependent TasksIV.Key Feature 3)SCGS Cluster Architecture Harmoni
307、c-Coordinate ComputationPoint Sampling&Conversion Eval.Visualized HC3D Gaussian Feature GenerationFeature Gen.MLPsGaussian PositionsRGB,DensityRGB,DensityRGB,DensityRGB,DensityRotationRotationRotationRotationScaleScaleScaleScaleTask 1Task 2ReflectRenderNode FeaturesWWI3I2ATW0,1/6,1/36Line-Buffer 0(1
308、6b 32)I3I2HAMAT 0Reconfigurable3232 MAC ArrayWeight StationaryInput-Broadcast&RFMRFMRFMRFMRFMRFMATLine-Buffer 1Line-Buffer 31Systolic Input Fetching&Weight-BroadcastOch1Och2Ich1WWIch2WWATATHigh Weight ReusabilityHAMAT 313D Window Buffer 2025 IEEE International Solid-State Circuits Conference38 of 44
309、2.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Spatial Gaussian Locality leads to Gradient Similarity Lossless&Effective reuse ratio by Polarization of Error vector Polarized key utilizing Spatial Locality&Dir
310、ection PredictabilityPolarized Key Generation for EDCIV.Key Feature 3)SCGS Cluster Architecture 2025 IEEE International Solid-State Circuits Conference39 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting Ena
311、bling Elimination of Redundant Computation BP Hierarchical Key-Matching to prevent accuracy loss 73%Back-propagation computation eliminatedSkipping Similar Error Direction BPIV.Key Feature 3)SCGS Cluster Architecture 2025 IEEE International Solid-State Circuits Conference40 of 442.7:IRIS:A 8.55 mJ/f
312、rame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting 29%p Utilization Increase(vs.Heterogeneous Archy)32.8%of Power Consumption Reduction ModelingPerformance of SCGS ClusterIV.Key Feature 3)SCGS Cluster Architecture 2025 IEEE Internati
313、onal Solid-State Circuits Conference41 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingChip Photograph and SummaryV.Implementation ResultsRAMA#0RAMA#1RAMA#2RAMA#3SMEM&EDCSCGS Cluster#1SPUSCGS Cluster#3RISC-
314、V CoreI/F 0I/F 1EIUVRUEIUVRUSCGS Cluster#2GMEM&2D NoC1 MAC=2 OpsI/O Voltage=1.8VSpecificationsTechnology28nm Logic CMOSDie Area4.5 mm 4.5 mm(20.25 mm2)SRAM1743 KBSupply Voltage0.7 0.9 VMax.Frequency200 MHzFP 4EMBFP 8MLPFP 8HC Eval.Rendering PerformanceThroughput72.1 FPSEnergy Consumption3.9-8.55 mJ/
315、frameMLP Efficiency 23.8 TFLOPS/W 50 MHz,0.7 VModeling PerformanceThroughput67.1 Iter/sPower Consumption554 mWMLP Efficiency16.2 TFLOPS/W 50 MHz,0.7 VFP 8FP 16-FFBP,WGFP 16Else(VR,ARAP etc.)FP 16DataType 2025 IEEE International Solid-State Circuits Conference42 of 442.7:IRIS:A 8.55 mJ/frame Spatial
316、Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingComparison TableV.Implementation ResultsTechnologyDie Area(mm2)Op.Voltage VJetson Orin Nano7 nm-Algorithm 5)3DGSThis Work28 nm3.23 1.570.9 V3DGSMax.Freq1400 MHz50-200 MHzMax ThroughputFPS30 1)2.1
317、2)1.79Rendering Energy mJ/frame3571Rendering Power 6)mW7500330.41)Static Object Rendering 2)Real-time Interactive Rendering ISSCC24 828 nm20.250.68 0.9NeRFThis Work28 nm20.250.7 0.93DGS(SEMM)200 MHz200 MHz73.5 1)142.1 1)72.1 2)-8.55728.4554-616.5ISSCC23 728 nm20.250.6 0.95NeRF200 MHz108.5 1)-899PSNR
318、 1)3)26330.41725.8-Modeling PerformanceModeling Speed iter/s2.8634.94-134.835.8467.1-Modeling Efficiency 4)mJ/iter260034.94-134.818.515.07-Interactive RenderingO3DGSXOXRendering Performance4)Including External Memory Access(Esimated w/DDR4 SDRAM 17)5)The model that achieved the highest FPS among the
319、 supportable modelsIPRStatic2409.943.98.28 6)Measured Maximum Frequency 3)nuScenes Dataset 18 2025 IEEE International Solid-State Circuits Conference43 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian SplattingDemons
320、tration VideoV.Implementation ResultsDS1 2025 IEEE International Solid-State Circuits Conference44 of 442.7:IRIS:A 8.55 mJ/frame Spatial Computing SoC for Real-time Interactable-Rendering and Surface-aware-Modeling with 3D Gaussian Splatting IRIS:8.55 mJ/frame,72.1 FPS 3DGS-based Spatial Computing S
321、oC on Mobile AR/VR Devices For Energy-efficient Real-time Interactable Rendering1)Surface Perception Unit(SPU)integrating 3D Mesh on 3DGS2)Reconfigurable ALP-based MAC Array(RAMA)3)Homogeneous Core Cluster w/Error Direction CacheConclusionV.ConclusionA Mobile Interactive Real-Virtual Interfacing Sys
322、temwith 34.3 Rendering Speed and 300.0 Smaller Energy than Edge GPUDS12.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference1 of 37A 210fps Image Signal Processor for 4K Ultra HD True Video Super ResolutionYing-Sheng Li
323、n1,Jun Nishimura2,and Chia-Hsiang Yang11National Taiwan University,Taipei,Taiwan2Google,Mountain View,CA 2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference2 of 37Outline Introduction Preliminaries Algorithm-Architec
324、ture Co-Optimization System Architecture Experimental Verification Summary2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference3 of 37Video Super-Resolution(VSR)VSR 1 aims to increase the resolution of the input video
325、Temporal information is leveraged to achieve better quality Applications include video streaming and conferencingVideo Streaming Video Conferencing LR VideoHR VideoAcceleratorHR ScreenData CenterVSRVSRLR:Low resolution,HR:High resolution2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video
326、Super-Resolution 2025 IEEE International Solid-State Circuits Conference4 of 37State-of-the-Art SR Processors Single-image:temporal correlation is not utilized Multi-image:features are shared between adjacent frames Inter-dependency and redundancy of videos neglectedSingle-ImageVDSRHR FrameFrameLRBi
327、t-streamVLSI19ISSCC21VLSI21Multi-ImageVDVSRBit-streamLRFramesHR FramesISSCC23SR:Super-resolutionVD:Video decoding2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference5 of 37This Work First image signal processor for tr
328、ue VSR Information embedded in the encoded video is utilized Frame redundancy is exploited for better efficiencyStatic backgroundFlat regionsLess SR Effort2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference6 of 37Out
329、line Introduction Preliminaries Algorithm-Architecture Co-Optimization System Architecture Experimental Verification Summary2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference7 of 37NN-based VSR FrameworkAlignmentPre
330、viousHR FramesLR Frames MCRefinementUP.RefinedHR FrameCoarse HR Frame convconvOptical Flow Alignment:extracting and aligning features 2 Main difference between VSR and single-image super-resolution Refinement:fusing the features and refining the HR frameMC:Motion compensation,NN:Neural network2.8:A
331、210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference8 of 37Block Prediction Frames are divided into blocks in video codec Similar blocks in other frames can be referenced Residual block is derived and embedded in bitstream
332、Difference between the current block and reference block Motion vector indicates the location of the reference blockMotion VectorResidual FrameCurrent FrameCurrent BlockMCReferenceBlockPrevious FramesBlock Prediction-DecodingBlock Prediction-EncodingMotion VectorCurrent FrameReferenceBlockResidual F
333、rame+-Previous FramesCurrent BlockME&MCEmbedded in BitstreamME:Motion estimationMC:Motion compensation2.8:A 210fps Image Signal Processor for 4K Ultra HD True Video Super-Resolution 2025 IEEE International Solid-State Circuits Conference9 of 37Video Frame Type Adaptive encoding strategy for video frame types I frames are encoded without referencing other frames P and B frames are encoded by refere