Session 37Design-Technology Optimization and Digital Accelerators.pdf

編號:620843 PDF 27頁 37.99MB 下載積分:VIP專享
下載報告請您先登錄!

Session 37Design-Technology Optimization and Digital Accelerators.pdf

1、Session 37 Overview:Design-Technology Optimization and Digital Accelerators DIGITAL CIRCUITS SUBCOMMITTEEThis session presents eight papers that push the boundaries of digital circuit techniques for design-technology optimization,domain-specific computing and digital accelerators,aiming to enhance e

2、nergy efficiency,system performance,and application-specific capabilities.The first paper explores design-technology co-optimization(DTCO)for an industrial processor design,and the second paper presents a chiplet solution for networks-on-textiles with system-on-chip and networking chiplets.The third

3、 paper demonstrates a compute-in-memory(CI M)-based microprocessor utilizing embedded MRAM for neural network inference,and the fourth paper reports a reusable active TSV-interposer with programmability.The fifth paper introduces a complete KSAT solver that achieves 100%solvability or proves unsatis

4、fiability,and the sixth paper showcases a diffusion accelerator leveraging SRAM CI M and eDRAM storage.The seventh paper highlights a GPS acquisition accelerator designed with energy-accuracy-driven optimization and computing,and the eighth paper presents a low-power keyword spotting system featurin

5、g on-chip training for accented users.Session Chair:Jae-sun Seo Cornell Tech,New York,NY Session Co-Chair:Mahmut Ersin Sinangil NVI DI A,Santa Clara,CA 604 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/OVERVIEW9

6、79-8-3315-4101-9/25/$31.00 2025 I EEE1:30 PM 37.1 IBM Telum II Processor Design-Technology Co-Optimizations for Power,Performance,Area,and Reliability David Wolpert,I BM,Poughkeepsie,NY I n Paper 37.1,I BM and University of Bonn present design technology co-optimization for the 8-core 5.5GHz I BM Te

7、llum I I processor in 5nm,packing 1.38 more transistors in 1.13 area footprint and 1.05 power envelope,while simultaneously meeting a reliability target of 99.999999%system uptime.Key innovations in standard cell library architecture,low-power sequential circuits,soft-error-tolerant design and desig

8、n-for-manufacturability are demonstrated.1:55 PM 37.2 A 2-Dimensional mm-Scale Network-on-Textiles(kNOTs)for Wearable Computing with Direct Die-to-Yarn Integration of 0.62.15mm2 SoC and bySPI Chiplets Anjali Agrawal,University of Virginia,Charlottesville,VA I n Paper 37.2,researchers from the Univer

9、sity of Virginia and Nautilus Defense LLC present two chiplets in 65nm,a system-on-chip and a networking chiplet bySPI,that are designed for direct-die attachment to an embroidered conductive-yarn in-textile bus.The 0.6mmx2.15mm chiplets enable networks-on-textiles that operate at 1.8V with the SoC

10、consuming 3.34mW at 80MHz and the bySPI consuming 0.27mW at 34MHz.4:00 PM 37.6 A 22nm 60.81TFLOPS/W Diffusion Accelerator with Bandwidth-Aware Memory Partition and BL-Segmented Compute-in-Memory for Efficient Multi-Task Content Generation Yiqi Jing,Peking University,Beijing,China I n Paper 37.6,rese

11、archers from Peking University describe a 864kb SRAM CI M-based accelerator in 22nm with 3Mb eDRAM storage for diffusion models that perform image,3D and video generation.The chip features bandwidth-aware memory partition and a bitline-segmented CI M cluster design,achieving 60.81TFLOPS/W for I NT8/

12、BF16 hybrid precision at 180MHz and 0.6V.4:25 PM 37.7 A 28nm 18.1J/Acquisition End-to-End GPS Acquisition Accelerator with Energy-Accuracy-Driven Mixed-Radix IFFT and ROM-Assisted Computing Sangsu Jeong,Seoul National University I n Paper 37.7,researchers from Seoul National University and Columbia

13、University present an end-to-end GPS acquisition accelerator using mixed-radix I FFT optimizations and ROM-assisted computing in 28nm technology.Achieving 18.1J/acquisition at 88.5MHz and 0.6V,the proposed accelerator supports acquisition of 32 satellites across all 41 doppler bins.4:50 PM 37.8 A 13

14、.5W 35-Keyword End-to-End Keyword Spotting System Featuring Personalized On-Chip Training in 28nm CMOS Hyuk-Jin Lee,KAI ST,Daejeon,Korea I n Paper 37.8,researchers from the Korea Advanced I nstitute of Science and Technology,ETH Zrich and Columbia University present a keyword spotting system to reco

15、gnize 35 keywords spoken by users with accents,featuring efficient feature extraction and on-chip training.The 28nm chip achieves 92.2%accuracy on accented speech,while consuming 10.93W for inference and 13.46W for training at 0.75MHz under a 1V analog and 0.7V digital supply.3:35 PM 37.5 SKADI:A 28

16、nm Complete K-SAT Solver Featuring Dual-Path SRAM-Based Macro and Incremental Update with 100%Solvability Zihan Wu,Peking University,Beijing,China I n Paper 37.5,Peking University presents a complete K-SAT solver in 28nm,achieving 100%solvability and unsatisfiable problem detection with an SRAM-base

17、d dual-path macro,an incremental updater,and a position-encoded counter.This chip achieves average solving time of 218-clause 50-variable SAT problems in 17.1s or proves unsatisfiability in 42.1s average time,at 0.9V and 200MHz,consuming 3.39mW.ISSCC 2025/February 19,2025/1:30 PM605 DI GEST OF TECHN

18、I CAL PAPERS 2:45 PM 37.4 SHINSAI:A 586mm2 Reusable Active TSV Interposer with Programmable Interconnect Fabric and 512Mb 3D Underdeck Memory Bo Jiao,Fudan University,Shanghai,China I n Paper 37.4,Fudan University and Kiwimoore Semiconductors present a 586mm2 reusable active TSV-interposer in 28nm C

19、MOS TSV/microbump technology integrating 512Mb SRAM and programmable 2.5D and 3D Network-on-active-interposer with up to 307.2GB/s BW,integrated buck VRs,PLLs and a boot CPU.2:20 PM 37.3 Monolithic In-Memory Computing Microprocessor for End-to-End DNN Inferencing in MRAM-Embedded 28nm CMOS Technolog

20、y with 1.1Mb Weight Storage Soonwan Kwon,Samsung Advanced I nstitute of Technology,Suwon,Korea Seoul National University,Seoul,Korea I n Paper 37.3,authors from Samsung Advanced I nstitute of Technology,Seoul National University,Samsung Electronics and Harvard University present an MRAM I MC-based p

21、rocessor for neural network inference in 28nm technology.Featuring 1.1Mb of embedded MRAM with system-wise calibration techniques,this work achieves 20TOPS/W processor-level energy efficiency for 1b precision at 1.0V.37606 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 3

22、7/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/37.1979-8-3315-4101-9/25/$31.00 2025 I EEE37.1 IBM Telum II Processor Design-Technology Co-Optimizations for Power,Performance,Area,and Reliability David Wolpert1,Gerry Strevig2,Chris Berry1,Leon Sigal3,Bill Huott1,Mark Cichanowski2,Matthias

23、Pflanz4,Tobias Werner4,Philipp Salz4,Nick Jing1,Michael Romain1,I ris Leefken4,Richard Serton1,Rajesh Veerabhadraiah5,Dureseti Chidambarrao3,Robert Arelt1,Matt Angyal1,Ben Trombley1,Arvind Haran2,Stefan Hougardy6,Ben Klotz6,Rahul Rao5 1I BM,Poughkeepsie,NY 2I BM,Austin,TX 3I BM,Yorktown Heights,NY 4

24、I BM,Bblingen,Germany 5I BM,Bangalore,I ndia 6University of Bonn,Bonn,Germany The I BM zNext s Telum I I processor 1 is a full-stack overhaul of its predecessor 2,featuring a modified 5nm Samsung bulk technology 3,new I BM-created standard cell and I P libraries,and a variety of further design-techn

25、ology co-optimizations(DTCO).While die size grew from 530mm to 600mm(1.13),each chip in the dual-chip module became denser,featuring 1.3T shapes(1.25),43B transistors(1.38),and 38km of wire(1.28)across 18 BEOL layers.As I BM s first 5nm processor,TelumI I was a concurrent hierarchical design effort,

26、with libraries,I P blocks(array/regfile/analog/phy/custom),and all gate-level entities designed simultaneously.These data volume and concurrent design pressures drove a variety of abstraction,modeling,and other EDA innovations in addition to those involving power,performance,area,and reliability tar

27、gets.The Telum I I power envelope grew just 5%beyond its predecessor despite significant design growth,including a new on-chip Data Processing Unit,1.4 cache size,1.4 latch count,and enhanced I/O and logic speeds(Fig.37.1.1).While technology scaling provided a 7%power reduction at the 1.15 frequency

28、 target,the latch content growth in Telum I I required advances in low-power sequential cell designs.Two new latch topologies were implemented:a low-power“0-state”latch and a low-power“static”latch(Fig.37.1.2).Both exhibit a similar delay profile to the previous transmission-gate-based latches 4 whi

29、le reducing area,clock loading/power,and leakageparticularly when the latch is not changing state.Eliminating the data port transmission gate and local clock inverter provided a reduction in local clock loading to three devices per latch.The“0-state”latch facilitates delay-borrowing while reducing c

30、lock power for a 0-state by disabling discharge of node D0.The“Static”latch minimizes power for all non-switching states through the feedback from L2 to the pull-down stack of node D0.These latches are inserted during synthesis depending on timing/switching characteristics,and account for over 85%of

31、 Telum I I logic latches.The per-state power savings of these latches is shown in Fig.37.1.3,providing a reduction in the Telum I I total power consumption of over 3%a significant swing given logic latches consumed 15%of the total power in the predecessor design.The Telum I I resonant global clock o

32、scillates at 5.5GHz,driving the need to minimize the clock fanout on the mesh to save“always on”clock mesh power.Previous processors required a Local Clock Buffer(LCB)attached to the clock mesh 5 for every clock-gated latch domain,which encouraged larger-sized clock-gated latch domains to reduce LCB

33、 loading and clock mesh power at each clock gate.The introduction of a micro-gate LCB(Fig.37.1.4)on Telum I I enabled more fine-grained latch clock gating,mixing larger latch count clock-gated domains with“micro-gated”smaller latch domains without adding more LCBs,and reducing the total number of LC

34、Bs connected to the clock mesh by 25%(Fig.37.1.4).Additionally,micro-gate LCBs largely removed scan-clock loading from the clock mesh.Overall,TelumI I LCB re-design reduced total chip power by over 5%,shown in a single-core power split in Fig.37.1.1.Despite core latch counts growing by 7%,these low-

35、power innovations generated 40%power savings in core sequential elements,without affecting design-for-test logic.Technology scaling has increased the probability of multi-bit errors by reducing the critical energy for a single incident particle to split a Si atom into fragments and ions with suffici

36、ent energy to influence neighboring cells6(Fig.37.1.5).The Telum I I 99.999999%system uptime target 7 required advances in multi-bit error detection across adjacent memory cells.The probability of multi-cell errors in array cells is significantly reduced by increasing the physical or logical distanc

37、e between cells of a parity group 8.I n Telum I I,this concept was extended to gate-level latch bank structures,with logic/placement automations to increase the physical distances between cells within the same parity group.These latch groups were determined after initial clustering,then fed into pla

38、cement post-processing according to their parity groups(Fig.37.1.5).Other design optimization steps were unaffected by this process,enabling these reliability improvements with negligible power,performance,or area(PPA)penalty.Due to their density,SRAMs are more susceptible to soft errors and harder

39、to restructure for parity without incurring significant PPA penalties.To satisfy performance and reliability targets,Telum I I SRAMs exploit both circuit interleaving and logic interleaving.The Telum I I gate-level standard cell image is shown in Fig.37.1.6.The 5nm cell architecture features increas

40、ed signal track count and improved cell routability without reducing fin density.I n the predecessor 7nm design,M2 signal tracks were forced to align with PC,with vertical M1 signal pins limiting input pin access to a single M2 track.I n Telum I I,although gate contacts over active regions were avoi

41、ded,a joint DTCO effort provided a larger gate access area between active regions,enabling the use of horizontal M1 pins in the default image.Through technology scaling,DTCO efforts,and exploiting the ratios of multi-input gates,the Telum I I library enabled staggered gate contacts across two horizo

42、ntal input pin tracks in the p-n gap without reducing performance.These horizontal M1 pins provide multiple M2 track access points,as well as enabling the router to extend pins planarly into white space over gaps between active cells.M2 pitch scaling allowed for a PC:M2 ratio of 2:3,effectively incr

43、easing M2 track count by 50%,further improving both gate-level and custom I P routability.These cell architecture changes enabled design area scaling of 10-15%without reducing image height,improving overall image performance without sacrificing reliability.To expand design optimization beyond VT/rat

44、io/power levels,the library grew to include cells with tapered FET stacks,buffers with different internal gain,and gates that split the MOL connection between shared sources.Library cell automation tools enabled further optimization in custom macros,with procedurally-generated cells for common Boole

45、an logic providing options such as multiple MOL-/BEOL-level pin assignment.The standard cell library was further augmented with 250 complex library cells,spanning 20 configurations and their power and VT variants.Each configuration is a combination of logic gates in a single cell(e.g.cascaded NAND2

46、gates),enabling reduced cell footprints with local interconnections on metal layers not otherwise accessible by the router.This provided RC savings by grouping cells often placed many pitches/rows apart,as well as using fewer vias,resulting in lower resistance and better performance of the complex c

47、ombination.To generate these cells,a tool was developed that uses a schematic netlist,technology constraints,and the standard-cell image as inputs and produces a fully placed and routed transistor layout.Key ideas used within this tool include efficient lower-bound computations for the branch-and-bo

48、und placement algorithm and a sophisticated mixed integer programming formulation for routing 9.Telum I I contains 5.3 million complex gate instances created with this tooling.Returning to BEOL optimization,while minimum-pitch M2 wires help increase routing resources for synthesized logic,structures

49、 like latch banks with clock trunks or array topologies with structured placement can benefit from wider wires that break the default track pattern.Wide wires are more effective and have lower timing complexity than combining multiple thin wires,and staying on low metal planes can save wire resource

50、s and capacitance compared to pushing critical wires up to higher metal planes with wider pitches.As minimum-pitch wires usually require larger spacings to wider wires,Telum I I includes two choices of M2 image:a“Type 1”that supports only minimum-pitch M2(including both power staples and signal wire

51、s),and a“Type 2”that supports wider M2 for structured long trunk wires.The transition between different M2 images often occurs at physical hierarchical boundaries,but Telum I I also includes cell transitions from a synthesized“Type 1”area to a“Type 2”area across a consistent library cell image,shown

52、 in Fig.37.1.6.Mixed routing support was achieved with type-specific marker shapes and M2 routing blockages.Another key area of Telum I I DTCO is the design-for-manufacturing(DFM)tooling and methodology.Telum I I utilized an internally developed DFM tool for library design validation that minimized

53、pattern permutations while facilitating other block-level DFM enhancements in the downstream design flow.This was layered over a foundry-provided DFM tool that was augmented with additional functions to identify yield opportunities across 8 metal/via levels,including line end extensions,minimum meta

54、l area,and redundant via insertion.These joint optimizations enhanced robust via counts by 19%compared to the predecessor design,including insertion of over 715M redundant vias across 7 metal layers,and reducing DFM omissions by 66%.Remaining omissions were largely constrained to hierarchical bounda

55、ries,intentionally prohibited to avoid cross-hierarchy DRC fails during the concurrent hierarchical design effort.Considering growth in die size and wire length from the predecessor chip,the combined defect-driven yield value of these DFM operations was estimated to approach 10%.Large chips with com

56、plex layout variances like Telum I I increase the potential for design constructs thatthough clean to sign-off checkscould interact deleteriously with the process.Special design-side controls were implemented to mitigate known product-technology interactions across test sites and first-pass hardware

57、,building systematic defect avoidance directly into the design flow and further improving yields.Ac k nowle dge me nt:The authors would like to thank the entire I BM Enterprise Systems Z,EDA,Product Engineering and I BM Research teams,as well as the Samsung fabrication team,for all their significant

58、 contributions to the success of this project.Figure 37.1.1:Total chip power component trends,and core-specific power distribution trends.Figure 37.1.2:“0-state”pulsed latch schematic,and“Static”pulsed latch.Inputs shown as squares,outputs as circles,with scan outputs omitted for clarity.Figure 37.1

59、.3:Latch power comparison normalized across workloads.Figure 37.1.4:Micro-clock gating LCB schematic,and LCB count optimizations.Figure 37.1.5:Particle strike fragments affecting multiple SRAM bits,and particle-aware latch and clock buffer placement.Figure 37.1.6:Enhanced 5nm library image and multi

60、ple M2 region types.ISSCC 2025/February 19,2025/1:30 PM607 DI GEST OF TECHNI CAL PAPERS 37 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.1.7:Telum II die photo.Re fe r e nc e s:1 G.Strevig et al.,

61、“I BM Telum I I:Next Generation 5.5GHz Microprocessor with On-Die Data Processing Unit and I mproved AI Accelerator,”I SSCC,2025.2 O.Geva et al.,“I BM Telum:A 16-core 5+GHz DCM,”I SSCC,pp.46-47,2022.3 J.Jeong et al.,“Performance-Power Management Aware State-of-the-Art 5nm FinFET design(5LPE)with Dua

62、l CPP From Mobile to HPC Application,”I EEE I EDM,pp.405-408,2020.4 B.Curran et al.,“Power-Constrained High-Frequency Circuits for the I BM POWER6 Microprocessor,”I BM J.Re s.&De v.,vol.51,no.6,pp.715-731,2007.5 S.Chan et al.,“A resonant global clock distribution for the Cell Broadband Engine proces

63、sor,”I EEE J.Solid-St a t e Cir c uit s,vol.44,no.1,pp.64-72,2009.6 J.A.Zoutendyk,L.D.Edmonds and L.S.Smith,“Characterization of Multiple-Bit Errors From Single-I on Tracks in I ntegrated Circuits,”I EEE Tr a ns.Nuc le a r Sc ie nc e,vol.36,no.6,pp.2267-2274,Dec.1989.7 C.Berry,“I BM Next Generation

64、Processor and AI Accelerator,”I EEE Hot Chips,2024.8 M.Maniatakos,M.Michael,Y.Makris,“Multiple-Bit Upset Protection in Microprocessor Memory Arrays Using Vulnerability-Based Parity Optimization and I nterleaving,”I EEE TVLSI,vol.23,no.11,pp.2447-2460,2015.9 P.Van Cleeff et al.,“BonnCell:Automatic Ce

65、ll Layout in the 7-nm Era,”I EEE TCAD,vol.39,no.10,pp.2872-2885,2020.608 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/37.2979-8-3315-4101-9/25/$31.00 2025 I EEE37.2 A 2-Dimensional mm-Scale Network-on-Textiles(

66、kNOTs)for Wearable Computing with Direct Die-to-Yarn Integration of 0.62.15mm2 SoC and bySPI Chiplets Anjali Agrawal1,Zhenghong Chen1,Braden E.Desman1,Jinhua Wang1,Akiyoshi Tanaka1,Fahim Foysal1,Charlie D.Hess1,Will Farrell2,Jim Owens2,Daniel S.Truesdell1,Benton H.Calhoun1 1University of Virginia,Ch

67、arlottesville,VA 2Nautilus Defense LLC,Pawtucket,RI This paper proposes a scalable,2-dimensional network-on-textiles(kNOT)comprised of systems-on-chip(SoCs)and“bySPI”networking chiplets that are jointly capable of supporting heterogeneous programming,multiple sensing modalities,and a distributed mem

68、ory system.Emerging e-textiles must retain the flexibility and comfort of their host garment to be viable in wearable applications for healthcare,virtual reality,and sports 1-4.Prior integration efforts have demonstrated textile computing by weaving flexible filament circuits 5-7,embroidering conduc

69、tive yarns 8,and fabricating electronic fibers 9-10.Still,these works suffer from a combination of bulky,rigid components,high cost,or 1-dimensionality(Fig.37.2.1).To preserve the textile s look and feel,SoCs for fabric integration must also be highly miniaturized,feature minimal area-hungry I O int

70、erfaces,be easily programmable,and be fully integrable.The e-textile system in 11 integrated a health sensing chip onto a planar fashionable circuit board,yet the board(2525mm2)is 40 larger than the chip itself.The System-in-Fiber from 12 is fully autonomous but has many I O pads,is limited to 1D ne

71、tworks,and requires an interposer(4.73.7mm2)that is 3.8 larger than the die.Most recently,a battery-less e-textile system in 13 integrates cm-scale harvesting tiles and an inductor onto a shirt,but it lacks an integrated SoC.Many wearable applications also demand substantial on-garment storage,yet l

72、arge memories are unsuited for comfortable textile integration due to their sizable footprint.Our kNOT solution addresses this by replacing monolithic memory units with a distributed set of smaller memories.However,networking these chiplets via existing bus standards incurs significant area and powe

73、r penalties 14.Existing Body Area Networks highlight the potential of multi-chip solutions,but they lack seamless textile integration 15-17.Hence,to truly realize a fabric computing system,we propose a board and interposer-free 2D kNOT built with the direct-die attachment of two miniaturized chiplet

74、s:1)an SoC with an array of reconfigurable pads,fault-tolerant global bootup,SoC-to-SoC clock synchronization,and dynamic network configuration;and 2)a bySPI chiplet which is a COTS-compatible,reduced-wire enhancement of the bypass-SPI (bySPI)protocol 14 capable of connecting to any number of receiv

75、ers(RXs)using three wires.The chiplets are directly interconnected to braided composite yarns,which are comprised of para-aramid structural yarns and insulated 25m-diameter conductors 26.For each pad,a yarn is embroidered into a cotton substrate and its insulation is selectively ablated using a lase

76、r.Solder paste is deposited,the chiplet is placed to align its pads with the yarns,and the solder is reflowed.The interconnections are then encapsulated to provide mechanical and environmental protection.This textile-chiplet interconnection approach is compatible with automated high-throughput 2.5D

77、electronics manufacturing methods,enabling the production of textile-integrated systems at scale(Fig.37.2.1).Figure 37.2.2 shows a system diagram of the kNOT,where SoCs are linked with other SoCs,sensors,and memories via bySPI chiplets in a 2D tree-like network that can be distributed across a garme

78、nt.To provide a scalable and distributed approach towards wearable computing,our kNOT system utilizes a miniaturized SoC with a Cortex M0+core and 32kB SRAM(Fig.37.2.2),optimized for kNOTs using:1)a linear array of reconfigurable pads spaced at a pitch of 180m(Fig.37.2.7),which dominate the chiplet

79、size.To meet I O requirements despite the limited area for pads,the pads are internally MUXed to different signals at runtime.2)A custom bySPI-compatible receiver(RX)that enables SoCs to boot up globally from the upstream network,reducing the number of off-chip components(i.e.,NVMs)required in the n

80、etwork compared to previous SoCs 18-22.Using this feature,different SoCs can be programmed with either the same or different programs via bySPI protocol.3)An on-chip clock architecture with tunable and dividable fast(50MHz)and slow(180kHz)clock sources,that can be synchronized across SoCs using a hi

81、gh-precision clock synchronization algorithm.At nominal conditions(TT,1.2V,20C),the nominal clock tuning resolution for the fast clock is 0.15%,which increases to 0.22%worst-case across temperature while the worst-case tuning resolution for the slow clock is 0.11%.The bySPI architecture(Fig.37.2.2)s

82、upports kNOT routing with:1)A one-to-many 3-wire SPI protocol that allows the transmitter(TX)to bypass signals to the intended RX.2)A group access feature that allows for simultaneous access to multiple chiplets within a downstream group.3)A chip select(CS)-free soft reset to address the lack of a c

83、onventional CS signal and a timeout state to address dropped clock edges(Fig.37.2.2,bottom left).4)A CS generator to interface with COTS modules using 4-wire SPI.Both chiplets communicate at 1.8V,with the digital core running at 1.1V generated by a fully-on-chip low-dropout regulator(LDO)(Fig.37.2.3

84、)comprising a PMOS pass transistor,300mV Vref,and a 1:4 feedback network of 4 pseudo resistors to save area.An RC circuit is added for stability compensation,and the on-chip output capacitor is a well-biased MOS 23+MOM+MI M design to achieve a high capacitor density of 9.95fF/m2(simulated value)over

85、 existing designs 24.I n our kNOT system,both chiplets can interface with COTS sensor and memory nodes by reconfiguring their pads to use 4-wire SPI.The upstream network can be any preprogrammed SoC or TX capable of sending SPI signals and setting up the network.Figure 37.2.3 explains the tailored a

86、lgorithms we developed for this kNOT system.We propose a global bootup process based on asynchronous handshaking between upstream chiplets and SoCs,where the TX reads back and validates the written program to correct any data corruption in the yarns.The handshake and timeout features collectively en

87、sure that the SoCs in a kNOT are programmed reliably.We introduce a clock and timestamp synchronization protocol 25,depicted in Fig.37.2.3(right),that surpasses traditional schemes requiring a crystal oscillator(XO)for each chip in a network.Timestamps are appended to data packets before storage to

88、ensure seamless reconstruction during read-out.Addressing the impracticality of using XOs for each SoC,this approach employs a periodic reference timestamp sent by the upstream network.Downstream SoCs compare this reference to a locally generated value from a timer peripheral,enabling precise tuning

89、 of on-chip clocks based on the difference in timestamp values.The custom bySPI chiplet uses a SPI-compliant protocol with 8b commands to identify the access target in the downstream network and reduce pad count vs.conventional SPI.By sending bypass instructions to set the output direction of the by

90、SPI chiplet,the TX can build a channel in the kNOT system to communicate with the target.The Function and Direction Controller block(Fig.37.2.2)decides whether to enter the“selected”or“unselected”state based on instructions received from the TX.To enter the“unselected”state,bySPI must receive a bypa

91、ss command,after which it transmits signals to the downstream chip.To enter the“selected”state from the“unselected”state,bySPI must receive a reset path command followed by an access command,enabling it to respond to future read and write commands.The soft-reset instruction resets the bySPI state ma

92、chine upon receiving at least fifty 1 s followed by a 0,allowing communication to be recovered following a partial signal loss.The SoC and bySPI chiplets are fabricated in 65nm LP CMOS.Figure 37.2.4(top)shows a demo kNOT system of SoCs(S1-S3)and bySPI chiplets(b1-b3),with a COTS sensor(BME280)and me

93、mory(MX25U12843G).Once S1 is programmed,it sets up the bySPI link b1-b2 and performs read and write operations with the sensor and memory.The measured waveform in Fig.37.2.4 shows the b1-b3 bySPI tree being configured to bypass signals in the desired direction and transmit to multiple leaf nodes.Whi

94、le programming S1,we intentionally send some corrupted data packets to validate the fault-tolerant bootup algorithm.The read and write operation with COTS components is achieved using the protocol conversion feature of b2,and Fig.37.2.4(bottom right)shows in-textile measured waveforms.Figure 37.2.5

95、shows the measured electrical performance of the LDO and measured waveforms for example operations from the bySPI protocol.At a frequency of 80 MHz,the SoC consumes 3.34mW while the bySPI chip consumes 0.27mW at 34MHz.The performance of the LDO is characterized across five measured chips.The LDO con

96、sumes a low quiescent current of 5.8A,and its output voltage is 1.12V when there is no load current(0A).When the load current changes by 10mA within 1ns,the output voltage experiences a variation(V)of 45mV.Additionally,the LDO achieves a power supply rejection(PSR)of-27 dB at 10MHz.Figure 37.2.6 com

97、pares the kNOT solution with state-of-the-art textile integrated systems and SoCs.The kNOT system excels in textile integration while offering capabilities for processing,networking,and data collection using the bySPI and SoC chiplets.Figure 37.2.7 shows the die micrograph of the SoC and bySPI chipl

98、ets and a comparison with miniaturized SoCs.The seamless integration of yarns,chiplets,and protocols positions the kNOT as an integratable,scalable choice for creating a distributed garment-wide,2D network of interconnected chiplets.Ac k nowle dge me nt:This research is based upon work supported in

99、part by the Office of the Director of National I ntelligence(ODNI),I ntelligence Advanced Research Projects Activity(I ARPA),via N66001-23-C-4512.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies,either

100、 expressed or implied,of ODNI,I ARPA,or the U.S.Government.The U.S.Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.Figure 37.2.1:The kNOT system(top);prior work(bottom left);integration of a chiplet in textiles(

101、bottom right).Figure 37.2.2:kNOT system diagram(top);architecture of the chiplets(bottom);bySPI reset behavior(bottom right).Figure 37.2.3:bySPI Protocol(top left);bootup algorithm(middle);LDO(bottom left);clock synchronization(right).Figure 37.2.4:Test setup(top)&measured waveforms(left and top rig

102、ht);in-textile measured waveforms(bottom right).Figure 37.2.5:Measured performance for LDO(top);measured waveforms for bySPI protocol(bottom).Figure 37.2.6:Comparison of kNOT with prior state-of-the-art textile integrated systems.ISSCC 2025/February 19,2025/1:55 PM609 DI GEST OF TECHNI CAL PAPERS 37

103、 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.2.7:Comparison SoC with state-of-the-art customized SoCs(top),die photos of the chiplets.Re fe r e nc e s:1 J.Pu et al.,“Textile Electronics for Wear

104、able Applications,”I nt.J our.of Ex t r e me Ma nufa c t ur ing,vol.5,no.4,2023.2 Liu S et al.,“Textile Electronics for VR/AR Applications,”Adv a nc e d F unc t iona l Ma t e r ia ls,vol.31,no.39,2021.3 K.Yang et al.,“E-Textiles for Sports and Fitness Sensing:Current State,Challenges,and Future Oppo

105、rtunities,“Se ns or s,vol.24,no.4,2024.4 de Mulatier S et al.,“Electronic Circuits I ntegration in Textiles for Data Processing in Wearable Technologies,”Adv a nc e d Ma t e r ia ls Te c hnologie s,vol.3,no.10,2018.5 Wicaksono,I.et al.,A Tailored,Electronic Textile Conformable Suit for Large-Scale S

106、patiotemporal Physiological Sensing I n Vivo,”npj F le x ible Ele c t r onic s,vol.4,2020.6 Komolafe,Abiodun et al.,“I ntegrating Flexible Filament Circuits for E-Textile Applications,”Adv a nc e d Ma t e r ia ls Te c hnologie s,vol.4,no.7,2019.7 Li M et al.,“I ntegration and Testing of a Three-Axis

107、 Accelerometer in a Woven E-Textile Sleeve for Wearable Movement Monitoring,”Se ns or s,vol.20,no.18,2020.8 E.I smar,X.Tao,F.Rault,F.Dassonville and C.Cochrane,“Towards Embroidered Circuit Board From Conductive Yarns for E-Textiles,”I EEE Ac c e s s,vol.8,pp.155329-155336,2020.9 Loke,G.et al.,“Digit

108、al Electronics in Fibres Enable Fabric-Based Machine-Learning I nference,”Na t Commun.,vol.12,2021.10 Hwang,S.et al.,I ntegration of Multiple Electronic Components on a Microfibre Towards An Emerging Electronic Textile Platform,”Na t Commun.,vol.13,2022.11 H.Kim et al.,“A 1.12mW Continuous Healthcar

109、e Monitor Chip I ntegrated on a Planar Fashionable Circuit Board,”I SSCC,pp.150-603,2008.12 X.Liu et al.,“A Self-Powered SoC with Distributed Cooperative Energy Harvesting and Multi-Chip Power Management for System-in-Fiber,”I SSCC,pp.236-238,2023.13 A.K.Gundu,L.Fassio,M.Alioto,“E-Textile Battery-Le

110、ss Walking Step Counting System with 23 pW Power,Dual-Function Harvesting from Breathing,and No High-Voltage CMOS Process,”I EEE Sy mp.VLSI Cir c uit s,2024.14 X.Liu,et al.,“A 1pJ/bit Bypass-SPI I nterconnect Bus with I 2C Conversion Capability and 2.3nW Standby Power for Fabric Sensing Networks,”I

111、EEE Biome dic a l Cir c uit s a nd Sy s t e ms Conf.,2023.15 J.Yoo et al.,“A 5.2 mW Self-Configured Wearable Body Sensor Network Controller and a 12 W Wirelessly Powered Sensor for a Continuous Health Monitoring System,”I EEE J SSC,vol.45,no.1,pp.178-188,2010.16 S.Lee et al.,“A 75 W Real-Time Scalab

112、le Body Area Network Controller and a 25 W ExG Sensor I C for Compact Sleep Monitoring Applications,”I EEE J SSC,vol.47,no.1,pp.323-334,2012.17 N.Desai,J.Yoo and A.P.Chandrakasan,“A Scalable,2.9 mW,1 Mb/s e-Textiles Body Area Network Transceiver With Remotely-Powered Nodes and Bi-Directional Data Co

113、mmunication,”I EEE J SSC,vol.49,no.9,pp.1995-2004,2014.18 P.Schonle et al.,“A Multi-Sensor and Parallel Processing SoC for Miniaturized Medical I nstrumentation,”I EEE J SSC,vol.53,no.7,pp.2076-2087,July 2018.19 X.Wu et al.,“A 0.04mm3 16nW Wireless and Batteryless Sensor System with I ntegrated Cort

114、ex-M0+Processor and Optical Communication for Cellular Temperature Measurement,”I EEE Sy mp.VLSI Cir c uit s,pp.191-192,2018.20 L.Xu et al.,“A 210 340 50 m I ntegrated CMOS System for Micro-Robots with Energy Harvesting,Sensing,Processing,Communication and Actuation,”I SSCC,pp.194-195,Feb.2022.21 S.

115、O Driscoll et al.,“A 200m 200m 100m 63nW 2.4 GHz I njectable Fully Monolithic Wireless Bio-Sensing System,”I EEE Ra dio F r e q.I nt e gr.Cir c uit s Sy mp.,pp.256-259,2017.22 J.Lim et al.,“A Light-Tolerant Wireless Neural Recording I C for Motor Prediction With Near-I nfrared-Based Power And Data T

116、elemetry,”I EEE J.Solid-St a t e Cir c uit s,vol.57,no.4,pp.1061-1074,2022.23 C.Jang et al.,“A Reconfigurable Step-Down Switched-Capacitor Power Converter Using Optimized Partial Series-Parallel(OPSP)Topology,”I EEE TCAS-I I,vol.71,no.6,pp.2941-2945,2024.24 P.Vivet et al.,“A 220GOPS 96-Core Processo

117、r with 6 Chiplets 3D-Stacked on an Active I nterposer Offering 0.6ns/mm Latency,3Tb/s/mm2 I nter-Chiplet I nterconnects and 156mW/mm2 82%-Peak-Efficiency DC-DC Converters,”I SSCC,pp.46-48,2020.25 Jia-Jun Qin et al.,“Design of a High-Precision Clock Distribution And Synchronization System,”Nuc le a r

118、 I ns t r ume nt s a nd Me t hods in Phy s ic s Re s e a r c h Se c t ion A:Ac c e le r a t or s,Spe c t r ome t e r s,De t e c t or s a nd As s oc ia t e d Equipme nt,vol.1062,2024.26 Chaudhary B,Lyngdoh GA,Owens J,Das S,Matos H,“An I nvestigation I nto The Electromechanical Performance Of Textile

119、Fabrics With Conductive Yarn Elements For Data Transfer Capabilities,”Te x t ile Re s e a r c h J our na l.,vol.94,no.17-18,pp.2079-2091,2024.610 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/37.3979-8-3315-4101

120、-9/25/$31.00 2025 I EEE37.3 Monolithic In-Memory Computing Microprocessor for End-to-End DNN Inferencing in MRAM-Embedded 28nm CMOS Technology with 1.1Mb Weight Storage Soonwan Kwon1,2,Sungmeen Myung1,Jangho An1,Hyunsoo Kim1,Minje Kim1,Hyungwoo Lee1,Wooseok Yi1,Seungchul Jung1,Daekun Yoon1,Shinhee H

121、an3,Saeyoon Chung3,Kilho Lee3,Jeong-Heon Park3,Kangho Lee3,Sang Joon Kim1,Donhee Ham1,4 1Samsung Advanced I nstitute of Technology,Suwon,Korea 2Seoul National University,Seoul,Korea 3Samsung Electronics,Giheung,Korea 4Harvard University,Cambridge,MA Always-on AI sensor applicationsbased on deep neur

122、al networks(DNNs)with sparse inference require low power consumption during both computing and idle phases.I n-memory computing(I MC)with non-volatile memory crossbar arrays could meet this demand.Concretely,the co-location of memory(DNN weight storage)and computing(analog matrix multiplications(MMs

123、)in crossbar arrays obviates the need to shuttle weight data,thus reducing the computing power consumption,while the use of the non-volatile memory minimizes the power consumption during idle states.Resistive,phase-change,and magneto-resistive random-access memory(RRAM,PRAM,and MRAM)and flash memory

124、 are well-known non-volatile memory types,with their own merits and drawbacks.Of these,MRAM boasts high endurance and low switching energy(with the drawback being 1b storage per cell)1 and has been commercially embedded in CMOS logic technology,and thus MRAM is a good non-volatile memory candidate f

125、or I MC.However,previous MRAM I MC works 2-5 have been focused on individual crossbar arrays,which correspond only to a small fraction of a DNN.We realize a monolithic I MC microprocessor capable of end-to-end DNN computing in MRAM-embedded 28nm CMOS logic technology.I t fully integrates a mixed-sig

126、nal data path consisting of many MRAM crossbar arrays for analog MMs and digital logic for non-MM processing(such as pooling and activation)to map all DNN layers(Fig.37.3.1).Specifically,the datapath features 14 analog MM enginescomprising 126 MRAM crossbar arrays and 1,008 transimpedance amplifier(

127、TI A)and analog-to-digital converter(ADC)pairsfor analog MMs and their digital conversion,and 1 post engine(digital logic)for non-MM processing.Computed data is routed to and from analog and digital sectors fully on chip to handle all DNN layers,while managing analog noise.Figure 37.3.2 illustrates

128、the datapath backbone of our processor,consisting of the 14 MM engines and 1 post engine.Each MM engine,comprising nine 12864 MRAM crossbar arrays and their readout electronics(8 TI A-ADC pairs per crossbar array),executes analog MMs and digital conversion.Concretely,the MM engine takes nine 1128 in

129、put vectors and produces nine 164 digitized vectors,which are digitally added to one 164 digitized vector as the final output.Effectively,the MM engine multiplies a 11,152 input vector with a 1,15264 weight matrix to yield a 164 digitized output vector.The post engine(digital logic)performs non-MM p

130、rocessing.The datapath has a total of 126 crossbar arrays with 1Mb weight storage capacity and 1,008 TI A-ADC pairs.For computing of a DNN layer,we typically use one MM engine followed by the post engine to perform MMs and subsequent non-MM processing.I ts output is then fed to the next MM engine fo

131、llowed by the post engine,for computing the next DNN layer.This is repeated for end-to-end DNN inference,controlled by a microcontroller unit(MCU)integrated on the same chip.Figure 37.3.3 illustrates the dataflow for a 33 convolution in an MM engine,which consists of 3 processing elements(PEs),with

132、each containing 3 crossbar arrays.These arrays share inputs and control circuits(except enable signals)to reduce routing complexity.To enhance throughput,the MM engine takes 4 lines of inputs(e.g.,X1,X5,X9,X13)through an input fetcher unit(I FU)and loads them into a shared shift register,which align

133、s the inputs to efficiently feed the 3 PEs.The MM engine processes these inputs over a 4-cycle period,producing 2-line outputs in a zig-zag pattern,which are sent directly to the post engine(if pooling is performed in the post engine,the output is condensed into 1 line).Although the throughput(0.5 o

134、utputs per cycle)is still lower than that of a fully connected dataflow(1 output per cycle),the benefits of this architecturereduced connection complexity and lower SRAM bandwidth requirementsare significant,particularly in resource-constrained always-on AI sensing applications.Furthermore,the dataf

135、low utilizes vertical overlap between input feature map(I FM)tiles,reducing SRAM accesses required for I FM fetching by a factor of 0.67,as compared to line-by-line processing 6.Figure 37.3.4 presents the schematic of a 12864 MRAM crossbar array.Each bitcell,which stacks 8 identical unit structures,

136、with each unit featuring 2 MTJs and 2 FETs,performs a binary multiplication,which is akin to the XNOR operation in the digital domain.This switching-based analog multiplication also works for a bitcell with only 1 unit structure 3,but we use 8 unit structures per bitcell to increase the bitcell resi

137、stance for relatively low-power operation.The MRAM crossbar array with 64 columns,followed by 8 TI A-ADC pairs(each TI A-ADC pair is multiplexed to 8 columns),produces 64 digitized multiply-and-accumulate(MAC)outputs(each ADC is a 4b flash ADC),thus enabling the aforementioned MMs.All crossbar array

138、s have two individual calibration points and share common reference voltages.Figure 37.3.5 depicts system-wide calibrations aimed at mitigating prominent circuit non-idealities,in particular,column-to-column variations,layer-to-layer variations,and residual noise.1)To mitigate the column-to-column v

139、ariations,each column-group,consisting of 8 columns that share an TI A-ADC pair,is initially tuned over a broad range by compensating for the offset voltage of the op-amp in the TI A and finely tuned by controlling the resistance in the TI A by comparing the measured output with the target output.Af

140、ter completing the calibration of all the column groups within a chip,95%of errors in 126 crossbar arrays can be reduced within 1 LSB.2)Layer-to-layer dependent variations arise because the number of weights generally varies from layer to layer in a DNN,and thus different MM engines use generally di

141、fferent numbers of crossbar arrays,drawing different average current from the power supply.This causes layer-to-layer variations in offsets in MAC output values.These variations can be experimentally estimated by using predefined patterns of inputs and can be subsequently calibrated out(this correct

142、ion incurs no additional cost because the estimated variations are combined into bias-addition parameters).This calibration of the layer-dependent variations is crucial for achieving software-comparable accuracy(without the calibration,inferencing accuracy with MNI ST dataset is decreased by nearly

143、10%).3)Residual noise encompasses all remaining errors after calibrating both column-and layer-dependent variations.To enhance the robustness of inference against this residual noise,we do noise-aware training.ReLU activation does not work well for DNNs with many layers,due to deep back propagation

144、of noise.The hyperbolic tangent(tanh)activation does not back propagate noise as deeply,but it suffers from the gradient vanishing problem during training.To address this,we perform noise-aware training as follows.First,we train the DNN with ReLU for all layers and with no noise.Then,we convert the

145、activation function of the first layer to tanh,and retrain the DNN,now with noise added to the first layer.Subsequently,we convert the activation function of the second layer to tanh,and retrain the DNN,with noise added to the first and second layers,while not updating the weights of the first layer

146、.We repeat this process until all activation functions become tanh.Ultimately,for inference,our processor uses the tanh activation for all layers.Figure 37.3.6 presents system-level measurements.The power efficiencies of a representative individual MRAM crossbar array and a representative I MC micro

147、processor in its entirety are 59.8TOPS/W and 20TOPS/W,respectively.The power breakdown across all processor components was measured for 100 chips,and its average is also shown in the figure.Under the identical clock and supply voltage used for the power measurement,we classify MNI ST handwritten dig

148、its using an 8-layer convolutional neural network fully mapped onto the I MC microprocessor with dataflow fully on chip,achieving an accuracy of 97.62%(software baseline:99.45%).I n addition,we further evaluated the face-detection model with the FDDB dataset,with all weights fully deployed on chip,b

149、ut with external feature map tiling(due to large image),achieving an accuracy of 91.3%(software baseline:92.51%).Figure 37.3.7 shows a die photo and the performance summary.Figure 37.3.1:Edge AI system overview,motivation for end-to-end IMC microprocessor.Figure 37.3.2:Overall architecture of IMC mi

150、croprocessor.Figure 37.3.3:Dataflow:Example for 3x3 convolution with max-pooling.Figure 37.3.4:MRAM crossbar array and readout electronics.Figure 37.3.5:A system-wide calibration approach.Figure 37.3.6:Comparison with prior work.ISSCC 2025/February 19,2025/2:20 PM611 DI GEST OF TECHNI CAL PAPERS 37

151、2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.3.7:Die photo and performance summary.Re fe r e nc e s:1 Shao,Qiming,Zhongrui Wang,and J.Joshua Yang,“Efficient AI with MRAM,”Na t ur e Ele c t r onic

152、 s,vol.5,pp.67-68,2022.2 P.Deaville et al.,“A 22nm 128-kb MRAM Row/Column-Parallel I n-Memory Computing Macro with Memory-Resistance Boosting and Multi-Column ADC Readout,”I EEE Sy mp.VLSI CI r c uit s,pp.268-269,2022.3 S.Jung et al.,“A Crossbar Array of Magnetoresistive Memory Devices for I n-Memor

153、y Computing,”Na t ur e,vol.601,pp.211-216,2022.4 H.Cai et al.,“A 28nm 2Mb STT-MRAM Computing-in-Memory Macro with a Refined Bit-Cell and 22.4-41.5TOPS/W for AI I nference,”I SSCC,pp.500-502,2023.5 W.Xie et al.,“A 709.3 TOPS/W Event-Driven Smart Vision SoC with High-Linearity and Reconfigurable MRAM

154、PI M,”I EEE Sy mp.VLSI Cir c uit s,2023.6 S.Yin et al.,“PI MCA:A 3.4-Mb Programmable I n-Memory Computing Accelerator in 28nm for On-Chip DNN I nference,”I EEE Sy mp.VLSI Cir c uit s,2021.7 Zhang,Shifeng et al.,“FaceBoxes:A CPU Real-Time Face Detector with High Accuracy,”I EEE I nt e r na t iona l J

155、 oint Conf.on Biome t r ic s,2017.612 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/37.4979-8-3315-4101-9/25/$31.00 2025 I EEE37.4 SHINSAI:A 586mm2 Reusable Active TSV Interposer with Programmable Interconnect F

156、abric and 512Mb 3D Underdeck Memory Bo Jiao*1,Haozhe Zhu*1,Yuman Zeng1,Yongjiang Li1,Jie Liao1,Siyao Jia1,Zexing Chen1,Mochen Tian2,Jundong Zhu2,Dexin Wen2,Yan Wang2,Yu Wang2,Jian Xu2,Feng Wang2,Jun Tao1,Chixiao Chen1,Qi Liu1,Ming Liu1 1Fudan University,Shanghai,China 2Kiwimoore Semiconductors,Shang

157、hai,China *Equally Credited Authors(ECAs)As high-density 3D integration technology progresses,active TSV interposers are emerging as a critical component in More-than-Moore strategies 1,2,3,enabling the transformation of traditional multicore monolithic systems into compact two-layer stacking archit

158、ectures.These active interposers,often referred to as base dies,typically occupy several hundred square millimeters and provide advantages,facilitating integrated power management,mature-technology-friendly system I Os,and efficient inter-chiplet communication through active buffers.Developing large

159、-area silicon with TSV/bumping technology for dedicated designs is both costly and inefficient,analogous to reinventing the wheel.Reusable 2.5D/3D components have emerged as a promising solution 4,5,6 to reduce the NRE costs and simplify the design complexity of 3D integrated systems.For example,a s

160、ingle active interposer can be configured in two different schemes 3:one supports three CPU top dies,while the other operates with two domain-specific accelerators.However,the interconnect and interfaces are customized for specific chiplets in different configurations,leading to challenges in reusin

161、g signal links on the interposer.Moreover,the expansive active area of the interposer allows for the integration of considerable on-interposer memory,known as 3D underdeck memory.However,the orchestration between the top dies and the underdeck memory poses challenges due to the diverse requirements

162、across different tasks.To address these issues,this paper presents a reusable active TSV-interposer,known as SHI NSAI,in 28nm-CMOS/TSV-middle/micro-bump(bump)technology.The key contributions are 1)heterogeneous dual-layer network-on-active-interposer(NoAI)with 512Mb underdeck SRAM;2)programmable hor

163、izontal(2.5D)die-to die link(H-Link)fabric,achieving independent bump-to-bump routing for various interconnect topologies;3)a reconfigurable vertical link(V-Link)bridging 3D NoC among the top and base dies,adaptable to varying up/down-stream bandwidth on stacked interfaces.Figure 37.4.1 illustrates

164、the overall architecture of 3D integrated systems based on SHI NSAI,hosting up-to-16 functional top chiplets.To accommodate the diverse communication requirements among chiplets,the NoAI of SHI NSAI supports two different communication mechanisms:a packet-switching 3D stacking NoC and a circuit-swit

165、ching programmable H-link fabric.To validate both approaches,an NPU chiplet,equipped with a RI SC-V core 7 and a previous compute-in-memory accelerator 8,was developed to support both interfaces.The H-Link interface is compatible with 2.5D interconnect circuits 9,while the V-Link interface unifies t

166、he 3D networks of the top and base dies into a single integrated network.The 3D NoC also connects 8 banks,512Mb of SRAM as underdeck shared memories,boosting the system performance with less external memory access.I n addition,SHI NSAI incorporates a CPU responsible for booting the entire system and

167、 configuring all modules,along with high-speed I Os and serial peripherals for external data communication.Analog modules,such as integrated voltage regulators and PLLs are also embedded.The programmable H-Link fabric achieves bump-wise interconnect flexibility through the structure demonstrated in

168、Fig.37.4.2.A grid matrix is implemented across the entire SHI NSAI to interconnect the top dies in various topologies.This matrix consists of multiple segments of high-speed parallel signal lanes,known as tracks,which run in both vertical and horizontal directions.At each intersection of these verti

169、cal and horizontal tracks,a Cross-Over Block(COB)is employed to handle signal direction changes,functioning similarly to a junction at a crossroads.Moreover,the matrix has 64 bump clusters connecting the tracks to the top dies H-Link interface,each with 128 40m-pitch bumps.To satisfy unique pin defi

170、nitions of each top die,the clusters must accommodate arbitrary connections from each bump to track by Turn-Out Blocks(TOB).For circuit implementation,each COB is equipped with bidirectional drivers on all four sides and includes a pull-down switch to prevent signal floating tracks when no driving c

171、apability is present.The cross-over topology employs a Wilton structure,which demonstrates good routability and efficient hardware implementation for segmented tracks 10.Each TOB cluster organizes 128 bumps into two separate banks.Each bank is connected through a multi-stage switching network,consis

172、ting of 88 and 22 switch matrices,enabling selective routing.To maintain signal synchronization,both the COB and TOB drivers are equipped with digitally controlled delay lines,managed by a 16b unary control.Placement and routing tools are also developed for the fabric 11.Figure 37.4.3 shows one clus

173、ter of the V-Link interface circuit,including 5 channels of 32-lane data transmission.The top-die interface module initially converts parallel bus signals into flits per a specified protocol,then serializes these signals at a higher rate for transmission through the fully digital physical layer circ

174、uits on the vertical stacking interface 12.A corresponding interface on SHI NSAI performs the reverse operation,completing the 3D network.To detect bonding faults,a built-in self-detector is fused into the last driver of the interface.A comparator on SHI NSAI evaluates the bumps voltage by using unb

175、alanced pull-up and pull-down resistors to assess connectivity.Upon detecting a failure,redundant bumps are employed to repair the faulty path 14.Unlike UCI e-3D 12,V-Link interfaces provide the reconfigurable rate of up/down stream channels.Among the five channels,there are two fixed-direction chan

176、nels and three bidirectional channels.The fixed-direction channels are either transmitting data from top dies or from SHI NSAI,referred to as the TX and RX channels,respectively.The bidirectional channels can be reconfigured to operate in either the TX or RX direction.The flexible allocation can fur

177、ther boost link utilization.For instance,when coping with different layers in neural networks,communication-aware data flows are assigned to different chiplets 13.I n the operation of bottleneck based data concatenating,increased TX bandwidth is desired for large output feature maps.Conversely,when

178、frequent weight updates are necessary due to the limited on-chip memory of the top die,additional RX bandwidth can reduce the latency.Experimental results demonstrate that a reconfigurable upstream-to-downstream ratio can reduce latency by 1.37,given the same total 3D bandwidth.I n contrast,2.5D mes

179、h interconnects,where only chiplets on the edge can access external memory,exhibit up to 3.11 higher latency.SHI NSAI also employs a set of I ntegrated Voltage Regulators(I VRs)to manage power distribution across the 3D I C system.These regulators convert the 1.8V input supply voltage into core volt

180、ages ranging from 0.6-1.2V 15,16,with the power switches and control loops of the buck converters integrated into the interposer,while passive components are mounted on the substrate.The I VRs reduce the I R drop of the power delivery network by 1.5-3 when directly supplying hundreds of amperes thro

181、ugh power/ground TSV pairs.The approach also supports multiple power domains on SHI NSAI with a single power supply.The physical design of SHI NSAI involved a task of evenly distributing a large number of SRAM banks,TSVs,and components belonging to the programmable H-Link fabric,including TOBs/COBs

182、and tracks,across an extensive CMOS logic area.This process required a dedicated manual layout to ensure precise placement,while adhering to TSV design keep-out rules.The SHI NSAI interposer is fabricated using 28nm CMOS technology,incorporating a TSV-middle process to enhance routability,whose layo

183、ut and performance are shown in Fig.37.4.7.The entire interposer consumes 36.2W under a 0.9V power supply,where the NoC and memories all operate at a clock rate of 400MHz.Figure 37.4.5 presents the voltage-frequency scaling characteristics of the 3D V-Link interface,along with a detailed power break

184、down.Figure 37.4.6 compares SHI NSAI with the base dies in SOTA 3D-integrated systems,highlighting the first active interposer focusing on fine-grained reusability,including programable 2.5D and reconfigurable 3D links.I n addition,it also features substantial underdeck memory and good energy effici

185、ency of 3D interfaces.The introduction of a prefabricated reusable active interposer enhances the modularization of heterogeneous multi-core systems 17 by consolidating the inter-chiplet network,SRAM buffers,and voltage regulators onto the interposer.This approach allows a concentrated focus on the

186、development of new functional top chiplets by reusing the interposer.Leveraging fully digital,highly efficient 3D interfaces,the resulting multi chiplet 3D system achieves similar performance to monolithic designs.Ac k nowle dge me nt:This work was supported in part by the National Natural Science F

187、oundation of China(NSFC)under Grant 62322404,in part by the National Key Research and Development Program of China under Grant 2023YFB4404402,in part by the Fundamental Research Funds for the Central Universities,and in part by the Strategic Priority Research Program of CAS under Grant XDB44000000.C

188、orresponding author:Chixiao Chen.Figure 37.4.1:The overall architecture of active TSV-interposer-based 3D integrated system.Figure 37.4.2:Circuit implementation of programmable H-Link fabric with COB and TOB.Figure 37.4.3:Circuit implementation of V-Link interface with bi-directional channels and bo

189、nding fault detector.Figure 37.4.4:Power management of 3D integrated system,and physical design of the active interposer.Figure 37.4.5:Measured interface performance,and the interposers power breakdown.Figure 37.4.6:Performance summary and compared to prior art.ISSCC 2025/February 19,2025/2:45 PM613

190、 DI GEST OF TECHNI CAL PAPERS 37 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.4.7:Micrograph of interposer layout and the entire 3D integrated system.Re fe r e nc e s:1 P.Vivet et al.,“I ntAct:A

191、96-Core Processor With Six Chiplets 3D-Stacked on an Active I nterposer With Distributed I nterconnects and I ntegrated Power Management,”I EEE J SSC,vol.56,no.1,pp.79-97,2021.2 W.Gomes et al.,“Meteor Lake and Arrow Lake I ntel Next-Gen 3D Client Architecture Platform with Foveros,”I EEE Hot Chips,2

192、022.3 A.Smith et al.,“AMD I nstinct MI 300 Series Modular Chiplet Package–HPC and AI Accelerator for Exa-Class Systems,”I SSCC,pp.490-492,2024.4 W.Lu et al.,“Scalable Embedded Multi-Die Active Bridge(S-EMAB)Chips with I ntegrated LDOs for Low-Cost Programmable 2.5D/3.5D Packaging Technology,”I

193、 EEE Sy mp.VLSI Cir c uit s,2024.5 F.Li et al.,“GI A:A Reusable General I nterposer Architecture for Agile Chiplet I ntegration,”ACM/I EEE I CCAD,2022.6 J.Nasrullah,Z.Luo,and G.Taylor,“Designing Software Configurable Chips and SI Ps Using Chiplets and zGlue,”I nt.Sy mp.Mic r oe le c t r onic s,pp.27

194、-32,2019.7 B.Zimmer et al.,“A 0.32-128 TOPS,Scalable Multi-Chip-Module-Based Deep Neural Network I nference Accelerator With Ground-Referenced Signaling in 16 nm,”I EEE J SSC,vol.55,no.4,pp.920-932,2020.8 H.Zhu et al.,“COMB-MCM:Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity

195、Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning,”I SSCC,2022.9 C.Liu,J.Botimer and Z.Zhang,“A 256Gb/s/mm-shoreline AI B-Compatible 16nm FinFET CMOS Chiplet for 2.5D I ntegration with Stratix 10 FPGA on EMI B and Tiling on Silicon I nterposer,”I EEE CI CC,2021.10 M.I.Masud and S.

196、J.E.Wilton,“A New Switch Block for Segmented FPGAs,”F PL,pp.274-281,1999.11 B.Jiao et al.,“FPI A:Communication-Aware Multi-Chiplet I ntegration With Field-Programmable I nterconnect Fabric on Reusable Silicon I nterposer,”TCAS-I,pp.4156-4168,2024.12 D.Sharma et al.,“High-Performance,Power-Efficient

197、Three-Dimensional System-I n-Package Designs with Universal Chiplet I nterconnect Express,”Na t ur e Ele c t r onic s,pp.244-254,June 2024.13 Y.S.Shao,et al.,“Simba:Scaling Deep-Learning I nference with Multi-Chip-Module-Based Architecture,”I EEE MI CRO,pp.14-27,2019.14 U.Rathore,S.S.Nagi,S.I yer an

198、d D.Markovi,“A 16nm 785GMACs/J 784-Core Digital Signal Processor Array With a Multilayer Switch Box I nterconnect,Assembled as a 22 Dielet with 10 m-Pitch I nter-Dielet I/O for Runtime Multi-Program Reconfiguration,”I SSCC,pp.52-54,2022.15 C.-H.Huang et al.,“I mproving SI MO-Regulated Digital SoC En

199、ergy Efficiencies Through Adaptive Clocking and Concurrent Domain Control,”I EEE J SSC,pp.90-02,2021.16 J.-H.Cho,et al.,“A Fully I ntegrated Multi-Phase Buck Converter With On-Chip Capacitor Dynamic Re-Allocation and Fine-Grained Phase-Shedding Techniques,”I EEE J SSC,pp.3840-3852,2021.17 M.C.Dos Sa

200、ntos et al.,“A 12nm Linux-SMP-Capable RI SC-V SoC with 14 Accelerator Types,Distributed Hardware Power Management and Flexible NoC-Based Data Orchestration,”I SSCC,pp.262-264,2024.614 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DI

201、GITAL ACCELERATORS/37.5979-8-3315-4101-9/25/$31.00 2025 I EEE37.5 SKADI:A 28nm Complete K-SAT Solver Featuring Dual-Path SRAM-Based Macro and Incremental Update with 100%Solvability Zihan Wu,Xiyuan Tang,Tao Zhang,Lishan Lin,Haoyang Luo,Bocheng Xu,Zhongyi Wu,Jiahao Song,Yitao Liang,Xiaochen Bo,Yuan W

202、ang Peking University,Beijing,China Boolean satisfiability(K-SAT,K3)is an NP-complete problem that has applications in various fields,including electronic design automation 1,formal verification 2,and fault diagnosis 3.The objective of the K-SAT problem is to determine whether a truth assignment exi

203、sts for n Boolean variables Xi to satisfy all clauses that typically are in conjunctive normal form F(x).Given its NP-complete nature,solving K-SAT problems on Von Neumann machines consumes extensive energy and time.To address this challenge,several ASI C solvers have been proposed,employing diverse

204、 methods such as continuous-time dynamics 4,I sing machines 5,and recurrent neural networks 6.However,all prior works 4-8 are incomplete solvers that are only capable of resolving satisfiable(SAT)cases,without providing proof for the unsatisfiability(UNSAT)of F(x).This constraint limits their practi

205、ce usage,as the satisfiability of most real-world K-SAT problems is not predetermined,requiring solvers to verify the existence of solutions for F(x).Addressing this issue necessitates a complete analysis,which encounters three key challenges:1)Unlike incomplete solvers that solely deduce clauses fr

206、om assignments,complete solvers require bidirectional deduction,i.e.,solvers deduce both clauses from assignments and assignments from clauses.2)I ncomplete solvers update assignments through heuristic methods without the need to track past decisions,thereby skipping assignment management;in contras

207、t,complete solvers rely on historical decisions to update assignments,incurring significant area costs due to storage requirements.3)Unlike incomplete analysis that involves only 2 clause states,complete analysis requires multi-level clause states with 100%accuracy,boosting computational demand.Anal

208、og approaches to SAT solving suffer from accuracy loss,while conventional digital methods face challenges with poor area and energy efficiency.This work(SKADI)presents a complete K-SAT solver with three techniques to address the challenges for complete analysis:1)a dual-path SRAM-based macro support

209、ing forward processing-in-memory(FW-PI M)to deduce clauses from assignments and backward indexing(BW-I DX)to update assignments from clauses;2)an incremental updater to hierarchically store and backtrack historical assignments;3)a compact and accurate position-encoded counter(PEC)for multi-level cla

210、use state deductions.Prototyped in 28nm CMOS,the proposed design offers an average solution time of 17.1s for SAT cases and 42.2s for UNSAT cases under test sets with 50 variables and 218 clauses,offering 952 speedup and 3.4*105 energy reduction over a complete software SAT solver(DPLL 9)on a AMD Ry

211、zen 5 4500U CPU.As a complete K-SAT solver,it shows great potential in addressing applications like formal verification and fault diagnosis,where knowing the satisfiability of F(x)is critical.Figure 37.5.2 shows the top-level architecture and operation flow.SKADI consists of an SRAM-based dual-path

212、macro,clause analyzer,incremental updater,and backtracking controller.The macro includes 50218 PEs,where each row represents a clause with 50 PEs storing variables X0 to X49.Since each variable X has two literals(positive x and negative x),each PE requires two 6T SRAMs to represent the variable:e.g.

213、,if literal x0 appears in clause C0,SRAM0,0(x0 in C0)is set to1;if literal x0 appears in C0,SRAM1,0(x0 in C0)is set to1.I n addition,each variable is associated with 2b column-wise assignments,AX and AXB:e.g.,the X0 is positive(x0=1,x0=0)when AX0=1 and AXB0=0,and negative(x0=0,x0=1)when AX0=0 and AX

214、B0=1;if AX0=AXB0=0,X0 is unassigned(x0=x0=);if AX0=AXB0=1,X0 incurs a variable conflict.After programming the F(x)into the macro,the incremental updater starts assigning unassigned variables and then activates the FW-PI M unit of the macro to deduce clause states.These states are processed by the cl

215、ause analyzer,which generates corresponding control signals to activate different data paths.Specifically,if any clause conflicts are detected,the backtracking controller retrieves previous assignments from the incremental updater for subsequent processing.Otherwise,the macro s BW-I DX units are act

216、ivated to index unassigned literals.Then,the incremental updater updates assignments based on the BW-I DX results,detects variable conflicts and starts the next FW-PI M round.This iteration loop repeats until the SKADI determines the satisfiability of F(x).Figure 37.5.3 shows the schematic and timin

217、g diagram of the proposed PE,which includes two 6T SRAMs,a FW-PI M unit,and a BW-I DX unit.The FW-PI M unit comprises two SAT computing cells,an unassigned literal detector,and a position-encoded counter(PEC),deducing clause states in two steps:initially,the reset signal PCHF resets all SWs and wire

218、d-OR SAT lines.Then,column-wise assignments AX and AXB are activated to compute clause states through SAT line and unassigned literal counting.The SAT line indicates whether the clause is satisfied:if(xi in Cj)=AXBi=1 or(xi in Cj)=AXi=1,the SAT line for Cj is charged to VDD,indicating the clause Cj

219、is SAT.Unassigned literal counting is performed by the detector and PEC.I n detail,the detector s SW charges to VDD only when both AXi=AXBi=0,while Cj includes either xi or xi,indicating an unassigned literal.The PEC,controlled by SW,cascades in a row-wise manner.I t either passes the previous PEC s

220、 3b position code(PC)output when SW=0 or right-shifts it when SW=1.Since each row s first PEC starts with,the position of“1”at the end of each row determines the number of unassigned literals in this clause.Combining the SAT line and the 3b PC,the clause analyzer can deduce the clause states accurat

221、ely.The BW-I DX unit indexes the unassigned literal in two steps.I nitially,the reset signal PCHB sets all column-wise matching lines(ML)to VDD.Then,the clause analyzer activates the search line(SL)for unsatisfied clauses to identify their unassigned literal:if the literal xi appears in Cj while its

222、 negation is 0,it indicates that variable Xi is unassigned,causing the related ML to discharge to GND.Figure 37.5.4 exhibits the incremental updater along with the assignment storage and backtracking mechanism.The incremental updater consists of an 8100b register file(RF),an updater,and an assignmen

223、t mask(AM).The primary function of the incremental updater is to store non-propagated literals(assigned in heuristic or minor mode)and their unit propagation results.The clause analyzer detects clause conflicts,units,heuristics,and minors in sequence(checking the next type only if none of the previo

224、us types exist).When the clause analyzer broadcasts a heuristic or minor flag,the AM selects an available“0”discharged in BW-I DX and passes it to the updater,which performs an XOR operation with the current assignment to generate a new one that includes the latest non-propagated literal.Then,the up

225、dater uploads the current assignment to the RF and changes the RF s address.I f a unit flag is broadcasted,the incremental updater merges all unit propagation results activated in BW-I DX with the current assignment and detects variable conflicts(any AXi=AXBi=1).When SKADI detects variable or clause

226、 conflicts,it starts a two-step backtracking process.First,a depth-calculating loop searches the latest unsearched non-propagated literals and determines the backtracking depth.Second,if the depth falls within the maximum backtracking depth,the updater retrieves the previous assignment from the RF b

227、ased on the backtracking depth and replaces the searched literal in the retrieved assignment with its negation;otherwise,SKADI restarts with the remaining non-propagated literals if the depth exceeds the maximum depth.Figure 37.5.5 shows the measured results with 1000 hard 3-SAT problems from SATLI

228、B uf50-218 and uuf50-218(CTV=4.36).Fabricated in 28nm CMOS,the dual-path macro delivers an energy efficiency of 47.8-24.7TOPS/W and an area efficiency of 117.9 to 471.1GOPS/mm2 with 0.65-to-0.9V supply.The system-level energy and area efficiencies are 24.8 to 12.9TOPS/W and 56.8 to 227.1GOPS/mm2,res

229、pectively.Operating at 200MHz and 0.9V,SKADI achieves average solution times of 17.1s for SAT cases and 42.1s for UNSAT cases,while consuming 3.39mW.Among state-of-the-art ASI C K-SAT solvers,this work is able to assert the satisfiability of F(x)with 100%solvability.To demonstrate efficiency,its ene

230、rgy consumption and solving time are compared with prior art.Note that complete SAT algorithms are substantially more complex than incomplete ones,leading to significantly increased energy consumption and extended solving times.For example,to solve the same SAT problems(uf50-218),employing DPLL 9 co

231、nsumes 147 more energy and requires 39 longer solving time compared to WalkSAT 10(WS)on the CPU(AMD Ryzen 5 4500U2.38GHz),as shown in Fig.37.5.5(right).Yet,the proposed complete solver achieves a faster solving time(0.9)with comparable energy consumption(2.8)vs.a previous K-SAT solver 8 on the same

232、SAT problems(uf50-218).This demonstrates the solving speed and energy efficiency of the proposed design.A comprehensive performance summary and comparison are presented in Fig.37.5.6.I n summary,this work implements a complete K-SAT solver with 100%solvability,featuring a short solution time(17.1s f

233、or uf50-218),and high-density(0.20mm2).Demonstrated with test sets uf50-218 and uuf50-218 from SATLI B,this work shows potential in applications that require complete analysis to determine the satisfiability of the F(x).Ac k nowle dge me nt:This work was supported by the Joint Funds of National Natu

234、ral Science Foundation of China(No.U20A20204).The corresponding authors are Xiyuan Tang and Yuan Wang(,).Figure 37.5.1:K-SAT basic concept along with the solver comparisons,and the challenges of applying complete analysis in ASIC.Figure 37.5.2:The top-level architecture,the macro architecture,and th

235、e overall working flow of the SKADI.Figure 37.5.3:The detailed PE schematic and its timing diagram.Figure 37.5.4:The architecture of incremental updater,related backtrack/update algorithm and timing diagram(an example case that only involves Bank0 and Bank1).Figure 37.5.5:Measured solution time,area

236、&energy efficiency,power,shmoo plot,and the performance comparison with prior art and CPU.Figure 37.5.6:Comparison with state-of-the-art ASIC K-SAT solvers.ISSCC 2025/February 19,2025/3:35 PM615 DI GEST OF TECHNI CAL PAPERS 37 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER

237、 CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.5.7:The die micrograph and performance summary of SKADI.Re fe r e nc e s:1 J.P.Marques-Silva and K.A.Sakallah,“Grasp:A Search Algorithm for Propositional Satisfiability,”I EEE Tr a ns.on Comput e r s,vol.48,no.5,pp.506-521,

238、1999.2 R.E.Bryant,“Graph-Based Algorithms for Boolean Function Manipulation,”I EEE Tr a ns.on Comput e r s,vol.100,no.8,pp.677-691,1986.3 J.R.Burch,E.M.Clarke,K.L.McMillan,D.L.Dill,and L.-J.Hwang,“Symbolic Model Checking:1020 States And Beyond,”I nfor ma t ion a nd Comput a t ion,vol.98,no.2,pp.142-

239、170,1992.4 M.Chang,X.Yin,Z.Toroczkai,X.Hu,and A.Raychowdhury,“An Analog Clock-Free Compute Fabric Based on Continuous-Time Dynamical System for Solving Combinatorial Optimization Problems,”I EEE CI CC,2022.5 Y.Su,T.T.-H.Kim,and B.Kim,“A Reconfigurable I sing Machine for Boolean Satisfiability Proble

240、ms Featuring Many-Body Spin I nteractions,”I EEE CI CC,2023.6 D.Kim,N.M.Rahman,and S.Mukhopadhyay,“A 32.5mw Mixed-Signal Processing-I n-Memory-Based K-Sat Solver in 65nm CMOS With 74.0%Solvability for 30-Variable 126-Clause 3-SAT Problems,”I SSCC,pp.28-30,2023.7 S.Xie,M.Yang,S.A.Lanham,Y.Wang,M.Wang

241、,S.Oruganti,and J.P.Kulkarni,“Snap-SAT:A One-Shot Energy-Performance-Aware All-Digital Compute-I n-Memory Solver for Large-Scale Hard Boolean Satisfiability Problems,”I SSCC,pp.420-421,2023.8 C.Shim,J.Bae,and B.Kim,“vip-sat:A Boolean Satisfiability Solver Featuring 512 Variable I n-Memory Processing

242、 Elements with 98%Solvability for 50-Variables 218-Clauses 3-Sat Problems,”I SSCC,pp.486-488,2024.9 M.Davis,G.Logemann,and D.Loveland,“A Machine Program for Theorem-Proving,”Communic a t ions of t he ACM,vol.5,no.7,pp.394-397,1962.10 B.Selman,H.A.Kautz,B.Cohen et al.,“Local Search Strategies for Sat

243、isfiability Testing,”Clique s,Color ing,And Sa t is fia bilit y,vol.26,pp.521-532,1993.616 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/37.6979-8-3315-4101-9/25/$31.00 2025 I EEE37.6 A 22nm 60.81TFLOPS/W Diffus

244、ion Accelerator with Bandwidth-Aware Memory Partition and BL-Segmented Compute-in-Memory for Efficient Multi-Task Content Generation Yiqi Jing1,Jiaqi Zhou1,Yiyang Sun1,Siyuan He1,Peiyu Chen2,3,Ru Huang1,Le Ye1,2,Tianyu Jia1 1Peking University,Beijing,China 2Advanced I nstitute of I nformation Tecnho

245、logy of Peking University,Hangzhou,China 3Nano Core Chip Electronic Technology,Hangzhou,China I nitially applied for image synthesis 1,Diffusion Models(DMs)have been rapidly expanded into many content-generation tasks,e.g.3D scenes 2-3 or video 4,and deliver exceptional performance.Figure 37.6.1 pro

246、vides an overview of DM architecture,which typically processes random noisy input through multiple,i.e.20-50,denoising steps to generate the desired output content.Each denoising step incorporates a U-Net structure with a down-sampling encoder and an up-sampling decoder,which contains repetitive tra

247、nsformer blocks.To support diverse content generation,multi-view 5 or temporal attention blocks 6 are integrated to enhance 3D scene or video-frame consistency.Due to the large number of denoising steps,generating a single piece of content consumes significant latency,e.g.70s for a 4 seconds of 6fps

248、 video on an A100 GPU.To improve the hardware performance and efficiency,compute-in-memory(CI M)accelerators have been developed for transformers 7-9 or DMs 10.However,for several reasons,it is still challenging to use existing CI M-based accelerators for practical image or multi-task DMs.First,a si

249、gnificant operational intensity(OpI)variation exists along DM layers leading to dynamic memory bandwidth(BW)requirements.The DMs also require excessive data storage,i.e.10 larger than VAEs and GANs.Second,the impressive CI M macro efficiency is often significantly degraded at the system-level due to

250、 the diminishing reuse rate for large AI models.Moreover,59%of CI M macro power is consumed on data access rather than computation.Third,emerging DM tasks for 3D or video require additional consistency operations,i.e.multi-view/frame attention,for smooth transitions across viewpoints or frames.Such

251、consistency attention is quite computationally-intensive and comprises 31%of the total operations.To address these challenges,this paper presents a digital CI M-based accelerator for multi-task DMs with following key features:1)a dynamic BW-aware memory partitioning scheme is developed with dense on

252、-chip eDRAM storage to optimize CI M utilization and reduce EMA,2)a bitline(BL)-segmented CI M cluster is designed with reuse-aware weight reordering to enhance system efficiency,and 3)a hierarchical consistency optimization flow is presented to minimize frame/pixel-level operations to improve perfo

253、rmance.Overall,these innovations enable our chip to achieve a 60.81TFLOPS/W system efficiency,which has 1.4 better performance than a prior image DM chip 10 and our chip also shows promising performance for more diverse content-generation tasks.Figure 37.6.2 shows the overall architecture of our chi

254、p.I t comprises a CI M acceleration subsystem with 9 CI M clusters,each contains four 24Kb digital-CI M macros,3Mb on-chip eDRAM,a multi-frame consistency management(MFCM)unit,a host RI SC-V CPU and peripheral circuits.Direct connections are implemented between CI M clusters to support pipeline para

255、llelism and crossbar(Xbar)interfaces are used to connect eDRAM.A reuse-aware weight update scheduler is incorporated inside the CI M subsystem to improve the data reuse rate for all clusters.The eDRAM is designed into four splittable banks and each has 83 32Kb 3T gain-cell arrays.A leak-tracking ref

256、erence column is added in each array to enhance Vref accuracy and extend eDRAM retention time(Fig.37.6.3).A dynamic BW-aware partitioning module is designed with an interconnect coupler to support flexible memory bank partitioning to adapt to varying OpI of different DM layers.The MFCM performs the

257、hierarchical consistency optimizations,which contains a progressive view extension unit for frame-level optimizations,along with a spatial-temporal compress unit and epipolar-attention sparse unit to reduce pixel-level computations.Figure 37.6.3 illustrates the architecture and operation of our BW-a

258、ware memory partitioning.A two-step partitioning scheme is developed for each runtime subtask,i.e.tiled matrix workload in DM layers.First,the computation and memory resource demand for each subtask are evaluated based on operational intensity and recorded in a resource-aware subtask table.Second,th

259、e data arrangement within eDRAM,i.e.eDRAM bank partitioning plan,is determined by the required BW and stored in a memory-partition table.To maximize utilization,a time-multiplexed BW distribution method is adopted to decouple BW from memory capacity,e.g.data for subtask 5 is stored equally in ten eD

260、RAM columns to increase available peak BW.Compared to a conventional fixed memory allocation(Case 1),which evenly distributes resources to subtasks,our BW-aware partitionng jointly optimizes CI M and BW utilization for subtasks with diverse OpI,e.g.,subtasks 3-5 have improved capacity and BW utiliza

261、tion by 1.98 and 2.76 in Case 2.Based on the partitioning plan,area-efficient crossbar and interconnect couplers are reconfigured for proper topology and flow control.The coupler determines source and destination addresses through a loop decoder,while a credit noter monitors eDRAM refreshes and CI M

262、 backpressure to ensure reliable data transfer.Overall,the BW-aware memory partitioning improves CI M and BW utilization by 1.27-9.54 and 1.51-10.19 across DM layers for a Wonder3D model,leading to a total 2.68 performance gain with only 3.7%area overhead.Figure 37.6.4 illustrates the implementation

263、 details of the CI M cluster and the weight-update scheduler.Each cluster integrates four BL-segmented CI M macros,an aggregator unit,and a local NoC.Each macro contains six 12832 6T SRAM MAC-arrays,pre/post-processing units,a weight-alignment unit,and I/O buffers.The CI M array is designed with an

264、architecture of 464 weight subarrays,which consist of 16 SRAM cells with a local stationary unit(LSU),and a LUT-bypass adder tree.To alleviate costly data access,a 2-stage BL-segmentation technique is incorporated,which segments BL based on operational addresses at both the MAC-array level and a fin

265、er-grained subarray level.This approach reduces the effective BL loading and unnecessary precharging during SRAM access to enhance CI M efficiency by 27%.A LUT-bypass adder tree is designed by leveraging a 4b sparse LUT as a multiplier and a first-stage adder to reduce high dynamic transitions.A byp

266、ass adder is used as the second-stage adder to skip zeros,resulting in 11%power reduction.During computation,the weight-update scheduler supports simultaneous computation with our reuse-aware weight reordering,in which a top-k module is used to generate a sparse attention pattern to indicate the wei

267、ghts to be reused in the CI M subarray.The pattern is first row-wise reordered via a reuse-driven activation sorter by similarity comparisons,forming the CI M s activation sequence.Then,a column-wise reordering based on weight lifespan decides the weight-update sequence with the least-recently-used

268、queue-replacement policy,thereby reducing memory access for attention layers by 29%.Overall,the BL-S CI M macro achieves 1.31 performance gain and 54%energy savings by the above techniques.Figure 37.6.5 depicts our MFCM scheme for efficient content generation.Conventional multi-view DMs denoise fram

269、es from all viewpoints and apply multi-view attention on pixels to ensure consistency,resulting in significant overhead.We develop a hierarchical consistency computation flow with both frame-level and pixel-level optimizations with following three stages.Stage 1 adopts a progressive view extension t

270、echnique,which utilizes fewer viewpoint frames at initial timesteps to reduce frame-level computations,e.g.only 2 frames at timestep 5.New frames from different viewpoints will be added by duplicating the previous frame once its similarity with reference frame(calculated by a frame compare unit)reac

271、hes predefined threshold.Stage 2 reduces pixel-level computation by skipping background and trivial pixels using a spatial-temporal compression unit.The target object is segmented from the background using RGBA values in salient object detection unit(SODU)based on a spiral search pattern.Unmodified

272、pixels are further skipped in trivial-pixel sparsity unit(TPSU)by similarity assessment across denoising iterations.Stage 3 further leverages a pixel-level epipolar-attention mechanism to minimize irrelevant inter-frame interactions.The epipolar solver controls a 33 MAC array to generate the pixels

273、and epipolar line on each view plane associated with the target light ray.A sparse attention is performed between the epipolar line and the pixel,e.g.pixel on the P3 frame and the epipolar line on the P0 frame,to mask out irrelevant regions and reduce computations.The above hierarchical consistency

274、optimizations bring us a 3.71 speedup and 68%energy saving.Figure 37.6.6 shows the measurement results of the 22nm CI M processor.Multiple content-generation tasks are evaluated using SOTA DMs,i.e.SD-v1.5,Wonder3D,SVD,based on customized hybrid BF16-W4A8 quantization.Compared to a SOTA image DM acce

275、lerator 10,our chip achieves a 1.4 performance improvement.Since there is no prior accelerator for 3D or video,we also provide the execution time for 3D and video models with high system efficiency.The BW-aware partitioning,BL-S macros and MFCM together contribute to 13.03 performance and 3.69 syste

276、m-efficiency improvement,leading to a 49.74-60.81TFLOPS/W system efficiency(1.52 better than 10).Our CI M macro FoM,which considers both energy and area efficiency,is 1.28 higher than the Booth8 CI M in 10.The system FoM is 1.44 and 3.82 higher than SOTA CI M 8 and DM 11 accelerators,illustrating be

277、tter optimized CI M computation and BW utilization.Figure 37.6.7 shows the die photo and more specifications.Ac k nowle dge me nt:This work was supported in part by NSFC Grant 92164301,Grant 62225401,and Grant U23A6007;Zhejiang Provincial Key R&D program under Grant 2021C01035;Grant QYJS-2023-2401-B

278、,and Grant QYJS-2023-2402-B.Corresponding authors:Tianyu Jia and Le Ye.Figure 37.6.1:Overview of diffusion models for multi-task content generation and the deployment challenges.Figure 37.6.2:Overall chip architecture.Figure 37.6.3:Dynamic BW-aware memory partitioning scheme with eDRAM to jointly im

279、prove CIM and BW utilization.Figure 37.6.4:Bitline-segmented(BL-S)CIM macro and weight update scheduler with reuse-aware reordering update.Figure 37.6.5:Multi-frame consistency management with hierarchical optimizations for diverse content generation.Figure 37.6.6:Measurement results and performance

280、 comparison table.ISSCC 2025/February 19,2025/4:00 PM617 DI GEST OF TECHNI CAL PAPERS 37 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.6.7:Chip micrograph and specifications.Re fe r e nc e s:1 R.R

281、ombach et al.,“High-Resolution I mage Synthesis with Latent Diffusion Models,”I EEE CVPR,pp.10674-10685,2022.2 X.-Y.Zheng et al.,“MVD2:Efficient Multiview 3D Reconstruction for Multiview Diffusion,”ACM SI GGRAPH,2024.3 X.Long et al.,“Wonder3D:Single I mage to 3D using Cross-Domain Diffusion,”I EEE C

282、VPR,pp.9970-9980,2024.4 H.Ni et al.,“Conditional I mage-to-Video Generation with Latent Flow Diffusion Models,”I EEE CVPR,pp.18444-18455,2023.5 Z.Deng et al.,“MV-Diffusion:Motion-aware Video Diffusion Model,”ACM Mult ime dia,pp.7255-7263,2023.6 H.Jeong et al.,“VMC:Video Motion Customization using Te

283、mporal Attention Adaption for Text-to-Video Diffusion Models,”I EEE CVPR,pp.9212-9221,2024.7 F.Tu et al.,“A 28nm 15.59J/Token Full-Digital Bitline-Transpose CI M-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes,”I SSCC,pp.466-468,2022.8 F.Tu et al.,“MuI TCI M:A 28nm 2

284、.24 J/Token Attention-Token-Bit Hybrid Sparse Digital CI M-Based Accelerator for Multimodal Transformers,”I SSCC,pp.248-250,2023.9 S.Liu et al.,“A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with I n-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CI M-Based Local-Attention-Reusab

285、le Engine,”I SSCC,pp.250-252,2023.10 R.Guo et al.,“A 28nm 74.34TFLOPS/W BF16 Heterogenous CI M-Based Accelerator Exploiting Denoising-Similarity for Diffusion Models,”I SSCC,pp.362-364,2024.11 Y.Qin et al.,“A 52.01 TFLOPS/W Diffusion Model Processor with I nter-Time-Step Convolution-Attention-Redund

286、ancy Elimination and Bipolar Floating-Point Multiplication,”I EEE VLSI Te c hnology a nd Cir c uit s,2024.12 J.Yue et al.,“A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN I nference/Training with I ntensive-CI M Sparse-Digital Architecture,”I SSCC,pp.252-254,2023.618

287、 2025 I EEE I nternati onal Sol i d-State Ci rcui ts ConferenceISSCC 2025/SESSION 37/DESIGN-TECHNOLOGY OPTIMIZATION AND DIGITAL ACCELERATORS/37.7979-8-3315-4101-9/25/$31.00 2025 I EEE37.7 A 28nm 18.1J/Acquisition End-to-End GPS Acquisition Accelerator with Energy-Accuracy-Driven Mixed-Radix IFFT and

288、 ROM-Assisted Computing Sangsu Jeong1,Sungji n Park1,Mi ngoo Seok2,Dongsuk Jeon1 1Seoul Nati onal Uni versi ty,Seoul,Korea 2Col umbi a Uni versi ty,New York,NY A gl obal posi ti oni ng system(GPS)i s now essenti al i n many mobi l e devi ces,but i ts extensi ve energy consumpti on i mposes a major d

289、esi gn chal l enge.The most computati onal l y i ntensi ve task i n a GPS recei ver i s acqui si ti on 1.The i ntermedi ate frequency(I F)si gnal from the RF front end contai ns si gnal s from mul ti pl e satel l i tes wi th di fferent frequenci es due to the Doppl er effect(Fi g.37.7.1,top).The acq

290、ui si ti on step i denti fies whi ch satel l i te i s currentl y vi si bl e and al so esti mates the amount of Doppl er frequency shi ft and ti me del ay.The I F si gnal (R n)i s downconverted and separated i nto i n-phase(I n)and quadrature(Q n)components through el ement-wi se mul ti pl i cati on

291、wi th 41 si nusoi ds generated by a numeri cal l y control l ed osci l l ator(NCO).These si nusoi ds account for I F wi th addi ti onal Doppl er shi fts rangi ng from-10kHz to 10kHz i n 500Hz i nterval s.The si gnal then undergoes correl ati on computati on(Y k =S n C n-k )wi th pseudo-random noi se

292、(PRN),a uni que 1023b code assi gned to each satel l i te.A peak i n the resul ts i ndi cates that the target satel l i te i s detected wi th a speci fic amount of Doppl er shi ft.The posi ti on of the peak represents the ti me del ay from the satel l i te to the recei ver.Obtai ni ng correl ati on

293、wi th PRN codes requi res a l arge amount of computati on,whi ch was conventi onal l y processed i n the frequency domai n si nce correl ati on can be repl aced by el ement-wi se mul ti pl i cati ons 28.Recent GPS accel erators 1,9 empl oyed anal og computi ng to enhance the energy effici ency of ea

294、ch operati on.However,these desi gns perform correl ati ons i n the ti me domai n,and hence they requi re substanti al l y more operati ons than the frequency-domai n approach,ul ti matel y constrai ni ng the overal l system effici ency(Fi g.37.7.1,bottom l eft).I n addi ti on,such desi gns onl y i

295、mpl ement correl ati on computati on,excl udi ng other cri ti cal steps of the acqui si ti on,such as Doppl er shi ft cancel l ati on.Thi s paper presents an energy-effici ent GPS accel erator ful l y supporti ng the enti re acqui si ti on process i ncl udi ng Doppl er shi ft correcti on and process

296、i ng PRN codes for al l satel l i tes.The desi gn maxi mi zes overal l system effici ency by i mpl ementi ng frequency-domai n correl ati on usi ng an energy-accuracy-opti mi zed i nverse FFT(I FFT)modul e.A ROM-assi sted computi ng scheme further enhances energy effici ency whi l e al so i mprovi n

297、g computati on accuracy.I n addi ti on,the Doppl er effect cancel l ati on i s al so processed i n the frequency domai n,si gni ficantl y reduci ng the requi red number of FFTs.The overal l archi tecture i s shown i n Fi g.37.7.2.The I F i nput si gnal R n consi sts of 2b real-val ued data wi th a s

298、equence l ength of 4096.A nave frequency-domai n i mpl ementati on woul d first mul ti pl y R n by each of 41 si nusoi ds wi th 500Hz i nterval s from NCO to compensate for Doppl er shi fts,necessi tati ng 41 FFTs to i ndependentl y convert the resul ti ng si gnal s i nto the frequency domai n.Our d

299、esi gn dramati cal l y reduces thi s overhead by performi ng Doppl er shi ft cancel l ati on i n the frequency domai n.More speci fical l y,compensati ng for a(k 1000)-Hz Doppl er shi ft can be achi eved by si mpl y performi ng a k-poi nt shi ft on the FFT resul ts of the raw i nput R n.Si mi l arl

300、y,a k-poi nt shi ft on the FFT resul ts of R n exp(j n/N)compensates for a(k1000+500)-Hz Doppl er shi ft.Consequentl y,the accel erator requi res onl y two FFT operati ons,one each for R n and R n exp(j n/N),usi ng an NCO generati ng onl y one si nusoi d to cancel al l possi bl e Doppl er shi fts.Th

301、e desi gn al so stores precomputed FFT resul ts of PRN codes for the 32 satel l i tes i n a ROM to further i mprove effici ency.After el ement-wi se mul ti pl i cati on i n the frequency domai n,the resul ts shoul d be converted back to the ti me domai n through I FFT to obtai n the correl ati ons.T

302、hi s I FFT must be performed separatel y for al l 32 satel l i tes and 41 Doppl er shi fts,requi ri ng 3241 I FFTs i n total and maki ng i t domi nate the overal l energy consumpti on and l atency.Our desi gn sampl es 1024 poi nts near 0Hz to mi ni mi ze computati onal costs wi thout accuracy degrad

303、ati on 7,8.I n addi ti on,the energy effici ency of I FFT i s maxi mi zed through energy-accuracy-dri ven opti mi zati on.Conventi onal FFT/I FFT accel erators often favor radi x-2 or 4 due to thei r l ow desi gn compl exi ty 1013;however,these l ower radi ces i ncrease the number of stages duri ng

304、FFT/I FFT and hence necessi tate hi gh-preci si on computi ng as quanti zati on errors accumul ate i n each stage.Si mul ati ons show that the requi red preci si on for a 1024-poi nt I FFT to mai ntai n detecti on accuracy degradati on bel ow 5%vari es si gni ficantl y,rangi ng from 5b for radi x-2

305、to 2b for radi x-1024(Fi g.37.7.2,bottom l eft).Anal ysi s consi deri ng both operati on count and preci si on reveal s that 3b preci si on wi th a mi xed-radi x of 8,8,16 i s an energy-opti mal desi gn poi nt(Fi g.37.7.2,bottom ri ght).Our desi gn empl oys a ROM-assi sted computi ng strategy i n th

306、e radi x-8 butterfly uni t for effici ent processi ng.The butterfly uni t features 128 ROM arrays,each consi sti ng of 2212 ROM bi tcel l s and correspondi ng to ei ther the real or i magi nary parts of the twi ddl e factors(Fi g.37.7.3,top).The butterfly uni t recei ves 8 real i nputs and 8 i magi

307、nary i nputs i n si gn-and-magni tude format,through an i nput router swi tchi ng between them.When real (i magi nary)i nputs are fed i nto the real (i magi nary)ROM arrays,8 real outputs are generated;conversel y,when real (i magi nary)i nputs are fed i nto the i magi nary(real)ROM arrays,8 i magi

308、nary outputs are produced.The final radi x-16 stage i s di vi ded i nto 4 parts,where each part processes 8 i nputs and generates 8 outputs usi ng the radi x-8 butterfly uni t.The resul ts from the butterfly uni t are then combi ned i n subsequent cycl es to compl ete the radi x-16 computati on.Conv

309、enti onal butterfly uni ts store the twi ddl e factors i n a ROM,whi ch are l ater mul ti pl i ed wi th i nputs.Thi s approach suffers from accuracy l oss i n two fol ds;the twi ddl e factors are quanti zed before bei ng stored i n the ROM,and the quanti zati on errors are ampl i fied by the subsequ

310、ent mul ti pl i cati on(Fi g.37.7.3,bottom l eft).To address thi s i ssue,the butterfly uni t adopts ROM-assi sted mul ti pl i cati on,where a ROM stores precomputed and quanti zed products of the i nputs and ful l-preci si on twi ddl e factors i nstead of the quanti zed twi ddl e factors.Si nce the

311、 i nput preci si on i s rel ati vel y l ow(1 si gn and 2 magni tude bi ts)i n our desi gn,i t i s feasi bl e to precompute the mul ti pl i cati on wi th each ful l-preci si on twi ddl e factor and store the quanti zed resul ts i n the ROM for al l possi bl e i nput val ues.I n si mul ati on,thi s te

312、chni que reduces the root mean square error(RMSE)by 21.7%,even wi th a narrower bi t-wi dth for representi ng the output.I n addi ti on,energy consumpti on i s decreased by 44.5%due to precomputati on and the adder tree wi th a l ower i nput preci si on compared to the basel i ne(Fi g.37.7.3,bottom

313、ri ght).The 12 col umns of the ROM array store the 4b products i n two s compl ement format for i nput magni tudes of 1,2,and 3.When the i nput i s non-zero,these products are read through the col umn mul ti pl exer(MUX)whi ch consi sts of 3 NMOS transi stors.One of these transi stors i s acti vated

314、 based on the i nput magni tude,whi l e the other two remai n off.These transi stors l i mi t bi tl i nes(BLs)to ri se onl y to VDD-Vt,thereby reduci ng the BL chargi ng energy(Fi g.37.7.4,top).Thi s techni que reduces the ROM access energy by 23.7%i n si mul ati ons.After prechargi ng BL,the word l

315、 i ne(WL)i s acti vated to di scharge the sense l i ne(SL).I f the i nput i s zero,I Nmag0 i s i mmedi atel y acti vated to set SL to zero.To prevent addi ti onal toggl i ng i n the subsequent ci rcui t duri ng the read,a transmi ssi on gate i s pl aced before SL reaches the next stage.I f the i npu

316、t i s posi ti ve,the retri eved val ue i s sent to the adder tree;i f the i nput i s negati ve,the output i s i nverted.I nstead of usi ng si gn extensi on to combi ne two s compl ement products from ROM arrays,the desi gn converts the ROM outputs i nto a bi as-unsi gned format by si mpl y i nverti

317、ng the MSB(Fi g.37.7.4,bottom).For i nstance,i nverti ng MSB of 4b two s compl ement resul ts i n unsi gned data wi th a bi as of+8.The resul ti ng bi as-unsi gned adder tree removes an adder cel l for si gn extensi on i n each ri ppl e-carry adder(RCA),reduci ng energy by 15.1%i n si mul ati ons.Th

318、e adder tree empl oys a 16T ful l-adder cel l,whi ch offers a smal l er area and l ower power consumpti on than a commonl y used mi rror adder cel l 14.However,when mul ti pl e 16T ful l-adders are connected i n seri es,si gnal s may be undermi ned due to consecuti ve transmi ssi on gates.To address

319、 thi s i ssue,addi ti onal i nverters are pl aced di agonal l y wi thi n the adder tree.Thi s approach regenerates the output si gnal s after passi ng through a maxi mum of 2 ful l adders(i.e.,4 transmi ssi on or pass gates).The I FFT modul e consi sts of 8 processi ng el ements(PEs),each compri si

320、ng a ROM-assi sted butterfly uni t,an i nput buffer(I BUF),an output buffer(OBUF),and a quanti zer(Fi g.37.7.5,top l eft).A butterfly uni t takes i nputs from i ts dedi cated I BUF and,after quanti zati on,stores the computati on resul ts i n i ts own OBUF.Si nce I FFT i nvol ves 3 stages,OBUF data

321、from the first and second stages are read and di stri buted back to I BUFs for the next I FFT stage.Both I BUF and OBUF support si mul taneous read and wri te(1R1W)operati ons to maxi mi ze throughput.Typi cal FFT/I FFT hardware i mpl ementati ons woul d empl oy 8 separate memory banks to provi de 8

322、 i nputs concurrentl y to the radi x-8 butterfly uni t,but thi s resul ts i n a l arge overhead i n read/wri te peri pheral ci rcui ts.I nstead,our desi gn empl oys a si ngl e bank both for I BUF and OBUF,where the 8 outputs from the butterfly uni t are stored i n OBUF as a si ngl e word.However,the

323、 next I FFT stage requi res a di fferent data access pattern,and OBUF supports thi s through a transposed structure wi th orthogonal read and wri te operati ons(Fi g.37.7.5,bottom l eft).Fabri cated i n a 28nm CMOS process,the GPS acqui si ti on accel erator occupi es a 0.5mm core area and consumes

324、18.1 to 29.2J/acqui si ti on wi th a l atency of 0.958 to 1.92 ms/acqui si ti on,where each acqui si ti on sweeps over al l 32 satel l i tes and 41 Doppl er frequency shi fts.Fi gure 37.7.6 di spl ays measurement resul ts,al ong wi th exampl es of correl ati on resul ts usi ng synthesi zed I F data

325、wi th a carri er-to-noi se rati o(CNR)of 50dB-Hz.Thi s desi gn achi eves 25.8 and 12.3 l ower l atency and energy consumpti on than pri or desi gns wi th anal og computi ng,where those metri cs are esti mated usi ng the reported val ues for correl ati on computati on.Fi gure 37.7.7 shows the di e ph

326、otograph and chi p summary.Ac k nowle dge me nt:Thi s work was supported by the Nati onal Research Foundati on of Korea(NRF-2022R1C1C1006880),the I nsti tute of I nformati on&Communi cati ons Technol ogy Pl anni ng&Eval uati on(I I TP-2023-RS-2023-00256081),the I C Desi gn Educati on Center(I DEC),a

327、nd COGNI SENSE,one of seven centers i n JUMP 2.0,an SRC program sponsored by DARPA.Figure 37.7.1:Challenges of GPS acquisition and proposed end-to-end GPS acquisition accelerator.Figure 37.7.2:Overall architecture and energy-accuracy-driven IFFT optimization.Figure 37.7.3:ROM-assisted Butterfly Unit

328、 for IFFT.Figure 37.7.4:ROM array and bias-unsigned adder tree of ROM-assisted butterfly unit.Figure 37.7.5:Architecture of IFFT module and memory access pattern.Figure 37.7.6:Measurement results and comparisons with prior designs.ISSCC 2025/February 19,2025/4:25 PM619 DI GEST OF TECHNI CAL PAPERS 3

329、7 2025 I EEE I nternational Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 I EEEFigure 37.7.7:Die photograph and chip summary.Re fe r e nc e s:1 J.Li et al.,“CCSA:A 394TOPS/W Mixed-Signal GPS Accelerator with Charge-Based Correlation Comp

330、uting For Signal Acquisition,”I SSCC,pp.430-432,2023.2 M.S.Braasch and A.J.van Dierendonck,“GPS Receiver Architectures and Measurements,”Pr oc e e dings of t he I EEE,vol.87,no.1,pp.48-64,1999.3 A.Alaqeeli et al.,“Real-time Acquisition and Tracking for GPS Receivers,”I EEE I SCAS,pp.500-503,2003.4 K

331、.Wang et al.,“A New Algorithm for Fine Acquisition of GPS Carrier Frequency,”GPS Solut ions,vol.18,pp.581-592,2014.5 S.F.Ahamed et al.,“Fast Acquisition of GPS Signal Using FFT Decomposition,”Pr oc e dia Comput e r Sc ie nc e,vol.87,pp.190-197,2016.6 S.F.Ahamed et al.,“Fast Acquisition of GPS signal

332、 Using Radix-2 And Radix-4 FFT Algorithms,”I EEE I nt e r na t iona l Conf.on Adv a nc e d Comput ing,pp.674-678,2016.7 V.Patel and P.Shukla,“Faster Methods for GPS Signal Acquisition in Frequency Domain,”I EEE I nt er nat ional Conf.on Emer ging Tr ends in Net wor ks and Comput er Communic a t ions

333、,pp.84-88,2011.8 Y.Zhang et al.,“Low Computational Signal Acquisition for GNSS Receivers Using a Resampling Strategy and Variable Circular Correlation Time,”Se ns or s,vol.18,no.2,p.678,2018.9 S.Skrzyniarz et al.,“A 36.8 2b-TOPS/W Self-Calibrating GPS Accelerator I mplemented Using Analog Calculation in 65nm LP CMOS,”I SSCC,pp.420-422,2016.10 E.E.Swartzlander et al.,“A Radix 4 Delay Commutator for

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(Session 37Design-Technology Optimization and Digital Accelerators.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站