《Session 8Digital Techniques for System Adaptation, Power Management and Clocking.pdf》由會員分享,可在線閱讀,更多相關《Session 8Digital Techniques for System Adaptation, Power Management and Clocking.pdf(29頁珍藏版)》請在三個皮匠報告上搜索。
1、Session 8 Overview:Digital Techniques for Sy stem Adaptation,Power Management and Clocking DIGITAL CIRCUITS SUBCOMMITTEEDigital architectures continue to improve energy ef ficiency through closed-loop hardware-sof tware integration,on-die sensors,and real-time adaptation.The first f our papers in th
2、is session exemplif y power management techniques that improve computing and energy ef ficiency and enable ef ficient Dynamic Voltage and Frequency Scaling(DVFS)in computing systems such as CPUs and SoCs.In the second half of this session,papers explore advanced clocking and power management circuit
3、s,f eaturing a command-aware hybrid LDO f or HBM,a low-spur low-jitter all-digital output f requency divider,two sensors f or monitoring thermal profiles or supply and temperature simultaneously,and an on-cell battery monitoring system.Session Chair:Heein Yoon UNIST,Ulsan,South Korea Session Co-Chai
4、r:Benton Calhoun University of Virginia,Charlottesville,VA 156 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKING979-8-3315-4101-9/25/$31.00 2025 IEEE8:00 AM 8.1 Dynamic Guard-Band Features of the IBM zN
5、ext System Tobias Webel,IBM Systems,Bblingen,Germany In Paper 8.1,IBM presents dynamic guard-band f eatures f or the 5nm 5.5GHz zNext computing system.A combination of on-die sensors,a run-time control loop,and reliability,serviceability and availability f eatures demonstrate savings of 18%total chi
6、p power corresponding to 10%system power savings enabling higher f requency,higher perf ormance,multiple system configurations and new AI f eatures.8:25 AM 8.2 Run-Time Power Management System by On-Die Power Sensor with Silicon Machine Learning-Based Calibration in a 3nm Octa-Core CPU Chien-Yu Lu,M
7、ediaTek,Hsinchu,Taiwan In Paper 8.2 MediaTek researchers present a run-time power management system f eaturing an on-die power sensor in a 3nm CPU,achieving 94.83%CPU power measurement accuracy with calibration by a silicon-based ML model.On-die current readouts at 10s sampling rate limits over-curr
8、ent excursion f or power stability in smartphone applications.8:50 AM 8.3 A Dynamically Reconfigurable Digital-Integrated Voltage-Regulator Fabric for Energy-Efficient DVFS in Multi-Domain SoCs Julian Arenas,Georgia Institute of Technology,Atlanta,GA In Paper 8.3,Georgia Institute of Technology pres
9、ents a dynamically reconfigurable voltage regulator f abric f or energy-ef ficient DVFS in SoCs in a 65nm CMOS process.The test chip demonstrates a peak power reduction compared to conventional static VR architectures by up to 45%over single-buck regulation and 25%f or two-buck regulation.10:30 AM 8
10、.6 A 0.024mm2 All-Digital Fractional Output Divider with 257fs Worst-Case Jitter Using Split-DTC-Based Background Calibration Hao Gao,Southeast University,Nanjing,China In Paper 8.6,the National University of Def ense Technology presents an all-digital f ractional output f requency divider in a 28nm
11、 CMOS process integrating background calibration f or both gain and INL using a split-DTC technique.The proposed design achieves spurious tones levels below-77dBc and jitter smaller than 257f s with a chip area of 0.024mm.10:55 AM 8.7 A Dual VDD-Temperature Sensor Employing Sensor Fusion with 2.4C,9
12、mV(3)Inaccuracy in 65nm CMOS Hikmet Seha Ozturk,Georgia Institute of Technology,Atlanta,GA In Paper 8.7,Georgia Institute of Technology and Intel present a coupled voltage-temperature sensor employing sensor f usion in 65nm CMOS.The proposed architecture employs pairs of low-cost sub-sensors of only
13、 67m instantiated into voltage-scaled digital domains and relies on computational sensor-f usion to achieve 2.4C and 9mV 3 inaccuracy.11:20 AM 8.8 Fine-Grained Spatial and Temporal Thermal Profiling of a 16nm CMOS Buck Converter and SoC Load-Current Emulator Using Low-Voltage Micron-Scale Thermal Se
14、nsors Zakir Ahmed,Intel,Hillsboro,OR In Paper 8.8,Intel presents a 16nm digital thermal sensor based on a current-starved ring oscillator,aiming f or fine-grain thermal profiling with high resiliency in noisy environments.A chiplet with a buck IVR integrates an array of 204 sensors,demonstrating hig
15、h accuracy of 0.7C dissipating 18W power and having a 400m area f ootprint per sensor.11:35 AM 8.9 An On-Cell Monitoring and Balancing System With Near-Field Communications for EV Batteries Simon Foster,Dukosi,Edinburgh,United Kingdom In Paper 8.9,Dukosi presents a single-chip solution in 55nm to mo
16、nitor individual cells in electric vehicle battery packs using near-field contactless monitoring to minimize the wiring overhead of up to 216 battery cells to a single antenna.The chip consumes an average of 550W per cell at 3.7V during networked monitoring f or 0.1Hz measurements and communication.
17、10:05 AM 8.5 A Command-Aware Hybrid LDO for Advanced HBM Interfaces with 150 A Quiescent Current and 20pF On-Chip Capacitor Achieving Sub-10mV Voltage Droop in 400ps Settling Time Jaeho Kim,Seoul National University,Seoul,Korea In Paper 8.5,Seoul National University presents a command-aware hybrid L
18、DO in a 40nm CMOS process to reduce voltage droops caused by quarter-rate data strobe buf f ers in HBM by proactively leveraging the known current profile of memory accesses.The measured voltage droop is less than 10mV,while consuming only 150A of quiescent current with 20pF output capacitor.ISSCC 2
19、025/February 18,2025/8:00 AM157 DIGEST OF TECHNICAL PAPERS 9:15 AM 8.4 A 4GHz,0.69%-Accuracy Voltage-Droop Detector with Multiple Remote Sensing and Under 2-Cycle Detection Latency in 2nm GAAFET DongHoon Jung,Samsung Electronics,Hwaseong,Korea In Paper 8.4,Samsung Electronics presents a voltage droo
20、p detector with a 4GHz f requency and sub-2-cycle detection latency in 2nm GAAFET process.Experimental results demonstrate 0.69%accuracy and ef f ective droop detection with 250ps response time,which is a 8 improvement f or over previous droop detectors,while reducing area overhead by 15-43%.8158 20
21、25 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKING/8.1979-8-3315-4101-9/25/$31.00 2025 IEEE8.1 Dynamic Guard-Band Features of the IBM zNext System Tobias Webel1,Phillip J.Restle2,Ramon Bertran2,Alper Buyuk
22、tosunoglu2,Sean M.Carey3,Alejandro Cook1,Karl Anderson3,Michael Romain3,Thomas Strach1,Pradeep Bhadravati Parashurama4,Aishwarya Tadkase4,Rahman Abber Tahir1,Luke Jenkins3,Kevin Low3,Eberhard Engler1 1IBM Systems,Bblingen,Germany 2IBM Research,Yorktown Heights,NY 3IBM Systems,Poughkeepsie,NY 4IBM Sy
23、stems,Bangalore,India Computer systems strive f or higher perf ormance,improved energy ef ficiency,reliability,f ault tolerance,and sustainability.Dynamically optimizing guard bands can help achieve all these goals with small design and chip area cost,using on-chip sensors and targeted investments i
24、n test and firmware.Many chips use fixed voltage guard bands at each supported f requency to cover all variation sources threatening correct operation in the field,including VDD power-supply droops f rom sudden workload changes,temperature excursions,and device aging.Advances in robust error recover
25、y and power-supply droop mitigation techniques have been used separately to reduce required guard bands and save power.In this work,we describe a f ull system that dynamically optimizes guard bands by simultaneously leveraging both robust droop mitigation and robust error recovery to deliver signifi
26、cant system power savings.In prior work,Razor enabled robust error recovery using error detecting latches 1-4.Then a Voltage Control Loop(VCL)used the recovery-rate to adjust VDD,saving power while limiting the perf ormance loss f rom recoveries.For lower-cost alternatives,small digital circuits wer
27、e developed to sense droops and other variations 5,6.These droop sensors are used to control adaptive clocking 6-15 or instruction throttling 16-22 to mitigate droops and prevent errors,allowing reduction in guard band.Choosing a static voltage that saves power but avoids the perf ormance cost of dr
28、oop mitigation f or all workloads remained dif ficult,so techniques were developed to control perf ormance loss 23-24.In 24,a VCL limited the droop-throttling perf ormance loss to less than 0.5%.The previous-generation IBM z16 system 22 uses a Digital Droop Sensor(DDS)with an instruction throttling
29、control loop like that described in 24,but without a VCL since power-supply redundancy and security requirements made f ast voltage changes impossible.For the IBM zNext,a more sophisticated VCL was developed that achieves similar power savings to that of 24,while keeping perf ormance loss less than
30、0.5%despite voltage changes requiring milliseconds rather than microseconds.In addition,a third control loop was added to react to core recoveries.Figure 8.1.1 shows the integral components of a system implementing the three control loops across multiple design hierarchies,as well as the flow of inf
31、 ormation required to enable the loops.IBM Telum II,as shown in Fig.8.1.7,is a f ull stack redesign of its predecessor in a 5nm bulk technology node,with eight cores,operating at 5.5GHz 25.In each core,7 DDS circuits monitor timing margin every cycle.A DDS contains a programmable delay f eeding a la
32、tch-tapped delay line with 24 possible values.The DDS output values(Bins)are interpreted by the Droop Mitigation Units(DMU),which initiate core throttling within nanoseconds.A throttle counter tracks how many cycles are throttled to monitor perf ormance the impact f rom throttling.Counters are read
33、by the Processor Chip Power Management Engine(PME)every 16ms.For security,the PME routes voltage change requests through a Security Micro Controller(SMC)to the external Baseboard Management Controller(BMC).Due to IBM zNext s extensive Reliability,Serviceability and Availability(RAS)f eatures,a core
34、contains fine grain error state detection,including timing violation checking.If a core error state occurs,the Core Recovery Unit(CRU)of the core restores the checkpointed core state,restarts execution,and reports such events to millicode.Key elements of the Processor Drawer consist of Processor Pow
35、er Controllers(PPC),Point of Load(POL)cards,and the BMC.Through UART,FSI,and I2C interf aces,the Processor Chip s PME can change DDS delays and request changes to the Processor Drawer s VDD supply voltages.Figure 8.1.2 shows the three coupled control loops,realized using both hardware and firmware.T
36、he Timing Protection Control Loop implemented in hardware uses DDS throttling to mitigate VDD droops.This is achieved by using the DDS to monitor timing margin at the circuit level,initiating core throttling when any DDS Bin in the core drops below 11(the throttling threshold).The Perf ormance Prote
37、ction Control Loop or VCL monitors the DDS-throttling rate and uses firmware to control VDD.High throttling rates cause VDD to be increased,while low throttling rates cause VDD to be decreased.DDS throttling can cause noticeable perf ormance loss bef ore voltage increments are executed.This ef f ect
38、 is mitigated by using larger voltage increments in response to higher throttling,but this solution does incur the risk of oscillations as discussed below.The third Guard band Optimization control Loop(GOL)responds to core recovery events,incrementing the DDS delays to increase timing guard band.Fig
39、ure 8.1.3a shows that DDS throttling mitigates droops to protect timing over a wide range of VDD setpoints.For these measurements,the VCL was disabled,and VDD was reduced f rom right to lef t.The resulting maximum and minimum DDS Bins are shown with DDS throttling enabled and disabled.A workload wit
40、h a square wave power pattern was run on all cores f or 5s then execution is stopped f or 5s synchronously on all cores to induce voltage droops.The DDS throttles the core when any DDS Bin value drops below 11.For the initial high VDD values,the minimum Bin remains above 10 so there is no throttling
41、.With lower VDD settings,throttling mitigates the droops if enabled and theref ore the minimum Bin remains higher.Figure 8.1.3b shows VDD traces corresponding to VDD bias=-10%.Figure 8.1.3c zooms into the initial steepest droop.Without DDS throttling,the average core voltage drops 8%in the high-powe
42、r workload phase.With DDS throttling enabled,this droop is reduced to 4%,showing 50%mitigation.To minimize perf ormance loss when workload power increases suddenly,the VCL is asymmetric,increasing VDD more quickly than lowering VDD.If a high rate of throttling is observed in any 16ms period,VDD is i
43、ncreased up to 4%in a single request.VDD decreases are always of 0.5%and require low throttling f or at least 64ms bef ore a request is made.A steady-state workload was constructed with such small noise that a single 0.5%VDD reduction can result in more than 30%DDS throttling.The VCL then increases
44、VDD 4%resulting in oscillations costing both power and perf ormance(Fig.8.1.4).To prevent this,a second DDS hardware counter was configured to count the percentage of cycles the DDS bin was just above the throttling threshold f or use by the VCL.For very low noise workloads this percentage is high,i
45、ndicating that even a small reduction in VDD might cause excessive throttling.The right side of Fig.8.1.4 shows that the warning counter prevents oscillations,saving power while limiting the impact on perf ormance to less than 0.5%.Extensive characterization of droop-mitigation and the VCL was perf
46、ormed f or a wide variety of workloads with examples shown in Fig.8.1.5a.Voltage,power,throttling statistics,and DDS Bin statistics f rom every cycle were recorded every second to provide data on power savings,VDD setpoint,and on-chip VDD.Figure 8.1.5b shows VDD power savings f or 16 chips and 10 wo
47、rkloads.Low utilization and low noise workloads allow more power savings f rom the VCL.The VCL relies on the robustness of droop-mitigation to protect timing at all possible VDD setpoints and workload patterns.To verif y this robustness,a surge-test was designed where the VCL is f rozen with the low
48、est chip idle VDD,then a max-power stressmark is started repeatedly.No workload was f ound to stress timing more than the short surge-test.Figure 8.1.6 illustrates the results of a GOL control loop test.While timing-related errors in the field are extremely rare,IBM zNext has extensive RAS f eatures
49、 including robust core recovery.The GOL control loop considers a recovery event a warning that timing protection may be insuf ficient.In response,the programmable delays of each DDS circuit on the chip are incremented to increase guard band.With def ault DDS delays,recovery events may never happen.T
50、hus,to exercise the GOL control loop,a test was developed to exercise all three tightly coupled control loops.With all control loops enabled a high-coverage noisy workload is run.Then all DDS programmable delays on the chip are decremented every 6 minutes,reducing timing protection as shown in Fig.8
51、.1.6.Each DDS delay decrement reduces the DDS timing protection and the DDS throttling rate.As a result,the VCL responds by reducing VDD.Eventually a recovery event occurs,triggering the GOL to increment DDS delays f or 6 minutes.In the field,this GOL guard-band increment would be retained longer or
52、 permanently.The DDS is designed using gates and devices common to many critical paths.Chip characterization confirms that the DDS sensitivity to VDD,temperature and accelerated End-Of-Lif e(EOL)stressing is like that of the processor s critical paths.This correlation,although imperf ect,is used to
53、improve system reliability.For example,if threshold voltages increase slightly towards EOL both critical paths and DDS will slow.The control loops will increase VDD slightly at EOL to maintain timing margin and perf ormance.If an anomalous path or event results in a recovery event,the GOL increases
54、guard band.In summary,three coupled control loops on the IBM zNext provide timing protection(by DDS droop-throttling),perf ormance protection(by dynamically controlling VDD with the VCL),and recovery-event protection(by controlling timing margin with the GOL).These dynamic guard-band f eatures save
55、an average of 18%of the total chip power,resulting in a 10%total system power savings.These power savings enabled significantly higher perf ormance system configurations with more cores at higher f requency,more system memory,and the addition of new AI f eatures without violating the system power co
56、nstraints.Figure 8.1.1:Diagram of the IBM zNext system drawer:A Drawer consists of two BMCs,two PPCs,five POL Cards and eight Processor Chips.Each Processor Chip contains eight cores.Figure 8.1.2:Three-way dynamic guard-band control loop design:the DDS throttling to protect Timing(top),the Voltage C
57、ontrol Loop(VCL)to protect Performance(center)and the Guard-band Optimization control Loop(GOL)to provide additional assurance.Figure 8.1.3:a)VCL disabled:Minimum and maximum DDS bin and throttling%as VDD is reduced from a base value with DDS throttling enabled and disabled.b,c)VCL disabled:VDD trac
58、es from DDS throttling at VDD bias=-10%.Figure 8.1.4:An oscillation problem was fixed using an additional counter input to the VCL.The solution avoids unnecessary voltage decrease requests when voltage is already close to the throttling point.Figure 8.1.5:a)Time series of key parameters when the VCL
59、 is enabled(straight line)and disabled(dotted line).b)Chip VDD power savings for 16 chips and 10 workloads vs.workload power.Figure 8.1.6:To demonstrate the GOL in action,DDS delays were ramped down with VCL enabled until core recoveries occur,triggering the GOL to increase DDS delays.ISSCC 2025/Feb
60、ruary 18,2025/8:00 AM159 DIGEST OF TECHNICAL PAPERS 8 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 8.1.7:IBM Telum II floorplan:The Processor Chip(left side)consists of eight cores,three on the left s
61、ide of the chip and five cores on the right side of the chip.The core contains seven Digital Droop Sensors(DDS),five Droop Mitigation Units(DMU),and one Core Recovery Unit(CRU).Re fe r e nce s:1 Dan Ernst et al.,“Razor:A Low-Power Pipeline Based on Circuit-Level Timing Speculation,”I EEE/ACM MI CRO,
62、pp.7-18,2003.2 Shidhartha Das et al.,“A Self-Tuning DVS Processor Using Delay-Error Detection and Correction,”I EEE J SSC,vol.41,no.4,pp.792-804,2006.3 Shidhartha Das et al.,“Razor II:In Situ Error Detection and Correction f or PVT and SER Tolerance,”I EEE J SSC,vol.44,no.1,pp.32-48,2009.4 Jaydeep P
63、.Kulkarni et al.,“A 409GOPS/W Adaptive and Resilient Domino Register File in 22nm Tri-Gate CMOS Featuring In-Situ Timing Margin And Error Detection For Tolerance To Within-Die Variation,Voltage Droop,Temperature And Aging,”I SSCC,pp.82-83,2015.5 Alan J.Drake et al.,“A Distributed Critical-Path Timin
64、g Monitor f or a 65nm High-Perf ormance Microprocessor,”I SSCC,pp.398-399,2007.6 James Tschanz et al.,“Tunable Replica Circuits and Adaptive Voltage-Frequency Techniques f or Dynamic Voltage,Temperature,and Aging Variation Tolerance,”I EEE Sy mp.VLSI Cir cuits,pp.112-113,2009.7 Keng L.Wong et al.,“E
65、nhancing Microprocessor Immunity to Power Supply Noise with Clock-Data Compensation,”I EEE J SSC,vol.41,no.4:pp.749-758,2006.8 Michael S.Floyd et al.,“Introducing the Adaptive Energy Management Features of the POWER7 Chip,”I EEE Micr o,vol.31,no.2,pp.60-75,2011.9 Charles Lef urgy et al.,“Active Mana
66、gement of Timing Guardband to Save Energy in POWER7,”I EEE/ACM MI CRO,2011.10 Charles Lef urgy et al.,“Active Guardband Management in POWER7+to Save Energy and Maintain Reliability,”I EEE Micr o,vol.33,no.4,pp.35-45,2013.11 Keith A.Bowman et al.,“A 22 nm All-Digital Dynamically Adaptive Clock Distri
67、bution f or Supply Voltage Droop Tolerance,”I EEE J SSC,vol.48,no.4,pp.907-916,April 2013.12 Aaron Grenat et al.,“Adaptive Clocking System f or Improved Power Ef ficiency in a 28nm x86-64 Microprocessor,”I SSCC,pp.106-107,2014.13 Keith A.Bowman et al,“A 16 nm All-Digital Auto-Calibrating Adaptive Cl
68、ock Distribution f or Supply Voltage Droop Tolerance Across a Wide Operating Range,”I EEE J SSC,vol.51,no.1,pp.8-17,2016.14 Michael S.Floyd et al.,“Adaptive Clocking In The POWER9 Processor f or Voltage Droop Protection,”I SSCC,pp.444-445,2017.15 Tetsutaro Hashimoto et al.,“An Adaptive-Clocking-Cont
69、rol Circuit With 7.5%Frequency Gain f or SPARC Processors,”I EEE J SSC,vol.53,no.4,pp.1028-1037,2018.16 Michael S.Floyd et al.,“Voltage Droop Reduction Using Throttling Controlled by Timing Margin Feedback,”I EEE Sy mp.VLSI Cir cuits,pp.96-97,2012.17 Pierce I-Jen Chuang et al.,“Power Supply Noise in
70、 a 22nm z13 Microprocessor,”I SSCC,pp.438-439,2017.18 Christos Vezyrtzis et al.,“Droop Mitigation Using Critical-Path Sensors and an On-Chip Distributed Power Supply Estimation Engine in the z14 Enterprise Processor,”I SSCC,pp.300-302,2018.19 Christopher J.Berry et al.,“IBM z14:Processor Characteriz
71、ation and Power Management f or High-Reliability Mainf rame Systems,”I EEE J SCC,vol.54,no.1,pp.121-132,2019.20 Tobias Webel et al.,“Proactive Power Management in IBM z15,”I BM J.Re s.De v.,vol.64,no.5/6,pp.15:1-15:12,2020.21 Christopher J.Berry et al.,“IBM z15:A 12-Core 5.2GHz Microprocessor,”I SSC
72、C,pp.54-56,2020.22 Of er Geva et al.,“IBM Telum:a 16-Core 5+GHz DCM,”I SSCC,pp.46-48,2022.23 Xun Sun et al.,“A Combined All-Digital PLL-Buck Slack Regulation System with Autonomous CCM/DCM Transition Control and 82%Average Voltage-Margin Reduction In A 0.6-To-1.0V Cortex-M0 Processor,”I SSCC,pp.302-
73、304,2018.24 Brian T.Vanderpool et al.,“Deterministic Frequency Boost and Voltage Enhancements on the POWER10 Processor,”I SSCC,pp.218-219,2022.25 Chris Berry,“IBM Telum II Processor and IBM Spyre Accelerator Chip f or AI,”Hot Chips,2024.160 2025 IEEE International Solid-State Circuits Conf erenceISS
74、CC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKING/8.2979-8-3315-4101-9/25/$31.00 2025 IEEE8.2 Run-Time Power Management System by On-Die Power Sensor with Silicon Machine Learning-Based Calibration in a 3nm Octa-Core CPU Chien-Yu Lu1,Bo-Jr Huang1,Min-Chieh Chen
75、1,Alf red Tsai1,Eric Jia-Wei Fang1,Yuju Cho1,Rex Che-Yuan Liu1,Ericbill Wang1,You-Ming Tsao1,Hugh Mair2,Shih-Arn Hwang1 1MediaTek,Hsinchu,Taiwan 2MediaTek,Austin,TX For flagship smartphones,the gaming experience has become one of essential demands,requiring high f rame rate per second(f ps),display
76、quality and durability.These demands drive the evolution of the CPU to enhance perf ormance and computing ef ficiency.The CPU of f ers multiple cores to support various computing tasks in an energy-ef ficient manner,relying on the scheduler 1 to allocate tasks to each core based on available power b
77、udgets.Power budgets are calculated at intervals of each computing thread.As games run at higher f ps,the intervals of power budgeting become shorter.The average interval of power budgeting in two top-gaming applications is shown in Fig.8.2.1.With f ps rising to 90,the average interval is shortened
78、to less than 50s.Meanwhile,the CPU clock speeds increase to enhance perf ormance f or gaming,resulting in ever-higher current.Peak current grows f aster than average current when high CPU perf ormance is needed.The growing peak current causes over-current excursions to exceed the PMIC tolerance,resu
79、lting in system instability that degrades CPU perf ormance.Theref ore,a hardware-based power sensing system is essential to enable f ast power/current management,f or the purpose of achieving gaming perf ormance,computing ef ficiency and sustained power delivery f or the CPU.In this work,a run-time
80、power/current management system consisting of an on-die power sensor(ODPS)is demonstrated in an octa-core CPU.The CPU is part of the mobile 5G flagship SoC,implemented in 3nm FinFET technology.The ODPS calculates the power using a current sensor(I-sensor)and a synchronous voltage sensor(V-sensor).To
81、 enhance power sensing accuracy,a silicon-based AI/Machine Learning(ML)model is proposed f or calibration.The ODPS achieves 94.83%CPU power accuracy f or the GeekBenchv6 CPU benchmark,accounting f or run-time voltage and temperature fluctuations.The ODPS occupies 0.0313%of the whole CPU area.Moreove
82、r,the ODPS is capable of monitoring CPU on-time current in 10s sampling windows.When applied to over-current control(OCC),it limits the over-current excursion on the PMIC to ensure power stability without observable perf ormance loss.When running a game,the ODPS profiles per-core CPU power consumpti
83、on,providing dynamic run-time conditions to optimize power f or a better user experience and battery lif e.Figure 8.2.2 depicts the CPU comprising eight single-thread cores in the SoC,supporting the ARMv9.2 instruction set.The cores are categorized into three perf ormance groups.The first gear inclu
84、des an ARM Cortex-X925 High-Perf ormance(HP)core clocked at up to 3.63GHz.The second gear f eatures three ARM Cortex-X4 Balance-Perf ormance(BP)cores at 3.3GHz,and the third gear comprises f our 2.4GHz Cortex-A720 cores f or High-Ef ficiency(HE)computing.To implement a power sensor,current and volta
85、ge readings are essential f or power calculation.As shown in Fig.8.2.3,the f ully digital I-sensor 2 is implemented in each core to capture the run-time current.The I-sensor senses the current by measuring the voltage drop as delta speed counts(RO)using two identical ring oscillators(ROSC).It conver
86、ts RO to current using a calibrated slope(counts vs.voltage)and Direct-Current Resistance(DCR).Since both slope and DCR vary with voltage and temperature,V-sensor and Temperature sensor(T-sensor)are necessary f or accurate current calculation.The V-sensor provides voltage readings used not only f or
87、 I-sensor hardware calibration,but also f or power computation.The designated ROSC of I-sensor(ROSCI)can detect V-level voltage drops,and its speed counts(RO)are calibrated against a trim voltage step with small nonlinearity error,making ROSCI a reliable indicator of supply voltage.In Fig.8.2.3,the
88、proposed V-sensor synchronously records the RO of ROSCI to quantif y the run-time voltage af ter each sampling window.It uses the RO look-up-table with calibrated counts vs.voltage to find the closest ROn and ROn+1 and their corresponding trim step voltages Vn and Vn+1.The run-time voltage(V in Fig.
89、8.2.3)is then interpolated between Vn and Vn+1 based on the RO readings between ROn and ROn+1.The voltage V is used f or both I-sensor slope/DCR calibration and ODPS power calculation.Leveraging the same trimming voltage step and calibration table as the I-sensor,the V-sensor achieves a 96.8%accurac
90、y,ranging f rom 65%to 120%of the CPU supply voltage(VDD).Using the T-sensor,a two-point temperature characterization is done 2.Then,the I-sensor outputs RO and converts it to current I.As shown in Fig.8.2.3,this current conversion f ollows the f ormula of Counts-to-Ampere Multiplier(MCA)based on RO.
91、The MCA is the multiplicative inverse of the slopeDCR,which is a polynomial f unction of dif f erent process-sensor(P-sensor,i.e.,ROSCI and ROSCDCR*)counts with varying coef ficients(i.e.,CI and CDCR*),that align with the hardware design at specified step voltages(i.e.,Vn and Vn-1).Both P-sensor cou
92、nts and coef ficients can vary with voltage to determine the MCA.In addition to the original terms of the MCA,to enhance accuracy,several potential P-sensors terms are considered to identif y more process-related correlations beyond the designated model.For example,wire variations between true and v
93、irtual VDD in Fig.8.2.3 can af f ect the I-sensor accuracy,and N/PMOS unbalances may also distort the ROSC counts at the worst corner.Thus,the MCA model is extended to include extra P-sensor coef ficient terms f or regression based on AI/ML calibration,as illustrated in Fig.8.2.4.The AI/ML calibrati
94、on model leverages extensive silicon data to refine the MCA f or accuracy convergence.Taking the current measured by the oscilloscope as golden,the current I converted by MCA,RO(RO)f rom the I-sensor(P-sensors)are trained in various scenarios,such as generic benchmarks including Geekbenchv6,Antutu,a
95、nd SpecInt2k6 etc.This training f eatures good tolerances against voltage and temperature fluctuations.AI algorithms are used to elect the most relevant RO terms(i.e.,P-sensors),thereby simplif ying the coef ficient terms of the model.The simplified MCA model then utilizes ML method to ef ficiently
96、train the coef ficients f or each elected RO term.Finally,the elected RO terms and their corresponding coef ficients are precisely determined f or each trim step voltage in one-shot ef f ort.Figure 8.2.4 shows the per-die calibration example f or the MCA of the HP core whose VDD range f rom 60%to 12
97、0%.Following this,the ODPS calculates the power by obtained voltage and current(RO x MCA).Figure 8.2.5 shows the ODPS measured power and voltage when the CPU runs Geekbenchv6/SpecInt2k6 benchmarks.In the HP core,the run-time power measured by the ODPS agrees well(R2=0.9602/0.952)with the oscilloscop
98、e measurements,achieving low inaccuracy within 5.17%/4.76%.The ODPS also profiles the run-time power of the three BP and f our HE cores.It shows the on-time power core-to-core variations of up to 25/16.7%at 102.5%/95%VDD between the BP cores,and 28%at 98.75%VDD between the HE cores.Figure 8.2.6 illu
99、strates the architecture of the ODPS-oriented power management system.When the CPU runs games,the ODPS provides power readings f or profiling and budgeting in each core.These power readings can be transmitted in sequence f rom CPU to SoC/OS via a microprocessor(P).Smartphone applications can ascerta
100、in the per-core breakdown in run-time power.This benefits power optimization,yielding a better user experience and battery lif e.Additionally,the ODPS monitors run-time peak current in 10s sampling windows.Once the current exceeds a threshold,the OCC throttles the CPU clock speed to lower the curren
101、t f or protection.Silicon validation shows that the OCC limits the current according to PMIC tolerance and thus stabilizes CPU power delivery.The die photo in Fig.8.2.7 shows the multicore CPU with area of 21.42mm.This work demonstrates a run-time power management system in a 3nm CPU complex.The ODP
102、S includes the f ully digital I-sensor and V-sensor operating at the f ull range of the CPU supply voltages.The V-sensor leverages the same calibration procedure as the I-sensor to save characterization ef f ort,achieving 96.8%voltage accuracy.For the I-sensor,an AI/ML calibration model is adopted t
103、o improve accuracy.Incorporated with the V-sensor,the ODPS achieves a power sensing accuracy of 94.83%,occupying 0.0313%of the total CPU area.In addition to of f ering per-core power f or budgeting,the ODPS also monitors and controls the peak current in 10s sampling windows as the CPU power/current
104、management system,which stabilizes CPU power without observable perf ormance penalties.The sequential CPU run-time power can be displayed and profiled by each core in smartphone applications,benefiting power optimization f or better user experience and battery lif e.Acknowle dge me nt:The authors th
105、ank Tran Trong Hieu,Ling Yeh,Yu-Chen Chou,TC Tsai,Ted Hu,Joyce Chu,Ian-Y Chen,Guan-wun Chen,Michael Jiang,Cheng-Yuh Wu,Sean Fan,Yuwen Tsai,and Alex Chiou,Mediatek,Hsinchu,Taiwan,f or their support on this work.Figure 8.2.1:Ever-faster power-budgeting calculation and over-current excursion demands in
106、 3nm gaming-centric CPU.Figure 8.2.2:The 3nm CPU block diagram and architecture in SoC.Figure 8.2.3:The proposed fully digital ODPS/V-sensor with calibration,and silicon measurements of the V-sensor.Figure 8.2.4:Silicon-based AI/ML model for I-sensor calibration.Figure 8.2.5:The ODPS measurement res
107、ults of the HP core,BP cores and HE cores.Figure 8.2.6:The ODPS-oriented power/current management/monitor system for the CPU with silicon results and on-phone demonstration.ISSCC 2025/February 18,2025/8:25 AM161 DIGEST OF TECHNICAL PAPERS 8 2025 IEEE International Solid-State Circuits Conf erenceISS
108、CC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 8.2.7:The 3nm CPU die-photo and ODPS/digital V-sensor feature summary.Re fe r e nce s:1 B.-J.Huang et al.,“A 5G Mobile Gaming-Centric SoC with High-Perf ormance Thermal Management in 4nm FinFET,”I SSCC,pp.40-42,202
109、3.2 C.-Y.Lu et al.,“A Fully Digital Current Sensor Of f ering Per-Core Runtime Power f or System Budgeting in a 4nm-Plus Octa-Core CPU,”I SSCC,pp.260-262,2024.3 V.K.Kalyanam et al.,“Thread-Level Power Management f or a Current and Temperature Limiting System in a 7nm Hexagon Processor,”I SSCC,pp.494
110、-495,2021.4 M.Kar et al.,“A Sof tware-Assisted Peak Current Regulation Scheme to Improve Power-Limited Inf erence Perf ormance in a 5nm AI SoC,”I SSCC,pp.254-256,2024.5 M.Cochet et al.,“BlitzCoin:Fully Decentralized Hardware Power Management f or Accelerator-Rich SoCs,”I EEE/ACM I SCA,pp.801-817,202
111、4.162 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKING/8.3979-8-3315-4101-9/25/$31.00 2025 IEEE8.3 A Dynamically Reconfigurable Digital-Integrated Voltage-Regulator Fabric for Energy-Efficient DVFS in
112、Multi-Domain SoCs Julian Arenas,Chi-Hsiang Huang,Kevin Patino-Sosa,Jung-Jin Park,Hikmet Seha Ozturk,Visvesh Sathe Georgia Institute of Technology,Atlanta,GA Energy ef ficient SoCs f requently employ a hierarchical Voltage Regulator(VR)architecture(Fig.8.3.1),using Buck regulators to each drive a gro
113、up of Low Dropout(LDO)regulated domains.The domain with the highest Vdd requirement within each group the critical domain sets the voltage of its corresponding Buck rail.Such an architecture combines the complementary advantages of Buck and LDO regulators to enable ef ficient fine-grained Dynamic Vo
114、ltage and Frequency Scaling(DVFS).Buck regulators 1-7 of f er ef ficient supply-voltage(Vdd)scaling,but they require bulky passive components which hinder integration and restrict their use to only limited numbers.In contrast,LDOs 8-13 are highly compact but incur significant losses due to the resis
115、tive voltage drop across their header devices.Domain grouping is perf ormed at design time to maximize aggregate VR ef ficiency across the SoC,maximizing dynamic voltage scaling by the Buck regulators,while minimizing dropout losses associated with LDO Vdd conversion.Some domain groups may require c
116、ombining multiple Buck converters to f orm multi-phase regulators f or improved transient response and current compliance,ensuring regulator current limits exceed worst-case domain-group loading.SoC domains continuously undergo Power State(PState)transitions,with changing Vdd and current loading.How
117、ever,domain groups and Buck topologies in conventional VR architectures are immutable at run time.Restricting run-time reconfiguration precludes regrouping domains more ef ficiently and disaggregating multi-phase regulators(when possible)to provide more Buck regulator rails f or enhanced Vdd scaling
118、 ef ficiency.This paper describes DRIVR,a Dynamically Reconfigurable Integrated Voltage Regulator f abric that provides run-time configurability of Buck converter“tiles”and their connectivity to load domains.The resulting partial power crossbar allows f or ef ficient domain grouping.As required,Buck
119、 tiles can either be combined into multi-phase regulators or disaggregated to provide additional Buck regulators f or more ef ficient Vdd scaling.Test-chip measurements of a 2-tile 8-domain SoC across various PSta te transitions indicate that,compared to its static counterpart,DRIVR achieves an 18%a
120、nd 25%reduction in average and worst-case power draw f rom the system,respectively.Figure 8.3.2 presents a simplified overview of the DRIVR test-chip architecture.The power gating header,used in conventional designs,is replaced with two switches(M0 and M1)f or selectable Buck-rail connectivity.Using
121、 a digital LDO that repurposes the headers mitigates this overhead by regulating Vdd through drive-strength control.Switches M0 and M1 are configured to connect the domain to either rail,or both rails to realize a two-phase Buck connection.The harsh trade-of f between current ef ficiency and transie
122、nt response associated with synchronous digital LDOs is overcome using a f requency-locked adaptive clocking architecture 2,5.A Vdd-powered Tunable Replica Oscillator (TRO)clocks the digital domain,relying on LDO control of Vdd to f requency-lock to a ref erence.The LDO Contr oller perf orms loop co
123、mpensation and short-circuit f ree domain hand-of f between Buck-rails during reconfiguration.DRIVR relies on the power management firmware that initiates PSta te transitions to also provide optimal and current-compliant tile and domain configurations.Buck regulators using voltage-f eedback to power
124、 across-chip domains f ace two major drawbacks:(1)additional voltage margins are needed to address voltage of f sets in comparators and voltage ref erences;(2)chip-wide distribution of Buck-rail voltages incurs PSta te-dependent voltage gradients in Vb0 and Vb1 which require either additional voltag
125、e margins or a complex design using multiple sense points.To overcome this limitation,DRIVR avoids voltage-f eedback.Instead,it controls each Buck using error signals and state variables the f requency error(f 7:0)and the header strength(KLDO 7:0)obtained f rom all domains that can be connected to i
126、ts rail.Selector Control 14 is used to identif y the critical domain(ncrit)whose signals are selected to control the Buck regulator.For each domain-i driven by the Buck regulator,a F e e dba ck Se le ctor module determines each domain he a de r s la ck(Kslacki),the number of unused LDO headers.The c
127、ritical domain with the lowest header slack is identified as ncrit.PID compensation is perf ormed by the Buck regulator using the f requency error of the critical domain(f ncrit).The regulator samples this error at a rate 8 lower than the LDO to prevent instability arising f rom interactions between
128、 the two loops.Negative f eedback provided by Ks la ckncrit plays a key role in ensuring that the Buck voltage is at the lowest value needed to meet f requency targets across all its load-domains.Non-zero Ks la ckncrit causes the Buck output voltage to reduce.In response,the much f aster domain LDOs
129、 rapidly increase header strength(reducing header slack)to maintain their respective domain-Vdd values.The Buck continues to lower its rail voltage until Ks la ckncrit converges to 0,indicating that the critical domain has used up any unused headroom.Thus,by actively using f eedback f rom each of it
130、s load domains,the proposed Buck regulator architecture avoids margins associated with conventional regulation.Short-circuit current f ree domain hand-of f between Buck rails across PVT variations is critical to DRIVR operation and is perf ormed by the LDO Contr oller using closed-loop f eedback.The
131、 governing rationale behind ensuring proper domain transition f rom one rail to another is to prevent a short-circuit by ensuring that M0 and M1 are not simultaneously conducting if Vdd exceeds either rail voltage.Figure 8.3.3 illustrates an example of domain hand-of f f rom a low(Tile1)to a high(Ti
132、le0)voltage rail.Feedback is provided by a clocked comparator with a user-programmable of f set voltage(Vof f set).The LDO initially regulates the domain by controlling M1,with M0 disabled.The domain enters a Ha lt state to avoid load transients bef ore the LDO Contr olle r suspends f eedback contro
133、l to gradually incrementing KLDO0,turning on M0 and thus increasing Vdd.The reduced source-drain voltage across M1 causes I1 to decline.KLDO0 is incremented,causing Vdd to increase until the comparator triggers when Vb1 Vdd,the source-drain voltage across M1 reduces to Vof f set.At this instant,I1 i
134、s negligible,with domain current provided ef f ectively by I0.KLDO1 is finally transitioned to 0,turning M1 of f completely.Next,KLDO0 is loaded into the PI controller to ensure seamless header control takeover and regulates Vdd according to its new PSta te.The domain finally exits Ha lt state,compl
135、eting the rail transition.A 2-tile 8-domain SoC test chip was implemented in 65nm CMOS to quantif y the power benefits of DRIVR.The SoC domains consist of one RISC-V processor,two CORDICs,one FFT module,and f our programmable dynamic power generators to limit test-chip area.All domains include a syn
136、thetic load current to readily evaluate regulator load-step response.The additional header-switch in each domain represents a 5%area overhead to the RISC-V processor domain.Figure 8.3.4 shows the measured voltage response of the Buck and LDO regulators to a step-up and step-down load current transie
137、nt in two domains.To demonstrate tile-reconfigurability,the Buck was programmed to operate in a 2-phase configuration.Also shown are measured regulator output wavef orms that demonstrate Buck regulator operation during changes in the critical domain.An increase in the target f requency(ftarg)of Loa
138、d-0 prompts the LDO to increase Vdd0 until no more headers can be enabled,making Loa d-0 the critical domain.Correcting f error in Loa d-0 requires Vb0 to increase until Loa d-0 achieves ftarg.Similarly,an ftarg reduction of a critical Loa d-0 causes LDO headers to be disabled to lower Vdd0.Meanwhil
139、e,as the Buck lowers Vb0 to f ollow the critical domain,Loa d-1 LDO enables more headers to maintain Vdd1 until it has a lower header slack and becomes the critical domain which influences the buck regulator.Figure 8.3.5 shows measured wavef orms of Vb0,Vb1 and Vdd,demonstrating domain hand-of f bet
140、ween Buck0 and Buck1 in both directions.The wavef orms reflect the intended sequencing of header switches controlled by the comparator.The absence of a surge in inductor current is indicative of short-circuit f ree hand-of f.To quantif y DRIVR power benefits,we emulated two baseline hierarchical VR
141、configurations by disabling domain and tile re-grouping in DRIVR.In the s ingle-Buck VR configuration,worst-case domain operation constrains both Buck tiles to f orm a single two-phase regulator.In the two-Buck VR configuration,domains are statically assigned to one of two voltage rails.SoC power wa
142、s measured f or 1000 random PStates f or DRIVR and both baselines.Each PSta te constitutes a random assignment of ftarg and synthetic load current f or every SoC domain.SoC power was measured under optimal PState-dependent VR configurations f or each architecture.The histogram of measured power(Fig.
143、8.3.6)summarizes our findings.DRIVR was f ound to provide 18%and 25%reduction in average and peak power,respectively,over its two-Buck counterpart.Compared to single-Buck,DRIVR achieves reductions of 38%and 45%in average and peak power,respectively.Figure 8.3.7 summarizes key DRIVR test-chip metrics
144、 relative to emulated baseline configurations.As the trend towards chiplet integration intensifies,enhancing the ef ficiency of scalable VR architectures will become increasingly critical.By allowing run-time reconfigurability of domain and tile grouping,DRIVR maximizes opportunities f or energy-ef
145、ficient Buck regulator scaling to reduce overall system power dissipation.Acknowle dge me nt:This work was f unded by SRC through the Texas Analog Center of Excellence under GRC contract No.2810.032.The authors would like to thank the f ollowing people f or their support and f eedback during this pr
146、oject:Arindam Mandal(Georgia Institute of Technology),Edevaldo Pereira(NXP Semiconductors),Alvin Loke(Intel Corporation),Stef ano Pietri(NXP Semiconductors),Ravi Geetla(NXP Semiconductors),Xin Zhang(IBM),Raveesh Magod(Texas Instruments).Shidhartha Das(AMD),Carlos Tokunaga(Intel Corporation),and Nass
147、er Kurd(Intel Corporation).Figure 8.3.1:Simplified 3-Buck,5-domain hierarchical VR configuration across PStates.(Top)Conventional static allocation causes excessive dropout loss.(Bottom)Proposed DRIVR architecture maximizes efficiency through domain and tile reconfigurability.Figure 8.3.2:Simplified
148、 DRIVR architecture.Tiles can be configured for single-and two-phase buck regulation.M0 and M1 offer configurable rail connectivity to adaptive clocked domains.A feedback selector uses LDO state variables for reference-free buck control.Figure 8.3.3:Domain hand-off from low-to high-voltage rails.Fee
149、dback control using a programmable offset comparator is used to sequence M0 and M1 drive strengths to ensure short-circuit-free domain transition across PVT.Figure 8.3.4:Measured DRIVR transient response under:(Top)a load current step in two domains with tiles configured as a 2-phase buck regulator.
150、(Bottom)a target frequency transition in Load-0 changes the critical domain tracked by the Buck regulator.Figure 8.3.5:Measured domain hand-off sequence from(Left)Tile1 to Tile0 and(Right)Tile0 to Tile1.The rail voltage of Tile0(Vb0)exceeds the rail voltage of Tile1(Vb1).Absence of a Buck1 inductor
151、current(Iind1)spike demonstrates a short-circuit-free domain transition.Figure 8.3.6:Measured test-chip total power over 1000 random PState assignments for conventional static configurations(Single-Buck and Two-Buck)and DRIVR.Target frequencies and load current values were randomized independently.I
152、SSCC 2025/February 18,2025/8:50 AM163 DIGEST OF TECHNICAL PAPERS 8 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 8.3.7:Die micrograph and summary table.Two static configurations(Single-Buck and Two-Buc
153、k)are emulated for comparison with DRIVR.Re fe r e nce s:1 Y.Zhang et al.,“Very High Frequency PWM Buck Converters Using Monolithic GaN Half-Bridge Power Stages With Integrated Gate Drivers,”I EEE Tr a ns.on Powe r Ele ctr onics,vol.31,no.11,pp.7926-7942,2016.2 C.-H.Huang et al.,“A Single-Inductor 4
154、-Output SoC with Dynamic Droop Allocation and Adaptive Clocking f or Enhanced Perf ormance and Energy Ef ficiency in 65nm CMOS,”I SSCC,pp.416-418,2021.3 Y.-H.Kao et al.,“A 48V-to-5V Buck Converter with Triple EMI Suppression Circuit Meeting CISPR 25 Automotive Standards,”I SSCC,pp.164-166,2024.4 H.K
155、.Krishnamurthy et al.,“A Digitally Controlled Fully Integrated Voltage Regulator with On-Die Solenoid Inductor with Planar Magnetic Core in 14nm Tri-Gate CMOS,”I SSCC,pp.336-337,2017.5 X.Sun et al.,“An All-Digital Fused PLL-Buck Architecture f or 82%Average Vdd-Margin Reduction in a 0.6-to-1.0-V Cor
156、tex-M0 Processor,”I EEE J SSC,vol.54,no.11,pp.3215-3225,2019.6 X.Wang,X.Liu and W.-H.Ki,“A Self-Clocked and Variation-Tolerant Unified Voltage-and-Frequency Regulator f or In-Order Executed Digital Loads,”I EEE TCAS-I,vol.70,no.11,pp.4627-4640,2023.7 H.Ren et al.,“Heterogeneous Power Delivery f or L
157、arge Chiplet-based Systems using Integrated GaN/Si-Interconnect Fabric with sub-10 m Bond Pitch,”I EDM,2023.8 X.Sun et al.,“A 0.6-to-1.1V Computationally Regulated Digital LDO with 2.79-Cycle Mean Settling Time and Autonomous Runtime Gain Tracking in 65nm CMOS,”I SSCC,pp.230-231,2019.9 D.-H.Jung et
158、al.,“A Distributed Digital LDO with Time-Multiplexing Calibration Loop Achieving 40A/mm2 Current Density and 1mA-to-6.4A Ultra-Wide Load Range in 5nm FinFET CMOS,”I SSCC,pp.414-416,2021.10 J.Oh et al.,“An Output-Capacitor-Free Synthesizable Digital LDO Using CMP-Triggered Oscillator and Droop Detect
159、or,”I EEE J SSC,vol.58,no.6,pp.1769-1781,June 2023.11 S.Kundu et al.,“A Fully Integrated Digital LDO With Built-In Adaptive Sampling and Active Voltage Positioning Using a Beat-Frequency Quantizer,”I EEE J SSC,vol.54,no.1,pp.109-120,2019.12 D.Kim and M.Seok,“A Fully Integrated Digital Low-Dropout Re
160、gulator Based on Event-Driven Explicit Time-Coding Architecture,”I EEE J SSC,vol.52,no.11,pp.3071-3080,2017.13 J.Kim et al.,“An Output-Capacitor-Free NMOS Digital LDO Using Gate Driving Strength Modulation and Droop Detector,”I EEE TCAS-I,vol.70,no.12,pp.4975-4985,2023.14 K.Johan Astrom and R.M.Murr
161、ay,F e e dba ck Sy s te ms:An I ntr oduction for Scie ntis ts a nd Engine e r s,2008.164 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKING/8.4979-8-3315-4101-9/25/$31.00 2025 IEEE8.4 A 4GHz,0.69%-Accura
162、cy Voltage-Droop Detector with Multiple Remote Sensing and Under 2-Cycle Detection Latency in 2nm GAAFET DongHoon Jung*,Dongha Lee*,Seki Kim,Susie Kim,Min Young Kang,Takahiro Nomiyama,Dongsu Kim,Jongwoo Lee Samsung Electronics,Hwaseong,Korea *Equally Credited Authors(ECAs)Power management in the clu
163、ster or at the per-core level is becoming increasingly crucial in modern SoCs to minimize the power consumption and maximize the perf ormance.Voltage scaling each CPU cluster to its lowest possible voltage with minimal margin is essential to achieve this goal.The main design challenge in lowering th
164、e supply voltage is the voltage droop caused by instantaneous changes in load current.Theref ore,voltage droop detection techniques have recently gained attention 16.Previous droop detectors can be categorized as digital or analog.Digital droop detectors are mainly based on a TDC and are composed of
165、 standard cells 1,2.As a result,they are logic f riendly,easy to migrate to a dif f erent process node,and synthesizable.However,TDC-based indirect droop voltage sensing is vulnerable to PVT variation,and the limited voltage-to-delay conversion gain of the delay cell can restrict the operating f req
166、uency.Conversely,analog droop detectors 36 are relatively robust to PVT variation and compare supply voltage directly with ref erence voltage.However,the relatively bulky analog blocks and the need f or an analog supply voltage limits the integration of analog droop detectors.Additionally,the routin
167、g between the sensing point and the main comparator limits the maximum operating f requency.In this paper,a droop detector capable of remote sensing is proposed.A main droop detection controller(DDC)ensures PVT-tolerant and accurate detection,while a compact remote droop detector(RDD)directly senses
168、 voltage droop with a sampling f requency of up to 4GHz.Figure 8.4.1 shows the system-level integration details of the proposed droop detector.The CPU cluster includes several cores,and power gating switches(PGSs)control the supply of respective cores.The DDC is located inside the CPU cluster along
169、with other analog blocks,such as a PLL,temperature sensor,etc.,while RDDs exist inside the cores.The number of RDDs can easily be scaled according to the number of cores and the desired sensing points in the cluster.The RDDs operate with a supply voltage f rom the PMIC(VPMIC)and monitor the supply v
170、oltage of each core,f rom VCORE1 to VCORE#.By separating the DDC and RDD,multiple droop sensors with minimal area overhead is achieved,which is critical in a high-cost waf er process.The clock controller adjusts the clock f requency based on the FLAG f rom the RDDs.To detect high-speed droop caused
171、by PCB and PKG parasitics,whose resonance f requency is a f ew hundred MHz 7,the proposed droop detector achieves a detection f requency of up to 4GHz with a response time less than 2 CPU clock cycles.In addition,the separated analog DDC and RDD configuration ensures variation-tolerant operation acr
172、oss a wide range of core voltage.Figure 8.4.2(a)shows the structure of the proposed droop detector,including the DDC and RDD.The DDC includes a BGR,an R2R DAC,a REF buf f er f or driving ref erence voltage(VBGR)to R2R DAC,and a VDET buf f er f or driving the output of R2R DAC.The BGR generates PVT-t
173、olerant VBGR and the REF buf f er drives the R2R DAC af ter trimming VBGR.The R2R DAC adjusts the target level of VDET_DDC,which defines the target droop detection voltage,based on the VDET_CTRLn:0 f rom the SoC controller.The VDET buf f er drives VDET_DDC to the RDDs in the core.The distance betwee
174、n the DDC and the RDDs can be several thousand m depending on core placement in the SoC.The routing resistance of VDET_DDC(RROUTING)and the gate leakage of the RDDs can lead to detection errors.Theref ore,the routing between the DDC and the RDDs should be caref ully designed and RROUTING is constrai
175、ned to be below 100.The number of RDDs connected to the DDC can be scaled as needed,as long as the RROUTING requirement is met.The RDD consists of a SCALER,a pre-amplifier(PREAMP),a comparator and a calibration logic.The PREAMP and comparator compare VDET_DDC and VC_SCALE,which is the scaled voltage
176、 of VCORE,and generate a FLAG signal.The SCALER scales down VCORE to the input range of the PREAMP.The scaling ratio can be adjusted based on the VCORE level to support a wide range of VCORE.The calibration logic in the RDD self-calibrates the of f set in the PREAMP and comparator.Figure 8.4.2(b)sho
177、ws the configuration of calibration logic and its operational wavef orm.The self-calibration of the RDD begins when CALSTART is applied.During calibration mode,the RDD operates at one-f ourth of the CLK f requency.CAL_DONE shorts both inputs of the PREAMP to VDET_DDC.The shif t register then decreas
178、es CALNk:0 to calibrate the negative input of the PREAMP until FLAG becomes HIGH and holds the output.k+1 cycles af ter the rising edge of CALSTART,DONEN goes high and the positive input calibration begins.Similar to the negative input calibration,CALPk:0 decreases until FLAG becomes low and calibra
179、tion completes k+1 cycles af ter the rising edge of DONEN.Figure 8.4.3(a)shows the structure of the PREAMP and comparator within the RDD.Since the routing of VDET_DDC could extend over several thousand m,and a large decoupling capacitor cannot be added inside the RDD,VDET_DDC is susceptible to noise
180、.The use of multiple RDDs also increases the vulnerability of VDET_DDC to noise.The PREAMP is placed bef ore the comparator not only to provide gain but also to prevent clock kickback noise on VDET_DDC and VC_SCALE through MN1 and MN2 in the first stage of the comparator.A double-tail latch-type com
181、parator is used in the proposed RDD.The primary bottleneck in the operating speed of the proposed comparator is signal-slope degradation caused by wire resistance between the first and second stages,which is particularly significant in a 2nm process.To mitigate the ef f ect of wire resistance,MP4 an
182、d MP5,duplicates of MP1 and MP2,are added to assist with the pre-charging of OUTN and OUTP in the second stage.Additionally,MN8 to MN9 are added to f acilitate pre-discharging of VPCHP and VPCHN.Meanwhile,the SCALER shown in Fig.8.4.3(b)should maintain a unif orm scaling ratio across a wide f requen
183、cy range to scale VCORE without distortion in a high f requency range.The scaling ratio should be adjustable to support a wide VCORE range.Depending on the VCORE and SEL input,the proposed SCALER adjusts the scaling ratio to prevent VSCALE f rom becoming too low f or the NMOS input of the PREAMP.The
184、 connection of C2 is also modified according to the SEL input to maintain a unif orm scaling ratio up to 1GHz,as shown in Fig.8.4.3(b).The proposed droop detector is f abricated in a 2nm GAAFET process.Figs.8.4.4(a)and(b)show the measured wavef orm of the proposed droop detector when external AC noi
185、se of 1MHz and 300MHz are applied to VCORE.In the result shown in Fig.8.4.4(a),the target detection voltage(VDET_TARGET)was set to 1V and the clock f requency is 4 GHz.The FLAG shows accurate droop detection results when VCORE crosses VDET_TARGET.In Fig.8.4.4(b),the conditions are the same but the f
186、 ull swing of FLAG signal could not be measured because of IO interf ace limitations.However,both the VCORE and FLAG signals exhibit a f requency of 300MHz,which demonstrates that the proposed droop detector successf ully detects a 300MHz droop in real-time.Figure 8.4.4(c)shows the measured detectio
187、n voltage(VDET_MEAS)in three operating modes f or 75 DUTs.VDET_TARGET and operating f requency change according to the operating mode.As shown in the figure,VDET_MEAS exhibits less than 11mV peak-to-peak,which corresponds to-0.56%to 0.69%accuracy,including all three modes at typical SoC operating te
188、mperatures above 25C.Figure 8.4.5 shows measurement results f or the VDET_MEAS dif f erence between RDDs(VDET_MEAS)and the droop detection latency with its test circuitry.First,to measure the ef f ect of the distance between the DDC and the RDDs,the VDET_DDC routing between the DDC and the RDDs was
189、intentionally modulated as shown in Fig.8.4.5(a).Four RDDs are implemented in the test vehicle with the routing distances between the DDC and RDDs set to 200m,1500m,3000m,and 4500m,respectively.The corresponding RROUTING values are 8.3,20.8,41.8,and 62.6.As shown in Fig.8.4.5(b),VDET_MEAS shows no s
190、ignificant dif f erence or trend depending on the routing distance.VDET_MEAS was measured in 4GHz clock f requency mode and across five corners at 125C,where the impact of transistor leakage is highest.The peak-to-peak VDET_MEAS is less than 5 mV.Figure 8.4.5(c)shows the droop-latency measurement te
191、st circuitry embedded in the RDD,which accurately measures the latency.When ENDROOP is applied,the clock-synchronized ENSYNC turns on the droop injection NMOS,and VC_SCALE is pulled down to ground.The test circuitry captures the timing when the FLAG is generated at the CLK f requency and outputs the
192、 latency as a LTNC2:0 thermometer code.The measurement result shows that the droop detection is completed within one cycle in high-perf ormance operating mode.Only 3%of DUTs exhibit a two cycle latency in low-perf ormance mode.Figure 8.4.6 provides a perf ormance summary of the proposed droop detect
193、or and a comparison with previous designs.The proposed structure exhibits the f astest detection f requency and superior accuracy of 0.69%across the entire temperature range.Unlike other analog droop detectors,the proposed structure uses VPMIC as the supply voltage of RDD,allowing it to be implement
194、ed inside CORE,similar to other digital detectors.The measured lowest detection voltage is 0.4V.The proposed droop detector achieved lower power consumption compared to other digital detectors considering operating f requency.Assuming multiple-point sensing,which is common in SoC design,the area of
195、the proposed detector is 15-43%smaller(based on 4-point sensing)compared to previous analog structures.Figure 8.4.1:Previous droop detectors and proposed droop detector with remote sensing.Figure 8.4.2:The structure of(a)the proposed droop detector and(b)the calibration logic of RDD with its operati
196、onal waveform.Figure 8.4.3:(a)The structure of the PREAMP and comparator,and(b)the SCALER within the RDD.Figure 8.4.4:The measured waveform of the proposed droop detector with(a)1MHz droop and(b)300MHz droop on VCORE,and(c)the VDET_MEAS at various VDET_TARGET,frequencies and temperatures.Figure 8.4.
197、5:(a)The layout of the test vehicle to monitor the effect of distance between DDC and RDDs and(b)measured VDET difference between RDDs,and(c)measured droop detection latency.Figure 8.4.6:Performance summary of the proposed droop detector and comparison with previous designs.ISSCC 2025/February 18,20
198、25/9:15 AM165 DIGEST OF TECHNICAL PAPERS 8 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 8.4.7:Chip microphotograph and layout of the RDD and DDC.Re fe r e nce s:1 S.Bang et al.,“An All-Digital,VMAX-Co
199、mpliant,and Stable Distributed Charge Injection Scheme f or Fast Mitigation of Voltage Droop,”I EEE Solid-Sta te Cir cuits Le tte r s,vol.2,no.9,pp.143-146,2019.2 K.A.Bowman et al.,“A 16 nm All-Digital Auto-Calibrating Adaptive Clock Distribution f or Supply Voltage Droop Tolerance Across a Wide Ope
200、rating Range,”I EEE J SSC,vol.51,no.1,pp.8-17,2016.3 Y.Shif man et al.,“A Method f or Mitigation of Droop Timing Errors Including a 500 MHz Droop Detector and Dual Mode Logic,”in I EEE J SSC,vol.57,no.2,pp.596-608,2022.4 H.Mair et al.,“A 7nm FinFET 2.5GHz/2.0GHz Dual-Gear Octa-Core CPU Subsystem wit
201、h Power/Perf ormance Enhancements f or a Fully Integrated 5G Smartphone SoC,”I SSCC,pp.50-52,2020.5 A.Nayak et al.,“A 5nm 3.4GHz Tri-Gear ARMv9 CPU Subsystem in a Fully Integrated 5G Flagship Mobile SoC,”I SSCC,pp.50-52,2022.6 M.S.Floyd et al.,“Adaptive clocking in the POWER9 processor f or voltage
202、droop protection,”I SSCC,pp.444-445,2017.7 P.N.Whatmough et al.,“Power Integrity Analysis of a 28 nm Dual-Core ARM Cortex-A57 Cluster Using an All-Digital Power Delivery Monitor,”I EEE J SSC,vol.52,no.6,pp.1643-1654,2017.166 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION
203、8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKING/8.5979-8-3315-4101-9/25/$31.00 2025 IEEE8.5 A Command-Aware Hybrid LDO for Advanced HBM Interfaces with 150 A Quiescent Current and 20pF On-Chip Capacitor Achieving Sub-10mV Voltage Droop in 400ps Settling Time Jaeho Kim*1,Myeo
204、ngho Han*1,Jooeun Bang1,2,Younghyun Lim3,Jaehyouk Choi1 1Seoul National University,Seoul,Korea 2KAIST,Daejeon,Korea 3Kyung Hee University,Yongin,Korea *Equally Credited Authors(ECAs)With the advent of the generative AI era,high-bandwidth memory(HBM)has emerged as an irreplaceable solution that can p
205、rovide ultra-high memory bandwidth(BW)of more than 1TB/s to AI processors.To enable such a high BW,HBM3E accommodates 16 channels(CHs)with two pseudo CHs(pCHs)each,and HBM4 increases them to 32 CHs to double the BW.Each pCH receives a dedicated dif f erential write data strobe(WDQS)f rom the host.Th
206、en,the quadrature clocks(WDQS/2INs)at half f requency are passed to the WDQS buf f er,which generates the output clocks,SOUT,Xs(X=I,Q,I B,QB),to sample the DQ data in parallel(top lef t of Fig.8.5.1).To ensure error-f ree sampling f or all DQs,low jitter is necessary in the SOUT,Xs,but it is dif fic
207、ult to achieve due to the power-supply-induced jitter(PSIJ)issues of the WDQS buf f er.A sudden toggle of WDQS/2IN in response to the command signal(ACTCMD)f rom the host causes a surge in the instantaneous load current(IL)of the WDQS buf f er(e.g.,edge time(TEDGE)100ps)1(top right of Fig.8.5.1).Thi
208、s results in a significant voltage droop in the supply voltage(VDD)and a substantial increase in PSIJ 2.This problem is particularly severe when the VDD is unstable and noisy due to complex power grids and limited capacitor availability.An LDO is a common solution to the PSIJ problem,but its require
209、ments f or the WDQS buf f er are challenging.First,the amount of voltage droop(VDROOP)caused by the WDQS toggle must be reduced to suf ficiently small(e.g.,10mV)in an extremely short settling time(TS).TS should be reduced to less than 400ps when WDQS/2IN toggles at 2.5GHz since the write-preamble ti
210、me(tWPRE1)bef ore the first DQ data is only 4UI(or 1 cycle of WDQS/2IN)1.Second,power-supply rejection(PSR)must be provided across all f requencies to protect the WDQS buf f er f rom external noise.Third,these goals must be achieved with minimal power,area,and capacitors.The right of Fig.8.5.1 shows
211、 that,to reduce both VDROOP and TS with a small output capacitor(COUT),analog LDOs(ALDOs)2-6 require a wide-BW amplifier and thus a huge quiescent current(IQ),but achieving such an extremely short TS within the given power budget is very dif ficult.As an alternative to conventional digital LDOs 7-9,
212、which require higher clock f requencies f or smaller TS,event-driven LDOs(ED-LDOs)with digital 10-18 or hybrid 19-21 architectures can eliminate this tradeof f by using an adaptive clock that increases f requency only when an event occurs.However,responding immediately to a random event is challengi
213、ng and unreliable,leading to an extended response time(TR)and inevitably increasing VDROOP.This work presents a command-aware hybrid LDO(CA-HLDO)that can minimize VDROOP and TS while using minimal IQ,area,and capacitors.To achieve a breakthrough that can overcome the limitations of existing approach
214、es,we propose a method that exploits the unique property of memory:it always perf orms predefined operations according to commands f rom the host.The proposed CA-HLDO leverages this property to supply the WDQS buf f er with the optimal IL at the precise timing derived f rom ACTCMD and WDQS/2IN,there
215、by minimizing the PSIJ of SOUT,Xs.The bottom lef t of Fig.8.5.1 shows that the CA-HLDO consists of (1)a DLDO that instantaneously injects an optimally adapted digital current(ID)and(2)a low-power ALDO that compensates f or the residual IL with an analog current(IA).Conventional ED-LDOs wait f or unp
216、redictable random events,so they inherently experience long TR.However,the proposed CA-HLDO prepares the optimal ID in advance and provides it immediately when ACTCMD and WDQS/2IN toggle,minimizing both TR and TS(bottom right of Fig.8.5.1).The CA-HLDO also can provide at least 10dB PSR across all f
217、requencies while consuming only 150A IQ and 20pF COUT.Figure 8.5.2 shows the overall architecture of the proposed CA-HLDO.The DLDO has two separate PMOS pass-transistor switch arrays,i.e.,MP,ACT and MP,TOG,that supply two independent IDs,i.e.,ID,ACT and ID,TOG,respectively,which are adjusted continu
218、ously in the background by the ID-adaptor to accurately compensate f or the IL of the WDQS buf f er.The control-signal generator(CSG)provides signals to operate the ID-adaptor and MP,ACT/MP,TOG.The bottom lef t of Fig.8.5.2 shows that the WDQS buf f er consists of a series of sub-components,i.e.,a C
219、ML buf f er,a CML-to-CMOS(C2C)buf f er,and an inverter chain.The profile of IL f rom this WDQS buf f er is shown at the top of Fig.8.5.3,and it can be divided into f our sections(S1S4)based on the sequential operations of the sub-components.In the first section(S1)when ACTCMD is low,the WDQS buf f e
220、r is in standby mode with nearly zero IL.Af ter ACTCMD goes high,in S2,the CML buf f er is activated and begins drawing a static current(IL,ACT)of approximately 2mA f rom the input voltage(VIN),causing the first VDROOP at the LDO output voltage,VOUT.The CSG generates SACT synchronously at the rising
221、 edge of ACTCMD,which instantaneously turns on MP,ACT,generating ID,ACT to compensate f or IL,ACT.A f ew nanoseconds later,the quadrature WDQS/2IN starts toggling at 2.5GHz in S3.This sudden toggle momentarily generates IL,TOG of approximately 12mA through the C2C buf f er(0.2mA)and the f ollowing i
222、nverter chain(11.8mA),causing an even larger VDROOP.The CSG instantly generates STOG,which goes high during every toggle of WDQS/2IN,to turn on MP,TOG,generating ID,TOG to compensate f or IL,TOG.Since STOG is generated by the two output signals of the C2C buf f er(C2CI and C2CQ),the timings of IL,TO
223、G and ID,TOG are synchronized,thereby making TR nearly zero.As soon as WDQS/2IN stops toggling in S4,IL,TOG and ID,TOG cease simultaneously,preventing an overshoot at VOUT.Since most of IL is supplied by the DLDO,the role of IA of the ALDO is limited to compensating f or the residual IL,allowing the
224、 ALDO to have a small BW with minimal IQ.The error amplifier was designed based on a cross-coupled amplifier 6,22 with a local f eedf orward path to improve stability 23.Due to a small COUT,the ALDO has the dominant pole at the gate of its pass transistor(MPA),resulting in an inevitable high-f reque
225、ncy PSR hump(bottom right of Fig.8.5.2).Generally,the DLDO has poor PSR,so a native NMOS transistor with an RC filter(R1 and C1)was used at the bottom of each MP,ACT and MP,TOG to filter the high-f requency supply noise 24-26.The PSRs of the ALDO and the DLDO complement each other,enhancing the low-
226、f requency PSR of the DLDO and the high-f requency PSR of the ALDO.As a result,the overall PSR is maintained below 10dB at all f requencies.The switch MS1 is used in parallel with R1 to pull the gate voltage of the native NMOSs,VRC,up to VIN briefly at each beginning of S2 or S3(SRC),preventing a su
227、dden voltage drop at VRC when IL occurs momentarily.The switch MS2 is used to pull VOUT down to ground briefly af ter the end of each S3(SOVST),ef f ectively mitigating a sudden overshoot at VOUT when IL decreases momentarily.The bottom of Fig.8.5.3 shows the implementation of the ID-adaptor,which i
228、ncludes two adaptation circuits f or ID,ACT and ID,TOG.Each circuit has a dead-zone(DZ)-comparator and an accumulator(ACC).To detect VDROOP,each DZ-comparator samples VOUT at every rising edge(RE)of the sampling clocks,i.e.,SSP,ACT or SSP,TOG,which are generated by delaying SACT or STOG.According to
229、 the outputs of the DZ-comparators(DDZ,A and DDZ,T),the codes of the ACCs(DACT and DTOG)are updated until the sampled VOUT is within VREF 10mV(VREF is the ref erence voltage).The top of Fig.8.5.3 shows that VOUT initially exhibits significant VDROOP since IL cannot be compensated only with an ALDO.H
230、owever,af ter adaptation,IL is compensated accurately by optimal ID,ACT and ID,TOG,resulting in minimal VDROOP and a short TS.For real products,DACT and DTOG are initially adapted during the training time and then updated in the background.The initial adaptation takes less than 200ns.The proposed CA
231、-HLDO occupies 0.014mm2 area in 40nm CMOS(Fig.8.5.7).The top of Fig.8.5.4 depicts the measurement scenario:S2 is maintained f or 4.8ns,f ollowed by three sequential S3 events,where WDQS/2IN toggles f or 20ns each with a 2ns gap of S4.The bottom of Fig.8.5.4 shows the load transient response and the
232、resultant jitter histogram of the output clock(SMEAS)at 2.5GHz.When DACT and DTOG were set to zero,the limited BW of the ALDO caused a massive VDROOP of 185.0mV and a substantial jitterP-P of 52.0ps(lef t).(No AM modulation was observed in SMEAS due to the measurement buf f er using a constant VDD.)
233、However,with optimized adaptation of DACT and DTOG,ID quickly and accurately compensated f or IL,reducing VDROOP and jitterP-P to 8.7mV and 10.8ps,respectively(right).The overshoot in every S4 was reduced suf ficiently bef ore the next S3 began.The top of Fig.8.5.5 shows that the proposed CA-HLDO re
234、mains ef f ective even with variations in VIN and VOUT,maintaining VDROOP below 10mV in 400ps.This validates the ID adaptation,which continuously calibrates ID in response to changes in IL.When VOUT was increased to 0.85V,resulting in a larger IL,the maximum VDROOP slightly exceeded 10mV but was red
235、uced to below 10mV within 400ps.The bottom of Fig.8.5.5 shows that VDROOP and PSIJ remained minimal across a wide range of SMEAS f requencies.Figure 8.5.6 shows the perf ormance comparison of the proposed CA-HLDO with the prior-art LDOs.Employing the proposed command-aware scheme with the ID-adaptat
236、ion,it achieved significantly smaller VDROOP and TS even with a shorter TEDGE.Consequently,it achieved the best FoMs while using 150A IQ(VIN=1.05V)and 20pF COUT.It also provides a PSR of at least 10dB across all f requencies.Acknowle dge me nt:This work was supported in part by Samsung Electronics C
237、o.Ltd.,in part by the National R&D Program through the NRF f unded by the Ministry of Science and ICT under Grant 2020R1A2C2004260,and in part by the IIPT through the Artificial Intelligence Semiconductor Support Program to Nurture the Best Talents Grant f unded by the MSIT under Grant IITP-2023-RS-
238、2023-00256081.Figure 8.5.1:PSIJ issues of WDQS buffer in HBM,LDO requirement s,and proposed command-aware hybrid LDO(CA-HLDO)(left);IL profile and different magnit udes of volt age droop(VDROOP)and PSIJ according t o t ypes of LDOs(right).Figure 8.5.2:Overall archit ect ure of t he proposed CA-HLDO(
239、t op);configurat ion of t he WDQS buffer(bot t om left);and overall PSR of t he proposed CA-HLDO(bot t om right).Figure 8.5.3:IL profile and it s compensat ion by ID&IA before and aft er ID-adapt at ion(t op);implement at ion of ID-adapt or(bot t om).Figure 8.5.4:Measured load t ransient response an
240、d jit t er hist ogram of SMEAS at 2.5GHz when DACT and DTOG were set t o zero(left);and when DACT and DTOG were opt imized(right).Figure 8.5.5:Measured load t ransient response and jit t er hist ogram of SMEAS at 2.5GHz when VOUT is 0.65 and 0.85V(t op);measured load t ransient response and jit t er
241、 hist ogram of SMEAS wit h frequencies of 1.5 and 1.0GHz(bot t om).Figure 8.5.6:Performance comparison wit h prior-art LDOs for small VDROOP.ISSCC 2025/February 18,2025/10:05 AM167 DIGEST OF TECHNICAL PAPERS 8 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND
242、 REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 8.5.7:Die micrograph.Re fe r e nce s:1 JESD238A:JEDEC Standard High Bandwidth Memory(HBM3)DRAM Specification,Jan.2023.2 J.Jung,et al.,“A 4ns Settling Time FVF-Based Fast LDO Using Bandwidth Extension Techniques f or HBM3,”I EEE ASSCC,2023.3 J.-G
243、.Lee,et al.,“A Fully Integrated,Domino-Like-Buf f ered Analog LDO Achieving 28dB Worst-Case Power-Supply Rejection Across the Frequency Spectrum f rom 10Hz to 1GHz with 50pF On-Chip Capacitance,”I SSCC,pp.456-458,2024.4 G.Cai et al.,“A Fully Integrated FVF LDO With Enhanced Full-Spectrum Power Suppl
244、y Rejection,”I EEE Tr a ns.on Powe r Ele ctr onics,vol.36,no.4,pp.4326-4337,2021.5 M.Huang,H.Feng and Y.Lu,“A Fully Integrated FVF-Based Low-Dropout Regulator With Wide Load Capacitance and Current Ranges,”I EEE Tr a ns.on Powe r Ele ctonics,vol.34,no.12,pp.11880-11888,2019.6 D.Xu et al.,“A 0.960.9-
245、V Fully Integrated FVF LDO With Two-Stage Cross-Coupled Error Amplifier,”I EEE TCAS-I I,vol.70,no.10,pp.3757-3761,2023.7 X.Sun et al.,“A 0.6-to-1.1V Computationally Regulated Digital LDO with 2.79-Cycle Mean Settling Time and Autonomous Runtime Gain Tracking in 65nm CMOS,”I SSCC,pp.230-232,2019.8 W.
246、-J.Tsou et al.,“Digital Low-Dropout Regulator with Anti PVT-Variation Technique f or Dynamic Voltage Scaling and Adaptive Voltage Scaling Multicore Processor,”I SSCC,pp.338-339,2017.9 T.Mahajan et al.,“Digitally Controlled Voltage Regulator Using Oscillator-Based ADC with Fast-Transient-Response and
247、 Wide Dropout Range in 14nm CMOS,”I EEE CI CC,2017.10 Y.Song et al.,“A Fast Droop-Recovery Event-Driven Digital LDO With Adaptive Linear/Binary Two-Step Search f or Voltage Regulation in Advanced Memory,”I EEE Tr a ns.Powe r Ele ctr onics,vol.37,no.2,pp.1189-1194,2022.11 D.Kim and M.Seok,“A Fully In
248、tegrated Digital Low-Dropout Regulator Based on Event-Driven Explicit Time-Coding Architecture,”I EEE J SSC,vol.52,no.11,pp.3071-3080,2017.12 J.-E.Park and D.-K.Jeong,“A Fully Integrated 700mA Event-Driven Digital Low-Dropout Regulator with Residue-Tracking Loop f or Fine-Grained Power Management Un
249、it,”I EEE Sy mp.VLSI Cir cuits,pp.C231-C232,2018.13 F.Yang and P.K.T.Mok,“A Nanosecond-Transient Fine-Grained Digital LDO With Multi-Step Switching Scheme and Asynchronous Adaptive Pipeline Control,”I EEE J SSC,vol.52,no.9,pp.2463-2474,2017.14 S.Kundu et al.,“A Fully Integrated Digital LDO With Buil
250、t-In Adaptive Sampling and Active Voltage Positioning Using a Beat-Frequency Quantizer,”I EEE J SSC,vol.54,no.1,pp.109-120,2019.15 F.Yang and P.K.T.Mok,“Fast-Transient Asynchronous Digital LDO with Load Regulation Enhancement by Sof t Multi-Step Switching and Adaptive Timing Techniques in 65-nm CMOS
251、,”I EEE CI CC,2015.16 D.Kim et al.,“0.5V-VIN,165-mA/mm2 Fully-Integrated Digital LDO Based on Event-Driven Self-Triggering Control,”I EEE Sy mp.VLSI Cir cuits,pp.C109-C110,2018.17 J.Oh et al.,“A Capacitorless External-Clock-Free Fully Synthesizable Digital LDO With Time-Based Load-State Decision and
252、 Asynchronous Recovery,”I EEE Tr a ns.Powe r Ele ctr onics,vol.39,no.1,pp.985-997,2024.18 K.Z.Ahmed et al.,“A Variation-Adaptive Integrated Computational Digital LDO in 22-nm CMOS With Fast Transient Response,”I EEE J SSC,vol.55,no.4,pp.977-987,2020.19 Y.-H.Hwang et al.,“A Residue-Current-Locked Hyb
253、rid Low-Dropout Regulator Supporting Ultralow Dropout of Sub-50 mV With Fast Settling Time Below 10 ns,”I EEE J SSC,vol.57,no.7,pp.2236-2249,2022.20 X.Liu et al.,“A Universal Modular Hybrid LDO With Fast Load Transient Response and Programmable PSRR in 14-nm CMOS Featuring Dynamic Clamp Strength Tun
254、ing,”I EEE J SSC,vol.56,no.8,pp.2402-2415,2021.21 Y.Lin,X.Liu and K.N.Leung,“A Bandwidth-Boosted Hybrid LDO With Spike-To-Time Converter f or Near-Threshold Regulation,”I EEE Tr a ns.Powe r Ele ctr onics,vol.39,no.3,pp.3452-3463,2024.22 X.Ma,Y.Lu and Q.Li,“A Fully Integrated LDO With 50-mV Dropout f
255、 or Power Ef ficiency Optimization,”I EEE TCAS-I I,vol.67,no.4,pp.725-729,2020.23 J.-H.Jang et al.,“A 0.51-V Time-Voltage Hybrid Domain Dual-Loop Analog LDO With Wide-Bandwidth High PSR in 28 nm,”I EEE J SSC,early access,2024.24 J.M.Ingino and V.R.von Kaenel,“A 4-GHz clock system f or a high-perf or
256、mance system-on-a-chip design,”I EEE J SSC,vol.36,no.11,pp.1693-1698,2001.25 V.Gupta and G.A.Rincon-Mora,“A 5mA 0.6 m CMOS Miller-Compensated LDO Regulator with-27dB Worst-Case Power-Supply Rejection Using 60pF of On-Chip Capacitance,”I SSCC,pp.520-521,2007.26 C.Zhan and W.-H.Ki,“Analysis and Design
257、 of Output-Capacitor-Free Low-Dropout Regulators With Low Quiescent Current and High Power Supply Rejection,”I EEE TCAS-I,vol.61,no.2,pp.625-636,2014.168 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND CLOCKIN
258、G/8.6979-8-3315-4101-9/25/$31.00 2025 IEEE8.6 A 0.024mm2 All-Digital Fractional Output Divider with 257fs Worst-Case Jitter Using Split-DTC-Based Background Calibration Yan Yu1,2,Deng Luo1,Jianjun Chen1,Yang Guo1,Bin Liang1,Yaqing Chi1,Hanhan Sun1,Jing Xiao2,Hao Gao3,Ming Tao2,Kai Tang2 1National Un
259、iversity of Def ense Technology,Changsha,China 2Hunan University,Changsha,China 3Southeast University,Nanjing,China In modern system-on-chips(SoCs),compact,low-jitter and low-power on-chip clock generators are essential f or delivering multiple output f requencies to various modules,including microp
260、rocessors,IO interf aces,and power-management systems.Conventional clock generators using multiple phase-locked loops(PLLs)of ten f ace challenges such as high-power consumption and increased chip area.In contrast,employing multiple f ractional output dividers(FODs)driven by a single PLL provides a
261、more compact and power-ef ficient alternative.FODs can be implemented using either digital-to-time converters(DTCs)1-3 or digital-to-phase converters(DPCs)4.Among these,DTC-based FODs are commonly adopted due to their superior power ef ficiency and reduced area requirements.However,DTC-based FODs ar
262、e prone to variations in process,voltage,and temperature(PVT),and any gain error or integral nonlinearity(INL)in the DTC can significantly degrade jitter perf ormance.To calibrate the gain and INL of the DTC,a ref erence clock(Fref)is required to detect these errors,as depicted in Fig.8.6.1(top lef
263、t).Previous methods have utilized additional delay-complementary DTCs 1 or pulse-to-voltage converters 2 to generate Fref f or DTC gain calibration.However,these techniques of ten neglect DTC INL,achieving only limited spur perf ormance of approximately-55dBc 1 2.Although the background gain and INL
264、 calibration based on an auxiliary PLL can reduce the spurs ef f ectively,the additional PLL increases the chip area and circuit complexity significantly 3.This paper presents an FOD using split-DTC-based background calibration f or both DTC gain and INL,achieving 257f s worst-case integrated jitter
265、 and less than 77dBc spurious tones.Owing to the split-DTC-based background calibration,all circuits are designed with standard cells except f or the DTCs,making it possible to greatly reduce the area with advanced processes.Figure 8.6.1(top right)illustrates the overall architecture of the designed
266、 FOD.Unlike traditional FOD designs,this approach f eatures a split DTC divided into two identical parts,DTCA and DTCB.These two DTCs use each other as ref erences,enabling simultaneous background calibration of gain and INL.Each DTC consists of a main DTC(mDTC)and a calibration DTC(cDTC).The cDTC c
267、orrects the mDTC INL and the of f set between DTCA and DTCB.The mDTC has a 9b resolution,while the cDTC has a 7b resolution,with the adjustment step size of the cDTC being half that of the mDTC.Figure 8.6.1(bottom)shows the timing diagram of the calibration arrangement of the FOD.The quantization er
268、ror(qe)of the sigma-delta modulator(SDM)determines the calibration mode of the FOD:gain calibration or INL calibration.Although two DTCs are used f or calibration,both DTC outputs remain valid.This configuration reduces output phase noise by averaging the outputs of the two DTCs using an inverter-ba
269、sed phase interpolator(PI),which is implemented with standard cells.The noise perf ormance of the DTC depends on area and power consumption.This approach allows f or background calibration of DTC gain and INL with reduced power and area requirements while keeping the phase noise unchanged.Figure 8.6
270、.2(top)details the background DTC gain calibration.Two flipflops(DFF1 and DFF2)introduce a delay of one input clock cycle(Tin)relative to the FA edge compared to the FB edge.The delay control words(dcws)of DTCA and DTCB are Kdtcqe and Kdtc(1+qe),respectively.The bang-bang phase detector(BBPD)measure
271、s the phase errors between the outputs of the two DTCs,and the sign-LMS algorithm adjusts the gain calibration f actor Kdtc to minimize these phase errors.In the steady state,FAO and FBO are in phase(i.e.,KdtcqeTDTC+Tin=Kdtc(1+qe)TDTC),where TDTC represents the step size of DTCA and DTCB,thus perf e
272、ctly canceling the phase error caused by qe(KdtcqeTDTC=qeTin).During gain calibration,dcwB=Kdtc(1+qe).To reduce the DTC s required delay range,gain calibration is perf ormed when qe is smaller than 0.25,while the INL calibration is perf ormed when qe exceeds this threshold,as shown in the timing dia
273、gram in Fig.8.6.1(bottom).The mDTC,with 9b resolution,is implemented using a cascade of 8 identical inverter-based DTC cells(DTCCs),as illustrated in Fig.8.6.2(bottom).Dif f erent f rom the traditional binary-weighted DTC,a binary-scaled recombination weighting method is adopted in the 3 most signif
274、icant bits(MSBs)to introduce 1b redundancy f or INL calibration,while the 6 least significant bits(LSBs)remain unchanged.The original MSB segment with 4 DTCCs is divided into two groups:3 DTCCs and 1 DTCC,with the latter serving as a redundancy.This configuration results in a new weighting ratio of
275、MSBs as 3,2,1,and 1,providing 1b redundancy.Figure 8.6.3 illustrates the details of DTC INL calibration.It should be noted that the calibration implementations of DTCA and DTCB are completely symmetrical,except f or the pseudo-random binary sequence(PRBS).During INL calibration,the dcws of DTCA and
276、DTCB are identical,both equal to Kdtcqe,so the phase errors detected by the BBPD exclude input-related parts and reflect only uncalibrated delay errors.These delay errors are compensated by the cDTC based on error codes generated by the sign-LMS algorithm,which reflect the delay errors of each DTCC,
277、as illustrated in Fig.8.6.3(top right).To ensure the convergence of the sign-LMS algorithm,two mDTCs need to operate dif f erently under the same dcw.This guarantees that all DTCCs in both mDTCs have the opportunity to compare against the ref erence or DTCCs that have already been calibrated to the
278、ref erence.The redundancy in the weighting allows each dcw8:6 to provide two possible values f or D9:6,as shown in the table in Fig.8.6.3.D9:6 f or the two mDTCs are selected randomly f rom these two possible values using PRBS,ensuring that the mDTCs operate independently under the same dcw.It is no
279、teworthy that the DTCC controlled by D5:0 f or both mDTCs remains uncalibrated,leading to residual errors.To address this issue,another PRBS is employed to randomly swap it with a DTCC controlled by D6.The designed FOD was f abricated in a 28nm CMOS process.Figure 8.6.7 presents the die micrograph.S
280、pur and jitter measurements were perf ormed with N=20,where the outputs are divided by 2 bef ore being analyzed with a signal analyzer.Figure 8.6.4 illustrates spur measurement results f or=0.24414(random case)and=0.25(normal case).For these values,spurs decrease f rom-55.1dBc to-66.9dBc and f rom-5
281、4.5dBc to-73.5dBc,respectively,af ter enabling gain calibration.When both gain and INL calibration are active,spurs f urther reduce to-77.6dBc and-81.1dBc.Figure 8.6.5(bottom lef t)shows the spur measurement results across a wide range of .It indicates that more than 22dBc spur reduction can be achi
282、eved,due to the use of split-DTC-based background calibration f or both gain and INL.Figure 8.6.5(top)displays phase noise and integrated jitter measurement results.When=0.24414,with only gain calibration enabled,the 10kHz to 20MHz integrated jitter decreases f rom 1.2ps to 648f s,and the jitter f u
283、rther decreases to 203f s with both gain and INL calibration applied.When=0.25,the jitter can be even larger with only gain calibration than with no calibration at all.This is because,although gain calibration suppresses spurs to some extent,as shown in Fig.8.6.4(bottom),the DTC INL prevents Kdtc f
284、rom converging to its optimal value,causing it to fluctuate within a narrow range.As a result,the integrated jitter with only gain calibration can sometimes be worse than without any calibration,as also observed in Fig.8.6.5(bottom right).When both gain and INL calibrations are applied,the integrate
285、d jitter consistently remains within a narrow range(257f s)across all values.Figure 8.6.6 compares the designed FOD with previous works.The work in 4 is not included in the comparison table,because it uses a DPC-based structure and requires prior inf ormation f or calibration.Compared to FODs that e
286、mploy background calibration f or both gain and INL,as reported in 3,the proposed FOD is 0.024mm.This area is approximately 3.5 smaller than that of 3.Thanks to the split-DTC topology,no custom-designed circuits are required except f or the DTCs themselves,allowing the power consumption and core are
287、a to be f urther optimized using advanced processes.Acknowle dge me nt:This work was supported in part by the National Natural Science Foundation of China under Grant 62304258 and 62174180,and in part by Natural Science Foundation of Hunan Province under Grant 2023JJ40176.The corresponding authors a
288、re Deng Luo and Ming Tao.Figure 8.6.1:Overall architecture of the FOD based on split DTC and the timing diagram of the calibration.Figure 8.6.2:Details of the DTC gain calibration and the architecture of the 9b mDTC.Figure 8.6.3:Details of the DTC INL calibration based on the split DTC and the sign-
289、LMS algorithm.Figure 8.6.4:Spur measurement results w/and w/o without calibration.Figure 8.6.5:Phase noise measurement results w/and w/o calibration;spur and jitter measurement results vs.Figure 8.6.6:Performance summary and comparison with prior work.ISSCC 2025/February 18,2025/10:30 AM169 DIGEST O
290、F TECHNICAL PAPERS 8 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 8.6.7:Die micrograph.Re fe r e nce s:1 A.Elkholy et al.,“Low-Jitter Multi-Output All-Digital Clock Generator Using DTCBased Open Loop
291、Fractional Dividers,”I EEE J SSC,vol.53,no.6,pp.1806-1817,2018.2 C.-Y.Lin et al.,“An Area-Ef ficient Low-Jitter Fractional Output Divider with Replica-DTC-Free Background Calibration,”I EEE J SSC,early access,2024.3 Y.Yang et al.,“A 10-to-300MHz Fractional Output Divider with-80dBc Worst-Case Fracti
292、onal Spurs Using Auxiliary-PLL-Based Background 0th/1st/2nd-Order DTC INL Calibration,”I SSCC,pp.228-229,2023.4 M.Zeinali et al.,“Design and Analysis of a Fractional Frequency Synthesizer With 90-f s Jitter and 103-dBc Spurious Tones Using Digital Spur Cancellation,”I EEE J SSC,early access,2024.5 S
293、kyworks Solutions Inc.,“Web-Customizable,Any-Frequency,Any-Output Quad Clock Generator/Buf f er,”Si5335 Data Sheet,Rev.1.4,Aug.2021.Accessed on Sept.4,2024,.170 2025 IEEE International Solid-State Circuits Conf erenceISSCC 2025/SESSION 8/DIGITAL TECHNIQUES FOR SYSTEM ADAPTATION,POWER MANAGEMENT AND
294、CLOCKING/8.7979-8-3315-4101-9/25/$31.00 2025 IEEE8.7 A Dual VDD-Temperature Sensor Employing Sensor Fusion with 2.4C,9mV(3)Inaccuracy in 65nm CMOS Hikmet Seha Ozturk1,Julian Arenas1,Carlos Tokunaga2,Nasser Kurd2,Visvesh Sathe1 1Georgia Institute of Technology,Atlanta,GA 2Intel,Hillsboro,OR Trends in
295、 heterogeneous and 3D chiplet integration have escalated thermal and power delivery challenges in modern Systems-in-Package(SiPs)1.Increased power density and degraded thermal conductance intensif y thermal hotspots.Meanwhile,workload-dependent supply-voltage(Vdd)gradients in increasingly constraine
296、d Power Distribution Networks(PDNs)are becoming more pronounced,requiring additional Vdd guardbands.Run-time thermal and power management,required to address these challenges,relies on a system-wide network of temperature sensors(T-Sensor s)and DC supply voltage sensors(Vdd-Se ns or s)with fine spat
297、iotemporal resolution.These sensors need to be compact and readily integrated within their host modules without external voltage or current ref erences.They must also provide measurements of adequate accuracy with suf ficiently low latency.Finally,sensors must operate to specification,despite run-ti
298、me Vdd scaling of SoC domains and Vdd noise.Significant advances have been made in thermal sensors(T-Se ns or s)2-11.BJT-based sensors have traditionally been ubiquitous 2-3,but their design f or large-scale sensing in advanced process nodes has become increasingly challenging 4.While recent work ha
299、s yielded several CMOS-based T-Sensor s 5-10,they are ill-suited to SoC thermal management applications based on either area 6,digital Vdd scalability 7,8,or latency 9.Notably,compact sensors proposed by 8,10 meet most requirements.However,8 requires a voltage regulator(VR)to tolerate moderate host-
300、domain Vdd-scaling at the cost of VR area and external current ref erencing,while the VR-f ree design in 10 limits Vdd scalability.These designs exemplif y the key challenge f acing compact sensor design(Fig.8.7.1):unlike ideal sensors,readings f rom actual Vdd-Se ns or a nd T-Se ns or implementatio
301、ns depend on both Vdd and T(Fig.8.7.1).Theref ore,Tsense(Vsense)does not uniquely determine T(Vdd),due to the influence of Vdd(T)on the sensor.This paper describes a scalable network of compact and distributed sensors that are well-suited f or power and thermal management.Prior design ef f orts have
302、 f ocused on circuit techniques to eliminate sensor sensitivity to extraneous parameters,incurring area,latency,or scalability costs in the process.In contrast,the proposed approach involves judicious design of a pair of compact,non-ideal Vdd,and T sub-sensors relying on computational techniques to
303、synergize them f or enhanced perf ormance.A key enabling observation is that together,the two sub-sensors have adequate inf ormation to accurately determine both Vdd and T.A shared Se ns or F us ion Pr oce s s or (SF P)perf orms joint processing using separate regression models each of which depends
304、 on readings f rom both sub-sensors to extract Vdd and T.We developed a methodology to derive sensor-specific regression models using the limited number of calibration points available at production testing.Our approach is validated on a 65nm CMOS test-chip.The proposed sensor network connects multi
305、ple distributed sensor Hubs to the SF P using a SoC shared bus,with each Hub polling multiple sensor nodes(Fig.8.7.2).Each sensor node contains two oscillator-based sub-sensors.These sub-sensors are chosen because(1)integrating phase over time to measure f requency ef f ectively filters Vdd noise;(2
306、)the ref erence clock required f or this measurement is readily available in each clock domain;and(3)digital oscillator clock signals are easily routed f rom each sensor to the Hub.T-s e ns or relies on an integrated 3T voltage ref erence 12 to produce Vref,biasing the f ooter device of a current-st
307、arved oscillator in weak inversion.The exponential oscillator f requency dependence on MOS threshold voltage(Vth)provides high T-sensitivity,while the current starved architecture limits sensitivity to Vdd.Crucially,non-idealities stemming f rom Vref variation across Vdd and T are absorbed by the re
308、gression model.The Vdd-s e ns or is a CMOS ring oscillator made with low-Vth devices so that its f requency is largely sensitive to Vdd.The Hub selects a clock-pair f rom a target sensor node i and uses counters to measure the pair of sub-sensor f requencies,fvdd,i,and fT,i.The SF P acquires fvdd,i,
309、and fT,i f rom the Hub and loads pre-determined coef ficients f or two sensor-specific regression polynomials,Gvdd,i(fvdd,fT)and GT,i(fvdd,fT),which are third-order(10 coef ficients)and f ourth-order(15 coef ficients),respectively.Both polynomials are evaluated at fvdd,i,and fT,i to produce measurem
310、ents of Vdd and T.Inter-sensor process variation necessitates distinct regression polynomials.Deriving sensor-specific models with the limited number test-points available at production testing is f oundational to the viability of the proposed approach.Perf orming a least-squares(LS)fit to directly
311、identif y polynomial coef ficients f or Gvdd,and GT is impractical,requiring at least 15 calibration points per sensor.Instead,we observe that the number of underlying process parameters(Vth,ef f ective channel-length,mobility,wire resistance)governing sub-sensor circuit operation,and theref ore mod
312、el coef ficients,is significantly smaller.Accordingly,the vector of best-fit polynomial coef ficients across multiple sensors varies within a lower-dimensional subspace spanned by a limited set of principal components(PCs)13.The sensor model can be accurately described by the PCs using f ar f ewer c
313、alibration points while avoiding overfitting.The test inf rastructure perf orms the computations required by this process.Figure 8.7.3 outlines the model construction methodology employed f or T-Sensor (the companion Vdd-Sensor model construction is omitted f or brevity).fT is measured f or 10T10Vdd
314、 points on an initial Tr a ining Se t of Ntrain parts with the objective of finding the PCs that describe the sub-space within which the coef ficients of PT(T,Vdd)vary.A least-squares regression is first perf ormed on each part i to derive a0,i.a14,iT,the coef ficients of PT(T,Vdd).The resulting vec
315、tor-set of PT(T,Vdd)coef ficients collectively represent the spectrum of T-Sensor model variations across dif f erent process conditions.Principal Component Analysis(PCA)is then perf ormed on this vector-set to identif y the 4 most significant PCs:,i=14.Recall that each PC is itself a 4-degree bi-va
316、riate polynomial in Vdd and T.Once identified,the PCs are used to construct sensor-specific models f or each part during production testing.To do so,a second LS regression finds the 4 PC coef ficients needed to describe fT measurements at the two temperature test-points(2T3Vdd)that the part is valid
317、ated at.However,to simplif y run-time computation,the f unctions PT(T,Vdd)and the similarly derived Pvdd(T,Vdd)are inverted f or each sensor.Coef ficients of polynomials GT,i(fT,fvdd)and Gvdd,i(fT,fvdd)that describe the inverse mapping of PT(T,Vdd)and P vdd(T,Vdd)f or sensor i are stored into SFP me
318、mory.Thus,presented with fT,i,fvdd,i measurements f or a sensor node i,the SFP is tasked with run-time evaluation of GT,i(fT,i,fvdd,i)and Gvdd,i(fT,i,fvdd,i)to produce the measurement f or T and Vdd.This approach moves the computational burden f rom the SFP to the test inf rastructure where it can b
319、e readily perf ormed.A prototype sensor network was constructed within a 65nm SoC to validate the proposed approach(die photograph in Fig.8.7.7).Eight sensor nodes,each occupying 67m2 were placed across f our SoC voltage domains.All characterization and measurement were perf ormed over T(0-100C)and
320、Vdd(0.6-1.0V).Due to the limited availability of packaged parts,the Tr a ining Se t was restricted to 12 randomly chosen sensors,leaving 12 sensors to perf orm accuracy measurements on.The measurement outputs produced by the SFP as T and Vdd were swept across the measurement range is compared agains
321、t bench-equipment measurements to determine sensor error.Measurement time depends heavily on T due to its exponential ef f ect on fT.These times were measured to range up to 30s at 0C.Vdd-Se ns or measurements are shown in Fig.8.7.4.The 2D color map shows measured 6 error in DC Vdd measurement acros
322、s both T and Vdd.The worst-case 6 Vdd error of 9mV is observed at Vdd=0.78V,T=11C.The error profile of each part along the Vdd-axis,f or the worst(Vdd,T)pair is shown.The measurement error deviation resulting f rom variation in T scaling is also shown to be below 6mV.T-Se ns or measurements across T
323、 and Vdd are similarly reported using a 2D-color map(Fig.8.7.5).The measured 6 error was f ound to be highest at Vdd=0.78V,T=11C.The error profile per part along the T-axis containing this worst-case point is shown along with the minimal error deviation(10C)across a f ew 100m 5.Theref ore,fine-grain
324、ed thermal profiling as well as runtime thermal monitoring and management are essential.BJT-based thermal sensors(TS)are used in microprocessors due to their high accuracy 0.2C and wide operating range-55C to 125C 6-7.However,they are larger in size and incompatible with 1V digital supply which prev
325、ents them f rom being sprinkled at 100 s of m granularity.Fully-digital,ring oscillator(RO)based TS can operate at sub-1V digital supply,while consuming Ws of power with an area overhead of 100m2,thus making them suitable candidates f or fine-grain spatial distribution.However,traditional RO-based T
326、S are susceptible to both DC and AC supply voltage variations requiring local voltage regulation.In this paper,we propose an accurate(0.7C),area-ef ficient(20m20m),low-power(18W),digital and low-voltage f riendly(0.7-1V),current-starved ring-oscillator-based thermal sensor(CSRO-TS)that can be distri
327、buted across the die f or fine-grained thermal profiling of complex compute SoCs and IVR chiplets.An array of 204 sensors is implemented to demonstrate the fine-grained thermal profiling capability that can be used to detect large temperature gradients(15C)within 100m.An additional 12 CSRO-TS have b
328、een placed across the active area of a high-power-density package-integrated buck IVR chiplet to validate its resiliency in noisy environments and demonstrate its usage f or thermal and reliability monitoring of the IVR power train.Thermal management is one of the major challenges in complex SoC des
329、ign,requiring the prediction of hotspots across the SoC f or numerous workloads,while minimizing latency and maximizing perf ormance.Modeling the locations of hot spots is computationally intensive given the loss density variations across various sub-blocks(cores/graphics/system agents etc.)f or num
330、erous workloads.Additionally,considering thermal conductivities across multiple packaging options(including 3D-IC die stacking)makes it even harder,and can of ten create blind-spots.A missed thermal hotspot can result in silicon temperature exceeding the saf ety margin limits,decrease chip s lif eti
331、me and/or cause permanent f ailure.To appropriately mimic an SoC running a variety of workloads with dif f erent power density profiles,204 programmable load modules have been distributed over an area of 3.4mm0.6mm(Fig.8.8.1),with each unit module including a CSRO-TS placed every 100m100m.On-die hot
332、spots are dynamic and can f urther be influenced by the presence of IVR dies in a 3D-IC stack as shown in Fig.8.8.1(bottom).Measured results f rom a concentrated workload(3W)across the 204 modules shows a discrepancy of up to 12C when placing sensors in a sparse 600m grid as opposed to a dense 100m
333、grid.The load array is designed with 34 columns and 6 rows with a global controller managing its selection.Every cell is connected to the common digital bus and activates when selected,f eeding temperature data back to the global controller,and providing buf f ering to neighboring cells signals when not selected.Each 100m100m module can be programmed to consume up to 0.8W at 1V(maximum power densi