《Session 7Ultra-High-Speed Wireline.pdf》由會員分享,可在線閱讀,更多相關《Session 7Ultra-High-Speed Wireline.pdf(32頁珍藏版)》請在三個皮匠報告上搜索。
1、Session 7 Overview:Ultra-High-Speed Wireline WIRELINE SUBCOMMITTEEThe advent of AI(artificial intelligence)and HPC(high performance computing)stretches the data communication bandwidth limit between chips,modules,and data centers,leveraging ultra-high-speed wireline transceivers.The papers in this s
2、ession describe transmitter and receiver architectures that enable ultra-high-speed operation while consuming low power.The first two papers(7.1 and 7.2)present 212.5Gb/s DSP-based PAM-4 transceivers compensating beyond long-reach channels(40dB channel loss).The third paper proposes an extra-short-r
3、each 212.5Gb/s receiver employing a slice-based CTLE and phase-interpolator-based clock generator achieving an energy efficiency of 1.11pJ/b.The fourth paper describes a 112 Gb/s DSP-based PAM-4 receiver with an LC resonator-based CTLE compensating up to 52dB channel loss.The fifth paper demonstrate
4、s a 112Gb/s discrete multitone receiver with a time-based ADC consuming 353mW.The sixth paper proposes a 106.25Gb/s PAM-4 receiver achieving an energy efficiency of 2.06pJ/b assisted by a 3-tap FFE and a 1-tap speculative DFE.Paper 7.7 and 7.9 present advancement in 50 and 60Gb/s NRZ burst-mode CDR
5、solutions,respectively,for passive optical network(PON)applications.Paper 7.8 describes a reference-less CDR exploiting a SAR-based frequency acquisition technique to achieve 63.64Gb/s/s acquisition speed.Paper 7.10,the last paper of this session demonstrates an 8-phase clock generator,achieving a w
6、ide range of clock speeds from 8 to 28GHz with the help of a dual-feedback ring oscillator.Session Chair:Be n(Hy o Gy ue m)Rhe w Samsung Ele ctroni cs,Hwase ong,Kore a Session Co-Chair:Jay Im AMD,San Jose,CA 134 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-S
7、PEED WIRELINE/OVERVIEW979-8-3315-4101-9/25/$31.00 2025 IEEE1:30 PM 7.1 A 212.5Gb/s DSP-Based PAM-4 Transceiver with 50dB Loss Compensation for Large AI System Interconnects in 4nm FinFET E-Hung Chen,MediaTek,Hsinchu,Taiwan In Paper 7.1,MediaTek presents a 4nm 212.5Gb/s DSP-based PAM-4 transceiver,in
8、 which each lane incorporates TX/RX PLLs.Designed in 4nm FinFET technology,the transceiver achieves an MLSD BER of 2.5e-6 and 1e-8 over a channel with 50.5dB and 46dB loss,respectively.The RLM and SNDR of the transmitter are measured at 98.5 and 35.5dB,respectively.The receiver achieves an SNDR of 3
9、5dB at a 1GHz sinusoidal input frequency.The transceiver consumes 5.3pJ/b per lane at a 212.5Gb/s data rate with 1.8pJ/b from DSP.1:55 PM 7.2 A 2.2pJ/b 212.5Gb/s PAM-4 Transceiver with 46dB Reach in 5nm FinFET A.Mostafa,Marvell,Santa Clara,CA,In Paper 7.2,Marvell presents a 4-lane 212.5Gb/s PAM-4 tr
10、ansceiver,implemented in 5nm FinFET technology.The transceiver achieves a BER of 6e-9 on a 46dB channel and an analog energy efficiency of 2.2pJ/b.The current mode transmitter shows an SNDR of 36.1dB and an RMS jitter of 73fs at a 212.5Gb/s data rate.2:20 PM 7.3 A 1.11pJ/b 224Gb/s XSR Receiver with
11、Sl ice-Based CTLE and PI-Based Cl ock Generator in 12nm CMOS Bingyi Ye,Peking University,Beijing,China East China Normal University,Shanghai,China In Paper 7.3,Peking University presents a 224Gb/s XSR receiver in 12nm CMOS technology.The receiver features a slice-based CTLE with optimized group dela
12、y,achieving a-3dB bandwidth of 59GHz and a maximum step size of 1.9dB.The PI-based clock generator with pre-distortion enables an 8-way interleaved architecture.The receiver demonstrates a BER of 8.1e-7 at 224Gb/s over a 13.6dB-loss channel with only a 3-tap TX FFE and an efficiency of 1.11pJ/b.4:25
13、 PM 7.7 A 50Gb/s Burst-Mode NRZ Receiver with 5-Tap FFE,7-Tap DFE and 15ns Lock Time in 28nm CMOS for Symmetric 50G-PON Boyang Zhang,Peking University,Beijing,China Beijing Advanced Innovation Center for Integrated Circuits,Beijing,China In Paper 7.7,Peking University describes an analog 50Gb/s burs
14、t-mode NRZ receiver in 28nm CMOS technology for symmetric 50G-PON.5-tap FFE and 7-tap DFE including 4 floating taps are implemented to achieve a sensitivity of-24dBm while consuming smaller power than the DSP-based works.Edge equalization in this work realizes a max lock time of 15ns.4:50 PM 7.8 A R
15、eference-l ess CDR Using SAR-Based Frequency-Acquisition Technique Achieving 55ns Constant Band-Searching Time and up to 63.64Gb/s/s Acquisition Speed Yuhao Zhang,Xidian University,Xian,China In Paper 7.8,Xidian University proposes a reference-less CDR using a SAR-based frequency acquisition(FA)tech
16、nique.The CDR achieves a constant band switching time of 55ns utilizing the frequency error polarity information.With the help of the proposed charge pump,the total FA time of 52dB Loss Compensation in 4nm FinFET Henry Park,MediaTek,Irvine,CA In Paper 7.4,MediaTek describes a 112Gb/s DSP-based PAM-4
17、 receiver implemented in 4nm FinFET technology.An LC-resonator-based CTLE is proposed to compensate for 52dB channel loss along with an analog data path latency reduction technique.A single lane receiver with CDR consumes 210mW,and the analog transceiver occupies 0.367mm2.7136 2025 IEEE Internationa
18、l Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-SPEED WIRELINE/7.1979-8-3315-4101-9/25/$31.00 2025 IEEE7.1 A 212.5Gb/s DSP-Based PAM-4 Transceiver with 50dB Loss Compensation for Large AI System Interconnects in 4nm FinFET E-Hung Chen*1,Henry Park*2,Mohammed Abdullatif*2,Miguel Gand
19、ara2,Ahmed ElShater2,Amr Khashaba2,Shih-Hao Huang1,Tsz-Bin Liu1,Atharav Atharav2,Joonyeong Lee2,Qaiser Nehal2,Mohamed Megahed2,Yusang Chun2,Cheng-En Shieh1,Vidhan Jolly2,SoonWon Kwon2,Hsin-Ta Chien1,Ke-Chung Wu1,Cheng-En Liu1,Peng Yan2,Po-Jui Li1,Chun-Han Chen1,Tzu-Shun Lin1,Pei-Chieh Liu1,Tamer Ali
20、2 1MediaTek,Hsinchu,Taiwan 2MediaTek,Irvine,CA *Equally Credited Authors(ECAs)With the escalating demand of data-intensive applications such as artificial intelligence(AI)and high-performance computing(HPC),off-chip data transfer has become a critical bottleneck.Consequently,the data rate of wirelin
21、e serial links has been propelled beyond 200Gb/s 1-2.The large and complex packaging used in these links presents significant signal integrity challenges,primarily substantial channel loss and reflections.To mitigate these issues,sophisticated digital equalization techniques are required.For instanc
22、e,the use of extended lengths of feed-forward equalization(FFE),floating taps for reflection cancellation,and maximum likelihood sequence detection(MLSD)within RX DSP are essential to address severe signal integrity impairments.Furthermore,the analog frontend must provide additional bandwidth and re
23、duced noise impairments as the baud rate of transmitted and received signals is doubled.Aggressive design innovations in both analog frontend and DSP are imperative to achieve competitive performance,area,and power efficiency in a 200Gb/s link.This paper presents a 212.5Gb/s DSP-based PAM-4 transcei
24、ver,fabricated in 4nm FinFET technology,achieving 2.5e-6 BER over a channel with loss exceeding 50dB.The TX block diagram is illustrated in Fig.7.1.1.TX DSP transmits 64b parallel PAM-4 MSB and LSB data to a TXFIR digital block.It converts 2b PAM-4 data into 7b codes with a 6-tap pre-emphasis for th
25、e following 64-to-1 data path serializer.The driver is a CML 7b DAC,where each DAC slice incorporates a 4-to-1 MUX that serializes data to 112GBaud using 4-phase clocks with 1UI pulse width.A 2-tap FFE is integrated into this high-speed MUX to maximize the bandwidth and mitigate output data-dependen
26、t jitter.This function is activated during high-speed operations(56Gbaud and above),where data are delayed by a small inverter,sampled by the subsequent phase clock,and fed to the MUX output node to perform post-cursor subtraction.The TX output network includes four pairs of inductors that distribut
27、e loading from the termination resistor,driver,ESD diodes,and C4 bump 3.The L-C network is designed to minimize reflections while preserving a broad bandwidth to optimize the link performance.The size of ESD diodes is selected to meet a 200V CDM requirement in a 75mm 75mm package.The clocking system
28、 is comprised of a jitter-cleanup PLL(JCPLL)which generates a clean,low-frequency reference clock that is routed to the per-lane TX/RX PLLs.Utilizing a JCPLL is crucial for larger chips where the quality of the reference clock cannot be guaranteed and hence allows for wide-bandwidth operation of the
29、 TX/RX PLLs.A dedicated digital PLL (Fig.7.1.2)is integrated into each TX and RX lane with compact area to facilitate flexible per-lane TX/RX speed programming 4.A 20-to-28GHz LC VCO with a high-Q inductor and a second-harmonic LC filter are employed to achieve low phase noise performance.An optimiz
30、ed-threshold time-to-digital converter(TDC)incorporates a time amplifier circuit to enhance the time-domain gain,thereby reducing the input-referred jitter and offset.Quadrature clocks are generated by an IQ generation circuit at the VCO output.In the Q-path,the required delay is generated by a curr
31、ent-starved inverter,interpolated with a small inverter,with minimal impact on jitter.The bias for the current-starved inverter is generated through a background calibration loop to maintain a precise 90-degree phase shift.Figure 7.1.3 presents the RX analog and digital block diagram.The RX input ne
32、twork comprises a compact T-coil and a shunt peaking inductor tuned with 200V CDM compliant ESD diodes.The RX termination resistance is normally 50,but it can be programmed to as low as 20 if the incoming signals swing is excessively large.A source follower buffer with small input loading is placed
33、as the first stage of the RX frontend for minimum signal reflections.The buffers output signal is conditioned by a CTLE,and the equalized signals are distributed to four T/H buffers.Each T/H buffer is sampled at a 28GS/s rate by four T/H switches,and each interleaved path is subsequently buffered by
34、 a flipped source follower(FSF).Each FSF buffer drives eight 7b 875MS/s SAR ADCs,resulting in a total of 128 ADC units in each RX lane.A major practical concern with integrating 128 ADCs is ensuring a proper analog-to-digital interface that includes a level converter with ESD protections,as the anal
35、og and digital domains utilize different supply and ground networks.Each ADC unit has 7b output data and digital control signals for offset and gain calibration.The ADC output interface complexity can be halved by employing 2-to-1 serializers to upconvert the data from 128875MS/s to 641.75GS/s.Addit
36、ionally,the N-bit offset controls for each ADC comparator are converted to a 1b interface in DSP by using delta-sigma modulators instead of using an N-bit interface.The RX DSP equalizer path features a 32-tap FFE for ISI and near-end reflections,a 32-tap floating DFE for far-end reflections,and a 1-
37、tap DFE or an MLSD for the final data decision 2.The baud-rate CDR processes the outputs from the FEE,ensuring a satisfactory equalized waveform and latency for effective timing recovery(TR).The TR output controls the T/H sampling phase via an RX PLL,and the entire clocking path,from the PLL operati
38、ng at 28GHz,to the 7GHz blocks,is shielded from supply noise by an N-type LDO.The LDO power supply rejection ratio(PSRR)is primarily determined by the ratio of the parasitic capacitor CGD and the de-coupling capacitor CL at the load devices gate(VG).It presents a direct trade-off between the device
39、area and the level of coupled noise.Without significantly increasing the de-coupling capacitance,the LDO performance can be substantially enhanced by employing a noise cancelling path to VG 5.This path includes an inverting amplifier,whose output is connected to VG through a small coupling capacitor
40、 CC.The amplifier gain,denoted as-,is programmable to ensure that the noise coupled from the supply VDDH to VG is matched by the injected supply noise through the cancellation path.Figure 7.1.4 shows the system schematic for one group of T/H.The first rank T/H system has a source follower with serie
41、s peaking that drives four P-type T/H switches controlled by CK7G.Each switch pair is equipped with data feedthrough cancellation capacitors and clock feedthrough cancellation capacitors(controlled by CK7Gb),along with FSF-based ADC buffers.The FSF-based ADC buffer is heavily loaded with eight SAR A
42、DCs and its output bandwidth is limited to below 20GHz.When the T/H switch is enabled,the ADC buffer output cannot track high-frequency input signals beyond 20GHz.When the switch is disabled in the hold state,the buffer output settles to the stored input voltage level(VTHP/VTHN).The output swing dur
43、ing the hold state couples back to VTHP/VTHN through the coupling capacitor CC,creating a positive feedback loop with the loop gain less than 1.With the aid of this positive feedback,the net T/H response shows a peaking gain above 20GHz.The peaking gain is proportional to the size of CC;however,an e
44、xcessively large CC can lead to an incomplete reset of the previously sampled signal.As depicted in Fig.7.1.4,the simulated T/H pulse response shows a clear memory effect after 16UI delay by having a large CC.This memory effect becomes more pronounced when the buffer output bandwidth is limited in t
45、he slower corners,leading to periodic AC gain dips at integer multiples of 7GHz if the main tap and the 16th post-cursor have opposite polarity.The size of CC is determined based on simulation to strike a balance between gain and memory effect.Additionally,the FSF buffer incorporates another cross-c
46、oupled capacitor pair CGD to decouple FSFs internal feedback loop from the high-impedance sampling node VTHP/VTHN,as the relatively slow internal loop can also induce a memory effect.The chip was fabricated in 4nm FinFET technology and features a bump design compatible with both FCBGA and CoWoS pack
47、aging.Figure 7.1.5 displays the TX and RX measurement data.The 212.5Gb/s QPRBS13 TX eye diagram after a 9dB loss channel has RLM and SNDR equal to 98.5 and 35.5dB respectively.Notably,the SNDR deteriorates to below 30dB when the TX 4-to1 MUX FFE is deactivated.At the maximum peaking setting,the meas
48、ured RX analog frontend frequency response with channel loss de-embedded showed around 14dB gain at 56GHz.The measured RX SNDR is 35dB at low frequency(1GHz),while at high frequency,it is constrained by the random jitter in the RX clocking path.The 212.5Gb/s RX JTOL measurement demonstrates a 130mUI
49、 margin across a channel with 38dB loss.Single-lane testing across various trace lengths indicate BER 46dB Reach in 5nm FinFET A.Mostafa1,A.Hassan1,A.Hsu2,A.K.Singh3,C.-H.Wu4,C.-R.Yang1,D.Prabakaran3,D.Storaska5,D.Zhou1,D.Visani1,E.Hsiao1,F.Chu1,F.Khan1,F.Lu1,G.Cui1,G.Wang1,J.Natonio5,J.Deng1,J.Ding
50、1,J.Guo1,J.Gu1,J.Zang1,L.Jiang1,K.-M.Lu4,M.Hasan1,M.Kelly6,M.H.Kashani2,M.Gambhir1,M.R.Patoju3,M.Singh1,M.Shannon5,M.Yang1,P.Liu1,P.Ramakrishna3,R.Chen4,R.Ho7,S.N.Shahi8,S.Sivakumar1,S.Xu7,X.Yang1,X.Han1,Y.-P.Su4,Z.Adal1,Z.Guo1,Z.Li1,Z.Yu1,Z.Yan1,H.Wang1,K.Chang1 1Marvell,Santa Clara,CA 2Marvell,Tor
51、onto,Canada 3Marvell,Bangalore,India 4Marvell,Zhubei,Taiwan 5Marvell,Fishkill,NY 6Marvell,Boston,MA 7Marvell,Burlington,VT 8Marvell,Kanata,Canada Given the explosive demand for bandwidth to satisfy new AI requirements,100G/lane connectivity is transitioning to 200G/lane 1-6 to enable 1.6Tb/s Etherne
52、t.Enabling robust operation at this rate necessitates resolution of challenges such as component bandwidth and signal integrity improvements in cables,connectors,and packages.Achieving the 200G/lane objective requires power-and area-efficient approaches.The critical component to enable this connecti
53、vity is the 224Gb/s transceiver.This work presents a long-reach low-power 224Gb/s PAM-4 SerDes transceiver capable of compensating for more than 46dB of loss at 212.5Gb/s with an optimized analog power of 2.2pJ/b,implemented in 5nm FinFET process technology.The SerDes transceiver is configured as a
54、4-lane module,as shown in Fig.7.2.1.Each lane can operate in complete independence of others,or with the ability to pair different receivers(RXs)and transmitters(TXs)across the 4 lanes as a single port to improve package breakout and alignment with PCB routes in high lane density applications.An ada
55、ptable,reconfigurable architecture is used to enable data rate flexibility along with power saving modes.Each transmitter is implemented with an adjacent PLL with the option to share a PLL across 2 lanes;while each of the receivers operates independently with clock generation through a lane-dedicate
56、d multi-mode digitally controlled oscillator(DCO).A common block distributes voltage and current references across the lanes,it also contains a temperature sensing module that supports temperature tracking functions needed to reduce performance variation over a wide temperature range.The analog-fron
57、t-end(AFE)of the RX includes a highly configurable wide-band CTLE that equalizes a wide range of loss across various data rates.The CTLE output is quantized through a time-interleaved ADC(TI-ADC)to enable the DSP that provides the digital equalization and performance monitoring for the receiver.A lo
58、w-jitter analog PLL generates the clock for each transmitter.The transmitter data path consists of an FIR filter followed by the data serializer and high-speed mux stages.The following 8b DAC converts the data to analog domain,which is finally buffered with a high-bandwidth current-mode driver.Figur
59、e 7.2.2 shows the architecture of the analog receiver.The RX analog subsystem is architected with focus on power and noise-efficient front-end equalization as well as generation of low aperture jitter clocking.The AFE consists of input network with T-coils and a separate termination path to distribu
60、te the parasitics and provide a wideband well-terminated response.CTLE gain stages consist of resonant gain-boosted GM equalization stages.The first stage is RC degenerated with T-coil load,and the second stage is an asymmetric push-pull GM with TIA driver with inductive-resistive feedback.An optimi
61、zed network of resonance peaking,as well as broadband gain controls,provide a wide range of equalization capability and optimize the ADC dynamic range utilization in mission mode.The TIA-driver stage is coupled to a 16-way rank-1 sampling network optimized for tracking bandwidth upward of 60GHz.The
62、second rank sampling consist of super-follower-based buffers,each driving 7 ADC-slices.Offset and gain errors for interleaved ADC-slices are background corrected in the digital domain.The TI-ADC architecture,with 112 ADC slices,is adaptable and scalable to achieve power efficiency across process cor
63、ners,data rates and channel losses:ADC resolution,parallelism,supply and signal range provide this programmability with minimal analog cost.Clock recovery is closed around a digitally controlled oscillator(DCO)which carries out the functions of frequency synthesis as well as wideband CDR.The recover
64、ed clock is routed to a multiphase generator through a resonant distribution network.A multiphase generator combines functionalities such as 2-16 phase generation,shaping for the sampling clock as well as timing skew correction.Efficient jitter optimization is achieved by minimizing the number of ac
65、cumulative CMOS stages across these major blocks and building a cohesive multifunctional strategy for frequency synthesis,CDR,phase generation and skew control.The RX AFE,including the DCO,achieves a power efficiency of 1.1pJ/b while configured to operate at 212.5Gb/s.The receiver DSP engine is depi
66、cted in Fig.7.2.1.The analog equalized signal is sampled by the ADC and equalized by a 29-tap feed-forward equalizer(FFE)and 1 tap speculative decision-feedback equalizer(DFE).The DFE generates a recovered symbol utilized by the CDR block.It is also used for 3 banks of 4-tap floating filters to cove
67、r reflections up to 100UI.The final equalized signal is optionally propagated into a maximum likelihood sequence detector(MLSD)for symbol detection of the partial response,specifically to improve RX performance margin for long reach channels.Programmable adaptation logic is in place to continuously
68、tune FFE,DFE,and floating bank coefficients to maintain optimum performance after initial training.The digital TX equalization is provided by a 10-tap FIR filter,with 6 taps positioned around the main tap,and two 2-tap banks which can cover a range of up to 30UI.Figure7.2.3 shows the architectural b
69、lock diagram of the TX.An 8b DAC-based transmitter with 6 bits binary and 2 thermometric bits,uses a differential input clock of up to 28GHz,generated from an LC PLL.Resonant CML stages distribute the clock to the TX providing filtering for broadband noise,hence improving clock jitter.An IQ generati
70、on circuit converts the incoming clock to quadrature phases and is assisted by a phase detection and control loop for calibration and tracking.The quadrature clocks drive the final 4 to 1 mux,as well as the clock generation circuitry for the 64 to 4 serializer.The serializer consists of a 4 to 1 shi
71、ft register and 2 stages of 2 to 1 mux.A timing detection circuit is implemented to maximize interface timing for the serializers last 2 to 1 mux interface.Figure 7.2.3 also shows the 2-stage current mode driver.A pre-driver is used to convert 4-phase interleaved DAC current into voltage and a push-
72、pull buffer stage mirrors the current to the output.An optimized T-coil and inductive network along with a tuned resistive termination provide impedance matching as well as bandwidth boost by isolating the driver from ESD and bump loading.This high performance and power efficient transmitter with cu
73、rrent mode driver achieves 0.86pJ/b power efficiency at 212.5Gb/s.With data-rate doubling,clocking jitter improvement relative to that achieved at 100Gb/s is required to maintain sampling error relative to data eye margins.This transceiver utilizes an ultra-low-jitter clocking PLL designed to enable
74、 the jitter reduction target.The PLL utilizes a dual-inductor,and dual-tail 2nd harmonic resonance LCVCO.The VCO resonant tank is coupled with CMOS and CML buffers employing resonant clocking to achieve wide-band filtering and jitter reduction.Additionally,a narrow-band jitter attenuator buffer is i
75、ncorporated to minimize jitter amplification.The analog-based PLL features an 8b phase rotator supporting fractional-N and SSC functionalities.The dual-inductor LCVCO,spanning a wide frequency range from 24.2GHz to 33GHz,is optimized for low noise using flicker noise up-conversion rejection techniqu
76、es.Measurement results of the PLL at the transmitter output show low jitter performance,with 48fsRMS jitter integrating from 3MHz to Nyquist frequency at 212.5Gb/s and assuming an ideal CDR filter.Furthermore,the 2T clock pattern measured random(Rj),deterministic(Dj),and total jitter(Tj)are reported
77、 as 73fs,157fs and 1.18ps,respectively as shown in Fig.7.2.4.Figure 7.2.4 also shows the raw eye diagram of the TX,measured at 212.5Gb/s,as well as the eye diagram when scope FFE is applied.In addition,Fig.7.2.4,shows the TX measured SNDR with PRBS-13 pattern at 212.5Gb/s is 36.1dB.TX output SNDR at
78、 106.25Gb/s is greater than 39dB.Figure 7.2.5 shows the RX equalized eye after compensating for a loss of 46dB and achieving a PRE-FEC BER of 6e-9.The test setup for that measurement is a board-to-board channel through a custom test board providing a trace loss of 33.3dB.The TX side connectivity los
79、s was measured at 7.5dB,and RX at 5.5dB for a total loss of 46dB.Figure 7.2.5.also shows the JTOL margin with BER threshold of 1e-4,for the RX with 24dB and 29dB loss channels.Figure 7.2.6 shows a performance comparison summary for this work relative to recent published work on 200Gb/s.Finally,Fig.7
80、.2.7 is a die photo of the 4-lane module of the 224Gb/s transceiver.Ack nowle dge me nt:The authors would like to thank the entire Marvell Central Engineering team including analog,layout,digital,PD,DSP,packaging,hardware,test and CAD for the dedication and support.Figure 7.2.1:224Gb/s PAM-4 SerDes
81、Transceiver 4-l ane Modul e,RX DSP architecture.Figure 7.2.2:Receiver architecture,2-stage CTLE,Rank1 and Rank2 sampl ing.Figure 7.2.3:TX architecture and TX driver circuit.Figure 7.2.4:TX eye(212.5Gb/s raw),TX eye(212.5Gb/s with scope FFE),TX SNDR(36.1dB 212.5Gb/s),PLL 2T cl ock j itter:Tj=1.18ps,R
82、j=73fs,Dj=157fs.Figure 7.2.5:Board to board BER across 4 l anes different l oss.RX sl icer input histogram for 46dB l oss,RX sl icer input sampl es after equal ization for 46dB l oss,RX PRE-FEC symbol error distribution for 46dB l oss.RX JTOL pl ot at 212.5Gb/s at 24dB,29dB.Figure 7.2.6:Performance
83、comparison tabl e with previousl y publ ished work at 200Gb/s.ISSCC 2025/February 17,2025/1:55 PM139 DIGEST OF TECHNICAL PAPERS 7 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 7.2.7:Die micrograph of th
84、e 224Gb/s 4-l ane modul e.Re f e re nce s:1 D.Pfaff et al.,“A 224Gb/s 3pJ/b 40dB Insertion Loss Transceiver in 3nm FinFET CMOS,”ISSCC,pp.128-129,Feb.2024.2 M.Cusmai et al.,“A 224Gb/s sub pJ/b PAM-4 and PAM-6 DAC-based Transmitter in 3nm FinFET,”ISSCC,pp.126-127,Feb.2024.3 J.Q.Wang et al.,“A 2.69pJ/b
85、 212Gb/s DSP-Based PAM-4 Transceiver for Optical Direct-Detect Application in 5nm FinFET,”ISSCC,pp.123-125,Feb.2024.4 Y.Segal et al.,“A 1.41pJ/b 224Gb/s PAM-4 SerDes Receiver with 31dB Loss Compensation,”ISSCC,pp.114-116,Feb.2022.5 J.Kim et al.,“A 224Gb/s DAC-Based PAM-4 Transmitter with 8-Tap FFE i
86、n 10nm CMOS,”ISSCC,pp.126-128,Feb.2021.6 M.Choi et al.,“An Output-Bandwidth-Optimized 200Gb/s PAM4 100Gb/s NRZ Transmitter with 5-Tap FFE in 28nm CMOS,”ISSCC,pp.128-130,Feb.2021.140 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-SPEED WIRELINE/7.3979-8-3315-41
87、01-9/25/$31.00 2025 IEEE7.3 A 1.11pJ/b 224Gb/s XSR Receiver with Sl ice-Based CTLE and PI-Based Cl ock Generator in 12nm CMOS Bingyi Ye1,2,Tianchen Ye1,Tianyuan Zhong1,Zhiwen Huang1,Lei Shen1,Boyang Zhang1,Dunshan Yu1,Yandong He1,Weixin Gai1,3 1Peking University,Beijing,China 2East China Normal Univ
88、ersity,Shanghai,China 3Beijing Advanced Innovation Center for Integrated Circuits,Beijing,China Over the past few decades,process scaling and architecture advancements have led to an exponential increase in transceiver data rates.Recently,224Gb/s DSP-based receivers(RXs)targeting long-reach electric
89、al and optical transmission have been demonstrated 1-3.For extra-short reach(XSR)communication at 112Gb/s,analog RXs employing continuous-time linear equalizers(CTLEs)and quarter-rate slicers have shown superior energy efficiency 4-5.However,doubling CTLE bandwidth and sampling frequency is challeng
90、ing,particularly when the power and area are limited.This paper presents a 224Gb/s XSR RX that overcomes these challenges using a slice-based CTLE and a PI-based eight-phase clock generator(CG).The RX clocking typology depends on the operating frequency of the slicer,which is realized with the doubl
91、e-tail latch 6 as shown in the top left of Fig.7.3.1.To accelerate the reset procedure,two switches(M1,M2)short the outputs of the two stages as they are more effective than pull-up switches.The top right shows the simulated output waveforms with a 10mV DC input.At operating frequencies of 18.7GHz a
92、nd 28GHz,the output voltage fails to achieve the full swing necessary for driving subsequent CMOS circuits.Consequently,the slicer has to operate at 14GHz,and the RX employs an 8-way interleaved architecture.The bottom portion of Fig.7.3.1 shows the RX architecture.The slice-based CTLE equalizes the
93、 input signal(VIN)and feeds the slicer array composed of three sets of data slicers and one set of error slicers for background adaptation.The DAC array generates the threshold voltage and calibrates the offset of each slicer.Thermometer-coded data(DH,DM,DL)and error information(E)are synchronized b
94、y the aligner and de-multiplexed to a 1/32 baud rate before entering the decoder and the pattern checker.The CG is comprised of a quadrature divider by two(QDIV2),phase interpolators(PIs),CML to CMOS(C2C)circuits,and phase shifter(PS)circuits.The RX-driven transmitter(TX)feed-forward equalizer(FFE)a
95、daptation and pre-distortion lookup table(LUT)for the PIs integral nonlinearity(INL)are realized externally for debugging and optimization.The top of Fig.7.3.2 shows the slice-based CTLE consisting of a 5b slice-based source degeneration(SD)transconductance(Gm)amplifier for wideband tuning,an invers
96、ely connected 2b Gm amplifier with an RC-based 3GHz low-pass filter(LPF)for low-frequency tuning,a T-coil for impedance matching,and series-shunt inductors for bandwidth extension.In the slice-based SD Gm amplifier,neutralization capacitors(MCP,MCN)counteract the Miller Effect,ensuring optimized inp
97、ut impedance across various configurations.The 3 LSBs and 2 MSBs control 7 unit-size slices and 3 eightfold-size slices,respectively,balancing area and mismatch.Each Gm slice can be configured as either an all-pass(AP)slice or a high-pass(HP)slice by turning on or off the switch(MSW),respectively.Th
98、e AP slice operates like a common-source differential amplifier because of the small on-resistance of the MSW,which is simulated at 62.The HP slice achieves RC SD by utilizing the wiring capacitance(CS)and output resistance(RO)of the current source,resulting in a more compact layout.Noise on the con
99、trol signal(EN)is tolerated because the MSW operates as a switch rather than a voltage-controlled resistor.Consequently,the frequency response of the amplifier can be tuned by setting some of the Gm slices as HP slices and the others as AP slices.The top right and the bottom of Fig.7.3.2 show the si
100、mulated frequency responses of the slice-based SD Gm amplifier and the conventional one with a lumped degeneration resistor 1-2,respectively.The slice-based amplifier achieves a 3.6dB DC gain and has a maximum step size of 2.4dB.The conventional amplifier achieves the same DC gain with a unit resist
101、or RX of 62 but has a maximum step size of 9.3dB,which is almost three times larger and may lead to over-or under-equalization.Increasing RX to 364 can alleviate the nonlinearity of the step sizes but reduce the DC gain by 4.5dB.The slice-based SD amplifier breaks this trade-off between gain and ste
102、p size,which helps to increase the GBW of the CTLE.The corresponding transistors in each slice are aligned in the same row,as shown in the middle-left floorplan,to minimize the intersection between high-speed signal wires.Therefore,the loading capacitance of the slice-based amplifier is nearly as lo
103、w as a lumped amplifier.In addition to GBW improvement,another design focus is to reduce the group delay variation of the CTLE that causes ringing in the single-bit response(SBR).This issue becomes more severe when cascading multiple stages or applying the quality factor(Q)-shaping 1,3,7,8.In this w
104、ork,the resonance frequency of the output network is intentionally increased to 80GHz,as simulated and shown in the top left of Fig.7.3.3.This shifts the peak group delay outside the Nyquist bandwidth,reduces the group delay variation between 5GHz and 56GHz by 75%,and nearly eliminates the ringing.T
105、he group delay ripple below 5GHz is only a fraction of the signal period and causes no ringing.As a result,the eye width and eye height are improved by 37%and 12%,respectively,although the gain at 56GHz is slightly lowered.The bottom left of Fig.7.3.3 depicts the block diagram of the PI-based eight-
106、phase CG.The four-phase 14GHz clock generated by QDIV2 is converted into the eight-phase sampling clock by 4 parallel PIs.To reduce jitter,the QDIV2 and PIs are implemented in the current-mode logic(CML)domain,and the PS circuits correct the skew between complementary phases.The key challenge of PI-
107、based CGs is the phase error caused by the INL of PIs.This may not be a problem for source synchronous links since a static error can be calibrated,while code-dependent phase error can induce a deterministic jitter(DJ)when PIs rotate continuously in a clock recovery loop.One solution to reduce DJ is
108、 implementing a digitally controlled oscillator(DCO)per RX for clock recovery 1.However,a more power-efficient solution is the digital pre-distortion technique 9,which allows multiple RXs to share a common phase-locked loop(PLL).The main idea is to compensate for the INL with a pre-distortion LUT.Th
109、e top right of Fig.7.3.3 shows the schematic of the 8b CML PI controlled by a fully thermometer-coded DAC to reduce the differential nonlinearity(DNL).Four always-on current sources bias each differential pair in saturation,further suppressing the DNL caused by the transitions between saturation and
110、 cut-off regions.Simulations indicate that pre-distortion reduces the peak-to-peak INL(INLpp)by 5.4LSB and only increases the peak DNL(DNLp)by 0.6LSB.The bottom right table illustrates that the INLpp remains below 2.0LSB across a wide range of process and temperature variations when using the pre-di
111、stortion LUT optimized for a nominal corner(TT 50C).The RX is fabricated in a 12nm CMOS technology and tested on a probe station.The left part of Fig.7.3.4 shows the frequency response of the CTLE and the slicer characterized using an arbitrary waveform generator(AWG).For each frequency,the amplitud
112、e is derived from the probability density function measured by sweeping the threshold voltage of the slicer.The gain boost can be adjusted from-0.6dB to 18.5dB with a maximum step size of 1.9dB.The worst-case-3dB bandwidth is 59GHz,and the-6dB bandwidth exceeds half of the AWG sample rate(65GHz).The
113、 bottom left of Fig.7.3.4 shows the low-frequency tuning of the CTLE.The bottom middle graph compares the frequency responses for normal and 84%bias current,which indicates the robustness to operation point variation.The pre-distortion LUT is generated by slicing a synchronized 14GHz sine wave with
114、a variable phase.An on-chip time-to-digital converter(TDC)3,10 can simplify this procedure.The right part of Fig.7.3.4 shows measured PI INL and DNL without and with pre-distortion.The average INLpp is reduced by 4.1LSB,and the average DNLp is only increased by 0.8LSB.The maximum code-dependent phas
115、e error of the sampling clock is improved from 10.3 to 3.4 with pre-distortion.The PRBS7 PAM4 signal,with a 930mVppd swing from the AWG,is fed into the RX through an 8.1dB-loss channel(CHA)and a 13.6dB-loss channel(CHB),respectively,whose responses are shown in the top left of Fig.7.3.5.The TX FFE w
116、ith up to 10 taps is adapted using the sign-sign least-mean-square(SS-LMS)algorithm,driven by the subsampled output of RX slicers,as depicted in the middle left.The DACs update threshold voltages 20 times faster than the TX FFE to ensure stability.The top right of Fig.7.3.5 shows the adapted FFE coe
117、fficients for CHB with different CTLE configurations.The middle right shows the measured bit error rate(BER)versus the number of FFE taps.For CHB,a 6dB boost reduces the number of taps from 6 to 3 to meet the 1e-6 BER requirement of CEI-112G-XSR+-PAM4 11,which is promisingly applicable for the forth
118、coming 224G XSR standard.The RX achieves a BER of 8.1e-7,1.1e-8,and 3.0e-9 with 3-,6-,and 10-tap FFE,respectively.For CHA,the RX achieves a BER of 6.5e-7 with a 3-tap FFE and no error for the measured 5e9 bits with a 6-tap FFE.The bottom of Fig.7.3.5 shows the bathtub curves for CHA and CHB.Figure 7
119、.3.6 compares the performance with previous RXs and AFEs.This analog 224Gb/s RX is more power efficient than the DSP-based ones.The area is comparable to other 112Gb/s XSR RXs,excluding the unoptimized DAC array.The slice-based CTLE achieves higher-3dB and-6dB bandwidth with lower power and smaller
120、area.The supply voltage is 1.2V for CTLE and DACs,and 1.0V for other circuits.The total measured power is 248.7mW,and the bottom right of Fig.7.3.6 shows the breakdown based on post-layout simulations.Figure 7.3.7 shows the die micrograph.Ack nowle dge me nt:This work was supported in part by the Be
121、ijing Major Science and Technology Project(Grant No.Z221100007722019).Figure 7.3.1:Receiver architecture.Figure 7.3.2:Sl ice-based CTLE and simul ated frequency responses of two SD Gm ampl ifiers.Figure 7.3.3:CTLE group del ay optimization,PI-based eight-phase cl ock generator,8b CML PI,simul ated I
122、NL and DNL.Figure 7.3.4:Measured frequency response of CTLE and sl icer,PI INL and DNL,code-dependent phase error.Figure 7.3.5:Measured channel l oss,RX-driven adaptation of TX FFE coefficients,BER vs.FFE taps,BER bathtubs.Figure 7.3.6:Performance comparisons and power breakdown.ISSCC 2025/February
123、17,2025/2:20 PM141 DIGEST OF TECHNICAL PAPERS 7 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 7.3.7:Die micrograph.Re f e re nce s:1 A.Khairi et al.,“A 1.41-pJ/b 224-Gb/s PAM4 6-bit ADC-Based SerDes Rec
124、eiver With Hybrid AFE Capable of Supporting Long Reach Channels,”IEEE JSSC,vol.58,no.1,pp.8-18,Jan.2023.2 J.Q.Wang et al.,“A 2.69pJ/b 212Gb/s DSP-Based PAM-4 Transceiver for Optical Direct-Detect Application in 5nm FinFET,”ISSCC,pp.123-124,Feb.2024.3 D.Pfaff et al.,“A 224Gb/s 3pJ/b 40dB Insertion Lo
125、ss Transceiver in 3nm FinFET CMOS,”ISSCC,pp.128-129,Feb.2024.4 R.Shivnaraine et al.,“A 26.5625-to-106.25Gb/s XSR SerDes with 1.55pJ/b Efficiency in 7nm CMOS,”ISSCC,pp.182-183,Feb.2021.5 C.F.Poon et al.,“A 1.24-pJ/b 112-Gb/s(870 Gb/s/Mm)Transceiver for In-Package Links in 7-nm FinFET,”IEEE JSSC,vol.5
126、7,no.4,pp.1199-1210,Apr.2022.6 D.Schinkel et al.,“A double-tail latch-type voltage sense amplifier with 18 ps setup+hold time,”ISSCC,pp.314-315,Feb.2007.7 Y.Krupnik et al.,“112-Gb/s PAM4 ADC-Based SERDES Receiver With Resonant AFE for Long-Reach Channels,”IEEE JSSC,vol.55,no.4,pp.1077-1085,Apr.2020.
127、8 S.Kiran et al.,“A 56GHz Receiver Analog Front End for 224Gb/s PAM-4 SerDes in 10nm CMOS,”VLSI,pp.1-2,Jun.2021.9 Z.Wang,H.Jiang,and P.R.Kinget,“A Digital Pre-Distortion Technique for High-Linearity,Low-Power,Compact,Phase Interpolators,”ISCAS,pp.1-5,May 2024.10 M.Mansuri,B.Casper and F.OMahony,“An
128、On-Die All-Digital Delay Measurement Circuit with 250fs Accuracy,”VLSI,pp.98-99,Jun.2012.11 Implementation Agreement OIF-CEI-05.2.Accessed:Aug.2024.Online.Available:https:/ 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-SPEED WIRELINE/7.4979-8-3315-4101-9/25/$
129、31.00 2025 IEEE7.4 A 112Gb/s DSP-Based PAM-4 Receiver with an LC-Resonator-Based CTLE for 52dB Loss Compensation in 4nm FinFET Henry Park*1,Qaiser Nehal*1,Miguel Gandara1,Atharav Atharav1,Joonyeong Lee1,Jengfeng Li2,Tamer Ali1 1MediaTek,Irvine,CA 2MediaTek,Hsinchu,Taiwan *Equally Credited Authors(EC
130、As)With rapid growth of AI accelerators and GPUs,long-reach high-speed interconnects with data rates of 100Gb/s or higher are widely demanded for many different applications such as Ethernet/Optical standards or PCIe 7.0.Despite the latest advances in DSP-based wireline transceivers,the increasing c
131、omplexity in networking systems requires a breakthrough in SerDes architecture for successful signal transmissions over long channels.This work presents an energy-efficient DSP-based 112Gb/s PAM-4 receiver with a novel CTLE architecture and analog data path latency reduction technique.The proposed i
132、deas are also directly applicable to the upcoming PCIe 7.0 systems for 128GT/s.A continuous time linear equalizer(CTLE)is one of the most important components for all wireline interconnects.Historically,source-degenerated differential pairs were used for data rates up to 56Gb/s,but recent 112G/224G
133、designs are adopting resonator-based architectures 1-3 to target peaking frequencies at 28GHz and beyond.The major problem of the resonator architectures is that a high quality-factor(Q)design is necessary to increase the Nyquist gain,but its time-domain response may exhibit an enlarged and prolonge
134、d ringing.To visualize the ringing effect,two CTLE AC responses are drawn in Fig.7.4.1.The red line is an ideal CTLE response with 2 taps(1-z-1)only,while the blue line includes an exponentially decaying ringing pattern.The two models have similar peaking gain(28GHz gain DC gain)but a major deviatio
135、n is seen at 520GHz.When convolved with a 36dB loss response,the high-Q CTLE leaves large post cursors at h2h4 that can be suppressed by a digital FFE but with noise peaking.The large post cursors can also lead to a noisy timing error detection if they are not fully rejected in the timing recovery l
136、oop filter.To obtain an ideal CTLE response,this work adopted a parallel RLC network that has nearly no ringing with a low tank Q.From the RLC networks frequency response in Fig.7.4.1,the DC gain is determined by the inductor series resistance while the Nyquist gain is a strong function of the Q fac
137、tor(i.e.shunt resistor).These two design variables are independent so a very large peaking gain can be achieved with nearly no ringing in the time-domain response.The proposed CTLE with a parallel RLC resonator is implemented in Fig.7.4.2.CTLE_HFs Gm cells drive the RLC network,and only high frequen
138、cy input signals are amplified by this path.The Nyquist gain is controlled by enabling Gm cells with no major impact on the Q factor.Since CTLE_HF provides a very low DC gain,a low-frequency equalizer(CTLE_LF)is added in parallel to CTLE_HF to provide the adequate DC gain.CTLE_LF has conventional so
139、urce degenerated Gm cells to meet linearity requirements for PAM-4 signaling.The currents from CTLE_LF Gm and CTLE_HF Gm are summed at the output node with a shunt-series peaking network that can be designed either for a wideband uniform impedance or for a mid-band peaking response.Both approaches p
140、reserve the CTLE_HFs 2-tap response after convolution,as shown by case 2 and case 3 in Fig.7.4.2,but a high-frequency peaking at the output network can lead to ringing(case 1).As the proposed CTLE can produce 28GHz gain higher than 17dB,a flipped source follower(FSF)is placed at the input network to
141、 minimize signal reflections.The LC-resonator based CTLE features three more major advantages compared to the conventional CTLE architecture.First,the peaking frequency is programmable by the capacitor bank control(CPEAK in Fig.7.4.2)with great backward compatibility to 56Gb/s or lower data rate sta
142、ndards.Second,the proposed CTLE has an inherently low output noise profile thanks to the bandpass filtering by the output resonator.One design example is shown in Fig.7.4.2 where CTLE_LF(conventional CTLE design)and CTLE_HF noise sources are drawn in the schematic.CTLE_LFs differential output voltag
143、e noise spectral density(SV,OD)is contributed by three noise sources(in,R1,in,CS,and in,gm)as the wideband output network amplifies all band noise.CTLE_HF has only one noise source in,gm,and nearly 60%in-band noise power is reduced by the output RLC network if Q is 1.This noise power analysis indica
144、tes that the resonator-based CTLE is nearly 2 times less noisy than the conventional architecture(CTLE_LF)designed for the same Nyquist gain.In this design,CTLE_HF provides more than 85%of the Nyquist gain,but its contribution to the total CTLE noise power is only 2.7 times greater than the CTLE_LF
145、contribution.Lastly,the CTLE CMRR can be greatly improved by a coupled differential inductor.Various mode conversions by skew or by layout/device mismatch in TX or channels can introduce differential noise(Vn,dm)or common-mode(CM)noise(Vn,cm1/Vn,cm2)at the RX front-end(RXFE)output(Fig.7.4.3).The com
146、mon-mode noise can cause a nonlinear distortion by lowering the RXFE voltage headroom,or it can be converted to a differential noise by the ADC SDC(CM to DM conversion).These effects can be suppressed by lowering the CTLE SCC(CM gain).From the CTLE_HF equivalent CM circuit in Fig.7.4.3,the inductor
147、coupling factor k can push the CM resonance frequency much higher than the cut-off frequency of the next stages SCC.Since the RLC networks low-frequency impedance is small,which is given as a sum of the inductor ESR and a CM termination,the CTLE CM gain can be kept low over all frequency range.Anoth
148、er potential source of a ringing response from RXFE is shown in Fig.7.4.3 by the parasitic cap(Cp)at the input termination node.In Ethernet standards,the incoming signals common-mode level(VCM)can be arbitrary in between 01V that requires large IO switches for termination controls(e.g.50 calibration
149、)if RX is DC coupled to TX.In some cases,on-board AC coupling caps can be used to de-couple the RX CM biasing voltage from TX.However,the external discrete components can add impedance discontinuities in signal integrity with a negative impact on the link performance.If the TX common-mode voltage ca
150、n be sensed and tracked by RXFE blocks,core devices can be used for the RX termination controls with a minimum Cp contribution.The proposed input termination circuit with a small(core device)NMOS-based resistor array is shown in Fig.7.4.3.The gate control voltage comes from a replica bias generator
151、that compares the NMOS replica device impedance with a reference resistor(RREF).The ON-state bias voltage(VG-VS)is added to VCM by a charge pump,and this voltage level(VG-VS+VCM)is used for an accurate NMOS device impedance control under arbitrary VCM.The off-state voltage(VOFF)is a negative level t
152、hat is generated by another charge pump.The front-end circuits endured 150V CDM test with no secondary protection for the core devices.Timing recovery(TR)of a DSP-based receiver is another critical design concern as the loops dynamic stability is dependent on the loop latency.Since the MM phase dete
153、ctor in Fig.7.4.4 used the full FFE/DFE slicer output and ADCs raw data with matched delay to the EQ path,the data path latency can seriously degrade the TR loop phase margin although this strong equalizer path is ideal for a more robust TR operation under large loss channels or under noisy operatio
154、n conditions.To minimize the jitter peaking,the analog data path delay is reduced by merging the SAR ADCs output 8b-7b encoding within the ADC conversion cycles(Fig.7.4.4).The proposed C-DAC design in Fig.7.4.4 guarantees redundancy for the first 5 bits(B7:3)but the last three LSBs are binary coded
155、as their settling speed is typically much faster than MSBs settling speed 4.By choosing capacitor weights of MSBs as integer multiples of 4,only 3 full adders(FAs)are needed to convert the five MSBs to 4-bit outputs.After B3 is resolved,the 5b-4b encoding will be completed after 3 FA delay that is q
156、uicker than the remaining three comparator cycles for B2:0.Since the encoder is merged with each ADC slice,maximum one full cycle(1ns)can be saved from the full analog data path latency.Figure 7.4.5 shows RX frequency response with maximum 30dB peaking by CTLE_HFs Gm control(EN_HF),or by CTLE_LFs de
157、generation controls(Rs/Cs).The CTLE peaking frequency is flexibly tunable from 28GHz to 14GHz by enabling more capacitor units(CPEAK)at the RLC tank.A single-tone test result is shown in Fig.7.4.5 with-1dBFS swing at the ADC.A low-frequency SNDR of 36dB gradually drops below 30dB near 28GHz,limited
158、by the clocking path RJ(135fs).Using an on-chip TX that can produce 1.1Vppd swing,TX to RX loopback tests were done with external ISI boards.At a 112.5Gb/s data rate,the transceiver can compensate up to 52.3dB loss with 5dB TX FIR.Figure 7.4.6 shows a KR channel TX-RX loopback test with a trace skew
159、 control.Using an RF phase adjuster,a P-N or N-P trace skew was introduced to test the link performance under extreme mode conversions.The measured BER is below 1e-9 even with 0.5UI skew.IEEE KR JTOL has 120mUIPP margin assisted by the ADC latency reduction technique.The table in Fig.7.4.6 compares
160、architectural differences,performance,power,and area of the previously published DSP-based high-performance Serdes papers to this work.A single lane RX with CDR consumes 210mW at 112.5Gb/s,and the analog TX+RX area occupies 0.367mm2.The eight-lane transceiver die photo is shown in Fig.7.4.7.Ack nowl
161、e dge me nt:The authors would like to appreciate for Mediateks layout,system design,and test teams supports in this 4nm chip design and silicon measurements.Figure 7.4.1:CTLE AC responses with or without ringing,CTLE post-cursor cancel l ations with a KR channel,and paral l el RLC circuits time/AC r
162、esponses.Figure 7.4.2:RX front-end architecture,time-domain design considerations in RX data path,and noise breakdown of CTLE_LF and CTLE_HF from a design exampl e.Figure 7.4.3:CTLE CM gain suppression by a coupl ed diff.inductor,RXFE network parasitic cap effect,and DC coupl ed input network with t
163、ermination control s.Figure 7.4.4:Bl ock diagram of timing recovery and 7b 1GS/s SARADC with a 1b redundancy,proposed 5b-4b encoder,and ADC conversion cycl e timing diagram.Figure 7.4.5:RX AC response at 112Gb/s mode,peaking frequency tuning,RX singl e-tone test(-1dBFS),and 112.5Gb/s TX-RX l oopback
164、 test.Figure 7.4.6:36.8dB KR channel TX-RX l oopback test with skew,IEEE 802.3ck KR JTOL test,and comparison tabl e with prior l ong-reach 112Gb/s papers.ISSCC 2025/February 17,2025/2:45 PM143 DIGEST OF TECHNICAL PAPERS 7 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025 PAPER CONTIN
165、UATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 7.4.7:Die photo(eight l ane transceivers).Re f e re nce s:1 Y.Krupnik et al.,“112-Gb/s PAM4 ADC-Based SERDES Receiver with Resonant AFE for Long-Reach Channels,”IEEE JSSC,vol.55,no.4,pp.1077-1085,Apr.2020.2 Y.Segal et al.,“A 1.41pJ/b
166、224Gb/s PAM-4 SerDes Receiver with 31dB Loss Compensation,”ISSCC,pp.114-116,Feb.2022.3 D.Pfaff et al.,“A 224Gb/s 3pJ/b 40dB Insertion Loss Transceiver in 3nm FinFET CMOS,”ISSCC,pp.128-130,Feb.2024.4 L.Kull et al.,“A 3.1 mW 8b 1.2 GS/s Single-Channel Asynchronous SAR ADC With Alternate Comparators fo
167、r Enhanced Speed in 32 nm Digital SOI CMOS,”IEEE JSSC,vol.48,no.12,pp.3049-3058,Dec.2013.5 P.Mishra et al.,“A 112Gb/s ADC-DSP-Based PAM-4 Transceiver for Long-Reach Applications with 40dB Channel Loss in 7nm FinFET,”ISSCC,pp.138-140,Feb.2021.6 M.LaCroix et al.,“A 116Gb/s DSP-Based Wireline Transceiv
168、er in 7nm CMOS Achieving 6pJ/b at 45dB Loss in PAM-4/Duo-PAM-4 and 52dB in PAM-2,”ISSCC,pp.132-134,Feb.2021.7 Z.Guo et al.,“A 112.5Gb/s ADC-DSP-Based PAM-4 Long-Reach Transceiver with 50dB Channel Loss in 5nm FinFET,”ISSCC,pp.116-118,Feb.2022.8 A.Varzaghani et al.,“A 1-to-112Gb/s DSP-Based Wireline
169、Transceiver with a Flexible Clocking Scheme in 5nm FinFET,”IEEE Sy mp.VLSI Ci rc.,pp.26-27,June 2022.9 H.Park et al.,“A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET,”ISSCC,pp.5-7,Feb.2023.144 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025/SES
170、SION 7/ULTRA-HIGH-SPEED WIRELINE/7.5979-8-3315-4101-9/25/$31.00 2025 IEEE7.5 A 353mW 112Gb/s Discrete Mul titone Wirel ine Receiver Datapath with Time-Based ADC in 5nm FinFET Jaewon Lee1,2,Pier-Andrea Francese2,Matthias Brndli2,Thomas Morf2,Marcel Kossel2,Seoyoung Jang1,Gain Kim1 1Daegu Gyeongbuk In
171、stitute of Science and Technology,Daegu,Korea 2IBM Research Europe,Rschlikon,Switzerland The growing demand for higher communication bandwidth between processors through wired interconnects in large-scale servers has been driving the need to increase the per-lane data rate beyond the current 112Gb/s
172、.Recently demonstrated analog-to-digital converter(ADC)-based receiver(RX)prototypes with 100Gb/s data rate typically employ a parallel feed-forward equalizer(FFE)with a large number of taps,1-tap decision feedback equalizer(DFE)1-5,and maximum likelihood sequence estimator(MLSE)as option 6-8.As the
173、 data rate grows exponentially,the pulse response length and the number of corresponding inter-symbol interference(ISI)cursors increase accordingly 5,8.As the length of the pulse response gets doubled,the FFE tap count also needs to be increased accordingly,which results in substantial area and powe
174、r overhead.The DFE feedback loop timing closure also gets more stringent as Baudrate increases 9.With an increased pulse amplitude modulation(PAM)order,the DFE and MLSE design complexity increases exponentially 6-8.While a 100Gb/s PAM-4 transceiver(TRX)can effectively equalize smooth channels 2-5,ri
175、pples and notches in the frequency response of the channel can significantly degrade the equalization performance of the current PAM-4 TRX.Discrete multitone(DMT)modulation provides robustness against reflective channels having the ability to individually allocate bit information to multiple frequen
176、cy-domain orthogonal subchannels depending on the subchannel signal-to-noise ratio(SNR)10-12.DMT RX collects time-domain DMT samples before applying frequency-domain equalization(FDE),where the cyclic prefix(CP)protects the DMT symbol from ISI 11,13.If the ratio between the CP length and the DMT sym
177、bol length is low,the increase in the number of ISI cursors(i.e.increase in the required CP length)due to increased data rate does not increase the computing complexity of the DMT RX significantly 11.This paper presents a 112Gb/s DMT RX datapath with a time-interleaved(TI)time-based(TB)-ADC.While pr
178、eviously reported 100Gb/s RXs typically employed SAR ADCs as the sub-ADC,this work utilizes a TB-ADC to explore a digital-friendly design with a low supply voltage,which is favorable in advanced FinFET nodes.The compactness of the TB-ADC from its digital-intensive architecture is further explored by
179、 sharing the ring oscillator among multiple sub-ADCs,instead of converting the voltage to time in each sub-ADC with its own oscillator.Figure 7.5.1 shows the overall block diagram of the implemented DMT RX datapath.The 8b 64-way TI-TB-ADC is placed at the RX front end.The RX receives 90-degree-space
180、d 4-phase quarter-rate clocks(C4),and other required clocks such as 1/8-rate clocks(C8)and 1/64-rate clocks(C64)are generated on-chip.The differential inputs are on-die terminated(ODT)with 100 parallel resistor and T-coils.The digital signal processor(DSP)operates at C64 rate,which is the same as th
181、e sampling rate of the sub-ADCs.The folding operation of the TB-ADC and calibration for residual gain and offset mismatch among sub-ADCs are performed in the DSP front-end.The sign cross-correlator(Xcorr)detects the location of the CP within the sampled time-domain DMT symbol,by receiving a known ti
182、me-domain synchronization sequence(SS)14.This synchronization process is required only once when the RX starts to operate if the transmitter(TX)and the RX are locked and operate at the same rate.To support various communication channels with different numbers of ISI cursors in the pulse response,mul
183、tiple CP lengths ranging from 4 to 32 are supported without the need to have different clock rates between the DSP and sub-ADCs as in 15.Demodulation of quadrature amplitude modulation(QAM)symbols from QAM-4 to QAM-256,DSP test circuitries including a per-subchannel constellation diagram scanner and
184、 digital sampling oscilloscope,are supported on-chip.Figure 7.5.2 illustrates a single DMT frame,consisting of a 128-sample SS,CP(of length Ncp,which is variable),and DMT symbols.A symbol synchronizer,64-parallel sign Xcorr,finds the location of CP by detecting the peak position with a threshold lev
185、el programmed from off-chip.The 128-tap sign-sign multiplications between calibrated ADC outputs and the known SS are performed by XNOR array.Then,the serial commutator rearranges DMT symbols for proper operation of the 128-tap 64-parallel single-stage multi-path delay feedback(MDF)fast Fourier tran
186、sform(FFT)removing the CP with coarse location selection.The MDF structure requires less multiplication than a fully parallel design since the 64-point FFT is recycled two times to compute 128-tap FFT.Compared to small-sized FFT,the residual ISI exceeding the CP coverage is amortized across a long D
187、MT symbol period.The coarse CP location selection significantly reduces the DSP routing congestion compared to the exact selection,yet sufficiently covering the ISI.During two DSP clock(C64)cycles,32-parallel complex multipliers in the FDE magnify and rotate the received 12b 63 subchannel constellat
188、ions with 8b coefficients.The complex multipliers consist of three real multipliers and four real adders to minimize the number of multipliers.The coefficients are stored after scaling to minimize DSP quantization noise,as in 15.The on-chip sign-sign least mean square(SS-LMS)engine adapts to the slo
189、wly varying channel fluctuation in a decision-directed manner.The error vector is directly obtained from the equalized constellations and decision symbols.Then,the SS-LMS engine updates the coefficients by subtracting a gradient with the programmed step size.The gradient computations are implemented
190、 by a 4-input lookup table(LUT)(the sign bit of I/Q for FFT output and error)to improve energy efficiency.The architecture of TI-TB-ADC and the block diagram of a group of eight sub-ADCs sharing the outputs from a single eight-stage injection-locked ring oscillator(IROSC)and a single 4b integer coun
191、ter are shown in Fig.7.5.3a.The TI-TB-ADC is 64-way 2-rank time-interleaved.In the first rank of the track-and-hold(T&H),eight bootstrapped sampling switches are followed by eight class-AB buffers 16 each driving eight sub-ADCs.A time-based 8b ADC with bipolar voltage-to-time conversion(VTC)17 is em
192、ployed as an ADC slice.Figure 7.5.3b depicts the exemplary waveforms of the conversion sequence through the TB-ADC and the current copying technique used in the bipolar VTC to control the sub-ADC slice gain.The quantization steps in the time domain are obtained by 2 interpolation of ring oscillators
193、 and do not require calibration 18.Each group of eight sub-ADCs receives a comb of eight phases from a single shared IROSC.The IROSC phases are strobed and interpolated inside each sub-ADC slice and represent the four least significant bits(LSBs)in thecyclic thermometer code.The most significant bit
194、s(MSBs)are converted by strobing a 4b integer counter connected to a single IROSC phase.The output of the counter is shared by each group of eight sub-ADCs as well.The counter is built as two cascaded divide-by-4 ripple counters.To strobe the counter output away from any rippling transition both out
195、puts A0:3 and B0:3 of the latches of each of the flip-flops in the counter are received by each sub-ADCs slice.The arbitration between the A and B outputs to select the stable value of the counter is done locally in each sub-ADC strobing the buffered IROSC phase connected to the counter input 19.A s
196、ign comparator determines the folding point in the voltage domain and steers correctly the bipolar VTC.The cyclic thermometer decoding of the four LSBs,the 4b integer counter modulo,and the folding point computation are all done on-chip with the DSP to obtain a 2s complement conversion format for th
197、e digital processor.The signal spectrum and corresponding integral nonlinearity(INL)and differential nonlinearity(DNL)plots of 64kS captured at 52GS/s with the on-chip DSP scope are reported in Fig.7.5.4.The input signal is a sine generated with the Keysight M8196A arbitrary waveform generator(AWG)o
198、f 800mVppd amplitude and 774MHz frequency.The measured effective number of bits(ENOB)is 5.4,and max DNL and INL are 0.31 and 1.03 LSB,respectively.The prototype RX fabricated in 5nm FinFET is characterized using wafer probing needles on the probe station.The area of the clock receiver,interleaver,AD
199、C,and DSP are 4621um2,9231um2,30793um2,and 594355um2,respectively,with a total RX active area of 0.242mm2.The cell density of the DSP is 36.24%with 632.1k standard cells.The 8b digital-to-analog converter(DAC)of the AWG repeatedly transmits 24kSa DMT frames created from the pseudo-random binary sequ
200、ence(PRBS)-15 pattern,with 440mV common mode and 1.96Vppd output swing at x1.5 oversampled 78GS/s conversion rate.The channel SNR in Fig.7.5.5 is obtained from the error vector magnitude(EVM)of the equalized pilot QAM-4 symbol.The bit and power loading profile are calculated based on the SNR with th
201、e Levin-Campello algorithm 20,targeting 1E-3 per dimension symbol error rate(SER)while excluding the 48th subchannel,showing a sharp notch of 7dB.The TX peak-to-average power ratio is set to 12.4dB,and the average spectral efficiency is 4.9365b/Hz,achieving 112.3Gb/s data rate,including 12.5%CP over
202、head,i.e.16-tap CP and 128-tap FFT.The constellation diagram from the on-chip scanner exhibits-23.1dB and-19.9dB EVM for QAM-64 and 32,respectively,and the calculated BER is 1E-4 when communicating over the channel,showing 12dB insertion loss at 26GHz without analog front-end(AFE).The measured perfo
203、rmance is currently limited by the amplitude modulation induced by the bounce of the shared analog and digital grounds due to the fast increase of DSP activity after synchronization detection.The voltages of the supplies of the interleaver(VDI),TI-ADC(VDA),and DSP(VDD)are sensed on-chip.At 0.77V,0.6
204、4V,and 0.62V for VDI,VDA,and VDD,the power consumption is 64.0mW,57.6mW,and 231.3mW,respectively.Figure 7.5.6 summarizes the performance of the fabricated RX test chip prototype and compares it with relevant prior arts.The demonstrated performance showcases the viability of DMT techniques in the pre
205、sence of a sharp notch in the channel and its energy efficiency when combined with a digital-friendly TB-ADC operating at low voltage.A die photograph and the layout details are shown in Fig 7.5.7.Ack nowle dge me nt:This work was supported in part by the Samsung Future Technology Incubating Program
206、(No.SRFC-IT2301-01),in part by Samsung Electronics(No.2024-0-50004),in part by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.RS-2024-00352784),and in part by the DGIST R&D Program(No.24-KUJoint-09).Figure 7.5.1:Overal l bl ock diagram of the impl emented
207、 DMT RX prototype.Figure 7.5.2:A DMT symbol frame and rel ated DSP subbl ocks for symbol synchronization,CP removal,singl e-stage 64-paral l el 128-tap MDF FFT,FDE,and SS-LMS.Figure 7.5.3:The 64-way TI-TB-ADC architecture with shared IROSC(top)and the exempl ary waveform of the vol tage-to-time-to-d
208、igital conversion process for a sl ice TB-ADC(bottom-l eft).Figure 7.5.4:DNL(upper-l eft),INL(upper-right),and signal spectrum(bottom)of the TI-TB-ADC with 64k data points sampl ed at 52GS/s.Figure 7.5.5:Channel SNR profil e with a deep notch at bin 48(upper-l eft),bit and power l oading profil e(up
209、per-right),scanned QAM-64 constel l ations after FDE for 1st to 14th subchannel s(bottom-l eft),and scanned QAM-32 constel l ations after FDE for 24th to 38th subchannel s(bottom-right).Figure 7.5.6:Comparison tabl e with prior arts.ISSCC 2025/February 17,2025/3:35 PM145 DIGEST OF TECHNICAL PAPERS 7
210、 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEFigure 7.5.7:Die photograph and l ayout detail s of impl emented DMT RX prototype.Re f e re nce s:1 J.Im et al.,“A 112Gb/s PAM-4 Long-Reach Wireline Transceiver U
211、sing a 36-Way Time-Interleaved SAR-ADC and Inverter-Based RX Analog Front-End in 7nm FinFET”,International Solid-State Circuits Conference(ISSCC),pp.116-118,Feb.2020.2 M.-A.LaCroix et al.,“A 116Gb/s DSP-Based Wireline Transceiver in 7nm CMOS Achieving 6pJ/b at 45dB Loss in PAM-4/Duo-PAM-4 and 52dB i
212、n PAM-2”,International Solid-State Circuits Conference(ISSCC),pp.132-134,Feb.2021.3 Z.Guo et al.,“A 112.5Gb/s ADC-DSP-Based PAM-4 Long-Reach Transceiver with 50dB Channel Loss in 5nm FinFET”,International Solid-State Circuits Conference(ISSCC),pp.116-118,Feb.2022.4 H.Park et al.,“A 4.63pJ/b 112Gb/s
213、DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET”,International Solid-State Circuits Conference(ISSCC),pp.110-112,Feb.2023.5 Y.Segal et al.,“A 1.41pJ/b 224Gb/s PAM-4 SerDes Receiver with 31dB Loss Compensation”,International Solid-State Circuits Conference(ISSCC),pp.114-116,Feb.202
214、2.6 H.Yueksel et al.,“Design Techniques for High-Speed Multi-Level Viterbi Detectors and Trellis-Coded-Modulation Decoders”,IEEE Transactions on Circuits and Systems I:Regular Papers(TCAS-I),vol.65,no.10,pp.3529-3542,October.2018.7 M.-E.Meybodi et al.,“Design and Implementation of an On-Demand Maxim
215、um-Likelihood Sequence Estimation(MLSE)”,IEEE Open Journal of Circuits and Systems(OJCAS),vol.3,pp.97-108,May.2022.8 D.Pfaff et al.,“A 224Gb/s 3pJ/b 40dB Insertion Loss Transceiver in 3nm FinFET CMOS”,International Solid-State Circuits Conference(ISSCC),pp.128-130,Feb.2024.9 J.Bailey et al.,“A 112Gb
216、/s PAM-4 Low-Power 9-Tap Sliding-Block DFE in a 7nm FinFET Wireline Receiver”,International Solid-State Circuits Conference(ISSCC),pp.140-142,Feb.2021.10 B.Vatankhahghadim et al.,“A Study of Discrete Multitone Modulation for Wireline Links Beyond 100 Gb/s”,IEEE Open Journal of Circuits and Systems(O
217、JCAS),vol.2,pp.78-90,January.2021.11 G Kim,“Design Space Exploration of Single-Lane OFDM-Based Serial Links for High-Speed Wireline Communications”,IEEE Open Journal of Circuits and Systems(OJCAS),vol.3,pp.134-146,July.2022.12 Z.Jiang et al.,“High Speed DMT for 224 Gb/s and Faster Wireline Transmiss
218、ion”,IEEE Transactions on Circuits and Systems I:Regular Papers(TCAS-I),vol.70,no.4,pp.1758-1771,April.2023.13 J.Armstrong.,“OFDM for Optical Communications”,IEEE Journal of Lightwave Technology(JLT),vol.27,no.3,pp.189-204,Feb.2009.14 Chu,D.C.“Polyphase codes with good periodic correlation propertie
219、s,”IEEE Transactions on Information Theory(TIT),vol.18,no.4,July 1972,pp.531-532.15 G Kim et al.,“A 161-mW 56-Gb/s ADC-Based Discrete Multitone Wireline Receiver Data-Path in 14-nm FinFET,”IEEE Journal of Solid-State Circuits(JSSC),vol.55,no.1,pp.38-48,Jan.2020.16 A.S.Yonar et al.,“An 8-bit 56GS/s 6
220、4x Time-Interleaved ADC with Bootstrapped Sampler and Class-AB Buffer in 4nm CMOS”,IEEE Symposium on VLSI Circuits(VLSI),pp.168-169,June.2022.17 A.S.Yonar et al.,“An 8b 1.0-to-1.25GS/s 0.7-to-0.8V Single-Stage Time-Based Gated-Ring-Oscillator ADC with 2 Interpolating Sense-Amplifier-Latches”,Interna
221、tional Solid-State Circuits Conference(ISSCC),pp.1-3,Feb.2023.18 M.Zhang et al.,“A 4 Interleaved 10GS/s 8b Time-Domain ADC with 16 Interpolation-Based Inter-Stage Gain Achieving 37.5dB SNDR at 18GHz Input”,International Solid-State Circuits Conference(ISSCC),pp.252-254,Feb.2020.19 M.Baert et al.,“A
222、5GS/s 7.2 ENOB Time-Interleaved VCO-Based ADC Achieving 30.5fJ/conv-step”,International Solid-State Circuits Conference(ISSCC),pp.328-330,Feb.2019.20 J.Campello.,“Optimal discrete bit loading for multicarrier modulation systems,”IEEE International Symposium on Information Theory(ISIT),pp 193-193,Aug
223、.1998.146 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-SPEED WIRELINE/7.6979-8-3315-4101-9/25/$31.00 2025 IEEE7.6 A 2.06pJ/b 106.25Gb/s PAM-4 Receiver with 3-Tap FFE and 1-Tap Specul ative DFE in 28nm CMOS Yen-Po Lin,Yun-Cheng Jao,Wu-Hung Hsieh,Pen-Jui Peng
224、National Tsing Hua University,Hsinchu,Taiwan The increasing demand for I/O bandwidth in data center pushes the data rate of serial links up to 100Gb/s.Although ADC-based receivers have powerful and flexible DSP equalization that can compensate for 20dB channel loss 1-5,the power consumption is not c
225、ompetitive when using process nodes larger than 16nm 6.The mixed-signal receiver utilizing 1+0.5D pulse shaping can implement the speculative DFE for PAM-4 signals with fewer slicers 7.It relieves the stringent timing constraint for the DFE with less power dissipation.However,to accomplish a 1+0.5D
226、pulse response with minimum residual ISI,the TX FFE must be included 8.It makes the adaptation loop for TRX link equalization more difficult.Another approach uses the RX FFE to fully eliminate h1 9.It avoids the DFE timing issue,but the AFE noise would be significantly increased by canceling the lar
227、ge amount of h1,degrading the BER performance.This paper presents a 106.25Gb/s PAM-4 RX in 28nm CMOS using a 3-tap RX FFE to adaptively generate a 1+0.5D response for the 1-tap speculative DFE.It compensates for 21.2dB loss without any TX FFE at a power efficiency of 2.06pJ/b.Figure 7.6.1 illustrate
228、s the concept for decoding the PAM-4 signal with 1+0.5D pulse shaping.Because h1=0.5h0,4 of the 12 speculative PAM-4 DFE decision levels are overlapped with the others.Therefore,the 1+0.5D PAM-4 signal requires only 8 slicing levels at-3.5h0,-2.5h0,and 3.5h0 to resolve the speculative PAM-4 data.The
229、 data slicing levels can be reused to extract the error information for the equalizer adaptation and the timing recovery.For example,3.5h0 is the DFE slicing level for the PAM-4 top eye when the previous symbol D-1 is+3,while it becomes the error slicing level when D-1 is+1.Instead of directly using
230、 8 slicers to decode the 1+0.5D PAM-4 signal which causes large clocking power and degrades AFE bandwidth in 28nm CMOS,a time-interleaved structure with 3b SAR ADCs is more efficient.It can provide 7 out of the 8 slicing levels.However,the original slicing levels are not suitable for the SAR ADC.As
231、a result,a voltage pre-shift circuit is adopted to generate Vshift(=0.5h0),moving the slicing levels to-3h0,-2h0,and 3h0.In addition to the 7 slicing levels from the 3b ADC,a standalone slicer regarded as a 1b ADC provides the slicing level at 4h0.As a result,the 1+0.5D PAM-4 signal is fully resolve
232、d by the time-interleaved 3b+1b ADC with the adaptive reference voltages Vref3b,Vref1b.The adaptation loops for Vref3b and Vref1b are realized by extracting the error information with the corresponding pattern filters.Figure 7.6.2 shows the RX architecture.The 106.25Gb/s PAM-4 signal is first receiv
233、ed by the on-die termination(ODT)with T-coil to improve the return loss and input bandwidth.The CTLE adopts series-RC and parallel-RC degeneration networks for low-frequency and mid-frequency boosting,respectively 10.Since h1 is unnecessary to be eliminated in 1+0.5D pulse shaping,the design for CTL
234、E bandwidth is relaxed.It improves the noise performance as well.The CTLE adaptation is implemented in the digital domain with a sign-sign LMS algorithm.It adjusts the mid-frequency and low-frequency RC networks by minimizing h3+h4+h5 and h6+h8+h10,respectively.The subsequent buffers isolate the CTL
235、E from the 4-way track-and-holds(T/Hs)and split the data into even and odd paths.Each buffer drives two T/Hs with complementary clocks to prevent overlapping access.Both the track and hold phases are 2UI for the T/H with 50%duty-cycle clocks,reducing the complexity and power.The FFE summer combines
236、the present data with the previous and next data from the adjacent T/H outputs,forming a 3-tap RX FFE.It removes residual ISIs and maintains the 1+0.5D pulse shaping under different channel losses with the adaptation loop.However,using 50%duty-cycle clocks for the T/Hs makes the timing overlap for t
237、he 3-tap FFE summing very small.To deal with this issue,the FFE summer inherently incorporates three paths with different delays.By applying the post-tap data to the slow path and the pre-tap to the fast path,the timing overlap is increased significantly.The voltage pre-shift circuit is integrated i
238、nto the FFE summer.The VGA is also included to adjust the signal swing for the ADCs.The source follower(SF)drives a group of 6 time-interleaved 3b+1b ADCs to decode the speculative PAM-4 data and error signals.Tunable delay cells are placed to ensure proper sample timing for the rank-2 switch under
239、PVT variations.All sub-ADC outputs are aligned,and then sent into the digital circuit for evaluating the 1-tap speculative DFE.The DFE adopts look-ahead logic to reduce the propagation delay in the critical path 11.The digital circuit includes adaptation loops for the CTLE/FFE/Vshift/Vref,offset can
240、cellation loops for the AFE and ADC,skew calibration for the rank-1 clock 12,and timing recovery with 1+0.5D baud-rate CDR logic.Note that four independent FFE/Vref/Vshift adaptation loops work separately for each rank-1 channel to resist the gain and bandwidth mismatches.The delay-locked loop(DLL)a
241、nd the injection-locked ring oscillator(ILRO)are adopted in the clock path to generate the 8-phase clocks for the phase interpolators(PIs)13.The PIs adjust the rank-1 T/H quarter-rate clocks in accordance with the digital baud-rate CDR logic 14.The 3-tap FFE summer is illustrated in Fig.7.6.3,which
242、adopts the trans-admittance transimpedance(TAS-TIS)topology for better linearity.The voltage pre-shift and variable gain characteristics are merged into the summer to reduce the hardware overhead.The voltage pre-shift is achieved by adding offset current at the TAS output.It is also used to cancel t
243、he overall DC offset from the AFE and time-interleaver.The pre-shift code is combined with the offset code in the digital domain and controls a 7b current DAC to produce Ishift accordingly.The TIS stage itself serves as a VGA by adjusting the feedback resistance.The gm enhancement technique in the T
244、AS provides the 2 transconductance while consuming only 1.5 power 15.The RC degeneration is included to compensate the settling loss caused by the rank-1 T/H.The gm-enhanced structure has inner and outer TAS stages.The outer TAS stage has a small delay since it is directly connected to the output.In
245、 contrast,the inner stage brings current to the output with an active load,resulting in a larger delay.The main-tap combines both fast and slow paths,leading to moderate delay.By applying the pre-tap and post-tap units at different nodes,a fast path for pre-tap and a slow path for post-tap can be ac
246、hieved.Such a structure increases the timing overlap for FFE summing without any additional circuits.Moreover,the common-mode current from the pre-tap and post-tap units partially cancel each other at the TAS output,reducing the common-mode variation and preserving good linearity when the 3-tap FFE
247、is activated.Figure 7.6.3 also shows the simulated SBRs under a 20dB loss channel.The ISI after h3 is removed by the CTLE in advance.Once the CDR logic locks the sampling clock at h1=0.5h0,the residual ISI(i.e.,h-1 and h2)can be fully eliminated by the 3-tap FFE,achieving a well-shaped 1+0.5D respon
248、se.Note that h2 can be directly canceled by the large amount of h1(=0.5h0)through the post-tap,and therefore only a 3-tap rather than a 4-tap FFE is required.The simulation result also shows the difference between the SBRs with and without the pre-shift voltage,revealing less than 2.5mV variation.It
249、 shows that the pre-shift circuit can provide a wideband response without hurting the bandwidth.Figure 7.6.4 depicts the block diagram of the 1-tap look-ahead DFE.The outputs of the 1b and 3b ADCs are D3 and D2D0,respectively.The 1+0.5D PAM-4 decoder resolves the speculative PAM-4 MSB/LSB data Q+3-3
250、,MSB/Q+3-3,LSB with 4 possible previous symbols (+3,+1,-1,and-3).The truth tables as well as the derived Boolean expressions for the MSB path are shown in the top.This method can generate speculative PAM-4 binary data for the following look-ahead DFE logic without the use of adders,relaxing the timi
251、ng criteria and decreasing the power consumption of decoding logic.The LSB decoder is implemented in the same manner.To make the DFE look-ahead logic operate at 1/24 rate,a group of six 4-to-1 MUXs forms a look-ahead DFE unit(LA6).Therefore,4 identical LA6 units can accomplish 24-way speculative PAM
252、-4 data selection,making the critical timing constraint of the 1-tap DFE to be tckq+4tmux+tsetup24UI.Pipelined DFFs are properly inserted in the LA6 unit to meet the timing criteria with minimum power overhead.The RX is fabricated in 28nm CMOS technology and tested on the probe station with GSSG pro
253、be.The RX consumes a total power of 219.2mW from 1.2V,1V and 0.9V supplies.The measurement results are shown in Fig.7.6.5.Two stripline channels with 11.2”and 20.1”(21.2dB and 34.6dB Nyquist loss)are used to investigate the RX performances under 106.25Gb/s PAM-4 and 56Gb/s NRZ input signals,respecti
254、vely.The input PRBS7 pattern is delivered from the Anritsu MP1900A pattern generator without any equalization.The RX bathtub curves are measured by adjusting the PI code from the nominal CDR locking point.With 3-tap RX FFE applied,the BERs for 106.25Gb/s PAM-4 and 56Gb/s NRZ measurements are improve
255、d from 1e-6 and 1e-8 to 50dB Channel Loss in 5nm FinFET,ISSCC,pp.116-117,Feb.2022.5 H.Park et al.,A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET,ISSCC,pp.110-111,Feb.2023.6 J.Hudner et al.,A 112Gb/s PAM4 Wireline Receiver using a 64-way Time-Interleaved SAR ADC
256、 in 16nm FinFET,IEEE Sy mp.VLSI Ci rcui ts,pp.47-48,June 2018.7 A.Cevrero et al.,A 100Gb/s 1.1pJ/b PAM-4 RX with Dual-Mode 1-Tap PAM-4/3-Tap NRZ Speculative DFE in 14nm CMOS FinFET,ISSCC,pp.112-113,Feb.2019.8 C.Menolfi et al.,A 112Gb/S 2.6pJ/b 8-Tap FFE PAM-4 SST TX in 14nm CMOS,ISSCC,pp.104-105,Feb
257、.2018.9 B.Ye et al.,A 2.29pJ/b 112Gb/s Wireline Transceiver with RX 4-Tap FFE for Medium-Reach Applications in 28nm CMOS,ISSCC,pp.118-119,Feb.2022.10 T.Norimatsu et al.,A 25Gb/s Multi-Standard Serial Link Transceiver for 50dB-loss Copper Cable in 28nm CMOS,ISSCC,pp.60-61,Feb.2016.11 T.Shibasaki et a
258、l.,A 56-Gb/s Receiver Front-End with a CTLE and 1-tap DFE in 20-nm CMOS,IEEE Sy mp.VLSI Ci rcui ts,pp.1-2,June 2014.12 B.-J.Yoo et al.,A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET Using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier,ISSCC,pp.122-1
259、23,Feb.2020.13 Z.Wang et al.,A High-Accuracy Multi-Phase Injection-Locked 8-Phase 7GHz Clock Generator in 65nm with 7b Phase Interpolators for High-Speed Data Links,ISSCC,pp.186-187,Feb.2021.14 T.Shibasaki et al.,A 56Gb/s NRZ-Electrical 247mW/lane Serial-Link Transceiver in 28nm CMOS,ISSCC,pp.64-65,
260、Feb.2016.15 A.Cevrero et al.,A 64Gb/s 1.4pJ/b NRZ Optical-Receiver Data-Path in 14nm CMOS FinFET,ISSCC,pp.482-483,Feb.2017.16 R.Yousry et al.,A 1.7pJ/b 112Gb/s XSR Transceiver for Intra-Package Communication in 7nm FinFET Technology,ISSCC,pp.180-181,Feb.2021.17 R.Shivnaraine et al.,A 26.5625-to-106.
261、25Gb/s XSR SerDes with 1.55pJ/b Efficiency in 7nm CMOS,ISSCC,pp.182-183,Feb.2021.18 G.Gangasani et al.,A 1.6Tb/s Chiplet over XSR-MCM Channels using 113Gb/s PAM-4 Transceiver with Dynamic Receiver-Driven Adaptation of TX-FFE and Programmable Roaming Taps in 5nm CMOS,ISSCC,pp.122-123,Feb.2022.148 202
262、5 IEEE International Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-SPEED WIRELINE/7.7979-8-3315-4101-9/25/$31.00 2025 IEEE7.7 A 50Gb/s Burst-Mode NRZ Receiver with 5-Tap FFE,7-Tap DFE and 15ns Lock Time in 28nm CMOS for Symmetric 50G-PON Boyang Zhang*1,2,Tianchen Ye*1,Shuaizhe Ma3,T
263、ianyuan Zhong1,Xin Liu1,Feiyang Zhang3,Bingyi Ye4,Dan Li3,Weixin Gai1,2 1Peking University,Beijing,China 2Beijing Advanced Innovation Center for Integrated Circuits,Beijing,China 3Xian JiaoTong University,Xian,China 4East China Normal University,Shanghai,China *Equally-Credited Authors(ECAs)With the
264、 growing demand for broadband services,the 50G passive optical network(PON)has become the future direction of optical access networks.As the baud rate rises to 50G,the limited bandwidth of photonic devices and chromatic dispersion degrade the signals more severely.In 1 a 41-tap feedforward equalizat
265、ion(FFE)is used to handle the complex inter-symbol interference(ISI).Meanwhile,in upstream PON,multiple optical network units(ONUs)transmit data to one optical line terminal(OLT)in burst mode,requiring OLTs to respond rapidly.In 2 a burst-mode receiver with 6.8ns CDR lock time is proposed,but it onl
266、y adopted 1-tap decision-feedback equalization(DFE),which is inadequate to achieve reliable communication in a 50G-PON system.In addition,it utilized a p-i-n photodiode,in contrast to the avalanche photodiode(APD)preferred in ITU-T 50G-PON standards 3 due to its high conversion gain.However,APD and
267、the low optical power in 50G-PON introduce more noise 4.To achieve burst reception under large ISI and poor SNR in symmetric 50G-PON systems,this work proposes a 50Gb/s burst-mode NRZ Receiver in 28nm CMOS with 5-tap FFE including 3 pre taps and 1 post tap and 7-tap DFE including 4 floating taps ran
268、ging up to 16th post-cursor.Edge equalization is realized for accurate fast-lock,achieving a lock time of 15ns.To address severe ISI and rapid lock requirements simultaneously in 50G-PON OLTs,equalization settings are trained and stored during each ONUs registration to accelerate CDR lock and conver
269、gence in burst-mode operation 5.Reported Burst-mode receivers 2,6,as shown in the top-left of Fig.7.7.1,achieve fast phase lock by finding signal edges during the preamble stage and assuming a lock phase at a distance of 1/2UI.However,this assumption fails as they observe the waveforms before equali
270、zation,of which the edge location can vary significantly after equalization,as depicted in the bottom-left of Fig.7.7.1.The equalized single-bit response(SBR)is basically symmetrical because the 1st pre-and post-ISI terms are eliminated.This results in similar amplitudes at 1/2 and-1/2UI positions,r
271、endering the edge appear around 1/2UI place.In contrast,un-equalized SBR cannot assure symmetry due to ISI,leading to different amplitudes at 1/2 and-1/2UI places,consequently causing an offset of the edge location.In a real 50G-PON system,this offset can reach 0.19UI as shown in the bottom-right of
272、 Fig.7.7.1.This may result in additional time for CDR and equalizer convergence after the burst-mode CDR(BM-CDR)is initially locked,increasing the bit-error-rate(BER)at the beginning of the payload under narrow eye conditions in a 50G-PON system.Therefore,the 1st pre-and post-tap equalization is ena
273、bled during the preamble stage in this work,as shown in the top-right of Fig.7.7.1,to realize edge equalization for better lock accuracy.FFE rather than DFE is employed as the 1st post tap because the noise around the edges will induce erroneous decisions in the DFE,leading to burst error and unstab
274、le edges.In addition,it avoids the extra hardware of a look-ahead DFE and the tight feedback timing of the direct-feedback DFE.Figure 7.7.2 shows the architecture of the proposed RX,which works in a 4-way time-interleaving manner and utilizes a baud-rate CDR structure to simplify circuits and clock
275、distribution.A retimer 7 samples the 50Gb/s signal amplified by a variable-gain amplifier(VGA),providing input signals for the 5-tap FFE of hybrid equalizers.In addition to the five transconductance(Gm)cells for FFE,each hybrid equalizer also contains seven DFE cells for 3 fixed DFE taps,serving as
276、post 2nd-4th taps,and 4 floating DFE taps.The output of each hybrid equalizer is decided by a data slicer followed by a latch,and an error slicer for adaptation.To meet the tightest feedback timing,the post 2nd tap is directly driven by data slicers,while the post 3rd and 4th taps are latch-driven.A
277、 floating-tap selector,shown in the bottom of Fig.7.7.2,generates feedback signals of the floating taps.Four SR latches in each slice save the quarter-rate data from the slicer in turns with four-phase 1/16-rate EN3:0 signals and delay this data for 16UI.Compared with shift registers 8,the parallel
278、structure lowers the latch speed,thereby reducing power.The position of each floating tap is determined by selecting data from latches.A current-mode logic(CML)divider converts the off-chip 25GHz differential clocks into four-phase 12.5GHz clocks,which are then phase-adjusted by a 7b-resolution phas
279、e interpolator(PI).CML clocks are then converted into CMOS clocks for the RX data path,as well as the adaptation logic after being divided down to 1.5625GHz.The adaptation logic receives the 1/32-rate data and error from two 4:32 demultiplexers and the output of signal detection(SD)circuits at the i
280、nput node of the RX,controlling the PI codes and equalization coefficients.Recovered data and clocks are sent off-chip through a 50Gb/s transmitter and clock buffers respectively.Figure 7.7.3 shows the block diagram of adaptation logic and the principle of the proposed baud-rate BM-CDR method,compat
281、ible with both 0101 and 0011 preamble patterns.A finite-state machine(FSM)controls the overall behavior.When DETECTED signal from SD goes high at the start of the preamble stage,the FSM activates and overrides the PI codes and equalization coefficients after waiting a programmable number of cycles f
282、or TIA settling.Only the main tap,1st pre-and post-tap FFEs are turned on during BM-CDR operation.The coefficients use the pre-stored values and remain constant throughout the BM-CDR operation.Four-phase clocks sweep and sample the waveform from a pre-set initial phase with a 1/16UI step to pursue s
283、hort lock time while maintaining accuracy.An edge is detected when the polarity of the samples flips.Unlike 2 which observes only a single edge,our work utilizes all edges to alleviate the impact of low SNR and enhance the lock accuracy as shown in the top-right of Fig.7.7.3.There are two ways to de
284、tect all edges:summing sample results with a majority voter or counting whether the half samples polarity has flipped.The structure with a lower of the locking phase is chosen.Once the edge is detected,sampling phase shifts 1/2UI to the center of the symbol by moving 3 extra steps due to the CDR loo
285、p latency,concluding the BM-CDR operation.Muller-muller CDR logic and least-mean squared(LMS)logic take over afterward for phase tracking and equalization updates.Pattern filters are used in LMS logic,so that the algorithm wont converge to a false status when BM-CDR has completed,yet the preamble ha
286、snt ended.The schematic of the hybrid equalizer is shown in Fig.7.7.4.Both FFE Gm cells and DFE tap cells employ differential pairs with digitally controlled variable tail currents to tune coefficients.FFE Gm cells utilize dual differential pairs with cross-connected outputs for dual-polarity coeffi
287、cients.In contrast,DFE tap cells use single-pole-double-throw switches to minimize capacitive loading on the summing node.Unlike switch transistors cascaded above bias transistors in 9,which have large parasitic Cgd as switch transistors work in linear region,this design positions bias transistors a
288、bove switch transistors to reduce parasitics,as bias transistors in saturation exhibit minimal Cgd.This modification enhances the equalizers bandwidth and decreases the delay of equalizer output by 4ps,as illustrated in the right of Fig.7.7.4.This work is fabricated in 28nm CMOS technology.The top o
289、f Fig.7.7.5 presents the test optical link including a 1270nm direct-modulation laser(DML),a commercial APD and TIA.An AWG repeatedly generates the burst-mode data packets and reset pulses for TIA.Each data packet contains 50ns 0011 preamble and 3s PRBS15 payload data.First 30ns of the preamble is a
290、llocated for TIA to stabilize,leaving 20ns for RX BM-CDR operation.The middle-left of Fig.7.7.5 shows the waveforms of the reset pulse and TIA output.Recovered data and clocks of the RX are sent to a real-time oscilloscope for BER checking and clock phase monitoring.Clock phase records under two cas
291、es at-11dBm received optical power(ROP)are presented in the middle-right of Fig.7.7.5.With 1st pre-and post-cursor equalization enabled during the preamble stage,the clock phase locks within 10ns and remains stable,whereas disabling preamble-stage equalization results in a prolonged settling period
292、after phase initially locked,extending total lock time to 51ns.The bottom-left of Fig.7.7.5 displays the lock time results at-19dBm ROP,obtained by sweeping the initial phase of the AWG synchronized with the RX clock.The max lock time is 15ns.The burst-mode BER of the payload data is measured by cap
293、turing 20 random data packets with 10ppm frequency offset and calculating the average BER of entire payload data.Results under-11,-15,and-19dBm are annotated in the bottom-right of Fig.7.7.5,and the sensitivity curve measured in continuous mode with 0ppm frequency offset is plotted as well.The burst
294、-mode BER closely matches continuous-mode BER at equivalent ROP,and the sensitivity of the RX achieves-24dBm at a 1e-2 BER limit of 50G-PON10.Figure 7.7.6 summarizes the performance of this RX and prior works.This work implements a 7-tap DFE to take advantage of the characteristic that the DFE does
295、not amplify noise.A sensitivity of-24dBm for a targeted BER of 1e-2 is reached with fewer taps,a DML inferior to external modulation laser(EML),and 1/10th of the power dissipation compared with DSP-based demonstrations.Also,it supports burst reception at 50Gb/s.Previous burst-mode optical receivers
296、use PIN photodiodes and lack equalization capabilities,which are not compliant with 50GPON standards.The proposed RX with novel baud-rate BM-CDR method realizes edge equalization to enhance lock accuracy under large ISI and poor SNR,achieving a maximum lock time of 15ns under a lowest optical power
297、with an APD and a 1270nm DML compatible with 50G-PON.Figure 7.7.7 shows the die micrograph.Ack nowle dge me nt:This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFB2803301.Figure 7.7.1:Edge detection in 2,6 and in this work.The impact of th
298、e equal ization on edge l ocation.Figure 7.7.2:The architecture of the receiver;the bl ock diagram and timing of the floating-tap sel ector.Figure 7.7.3:The bl ock diagram of the adaption l ogic.The principl e and the transient simul ation of the burst-mode CDR operation.Figure 7.7.4:The schematic o
299、f the hybrid equal izer.Figure 7.7.5:Measurement resul ts.Figure 7.7.6:Performance comparison.ISSCC 2025/February 17,2025/4:25 PM149 DIGEST OF TECHNICAL PAPERS 7 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025 PAPER CONTINUATIONS AND REFERENCES979-8-3315-4101-9/25/$31.00 2025 IEEEF
300、igure 7.7.7:Die photo.Re f e re nce s:1 B.Li et al.,“DSP enabled next generation 50G TDM-PON,”Journal of Optical Communications and Networking,vol.12,no.9,pp.D1-D8,Sep.2020.2 I.Ozkaya et al.,“A 56 Gb/s burst-mode NRZ optical receiver with 6.8 ns power-on and CDR-lock time for adaptive optical links
301、in 14 nm FinFET CMOS,”ISSCC,pp.266-268,Feb.2018.3 ITU-T.“G.9804.3:50-Gigabit-capable passive optical networks(50G-PON):Physical media dependent(PMD)layer specification”Online.Available:https:/www.itu.int/rec/T-REC-G.9804.3-202109-I.4 O.Kharraz and D.Forsyth.“Performance comparisons between PIN and A
302、PD photodetectors for use in optical communication systems,”Optik,vol.124,no.13,pp.1493-1498,July 2013.5 P.Torres-Ferrera et al.,“Overview of high-speed TDM-PON beyond 50 Gbps per wavelength using digital signal processing,”Journal of Optical Communications and Networking,vol.14,no.12,pp.982-996,Dec
303、.2022.6 A.Rylyakov et al.,“A 25 Gb/s Burst-Mode Receiver for Low Latency Photonic Switch Networks,”IEEE JSSC,vol.50,no.12,pp.3120-3132,Dec.2015.7 K.Sheng et al.,“A 4.6-pJ/b 200-Gb/s Analog DP-QPSK Coherent Optical Receiver in 28-nm CMOS,”IEEE JSSC,vol.58,no.1,pp.45-56,Jan.2023.8 P.Aziz,“Shift Regist
304、er Multi-Phase Clock Based Downsampled Floating Tap DFE for Serial Links,”ISCAS,pp.2469-2472,June 2014.9 B.Zhang et al.,“A 112-Gb/s Serial Link Transceiver With Three-Tap FFE and 18-Tap DFE Receiver for up to 43-dB Insertion Loss Channel in 7-nm FinFET Technology,”IEEE JSSC,vol.59,no.1,pp.8-18,Jan.2
305、024.10“IEEE Standard for Ethernet Amendment 9:Physical Layer Specifications and Management Parameters for 25 Gb/s and 50 Gb/s Passive Optical Networks,”IEEE Std 802.3ca-2020,vol.,no.,pp.1-267,July 2020.11 J.Zhang et al.,“Real-Time FPGA Demonstration of PAM-4 Burst-Mode All-Digital Clock and Data Rec
306、overy for Single wavelength 50G PON Application,”Optical Fiber Communications Conference and Exposition,pp.1-3,Mar.2018.12 J.Li et al.,“First Real-Time Symmetric 50G TDM-PON Prototype with High Bandwidth and Low Latency,”Opto-Electronics and Communications Conference,pp.1-4,Jul.2023.13 Z.Gu and X.Bi
307、,“28 Gbaud PAM-4 Burst-Mode CDR With Reconfigurable Sampling Scheme,”IEEE TCAS-I,vol.70,no.5,pp.1917-1927,May 2023.14 M.Verbeke et al.,“A 25 Gb/s All-Digital Clock and Data Recovery Circuit for Burst-Mode Applications in PONs,”Journal of Lightwave Technology,vol.36,no.8,pp.1503-1509,15 April,2018.15
308、0 2025 IEEE International Solid-State Circuits ConferenceISSCC 2025/SESSION 7/ULTRA-HIGH-SPEED WIRELINE/7.8979-8-3315-4101-9/25/$31.00 2025 IEEE7.8 A Reference-l ess CDR Using SAR-Based Frequency-Acquisition Technique Achieving 55ns Constant Band-Searching Time and up to 63.64Gb/s/s Acquisition Spee
309、d Xiaoteng Zhao,Yuhao Zhang,Hao Chang,Yilong Dong,Chenxi Han,Zekai Yang,Zhicheng Dong,Rong Zhou,Shubin Liu,Zhangming Zhu Xidian University,Xian,China Reference-less clock and data recovery(CDR)circuits are prevalently employed in applications that demand a continuous data rate due to their simplicit
310、y and cost-effectiveness,as they eliminate the need for an external reference 1-6.The frequency acquisition(FA)speed of such CDRs is critical for link setup,particularly in systems with dynamic data traffic management that necessitate frequent transitions from sleep to wake up 7,8.According to 9,par
311、t of the links can be turned off without congesting the network,even in high-load scenarios like data centers.However,the interface wake-up delay has been identified as the most significant bottleneck in the overall network.To address this issue,this paper proposes a fast FA reference-less CDR using
312、 a successive approximation register(SAR)-based frequency band switching technique,achieving a constant band switching time of 55ns.With the help of the proposed charge pump(CP)with positive net current,the total FA time of less than 150ns is achieved,which is at least 4 shorter than comparable work
313、.Most of the prior FA schemes rely on closed-loop operation 1-4,6,utilizing UP/DN signals from the phase detector(PD)to control the charge pump(CP)and successive loop filter(LF),which limits the improvement of FA speed.In Case 1 of Fig.7.8.1,a deliberate-mismatch-based PD is reported to achieve a lo
314、ck time of less than 6s 3.Unfortunately,its effectiveness is curtailed by the unidirectional frequency search method.As illustrated in Case 2,a frequency detector(FD)is introduced to produce the polarity indication signals(FAST/SLOW)of frequency error(f),accelerating the FA over a wide data-rate ran
315、ge.Regretfully,the lock time shows a pronounced positive correlation with f,reaching up to 25s with substantial initial f 2,where f equals the clock frequency(FreCK)minus data rate(fDR)in the full-rate topology.In contrast,Ref.5 reports a constant lock time of 0.6s using an open-loop autocorrelation
316、-based FA technique.However,the maximum reported frequency deviation is only thousands of ppm.The concise schematic of the proposed method to speed up FA is shown in Case 3 of Fig.7.8.1.It is based on the key fact that the SLOW signal from the proposed BBPFD always remains zero when f0,as shown in F
317、ig.7.8.1 top right.It implies that SLOW can be used as the indicator of the FA direction with unlimited frequency detection range 2.Notice that the simulation results of SLOW versus f are obtained by sweeping initial phase error between the clock and data at each frequency point,assuring its FA robu
318、stness against initial phase error.The characteristic of SLOW leads to the key idea of the proposed frequency band switching technique,which is the first step of FA in the proposed reference-less CDR,where the proposed SAR logic performs several iterations of band switching,guided by the SLOW signal
319、,effectively completing the majority of the FA task.In the subsequent two steps(in-band search and phase track),the CDR reduces the residual f using the proposed net-current CP.Figure 7.8.2 shows the overall architecture of the proposed fast-locking CDR,which consists of the proposed SLOW-only bang-
320、bang phase and frequency detector(BBPFD),a CP with positive net current path,a multi-phase clock generator,a loop filter(LF),and the proposed control logic.Figure 7.8.2 bottom outlines two FA processes.At the beginning,mode switch(MDSW)and band switch enable signal(BDSW_EN)are set to 0 and 1,respect
321、ively,by enabling the active-high reset signal(RSTP).Accordingly,CP is inactive and VCONT is pulled down to 0 by SW1.In the 1st FA step,the BBPFD detects the phase information between input data and clock to judge the polarity of f by SLOW.Then,the slow detector in digital logic compares the number
322、of SLOW in a specific period of time and the pre-set threshold words(SLOW_THRESHOLD)to confirm the slow state(SLOW_SET=1).Based on each SLOW_SET,the proposed SAR logic sequentially determines the frequency band bits(Bit 30),pushing the voltage-controlled oscillator(VCO)to the target band.Afterwards,
323、the CDR enters the in-band searching step by setting BDSW_EN to 0.As a result,the CP with positive net current charges the LF to increase VCONT.When FreCK approaches the target frequency,the lock detector will set MDSW to 1 based on the UP/DN characteristics near locking,turning off the net current
324、path in the CP and entering the 3rd acquisition step.The schematic of the proposed SAR logic is shown in Fig.7.8.3 left.The tunable timer regulates the timing for each band bit switch,while the core SAR logic is determined by Q0,Q1,and SLOW_SET according to the truth table.When Q2=0,the band switchi
325、ng enable signal(BDSW_EN)is valid.Upon reaching the preset threshold,the tunable timer generates a pulse on BDSW_TRIGGER,updating the outputs of the SAR logic.Meanwhile,the counter increments by 1 to refresh Q0 and Q1 for the subsequent bit.When Q2=1,it indicates that the 4-bit switching process is
326、complete.Consequently,BDSW_EN is reset to 0,deactivating the FF0-3 trigger signal and locking Bit.Two examples of the SAR-band switching process are depicted in Fig.7.8.3 right,where the initial band is Band8,and the target frequency falls in Band11 and Band10,respectively.The first SLOW_SET is 1,in
327、dicating that FreCK is below the target frequency.Therefore,Bit 3 is set to 1,causing a jump to Band12.At this time,FreCK exceeds 1/2 fDR,leading to the low-level 2nd SLOW_SET.In a similar way,the total 4 bits are sequentially determined in the SAR manner,ultimately identifying the target band.For c
328、larity,Fig.7.8.3 right also includes a table outlining the bit switching principles.Additionally,the proposed LSB switching logic makes it possible to return to a band it once passed by to prevent erroneous frequency band switch.Figure 7.8.4 presents two complete FA processes measured by a real-time
329、 oscilloscope under 14.2 and 7.3Gb/s 27-1 PRBS patterns,respectively.For the up-searching process(Case A),when RSTP=1,the CDR is under reset,and the initial frequency band is Band8.When RSTP jumps to 0,it enters into the 1st FA step.Since the target band is Band15,FreCK is always less than half of t
330、he fDR.Therefore,4 SLOW_SET signals were generated,driving FreCK upwards in the SAR manner,taking only 55ns(Fig.7.8.4 top left),corresponding to the FA speed of 63.64Gb/s/s.Subsequently,BDSW_EN becomes low,and the CDR enters the in-band searching step.The VCONTB increases from 0 to 730mV within 65ns
331、 by the proposed CP to reach the target frequency(7.1GHz).Only 120ns is consumed to complete the entire FA process.And then,MDSW is set to 1 by the proposed lock detector,and the CDR enters the phase-tracking mode to align data and recovered clock like a conventional BBCDR loop.Figure 7.8.4 bottom l
332、eft shows a down-searching process(Case B),it takes 150ns for the entire FA,achieving an FA speed of 16.67Gb/s/s,with the band switch time also being 55ns.Notice that the SLOW_SET remains zero as desired since FreCK is always larger than the target frequency in the 1st FA step,also verifying the SLO
333、W characteristic in Fig.7.8.1.Figure 7.8.5 left shows the recovered eye diagram at 7.3,10.4,and 14.2Gb/s input data,respectively.With 14.2Gb/s 27-1 PRBS data,the recovered clock spectrum at 7.1GHz is shown in Fig.7.8.5,top right.The jitter tolerance(JTOL)measured under the same condition is also presented in the bottom right of Fig.7.8.5,superior to the requirements in PCIe 4.0 and IEEE 802.3 mask