SESSION 34 - Compute-In-Memory.pdf

編號:154977 PDF 306頁 17.23MB 下載積分:VIP專享
下載報告請您先登錄!

SESSION 34 - Compute-In-Memory.pdf

1、ISSCC 2024SESSION 34Compute-In-Memory34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference1 of 34A 28nm 83.23TFLOPS/WPOSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications Yang Wang1,Xiao

2、long Yang1,Yubin Qin1,Zhiren Zhao1,Ruiqi Guo1,Zhiheng Yue1,Huiming Han1,Shaojun Wei1,Yang Hu1,Shouyi Yin1,21Tsinghua University,Beijing,China2Shanghai AI Laboratory,Shanghai,China 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International

3、Solid-State Circuits Conference2 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclu

4、sion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference3 of 34FP-CIM for High-accuracy AI Applications Recent AI tasks are becoming increasingly complex.Complex AI application requires FP-CIM for hig

5、h accuracy.Keyword SpottingFaceRecogn.AIAssistantAutomaticDriveAIGC34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference4 of 34Limitation of Conventional FP Data Format If a bullet becomes longer,break

6、it into two bullets Conventional FP cannot achieve high accuracy with low power.Data DistributionFloating Point 16Brain Float 1634.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference5 of 34Principle of P

7、OSIT Data Format If a bullet becomes longer,break it into two bullets POSIT exploits dynamic bit to adapt to varied distributions.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference6 of 34Conventional

8、 FP VS.POSIT If a bullet becomes longer,break it into two bullets POSIT8 saves 27%energy with 0.4%accuracy loss than FP16.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference7 of 34Outline Background a

9、nd Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memo

10、ry Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference8 of 34Challenge 1:Large Power in Regime Processing If a bullet becomes longer,break it into two bullets Dynamic regime increases 2.62 pre-processing energy.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute

11、-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference9 of 34Challenge 2:Cell Under-utilization in CIM Array If a bullet becomes longer,break it into two bullets Dynamic mantissa introduces 41.3%CIM cell underutilization.34.1:A 28nm 83.23TFLOPS/W P

12、OSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference10 of 34Challenge 3:Redundant Toggle in Adder Tree If a bullet becomes longer,break it into two bullets Dynamic aligned accumulation incurs 66.8%power waste.34.1:A 28nm 83.23TF

13、LOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference11 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-co

14、mpute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference12 of 34Feature 1:Bi-directional Regime Pr

15、ocessing Step1:Regime extracting with leading 1/0 detector.Step2:Regime processing with codec and addition.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference13 of 34 If a bullet becomes longer,break

16、it into two bullets BRPU replaces codec-addition with shift-or processing.Feature 1:Bi-directional Regime Processing 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference14 of 34Feature 1:Bi-directional

17、 Regime Processing Same sign addition:logic shift to increases 1s/0s counts.A large shift code always denotes more shift power.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference15 of 34Feature 1:Bi-d

18、irectional Regime Processing BRPU dynamically decodes small|RB|to shifts large|RA|.BRPU minimizes shift code to saves 40%of shift energy.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference16 of 34Feat

19、ure 1:Bi-directional Regime Processing Different sign addition:logic shift to decrease 1s/0s counts.If shift code Rs effective bit-width,it introduces shift error.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circu

20、its Conference17 of 34Feature 1:Bi-directional Regime Processing BRPU dynamically decodes small|RB|to shifts large|RA|.BRPU avoids shift overflow to reduce 50%of shift logic.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-

21、State Circuits Conference18 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion

22、34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference19 of 34 If a bullet becomes longer,break it into two bullets Dynamic mantissa bit-width introduces 48.9%cell waste.Feature 2:Critical-bit Pre-comput

23、e-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference20 of 34 If a bullet becomes longer,break it into two bullets CPCS uses spare bits to achieve dual-bit MAC in each cycle.Feature 2:Critic

24、al-bit Pre-compute-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference21 of 34Feature 2:Critical-bit Pre-compute-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for H

25、igh-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference22 of 34Feature 2:Critical-bit Pre-compute-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference23 of 34Outl

26、ine Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based

27、 Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference24 of 34 If a bullet becomes longer,break it into two bullets If A and B have no overlap bits,A+B is equal to A|B.Feature 3:Cyclically-alternating Scheduling34.1:A 28nm 83.23TFLOPS/W POS

28、IT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference25 of 34 If a bullet becomes longer,break it into two bullets Even A0/A1 have 1 overlap bit,A0W0+A1W1has to use adder.Feature 3:Cyclically-alternating Scheduling34.1:A 28nm 83.23

29、TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference26 of 34Feature 3:Cyclically-alternating Scheduling If a bullet becomes longer,break it into two bullets All cycles need adders for synchronous bit-serial computing.3

30、4.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference27 of 34 If a bullet becomes longer,break it into two bullets CASU cyclically shifts A0for asynchronous computing with A1.Feature 3:Cyclically-alterna

31、ting Scheduling34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference28 of 34 If a bullet becomes longer,break it into two bullets CASU eliminates overlap bits in former cycles of A0W0+A1W1.Feature 3:Cyc

32、lically-alternating Scheduling34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference29 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-direct

33、ional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Con

34、ference30 of 34Chip Photograph and SummaryOne operation(OP)represents one multiplication or addition.1)Highest efficiency point,0.65V,78MHz,50%input sparsity2)The baseline is 69.76%3)The baseline is 21.304)The baseline is 80.31%34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-A

35、ccuracy AI Applications 2024 IEEE International Solid-State Circuits Conference31 of 34Training and Inference Performance POSIT(16,2)only incurs 0.04%of training loss than FP32.It achieves 10.45TFLOPS/W of average energy efficiency.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for Hi

36、gh-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference32 of 34Performance ComparisonComparison with SOTA FP CIM Macros34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference33

37、 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion34.1:A 28nm 83.23TFLOPS/W PO

38、SIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference34 of 34ConclusionA POSIT-Based CIM Macro with Bi-directional Regime Codec,Critical-bit Pre-computing-Storing and Cyclically-alternating Scheduling Achieving 83.23TFOPS/W Energy

39、 Efficiency An Energy Efficient POSIT-Based CIM MacroBi-directional Regime Processing Codec Save Pre-processing Energy by Replacing Codec to Shift-ORCritical-bit Pre-compute-and-store CIM Array Improve CIM Utilization by Using Spare Bit for Dual-bit MAC Cyclically-alternating Scheduling Adder Tree R

40、educe Accumulation Power by Simplifying Addition to OR 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference35 of 34Please Scan to Rate Please Scan to Rate This PaperThis Paper34.2:A 16nm 96Kb Integer/F

41、loating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference1 of 48A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W a

42、nd 33.2-91.2TFLOPS/W for AI-Edge DevicesWin-San Vince Khwa*1,Ping-Chun Wu*2,Jui-Jen Wu1,Jian-Wei Su2,3,Ho-Yu Chen2,Zhao-En Ke2,Ting-Chien Chiu2,Jun-Ming Hsu2,Chiao-Yen Cheng2,Yu-Chen Chen2,Chung-Chuan Lo2,Ren-Shuo Liu2,Chih-Cheng Hsieh2,Kea-Tiong Tang2,Meng-Fan Chang1,21TSMC Corporate Research,Hsinc

43、hu,Taiwan2National Tsing Hua University,Hsinchu,Taiwan,3Industrial Technology Research Institute,Hsinchu,Taiwan,34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-Sta

44、te Circuits Conference2 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port Gain

45、-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference3 of 48Outline

46、Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port Gain-cell Array Scheme(SB-TP-GCA)Performa

47、nce and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference4 of 48Motivations of INT/FP Dual-mode CIM Advanced

48、AI-edge chip desirables:General purpose and computational flexibility High energy efficiency and high area efficiency Sufficient inference accuracy Tradeoff between number formats with energy efficiency and inference accuracy:Floating-point(FP)mode Pros:high inference accuracy Cons:higher energy and

49、 more parameters Integer(INT)mode Pros:higher energy efficiency and less parameters Cons:lower inference accuracy02468101214161800.511.522.533.54INT4INT8BF16CIFAR-10CIFAR-100ImageNetNormalized Accuracy Loss(%)Normalized Energy/Ops(AU)INT4INT8BF16/FP16Better Energy EfficiencyBetter AccuracyResNet-18D

50、esirable to design an INT/FP dual-mode CIM34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference5 of 48Challenges of INT/FP Dual-Mode CIM(1/2)Lo

51、w area utilization in integer-mode multiply-and-accumulation(INT-MAC)Floating-point multiply-and-accumulation functions becoming idle in INT-MAC e.g.exponent computation,mantissa alignment circuitA.Previous FP-CIMARRAY+ARRAYINDRVAlign.ADDERINEWExINWIDLEINT ModeARRAY+ARRAYINDRVAlign.ADDERINEWExINMWMF

52、P ModeARRAYARRAYINDRVAlign.ADDERINEWExINMWMADDERxIN+B.Straight-forward dual-mode ARRAYINDRVAlign.ADDERINExINWxIN+ADDERWARRAYINT ModeFP Mode34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 I

53、EEE International Solid-State Circuits Conference6 of 48 Long system-level latency NN data update interruptions without supporting concurrent write-and-compute High energy consumption Redundant system-to-CIM data transfer during computationLong system latency due to data updateCycleThis Work(2-Port)

54、Conv.Write/Refresh PortCompute PortWCompLess latencyWCompData StationaryCIM ComputeNo OperationWrite(SRAM:W Update/Gain-cell:W-update&refresh)High energy due to frequently data transfer&data accessLess energy 50 xNon-StationaryStationary(ResNet18 model and ImageNet)CIM MacroARRAYCompute UnitW/IN Buf

55、fer1212Less energy 100 xNon-StationaryStationaryCIM internal accessSystem-CIM transferChallenges of INT/FP Dual-Mode CIM(2/2)34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Internatio

56、nal Solid-State Circuits Conference7 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based

57、Two-port Gain-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference8

58、of 48Overview of Dual-mode Gain-cell-CIM Macro Dual-mode local-computing-cell(DM-LCC)Reuses the exponent addition as an adder tree stage for INT-MAC Improves area efficiency in INT mode1Zone-based Input Processing UnitNormal IOCTRLDigital Shift&AdderWL DriverDual-mode Gain-cell Computing ArrayINT-FP

59、 Dual-mode LCC(DM-LCC)Stationary-based Two-port Gain-cell Array(SB-TP-GCA)Gain-cell Computing Block(GC-CB)GC-CBx64INT-FP Dual-mode Gain-cell CIM213x2434.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Dev

60、ices 2024 IEEE International Solid-State Circuits Conference9 of 48Overview of Dual-mode Gain-cell-CIM MacroZone-based Input Processing UnitNormal IOCTRLDigital Shift&AdderWL DriverDual-mode Gain-cell Computing ArrayINT-FP Dual-mode LCC(DM-LCC)Stationary-based Two-port Gain-cell Array(SB-TP-GCA)Gain

61、-cell Computing Block(GC-CB)GC-CBx64INT-FP Dual-mode Gain-cell CIM213x24 Dual-mode zone-based input processing scheme(ZB-IPS)Simplifies subtraction in exponent computationReuses the alignment circuit in FP-mode as sparsity detection in INT-mode Improves energy efficiency and area efficiency234.2:A 1

62、6nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference10 of 48Overview of Dual-mode Gain-cell-CIM Macro Stationary-based Two-port Gain-cell Array(SB-TP-

63、GCA)Supports concurrent data updating and computationReduces system-to-CIM and CIM internal data accessesImproves system energy efficiency and latencyZone-based Input Processing UnitNormal IOCTRLDigital Shift&AdderWL DriverDual-mode Gain-cell Computing ArrayINT-FP Dual-mode LCC(DM-LCC)Stationary-bas

64、ed Two-port Gain-cell Array(SB-TP-GCA)Gain-cell Computing Block(GC-CB)GC-CBx64INT-FP Dual-mode Gain-cell CIM213x24334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-

65、State Circuits Conference11 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port

66、Gain-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference12 of 48Dua

67、l-mode Local-computing-cell(DM-LCC)Gain-cell computing block(GC-CB)Stationary-based gain-cell sub-array(SB-GCA)Dual-mode adder(DM-ADD)Dual-mode multiplexer(DM-MUX)Integer 8bit ModeBF16 ModeS,WM6:0WE7:0Sign Ext.pSUMIN1W0W1 W1W0IN0IN1W07:0W17:0DM-LCCDM-ADD+SB-GCASB-GCA:INT8 W07:0,W17:0BF16 S,WM6:0,WE7

68、:0DM-ADD:INT8(8bW0+8bW1)BF16 (8bINE+8bWE)DM-MUX:INT8 (8bIN0 x 8bW0)+(8bIN1x 8bW1)BF16 (8bINMx 8bWM)SB-GCADM-MultiplierDM-MUXLeading-1IN1WEDM-LCCDM-ADD+DM-MultiplierDM-MUXINEPDES,WMINMA34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and

69、 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference13 of 48Computation Flow(Floating-Point Mode)Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x

70、WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACVComputation Flow ChartAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345BF16 Mode(NACCU=64)D

71、M-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)534.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W fo

72、r AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference14 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0

73、)Computation Flow ChartAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Pha

74、se 3:Computes PDM(n)=INMA(n)x WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 20

75、24 IEEE International Solid-State Circuits Conference15 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)Computation Flow Ch

76、artAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)

77、=INMA(n)x WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Internationa

78、l Solid-State Circuits Conference16 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)Computation Flow ChartAlign INM(n)based

79、 on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x WM(n)n=063

80、 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circui

81、ts Conference17 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)Computation Flow ChartAlign INM(n)based on PDE(n)ComputesPD

82、E(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x WM(n)n=063 Phase 4:Accumulates

83、 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference18 of 4

84、8Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)5Computation Flow ChartAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Com

85、pute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=06

86、3(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference19 of 48Computation Flow(In

87、teger Mode)Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-GCCA

88、ZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumul

89、ate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference20 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)10

90、0%Utilization 2X Operationx8xDM-GCCAZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Spa

91、rsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34

92、.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference21 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-

93、GCCAZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-M

94、AC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-

95、Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference22 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-GCCAZB-IPU(IN Sparsity Aware)W0I

96、NE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=0

97、63 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Comput

98、ing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference23 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-GCCAZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM

99、)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2

100、(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73

101、.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference24 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-

102、LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port Gain-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W

103、 for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference25 of 48Dual-mode Zone-based Input Processing Unit:Motivation Two factors that leads to(1)area and energy and(2)hardware utilization Usage of full exponent bit-width in computing the number of shifting bits Mantissa bit exte

104、nsion after alignment to suppress truncation data loss00.10.20.30.40.50.68b10b12b16bMobileNet-v2(ImageNet)Accuracy Loss(%)0b4b8b6bExtended Bit#(Alignment)Area overhead&low utilizationFP ModeEXPALIGNMULTADDNumber of shifting bits NSH=(PDE-MAX PDE)Mantissa Bit ExtensionINT ModeIDLEMULTADDIDLEIDLEExtra

105、Exponent Handling8b1b7bBF16SignExpMan9bComparators9b PDE?9b PDE9b PDE-MAX9b PDE9b NSH9bSubtractorsPDE=Product Exponent=INE+WE-Bits34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Inter

106、national Solid-State Circuits Conference26 of 48Dual-mode Zone-based Input Processing Unit:This Work Zone-based exponent handling Divides the product exponent spectrum into zones Computes NSHbased on the difference between PDEand its zone boundary Mantissa bit extension with phases and re-use hardwa

107、re for sparsity No hardware bit extension and higher hardware utilizationThis Work(2 Phase and Sparsity Detection)FP ModeDouble the Mantissa BitsBetter Inference Accuracy andNo Hardware Bit ExtensionBetter Hardware UtilizationEXPMULTZDBAADDINT ModeIN7:0PREMULTSPARADDMACVPhase 1For 15:8pMACVPh1Phase

108、2pMACVPh2Mantissa Bit ExtensionExponent HandlingSparsity Detection This Work(Zone-based Exponent Handling)For 15:8Phase 134.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International

109、Solid-State Circuits Conference27 of 48Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(ZD)Classifies each PDE(n)into

110、 one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:3Zone BIAS Unit(ZBU)ZB-I

111、PUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 202

112、4 IEEE International Solid-State Circuits Conference28 of 48Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#6

113、3#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(ZD)Classifies each PDE(n)i

114、nto one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/

115、W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference29 of 48Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Part

116、ial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(Z

117、D)Classifies each PDE(n)into one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TO

118、PS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference30 of 48Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GC

119、CAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(

120、PDE-REF13)Zone detector(ZD)Classifies each PDE(n)into one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Mac

121、ro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference31 of 48Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone

122、 bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(ZD)Classifies each PDE(n)into one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZo

123、ne Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cel

124、l-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference32 of 48Dual-mode Zone-based Input Processing Unit:Example Two stages operation(FP-MAC)Stage 1:pEMAXF finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Example

125、:PDE(0)=253=011111101=PDE-MAXPDE(1)=243=011110011PDE(64)=236=011101100 ZBU then generates 3 zone-references(PDE-REF13)according to PDE-MAX8:3 Example:PDE-REF1=PDE-MAX8:3,111=255PDE-REF2=PDE-REF1 8=247PDE-REF3=PDE-REF1 16=239Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(

126、0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16PDE-MAX=PDE(0)PDE(0)=253PDE(1)=243PDE(63)=236x64PDE34.2:A 16nm 96Kb Integer/Floating-Point Dual-M

127、ode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference33 of 48Dual-mode Zone-based Input Processing Unit:Example Two stages operation(FP-MAC)Stage 1:pEMAXF finds the MSB-6b(PDE-MAX8:3)of PDE

128、-MAX Example:PDE(0)=253=011111101=PDE-MAXPDE(1)=243=011110011PDE(64)=236=011101100 ZBU then generates 3 zone-references(PDE-REF13)according to PDE-MAX8:3 Example:PDE-REF1=PDE-MAX8:3,111=255PDE-REF2=PDE-REF1 8=247PDE-REF3=PDE-REF1 16=239Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,1

129、11INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16ZFG=1ZFG=2ZFG=3PDE-MAX=PDE(0)PDE(0)=253PDE(1)=243PDE(63)=236x64PDEPDE-REF1247255239PD

130、E-REF2PDE-REF334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference34 of 48Dual-mode Zone-based Input Processing Unit:Example Two stages operat

131、ion(FP-MAC)Stage 2:ZD Classified each PDE(N)into one of the three zones based on its zone-flag(ZFG)Example:PDE(0)=253=011111101=(ZFG=1)PDE(1)=243=011110011=(ZFG=2)PDE(64)=236=011101100=(ZFG=3)DM-IPB computes the zone-shift number(NSHZ)and aligns the INMaccordingly,where NSHZis the inverse of PDE2:0(

132、LSB3b)Example:PDE(0)2:0=3b010=PDE-REF1 PDE(0)PDE(1)2:0=3b100=PDE-REF2 PDE(1)PDE(64)2:0=3b011=PDE-REF3 PDE(2)Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE

133、-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16ZFG=1ZFG=2ZFG=3PDE-MAX=PDE(0)PDE(0)=253PDE(1)=243PDE(63)=236x64PDEPDE-REF1247255239PDE-REF2PDE-REF334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W a

134、nd 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference35 of 48ZFG=1ZFG=2ZFG=3Dual-mode Zone-based Input Processing Unit:Example Two stages operation(FP-MAC)Stage 2:ZD Classified each PDE(N)into one of the three zones based on its zone-flag(ZFG)Example:PDE(0)

135、=253=011111101=(ZFG=1)PDE(1)=243=011110011=(ZFG=2)PDE(64)=236=011101100=(ZFG=3)DM-IPB computes the zone-shift number(NSHZ)and aligns the INMaccordingly,where NSHZis the inverse of PDE2:0(LSB3b)Example:PDE(0)2:0=3b010=PDE-REF1 PDE(0)PDE(1)2:0=3b100=PDE-REF2 PDE(1)PDE(64)2:0=3b011=PDE-REF3 PDE(2)Zone-

136、detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16PDE-MAX=PDE(0)PDE(0)=253PD

137、E(1)=243PDE(63)=236x64PDEPDE-REF1247255239=3=4=2Note:9b subtractor is simplified by 6b subtractor and 3b inverterPDE-REF2PDE-REF334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Intern

138、ational Solid-State Circuits Conference36 of 48 Three zone cases(FP-MAC)PDEwith ZFG=1:INMalignments are executed in Ph1 PDEwith ZFG=2:INMalignments are executed in Ph2 PDEwith ZFG=3:Triggers input-sparsity-aware circuit to reduce compute energyPDE ZonePDE 2:0NSHZPh1 INMA7:0Ph2 INMA7:0PDE-REF2 PDE PD

139、E-REF1(ZFG=1)111000=01IN6IN5IN4IN3IN2IN1IN000000000110001=101IN6IN5IN4IN3IN2IN1IN00000000000111=700000001IN6IN5IN4IN3IN2IN1IN00PDE-REF33 or 2.08 better worst EF,1.82 better peak EF Smaller EF variation compared with DCIM and ACIM 2024 IEEE International Solid-State Circuits Conference26 of 3434.3:A

140、22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM Macro Better performance due to Lightning structure 4x RC saving compared with vertical-cut structure 2.99x accuracy improvement co

141、mpared with weight-wise-cut structure 2024 IEEE International Solid-State Circuits Conference27 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM Macro Better performanc

142、e due to 4-2 compressor 2.73x less peripheral overhead and 1.61x less internal bandwidth 1.71x and 1.43x less area and power cost 2024 IEEE International Solid-State Circuits Conference28 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-

143、Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM Macro Better EF due to ASQC 6.56x energy consumption saving compared with benchmark 2024 IEEE International Solid-State Circuits Conference29 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed

144、 Adder Tree and Analog-Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM MacroApplication performance based on double-regularization99.994%error point changed on multi-networks 22.11x FoM improvement(Compared with JSSC22&23,ISSCC22&23)2024 IEEE International Solid-State Circuits C

145、onference32 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsDemo System Setup Proposed Hybrid CIM Macro Test on ResNet and Vision-Transformer(ViT)FPGA64KbLightningCIM chipMeasured waveform in P

146、C through ILATest platformMeasured waveform through oscilloscope DC powerILA:Vector testing;Oscilloscope:access time measurement;DC Power:CIM_VDD supplement&power measurement.2024 IEEE International Solid-State Circuits Conference33 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory M

147、acro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsConclusionFeatures of proposed Lightning-like hybrid CIM macro Lightning-like hybrid CIM architecture 4x RC decrease and 2.99x accuracy improvement NOR-based 4-2-compressor for high-precision digital calculation

148、1.71x less area consumption and 1.43x power saving Analog-Storage Quantizer(ASQC)for low-power analog calculation 6.56x less energy cost and better mitigating error accumulation A 22nm 64Kb lightning-like hybrid CIM macro is verified 8bIN-8bW-23bOUT Access time:4.1ns,524.7GOPS/mm20.9V 60.8TOPS/W0.6V

149、 2024 IEEE International Solid-State Circuits Conference34 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsThanks for your kind attention!2024 IEEE International Solid-State Circuits Conference

150、35 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsPlease Scan to Rate Please Scan to Rate This PaperThis Paper34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Me

151、mory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference1 of 15A 3nm 32.5 TOPS/W,55.0 TOPS/mm2and 3.78 Mb/mm2Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture Hidehiro Fujiwara1,Haruki Mori1,Wei-Chang

152、 Zhao1,Kinshuk Khare1,Cheng-En Lee1,Xiaochen Peng2,Vineet Joshi3,Chao-Kai Chuang1,Shu-Huan Hsu1,Takeshi Hashizume4,Toshiaki Naganuma4,Chen-Hung Tien1,Yao-Yi Liu1,Yen-Chien Lai1,Chia-Fu Lee1,Tan-Li Chou1,Kerem Akarvardar2,Saman Adham3,Yih Wang1,Yu-Der Chih1,Yen-Huei Chen1,Hung-Jen Liao1,Tsung-Yung Jo

153、nathan Chang11TSMC,Hsinchu,Taiwan2TSMC,San Jose,CA3TSMC,Ottawa,TX4TSMC,Yokohama,Japan34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference2 of 15Outline Ba

154、ckground Circuit Implementation Si Measurement Results Conclusion34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference3 of 15Digital CIMData movement and m

155、ultiply and accumulate(MAC)are keyDigital CIM(DCIM):distributed banks+customized MAC unitDirectly leverage technology scaling.Better testability.010020030040050060022nm 12nm 4nm3nmTOPS/mm20.00.51.01.52.02.53.03.54.022nm 12nm 4nm3nmMb/mm2010020030040050060022nm 12nm 4nm3nmTOPS/WComputing area efficie

156、ncyMemory area efficiencyEnergy efficiency This work This work This work34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference4 of 15Key Features 72 Input C

157、h x 4 Out Ch with signed INT12 format 18 sets of weight in sub array with foundry 6T cell Dual rail design Parallel MAC with look up table(LUT)DFT&BIST interface+CIM BIST34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC

158、Architecture 2024 IEEE International Solid-State Circuits Conference5 of 15Block Diagram Interface dual rail(IDR)like design MAC in VDD(low voltage)domainInterface+CTRL+DFTMACwith LUTDecodeLSData storage(SRAM array+R/W)196 cols(12b x 4In Ch x 4Out Ch)18 rows x 18 segments(18 weight sets x 18 In Ch)V

159、DD domain with low VT devicesVDDM domain with high VT devices34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference6 of 15Sub Array with Flying BL Foundry 6

160、T cell with flying BL architecture5%macro area reduction by array layout optimization Simultaneous access to all banks 48 cols(12b x 4In Ch)18 rows18 rowsLIO for top segBL_TOPBL_BOTLIO for bot segArray(W30)Array(W74)WL_BOT17:0WL_TOP17:0VDDM domainMAC w/LUT(8 input&8 weight)VDD domainXIN70BL_BOT(Mx)B

161、L_TOP(Mx+2)BLB_TOP(Mx+2)BLB_BOT(Mx)18 rows18 rows34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference7 of 15Simple Local IO Read with MUX1:Better throughp

162、ut Write with MUX4:Global routing track saving48 cols(12b x 4In Ch)18 rows18 rowsLIO for top segBL_TOPBL_BOTLIO for bot segArray(W30)Array(W74)WL_BOT17:0WL_TOP17:0VDDM domainMAC w/LUT(8 input&8 weight)VDD domainXIN70SAEBSAEWYB3:0BLBBLDTDBBLEQBQBSAESAEBWrite with MUX4Read with MUX1BL_BOT(Mx)BL_TOP(Mx

163、+2)BLB_TOP(Mx+2)BLB_BOT(Mx)18 rows18 rows34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference8 of 15Parallel MAC vs Serial MAC Parallel MAC:multi-bit x mu

164、lti-bit Serial MAC:multi-bit x single bit+shift&addMultiplier&adder treeWeight with multi-bit(from array)XIN with multi-bitMultiplier&adder treeXIN(bit by bit)Weight with multi-bit(from array)Parallel MACSerial MACMSBLSB34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-

165、Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference9 of 15Less Energy in Parallel MAC High throughput&low toggle rate in parallel MAC 7%less devices and 5%faster speed using LUT approach 27.821.227.425.019.818.425.919.5101520253035Al

166、exNetResNet-50 MobileNetV2 Inception-v1Average toggle rate(%)Serial MACParallel MACOn ImageNet05101520253035404550012345678910 11Toggle rate in parallel MAC(%)BitAlexNetResNet-50MobileNetV2Inception-v1MSBLSBOn ImageNetOn ImageNet34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Comp

167、uting-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference10 of 15Pipeline Structure 3 stages:SRAM array access+2 MAC stages SRAM array access only when weight update for MACMAC1 MAC3:no weight update(data reuse)MAC0&MAC4:weight up

168、dateCycle#MAC0MAC1MAC2MAC3MAC4SRAM array access1234567MAC stage1MAC stage234.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference11 of 15Simulation ResultsDe

169、dicated pipeline design with transistor level STA toolTOPS/W is up to input toggle rate and SRAM array access0.51.01.52.02.53.00.40.50.60.7Cycle time(ns)VDD(V)MAC stage1MAC stage2SRAM array access VDDM=0.675V01020304050600255075100TOPS/W VDD=0.55VInput toggle rate(%)No SRAM array accessSRAM array ac

170、cess/4cyclesSRAM array access/1cycleWeight=1 50%TOPS/W 0.55VInput toggle10%25%SRAM array accessNo53.8 32.5/4cycles30.3 22.9 34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-Sta

171、te Circuits Conference12 of 153nm Test Chip MicrographGIOCTRLMAC w/8 Input ChWLDRV/XINDRVWLDRV/XINDRV62.3 m252 mMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChBitcell6T cell(0.026 m2)Macro area0.0157 mm2Bit capacity60.7

172、5 kbMAC size72 Input Ch x 4 Output Ch x 18 weight setBit density(Mb/mm2)3.78TOPS/mm210.6 0.4V 55.0 0.9VTOPS/W 0.55VInput toggle10%25%SRAM array accessNo53.8 32.5/4cycles30.3 22.9 CIM BIST based on commercial MBIST engine-99%fault coverage by custom algorithm-Test time 2ms/macro34.4:A 3nm 32.5 TOPS/W

173、,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference13 of 15.01.11510203050708090959999.999.99VDD(V)PercentMeasurement Results 40 CVDD=VDDMVDDM fixed Bit MultiplyLess In-Array

174、ConsumeAnalog CIMLarger Out-arrayConsumeWL Driver6T6T6T 6T6T6TA/D BL Driver Shift&Add A/D A/DHeavy ADC-Bit Shift&Multi-bit AdditionA/D ControlWL Driver6T6T6T 6T6T6T BL Driver Shift&Add Adder TreeDigital CIMLarger In-ArrayConsumeLogic Gate-Bit MultiplyLess Out-array ConsumeAdder Tree-Bit Shift&Multi-

175、bit AdditionControl34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference5 of 32Challenges of SRAM CIM(2/3)High precision weight requires a quite heavy hig

176、h precision bit ADCShift&AddN Bit ADCW0W1 Bit Serial InputtingIM-1I0I1I2M+N Bit ADCW0W1 M Bit DACWN-1WN-1IM-1:0 Decode LogicSAR Logic&Decode LogicDAC Cap ArrayFlash:High Throughput&High CostSAR:Low Throughput&Low CostTrade-off of ADC Precision&Throughput&OverheadHigher bit precision increases ADC bu

177、rdenFlash ADC:High throughput but high cost,SAR ADC:Low cost but low throughput34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference6 of 32Challenges of S

178、RAM CIM(3/3)Inner-Product Based Large Fan-In Adder Tree BurdenThe transmission delay increases rapidly as the fan-in increasesMulti-level adder tree introduce higher transmission delayCIM Array Big Fan-In Adder TreeInner Product Based CIMWith higher propagation delay Element-wise summationni,ji,kk,j

179、k 1ca b=Outer Product Based CIMMuch less propagation delayAccumulatorD QD QD QCIM Array Vector-wiseaccumulationnij,ijj 1cb a=34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-S

180、tate Circuits Conference7 of 32110011001100(,1)220(0,(1)1:0(2)(1:0(2 )(2)()2(2)2)2()=NNiiiiNNjiijNNjiijmin i NNiij maxiNIN NW iIN jIN iWjW NIN iW iijjINW(2)(1)(3)Basic observation:N bit N bit Multiplication can be transformed into different forms,corresponding to different CIM topologiesOur Motivati

181、on34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference8 of 32Our MotivationDACINN-1:0W0W1W2WN-12N bitADCO2N-1:02021222(N-1)DAC based bit parallel scheme

182、ACIM topologyHigh throughput but quite high ADC bit precision,more accuracy loss1100()(2)2=NNjiijW iIN j34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Confere

183、nce9 of 32Our Motivation Bit serial scheme based ACIM topologyLower throughput but lower ADC bit precision requirementW0W1W2WN-1N bitADCO2N-1:0Shift&Add2021222(N-1)IN0IN1IN2INN-11100()(2)2=NNjiijIN iW j34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memor

184、y Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference10 of 32Our MotivationW0W1W2WN-1O2N-1:0Shift&AddNORNORNORNOR Bit serial scheme based DCIM topologyNo ADC,no accuracy loss but embedded logic gates needed1100()(2)2=NNjiijIN iW jIN0IN1IN2INN-1AdderTre

185、e34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference11 of 32Our MotivationAnalog PartAnalog bit multiplicationWINOAnalog bit addOIN0IN1Digital bit multi

186、plicationINWANDODigital bit addIN1IN0FASDigital PartAnalog bit shiftCMININ2NDigital bit shiftAdderININ2N(,1)220(0,(1)(2)=min i NNiijmaxiNW ijIN j Motivation of our proposed Hybrid CIM schemeRemove bit shift process from analog CIM thus lowing ADC bit precisionC34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain

187、 Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference12 of 32Outline Proposed Hybrid-domain SRAM-CIM MacroMacro Overall Architecture34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based F

188、loating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference13 of 32Macro Overall Architecture Hybrid-domain 8b Macro with Analog In-array&Digital Out-array Computing Logarithm Bit-width Zero-canceling Residual Analog-Digi

189、tal-Convertor Outer-product Based Floating-Point/Fixed-Point CIM Block ArchitectureActivation BufferCTRLBL DriverWL Driver CIM BlockCIM BlockCIM CellCLK CIM Cell CellCellCellCellCellCellCellCell SRAM Sub-ArrayMultiplier8 Bit Hybrid Domain MacroExponentComputeCircuitSign CircuitAccumulatorINT/FP Adde

190、r16Bit RegistersMulti-bit AccumulatorAnalog Bit Mul&Bit AddResidual Stages SRAM 16b ColSparsity Control Circuit DQDQDQEN12334.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-Sta

191、te Circuits Conference14 of 32Outline Proposed Hybrid-domain SRAM-CIM MacroAnalog In-Array&Digital Out-Array Hybrid-Domain Scheme34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Sol

192、id-State Circuits Conference15 of 32Analog In-Array&Digital Out-Array Hybrid-Domain SchemeAnalog bit multiply and bit adder for(,+)(,)Current based analog CIM scheme with current mirror and transmission gates arrayAnalog Bit MultiplierAnalog Bit AdderAccumulatorADCAijWjmin(i,7)j max(0,i 7)OAi)Analog

193、ij(W j=AnalogOi(Digital)Oi()14ii 0Oi(Digita2l)=min(i,7)14ii 0j max(0,i 7)(Wji)2Aj=Iunit O0(analog)=W0A0O1=W0A1+W1A0O2=W0A2+W1A1+W2A0O14=W7A7O7=W0A7+.+W7A0O8=W1A7+.+W7A1 Iout=IrefIrefW1A7IunitIunitW7A0W7A1W7A7W1A1W1A0W0A0W0A7Weight 6T Cell15Weight 6T Cell0 Weight 6T Cell15Weight 6T Cell0 Weight 6T Ce

194、ll15Weight 6T Cell0 BLBLBBLBLBBLBLBAct 6T Cell BLBLBAct 6T Cell BLBLBAct 6T Cell BLBLBW0A134.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference16 of 32Ana

195、log In-Array&Digital Out-Array Hybrid-Domain SchemeADC and accumulator for=Partial sums are converted into digital domain by ADC and shifted and added by 8-cycle digital accumulator Analog Bit MultiplierAnalog Bit AdderAccumulatorADCAijWjmin(i,7)j max(0,i 7)OAi)Analogij(W j=AnalogOi(Digital)Oi()14ii

196、 0Oi(Digita2l)=min(i,7)14ii 0j max(0,i 7)(Wji)2Aj=DQ O1O2O6O7O8O9O13O141:11:1ADC1ADC21.86x FoM improvementFoM=IN-precision W-precision OUT-ratio Energy EfficiencyISSCC20214ISSCC20223ISSCC20231This work5.763.551.8600.51*When calculating figure of merit(FoM),BF16 is considered as 16*Output Radio=Real

197、output accuracy/Ideal output accuracy*All take maximum average energy efficiency,accuracy running on Cifar-100*34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits

198、Conference27 of 32Shmoo Plot0.950.900.600.850.800.750.700.65VDD(V)Access time(ns)4.08.54.55.05.56.06.57.07.58.0Shmoo:BF16 IN,BF16 W,BF16 OUTMeasurement results BF16-IN,BF16-W,BF16-OUTAccess time(tAC)=4.78ns VDD=0.95VFAILPASS34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point S

199、RAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference28 of 32Technology(nm)Macro Area(mm2)SRAM Capacity(Kb)Supply Voltage(V)Number of InputChannelsInput PrecisionBF16INT8Weight PrecisionBF16INT8Throughput1.98-4.28TFLOPS2.89-5.31TO

200、PSEnergy Efficiency(Averageperformance)16.55-32.78TFLOPS/W22.78-50.53TOPS/WEnergy Efficiency(Peak performance)36.41-72.12TFLOPS/W50.12-111.17TOPS/WAccuracy Loss(Cifar100)-0.05%-1.57%256Chip Summary281.94192Kb0.7-0.95Chip SummaryCIM Macro1.54mm1.26mm*Performance in 90%input sparsity(ResNet-18)*Using

201、ResNet-18 model,the software baseline was 75.31%*34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference29 of 32Measurement SystemDC PowerPCLogic AnalyzerTe

202、st BoardJTAG34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference30 of 32Outline Conclusion34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Flo

203、ating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference31 of 32Conclusion Features of proposed Hybrid-domain FP CIM macroAnalog In-Array&Digital Out-Array Hybrid-Domain CIMLogarithm Bit-Width Residual ADCOuter-product b

204、ased FP/INT CIM Block Architecture A 28nm FP SRAM CIM Macro is VerifiedBF16IN-BF16W-BF16OUT Throughput:1.98-4.28 TFLOPS Energy Efficiency:16.55-72.12 TFLOPS/WINT8IN-INT8W-INT16OUT Throughput:2.89-5.31 TOPS Energy Efficiency:22.78-111.17 TOPS/W34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Bas

205、ed Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference32 of 32Thank you for your kind attention!34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logar

206、ithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference33 of 32Please Scan to Rate Please Scan to Rate This PaperThis Paper34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE Internat

207、ional Solid-State Circuits Conference1 of 27A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and RefreshingYifan He1,Shupei Fan1,Xuan Li1,Luchang Lei1,Wenbin Jia1,Chen Tang1,Yaolei Li1,Zongle Huang1,Zhike Du1,Jinshan Yue2,Xueqing Li1,Huaz

208、hong Yang1,Hongyang Jia1,Yongpan Liu11Tsinghua University,Beijing2Institute of Microelectronics of CAS,Beijing34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference2 of

209、 27eDRAM-based digital CIMWhy digital CIM?Higher throughputs&accuracy However,large logics overhead&low mem.densityWhy eDRAM CIM?Less transistor&potential higher density But has refreshing overheadY.ChihISSCC21 16.3D.WangISSCC22 16.1Y.HeISSCC23 7.3Z.ChenISSCC21 15.3S.KimISSCC23 16.5This WorkEnergy E

210、fficiency(TOPS/W)current-basedcharge-basedtime-baseddigital-based105104103102SRAM-basedeDRAM-basedNVM-based104103102101Den.Comp.of SRAM&eDRAM CIMDigitalAdder TreeeDRAMCellNorm.Ene.Effi.(TOPS/W)105104103102CurrentChargeTimeDigital100 101 102 103 104Norm.Compute Den.(TOPS/mm2)Storage Density(Kb/mm2)10

211、41031021014 5Effi.Comp.of Analog&Digital CIMTechnology Node(nm)716 2240556534.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference3 of 27Existing digital CIM solutionSol

212、ution 1:Circuit OptimizationSolution 3:Alg.Co-optimizationSolution 2:Logic Optimization Existing digital CIM designs focus on computation circuitsDifferent types of FA designsUniversal improvementLimited design spaceEfficient MAC implementation Higher efficieny/throughputComplex logic&larger areaApp

213、roximate computing Break digital logic limit Accuracy lossSource:ISSCC21 16.4,ISSCC22 15.5,ISSCC23 7.334.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference4 of 27Motiv

214、ation to eDRAM-LUT-based digital CIMCDAC-basedSAR-ADCCharge-based CIM cellReconfig.Vref+Previous analog CIM solutionMotivated by ISSCC21 15.3&ISSCC23 15.5Analogous digital CIM solution1.eDRAM benefits DCIM:Higher density2.DCIM benefits eDRAM:Parallel refresh3T eDRAM9 Track Full AdderLUT 0LUT 15.Loca

215、l R&WLocal R&WRefresh time 1xeDRAM ArrayGlobal R&WRefresh time 16x,+.16CS-eDRAM CIM Bank0161610b LUT eLAD 151610b LUT eLAD 01610b LUT eLAD 1Hierarchical WL Decoder34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 I

216、EEE International Solid-State Circuits Conference11 of 27eLAD implementation details.10.LWBLLRBLLWBLLRBLLWBLLRBL16WWLRWLWWLRWLWWLRWLACTGWBL_ENTOPCTRLDecoder_ENDecoder_ENCIM_EN Write BufferGWBLGWBLGWBL4-16 LSB RWL DecodereLAD schematiceLAD improvementeDRAM-LUT-based CIM saves 40%area compared to conv

217、entional DCIMPower gatingeLAD data mapping34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference12 of 27Design choice of LUT-based CIMArea&energy comparison of differen

218、t LUT cellFour-weight eDRAM LUT is a sweet point of density and efficiency trade-offDensity trade-off between different numbers of weights34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-S

219、tate Circuits Conference13 of 27IMREP implementation detailsIMREP schematicT=1T=2T=3T=4T=NLUT0LUT1ProposedIn-memRefreshConventionalSerialRefreshRow 1-4.refreshedunrefreshedRow N.ParallelRefreshing.RENRENBVDDPREWB_RFSwrite-back for refreshwrite back for encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRow

220、RowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP1034.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Confere

221、nce14 of 27In-memory refreshIMREP schematicIn-memory refresh exampleRENRENBVDDPREWB_RFSwrite-back for refreshwrite back for encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRowRowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP10T=1T=2T=3T=4

222、T=NLUT0LUT1ProposedIn-memRefreshConventionalSerialRefreshRow 1-4.refreshedunrefreshedRow N.ParallelRefreshing.34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference15 o

223、f 27In-memory encode(1/2)IMREP schematicIn-memory encode exampleW0,0W1,0W1,0W1,1T=1T=2T=3T=4LUT0LUT1encodedunencodedInter.resultsStage 1:Serial loadingW0,0W0,1Stage 2:EncodingDoubleWritingT=5W0,0W1,0T=6W0,1W0,0+1W1,0W0,0W1,1W1,0+1In-memEncodingRENRENBVDDPREWB_RFSwrite-back for refreshwrite back for

224、encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRowRowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP1034.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024

225、IEEE International Solid-State Circuits Conference16 of 27In-memory encode(2/2)W0,0W1,0W1,0W1,1T=1T=2T=3T=4LUT0LUT1encodedunencodedInter.resultsStage 1:Serial loadingW0,0W0,1Stage 2:EncodingDoubleWritingT=5W0,0W1,0T=6W0,1W0,0+1W1,0W0,0W1,1W1,0+1In-memEncodingRENRENBVDDPREWB_RFSwrite-back for refresh

226、write back for encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRowRowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP10IMREP schematicIn-memory encode example34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Mem

227、ory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference17 of 27Complete in-memory encode flowThe encoding flow guarantees no read-write conflicts0123456789101112131415W00W01W02W03W10W11W12W130123456789101112131415W00W01W02W03W10W11W12W13LUT 0LUT 1W10+

228、W11W00+W01W00+W02W10W00W01W02W11W03W12W13W00W10W11W12W01W13W02W03W10W00W01W02W11W03W12W13W00W10W11W12W01W13W02W03W01+W01W00+W02W01+W02W00+W03W01+W03W02+W03W10+W11W10+W12W11+W12W10+W13W11+W13W12+W13W10W00W01W02W00+W01+W02W03W00W10W11W12W13W00+W01W00+W02W01+W02W00+W03W01+W03W02+W03W10+W11W10+W12W11+W1

229、2W10+W13W11+W13W12+W13W00+W01+W03W00+W02+W03W01+W02+W03W10+W01+W02+W03W10+W11+W12W10+W11+W13W10+W12+W13W11+W12+W13W10+W11+W12+W13Resultread fromLUTWrite into LUTencodedLUTintermediate resultSerial Load (64 cycles)Weight Encode (22 cycles)34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digi

230、tal Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference18 of 27IMREP evaluationIn-mem.refresh improvementIn-mem.encode improvementImproved efficiency&throughput by reducing global data movement-57%-64%+2%00.20.40.60.81EnergyLatency

231、Macro AreaNormalized CostNear-memIn-mem-9.5x-16x+10%34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference19 of 27Outlinen Introduction n Challenge of eDRAM-LUT-Based D

232、igital CIMn Proposed eDRAM Digital CIMlOverall architecture and basic principallTwo-stage eDRAM-LUT-based adder treelIn-memory refreshing and LUT encodingn Measurement Resultsn Conclusion34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encodi

233、ng and Refreshing 2024 IEEE International Solid-State Circuits Conference20 of 27Chip photograph1.62mm2.0mmPLLInput/Output SRAM&Test-modes2Mb 10*5 eDRAM Macro ArrayWWL/RWL Decoder&Drivers8 x 256 x 10b LUT Banks+Adder Tree8 x 256 x 10b LUT Banks+Adder Tree78u221ueDRAMMacro x 1Shifter AddersShifter Ad

234、dersTop CTRLWBL DriversChip Summary28nm HKMGTechnology3T eDRAMBit Cell0.145 um2Cell Area0.017 mm2Macro Area40KbMacro Capacity2MbTotal Capacity10TOPSTotal ThroughputLUT+Adder TreeCompute Circuit64Input Channel16Output Channel0.9-1.2VSupply Voltage400-1100MHzFrequency2.4Mb/mm2Storage Density2684F2/bNo

235、rmalized compute Density19.7 8b,0.9VEnergy Efficiency16.7 TOPS/mm28b,1.2VArea Efficiency34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference21 of 27Test platformCurre

236、nt MeterTest ICFPGAPCHeat PumpDC SourceTest ICThermoelectric Heat PumpThermal ImagerFPGACurrent MeterPCDC Source34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference22

237、 of 27Voltage&efficiency scaling0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20Voltage(V)40050060070080090010001100Frequency(MHz)Best efficiency achieved 0.9V,400MHz Best performance achieved 1.2V,1100MHz0.91.01.11.2Voltage(V)05101520258b Energy Eff.(TOPS/W)10%sparsity20%sparsity30%sparsity40%sparsity50%spa

238、rsitybest efficiency:19.7 TOPS/W 0.9V34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference23 of 27eDRAM retention testMinimal refreshing overhead with in-memory accele

239、ration34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference24 of 27Power and area breakdown42%32%8%18%Energy Breakdown47%34%6%13%Area BreakdowneDRAM ArrayAdder TreeShi

240、ft&AdderCTRLS&DriversMemory circuits dominate both area and energy breakdown128x10 eDRAM Array128x10 eDRAM Array16x10b Adder Tree,+12.5u73u34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-

241、State Circuits Conference25 of 27Comparison with state-of-the-arts CIM macrosExceeding Storage&Compute Density 28nm:8b TOPS/W 0.9-1.1V Output Bits/(Input Bits+Weight Bits+log2(#of Input Channels):8b TOPS/W 0.9-1.1V 8b Peak TOPS/mm2 :8b Peak TOPS/mm2 Storage Density34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TO

242、PS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference26 of 27Outlinen Introduction n Challenge of eDRAM-LUT-Based Digital CIMn Proposed eDRAM Digital CIMlOverall architecture and basic principallTwo-sta

243、ge eDRAM-LUT-based adder treelIn-memory refreshing and LUT encodingn Measurement Resultsn Conclusion34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference27 of 27Conclu

244、sionn Digital CIM has unique advantages on performance,robustness and scalabilityn Previous DCIM works mainly optimize on computation logic instead of taking memory and computation as a whole n The eDRAM-LUT-based CIM allow us to achieve 16.7TOPS/mm2 8b compute density with 2.4Mb/mm2 storage density

245、n Potential improvement can be achieved by combining other high-density emerging devices in the future34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference28 of 27Plea

246、se Scan to Rate This Paper34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference1 of 36A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge DevicesTai-Hao Wen*1,Hung

247、-Hsi Hsu*1,2,Win-San Khwa*2,Wei-Hsing Huang1,Zhao-En Ke1,Yu-Hsiang Chin1,Hua-Jin Wen1,Yu-Chen Chang1,Wei-Ting Hsu1,Chung-Chuan Lo1,Ren-Shuo Liu1,Chih-Cheng Hsieh1,Kea-Tiong Tang1,Shih-Hsin Teng3,Chung-Cheng Chou3,Yu-Der Chih3,Tsung-Yung Jonathan Chang3,Meng-Fan Chang1,21National Tsing Hua University

248、(NTHU),Hsinchu,Taiwan,2TSMC Corporate Research,Hsinchu,Taiwan 3TSMC,Hsinchu,Taiwan34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference2 of 36Challenges of floating-point(FP)nonvolatile compute-in-m

249、emory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of floating-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-alignment(K-WPA)Rescheduled multi-bit input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement r

250、esultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference3 of 36Kernel0KernelNConv.Layer-wise Weight Pre-Alignment(PA)Kernel0KernelNBit width for PA=8bBit width for PA=12bDatatrun

251、cation!*S:Sign bitChallenges of Nonvolatile Compute-In-Memory(1/3)Tradeoff of bit width for weight pre-alignment between inference accuracy and weight storage Increasing bit width enhances inference accuracy Decreasing bit width reduces storage needsS S0 0 0S.0 0 0 0S S S S S.S S S S.0S SS.S S S S S

252、S S S S.Mantissa0Mantissa000.0050.010.0150.020.0250.038101200.20.40.60.811.21.41.6StorageAccuracy loss(Conv.)Accuracy loss(Prop.)81012a.u.Bit width for weight PAProposedK-WPA34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International So

253、lid-State Circuits Conference4 of 36Challenges of Nonvolatile Compute-In-Memory(2/3)00.511.522.5(1)(2)(3)Prop.RS-MIC00.511.522.5Latency/EnergySignal Margin(1)(2)(3)Prop.RS-MICa.u.Reduce LatencyLatency/EnergySignal Margin(1)Digital Bit-Serial Multi-bit InputLatency/EnergySignal MarginLatency/EnergySi

254、gnal Margin3x width(2)Analog WL Pulse Count Multi-bit Input(3)Analog Decoded WL pulse-width Multi-bit Input2 cycles11Example:2b IN=33x pulseLatency/EnergySignal MarginProposed RS-MIC1 cycle with FLAGMIC=111An Example of a 2b input:Long latency/energy of MAC operations due to high input bit width in

255、floating-point format Digital bit-serial multi-bit inputs suffer from long latency/energy Analog multi-bit inputs suffer from low signal margin34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference5

256、of 36Challenges of Nonvolatile Compute-In-Memory(3/3)Weight Value 0#of WeightsWeight distribution after pre-alignment2sHRS-Favor SchemeLRS(1)HRS(0)#45%35%Conv.Proposed HF-DSB weight encoding55%65%Weight Storage in ReRAM Array Large current consumption of cell array due to large number of memory cell

257、 for FP weight storage LRS cells consume much larger energy than HRS cells Proposed HF-DSB weight encoding reduce the ratio of LRS cellsLRSHRSEnergy of accessing one bit cellEnergy(a.u.)34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE Inte

258、rnational Solid-State Circuits Conference6 of 36Challenges of floating-point(FP)nonvolatile compute-in-memory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of floating-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-alignment(K-WPA)Rescheduled multi-bi

259、t input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement resultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference7 of 36Overview of

260、 floating-point ReRAM-CIM macroHF-DSB Weight Encoder(C3)WL DriverRS-MIC(C2)Pre-aligned WeightsFPInputs1T1R ReRAM-CIMArraywith K-WPAand HF-DSB(C1&C3)Input Pre-alignmentMux and ADCShifter and Sign-aware Adder(C3)MIC-awareAccumulator(C2)Exponent Processing(C1)FP OutputsCTRLFP computing flow with kernel

261、-wise weight pre-alignment(K-WPA)Reduce accuracy loss during the data truncation in weight pre-alignment Rescheduled multi-bit input compression(RS-MIC)Reduce MAC energy and latency with lossless compressionHRS-favored dual-sign-bit(HF-DSB)weight encodingReduce ReRAM array current consumption34.8:A

262、22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference8 of 36Challenges of floating-point(FP)nonvolatile compute-in-memory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of float

263、ing-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-alignment(K-WPA)Rescheduled multi-bit input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement resultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with

264、 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference9 of 36 Kernel-wise weight pre-alignment(K-WPA)Offline alignment process to eliminate need for on-chip weight alignmentFind the maximum exponent(EK)in each kernelAligns each weights sign and mantissa based on it

265、s exponent difference(EK-Ei)W PAKernel N WeightsEKNPA-MKN+K-WPAKernel 0 Weightsin FP16/BF16 formatKernel 0 Pre-aligned Mantissas(PA-MK0)(Q bits,Q=8,10,12,16)S0E0M0S1E1M1S2E2M2S01.M0S11.M1S2EK0Kernel0-sharedExponentEK0 Ei1.M2+Find Max.E in Kernel0Reg.MACVK0KNEK0KNExponent ProcessingReRAM Array w/K-WP

266、ANormal Read EK0KNPA-MK0(*Q-bits)(w/HF-DSB).PA-MKNOutput(FP32)EINInput PA-MIN(*P-bits)*A Inputs(FP16/BF16)Input Pre-alignment CircuitRS-MICMux and ADCShifter and Sign-aware Adder MIC-aware AccumulatorEK0.EKN123MAC Operation of PA-MK0KNFP computing flow with kernel-wise weight pre-alignment(1/7)34.8:

267、A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference10 of 36 Kernel-wise weight pre-alignment(K-WPA)Before the MAC operation,kernel-shared exponent(EK)are loaded from a ReRAM array row to register.The k

268、ernel-wise pre-aligned mantissa are ready to conduct MAC operation with input.Reg.MACVK0KNEK0KNExponent ProcessingReRAM Array w/K-WPANormal Read EK0KNPA-MK0(*Q-bits)(w/HF-DSB).PA-MKNOutput(FP32)EINInput PA-MIN(*P-bits)*A Inputs(FP16/BF16)Input Pre-alignment CircuitRS-MICMux and ADCShifter and Sign-a

269、ware Adder MIC-aware AccumulatorEK0.EKN123MAC Operation of PA-MK0KNFP computing flow with kernel-wise weight pre-alignment(2/7)W PAKernel N WeightsEKNPA-MKN+K-WPAKernel 0 Weightsin FP16/BF16 formatKernel 0 Pre-aligned Mantissas(PA-MK0)(Q bits,Q=8,10,12,16)S0E0M0S1E1M1S2E2M2S01.M0S11.M1S2EK0Kernel0-s

270、haredExponentEK0 Ei1.M2+Find Max.E in Kernel034.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference11 of 36 The Kernel-wise weight pre-alignment(K-WPA)reduce the data truncation during weight alignme

271、nt K-WPA achieve fine-grained kernel-wise alignment referenceQ bitsPropose K-WPAConventional WPAKernel0Kernel1KernelNKernel0Kernel1KernelNLayer-wiseAlignment Reference Kernel-wiseAlignment Reference Place Value of Weight Preserved data after alignmentTruncated data after alignmentPreserve more Weigh

272、t data!Large weightData truncation!Place Value of Weight Q bitsQ bits1.86 over ISSCC2334.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference33 of 36Technology 22nm CMOS logic process(Ultra low leakag

273、e)Memory deviceFoundry provided 1T1R ReRAMReRAM-CIM Capacity16Mb(16 sub-banks)Input/Weight precisionFP16/BF16Output precisionFP32Macro area(Inc.test mode)8.2mm2 Supply voltage 0.7-0.8VThroughputs(TFLOPS)*1*30.86(BF16)0.78(FP16)Computing density(TFLOPS/mm2)*1*30.104(BF16)0.095(FP16)Energy efficiency(

274、TFLOPS/W)*331.2*1-65.5*2(BF16)28.7*1-60.4*2(FP16)Inference Accuracy(CIFAR-100)*469.48%(Top-1),91.59%(Top-5)(BF16)Inference Accuracy(ImageNet)*571.55%(Top-1),90.17%(Top-5)(BF16)16MbFloating-Point ReRAM-CIMMacroChip Summary34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W

275、 for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference34 of 36Challenges of floating-point(FP)nonvolatile compute-in-memory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of floating-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-

276、alignment(K-WPA)Rescheduled multi-bit input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement resultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State C

277、ircuits Conference35 of 36ConclusionFeatures of proposed floating-point ReRAM-nvCIM macro:FP computing flow with kernel-wise weight pre-alignment(K-WPA)Reduce accuracy loss due to the data truncation in weight pre-alignment Rescheduled multi-bit input compression(RS-MIC)Reduce MAC energy and latency

278、 with lossless compressionHRS-favored dual-sign-bit(HF-DSB)weight encodingReduce ReRAM array current consumptionA 16Mb floating-point ReRAM-nvCIM macro is verified:Supporting both BF16 and FP16 computing formatMeasured energy efficiency was 31.2 TFLOPS/W under BF16 precision34.8:A 22nm 16Mb Floating

279、-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference36 of 36Thanks for your kind attentionAcknowledgements:NSTC-Taiwan,TSRI,NTHU-TSMC JDP34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for A

280、I Edge Devices 2024 IEEE International Solid-State Circuits Conference37 of 36Please Scan to Rate Please Scan to Rate This PaperThis Paper34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-Stat

281、e Circuits Conference1 of 40A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET ProcessLinfang Wang1,2,Weizeng Li1,2,Zhidao Zhou1,2,Hanghang Gao1,2,Zhi Li1,2,Wang Ye1,2,Hongyang Hu1,Jing Liu1,Jinshan Yue1,Jianguo Yang1,Qing Luo1,Chunmeng

282、 Dou1,2,Qi Liu1,3,Ming Liu1,31Institute of Microelectronics of the Chinese Academy of Sciences,Beijing,China2University of Chinese Academy of Sciences,Beijing,China3Fudan University,Shanghai,China34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Stand

283、ard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference2 of 40Outline Motivation and Challenges Proposed Plastic-CIM MacroOverview of Plastic-CIM MacroPlastic Cell Array Differential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fus

284、ed Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference3 of 40Development of nvCIM based Accelerator Edge AI applications call for CIM with learning abilities to adapt to the dynamic and unpredicta

285、ble environments W.Huang,ISSCC,2023M.Chang,ISSCC,2023M.Le Gallo,Nat Electron,2023W.Zhang,Science,202334.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference4 of 40Learning

286、by Back Propagation Conventional NN training:High computational precision Large amounts of intermediate dataCannot keep learning after initial trainingFeed-forward=,x1xix2yjError-computation=,Weight Gradient-computation,=C/y1xiHard to be deployed on edge devicesfor learning!C/y2C/yjC/xiC/yjC/wi,j34.

287、9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference5 of 40=(,+,)Feed-forwardx1xix2w2,jyj Plastic NN*learns from experience:Keep learning in feed-forward after initial trai

288、ningUpdating Hebb by Hebbian theory:“neurons that fire together,wire together”*T.Miconi,ICML,2018,+1=+(1 ),traditional connectionsplastic connectionsExtracted feature vectorOutput resultFully-connectedlayerLearning Tasks Trained Wi,jTrained Wi,j&i,jUpdating Hebbi,jConvolutional layersLearning by Syn

289、aptic Plasticity34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference6 of 40Learning by Synaptic PlasticityFeed-forwardx1xix2w2,jyj=(,+,)ParametersStorageComputationPlast

290、icitycoefficient,NVM for long-term informationMatrix Element-wise Multiplication(MEM)andMatrix-Vector Multiplication(MVM)Hebbian trace,VM for short-term information Plastic NN*learns from experience:Long-/short-term information expects non-volatile/volatile memoryMatrix element-wise multiplication f

291、or the plastic connections*T.Miconi,ICML,201834.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference7 of 40Challenge 1:NVM Solution in Advanced Nodes 5 Transistors Logic-FL

292、ASH(5T-LF)provides a cost-effective solution for non-volatile memory in advanced logic platformsEmerging Memory(RRAM/PRAM/MRAM)Embedded NVMLogic Flashe-FLASH201020142018202255nm40nm28nm180/130nm28/22/14nm65/40nm130/65nm350/180nm14 nmThis WorkAdditional Masks&Specialized ProcessesStandard Logic Proce

293、ss34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference8 of 40Challenge 2:MEM-MVM solution in CIM Previous works need (N+1)cycles to perform MEM-MVM Proposed CIM Array su

294、pporting MEM-MVM in 1 cyclexHebbxCIM for MVMHebbxProposed CIM ArrayHebbxHebbHebbHebbCIM for MEMNumber of columns,N34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference9 o

295、f 40Challenge 3:Low hardware cost ADC solution The area cost of DAC array exponentially increases with ADC resolution,incurring area-efficiency degradation Proposed CIM macro saves 50.4%ADC area by reusing MAC signal sampling capacitors SAR LogicMemoryDACComp.Conv.CIM MacroMemoryMemoryProposed CIM M

296、acroConv.Proposed00.20.40.60.81ADC Overhead(a.u.)50.4%reductionMemoryComp.&SAR Logic 34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference10 of 40Outline Motivation and C

297、hallenges Proposed Plastic-CIM MacroOverview of Plastic-CIM MacroPlastic Cell Array Differential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IE

298、EE International Solid-State Circuits Conference11 of 40Overview of Plastic-CIM Macro 14nm fused 32Kb 5T-LF&4Kb SRAM PCA for efficiently MEM-MVM DMA-ADC with reduced hardware overhead123FBL8:15SWL0SWL1FBL0:7BL63BLB63Block 0Block 1Block 31Input Driver and HVSR3L-DSA185T-LFSRAM185T-LFSRAM185T-LFBL0PWL

299、0WWL0BLB0185T-LFx07:0 x17:0 x637:0PWL63WWL63positive columnsnegative columnssign magnitude to 2s complement(SM-2SC)SM-2SCL1-DSAL2-DSAL3-DSAL3DSAO1L3DSAO31PCASM-2SCL1-DSAL2-DSAL3-DSA5TLF Read and WriteSRAM WL Driver,Read and WriteCPTiming&CTRL8-bit DMA-ADCL1-DSA L2-DSAL3-DSA185T-LFSRAMSRAMSRAM185T-LF

300、SRAML3DSAO0CP,charge pump;HVS,high voltage switches;R3L-DSA,reconfigurable 3-level digital shifter-and-adder;L3DSAO,3rd-level DSA output.12334.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-St

301、ate Circuits Conference12 of 40Computation Flow of P-CIM MacroSM-2SCL1-DSAL3-DSAL2-DSAX0-637:05T-LFSRAMSRAM5T-LFDMA-ADCPCAHebb update SRAM L3DSAO27:0R3L-DSA MVM mode for traditional connections,MEM-MVM mode for plastic connectionsSM-2SCL1-DSAL3-DSAL2-DSAX0-637:05T-LFSRAMSRAM5T-LFDMA-ADCPCAAll SRAM s

302、tore 1L2DSAO19:0R3L-DSAMVM ModeMEM-MVM Mode34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference13 of 40Outline Motivation and Challenges Proposed Plastic-CIM MacroOvervi

303、ew of Plastic-CIM MacroPlastic Cell ArrayDifferential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Confe

304、rence14 of 40PCA Unit Structure Merged 5T-LF*and SRAM cells thanks to the process compatibility Shared SGD/SGS and BL/BLB to reduce the number of metal lines ENAVGPWLWWL BL(SGD)BLB(SGS)SWLFBLFBLLSL MVPCHCLPM0PM1NM0NM1NM2VDDQQBFGCSAM5T-LFSRAM&CSAM*5T-LF cell refers to M.Kim,JSSC,202234.9:A Flash-SRAM

305、-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference15 of 40PCA Unit Structure 5T-LF cells built by different IO devices Under-drive IO transistors for the read and pass gates in the 5T-

306、LF to increase the cell read currents ENAVGPWLWWL BLBLBSWLFBLFBL LSL MVPCHCLPM0PM1NM0NM1NM2VDDQQBFGCSAMTransistor PM0PM1NM0/1/2OthersDevice Typep18p18n18ud12n08/p08W*L Ratio811-34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Pro

307、cess 2024 IEEE International Solid-State Circuits Conference16 of 40PCA Unit Structure SRAM and computing switches are built by the core devices HV paths decoupled from SRAM,CSAM,and the computing pathENAVGPWLWWLBLBLBSWLFBLFBL LSL MVPCHCLPM0PM1NM0NM1NM2VDDQQBFGCSAMHVPathsCompute Path34.9:A Flash-SRA

308、M-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference17 of 405T-LF Programing Bit-wise programing by applying VPRGon PWL/WWL,and inhibit the half-selected cells by applying VDDon BLProgr

309、am-inhibited*FBL(GND)FBL(VDD)PWL(VPRG)WWL(VPRG)BL(VDD)BLB(GND)LSL(Floating)Programmed*SWL(GND)5T-LF5T-LFSRAM*5T-LF cell refers to M.Kim,JSSC,202234.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Sol

310、id-State Circuits Conference18 of 405T-LF ErasingFBL(GND)FBL(GND)PWL(GND)WWL(VERS)BL(GND)BLB(GND)LSL(Floating)Erased*SWL(GND)5T-LF5T-LFSRAM*5T-LF cell refers to M.Kim,JSSC,2022 Row-wise erasing of 5T-LF by applying VERSbetween WWL and PWL34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro

311、for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference19 of 40Enhanced FN Tunneling in the Si Fin Structure Enhanced electric field during programming at the Si fin corner,leading to 185x enhanced local FN tunneling current 25%reduce

312、d minimum VPRGand 66%reduced minimum timeplanecornerSimulated FN-tunneling Current Density (a.u.)1102104106185x increased024681012140.0001 0.0010.01 0.1110Minimum Program TimeUniversity of Michigan,ISSCC 2017 TSMC,VLSI-DAT 2013University of Minnesota,JSSC 2021Virage Logic,ISSCC 2004KAIST,CICC 2023Mi

313、nimum VPRG(V)This Work25%66%PlaneCornerSi finGatePlaneTEM Cross-Sectional View and TCAD Electric Field SimulationCornerE(MV/cm)483216010ms100us1us34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International So

314、lid-State Circuits Conference20 of 40Computing of PCA Unit-Truth Tablex(BL Pulse Width*1)or w(5T-LF VTH*2)Hebb(SRAM Q)Multiplication Value(VMV)00(tACT=0)-00000(VDD)-0000(VTH0)-0(GND)01(tACT=t0)0001(VTH1)1(VDD)000001(VDD-V)01(tACT=t0)0010(VTH2)1(VDD)000010(VDD-2V)01(tACT=t0)0011(VTH3)1(VDD)000011(VDD

315、-3V)10(tACT=2t0)0100(VTH4)1(VDD)000001(VDD-2V)11(tACT=3t0)1111(VTH15)1(VDD)101101(VDD-45V)*1.tACT:BL activation time;t0:the unit time for pulse width modulation.*2.Cell current linearly increases as the threshold voltage(VTH)changes from VTH0to VTH15.CLENAVGPWLWWL BLBLBSWLFBL LSL MVPCHVDDQQBFGCSAMCo

316、mpute Path34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference21 of 40Computing of PCA Unit Phase0(Pre-charging)PWL/WWLSWLGNDVRDFBLFBLVDDVDD/2BLBBLVDDGNDVMVENAVGVDDLSL p

317、re-chargeLocal multiplicationAveragingFBL inhibitionPH1PH2PH0CLENAVGPWLWWL BLBLBSWLFBL LSL MVPCHVDDQQBFGCSAM Pre-charge top-plates of CSAMand parasitic cap of LSL34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE

318、International Solid-State Circuits Conference22 of 40Computing of PCA Unit Phase1(Multiplying)PWL/WWLSWLGNDVRDFBLFBLVDDVDD/2BLBBLVDDGNDVMVENAVGVDD5VLSL pre-chargeLocal multiplicationSGD activation depends on inputs Charge sharing between capacitorsAveragingFBL inhibitionVCLENAVGPWLWWL BLBLBSWLFBL LS

319、L MVPCHVDDQQBFGCSAMCompute Path Input pulses applied to BL,multiply through dis-charge CSAMsPH0PH1PH234.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference23 of 40Computin

320、g of PCA Unit Phase2(Averaging)PWL/WWLSWLGNDVRDFBLFBLVDDVDD/2BLBBLVDDGNDVMVENAVGVDD5VLSL pre-chargeLocal multiplicationSGD activation depends on inputs Charge sharing between capacitorsAveragingFBL inhibitionVCLENAVGPWLWWL BLBLBSWLFBL LSL MVPCHVDDQQBFGCSAM Average between capacitors through charge s

321、haringPH0PH1PH234.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference24 of 40Outline Motivation and Challenges Proposed Plastic-CIM MacroOverview of Plastic-CIM MacroPlast

322、ic Cell Array Differential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference25 of 40DMA-ADC Struct

323、ureCLPCLNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor Modified SAR logics for grouping and quantizing Common-centroid capacitors grouping strategy to reduce mismatch Multi-Element

324、Sparsity-Aware(MESA)control to leverage sparsity in plastic-NN12312334.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference26 of 40DMA-ADC Work Flow Phase1(Multiplying)CLPC

325、LNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor ADCENAVGCLP/CLNSAOPSAONTop-plate of CSAM connected to MV,while bottom-plate of CSAM connected to VREF_H because the input common-mode

326、 voltage is near VDDMVPCLPVREF_HVREF_LMVPMVPMVPENAVG=034.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference27 of 40DMA-ADC Work Flow Phase2(Averaging)ADCENAVGCLP/CLNSAOPS

327、AON Top-plate of CSAM connected to CL to enable the charge sharingMVPCLPVREF_HVREF_LMVPMVPMVPENAVG=1CLPCLNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor 34.9:A Flash-SRAM-ADC-Fused P

328、lastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference28 of 40DMA-ADC Work Flow Phase3(Quantizing)CLPCLNSAOPSAONADCSWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor ADCENAVGCLP/C

329、LNSAOPSAONBottom-plate of CSAM grouped to ratioed capacitors and switched to quantize the MACVCLPVREF_HVREF_LGrouped to ratioed capacitors(C6 C0)C0C1C2C6VREF_LVREF_H1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C634.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for

330、 Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference29 of 40Common-Centroid Capacitors Grouping StrategyCLPCLNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&Timin

331、gDFFsMESA ControlHebbx/w sparsity factor Ensuring the parasitic capacitators of C6 C0(CPAR6 CPAR0)maintains the same ratiosC0Comp.&SAR Logic Ratioed wire parasitic capacitators CPARof C6to C1:PCA row index0816243240485664C6C6C5C5C4C4CPAR1 32CPAR2(33+31)=2CPAR1SWCpara1Cpara2CPAR3(35+34+30+29)=4CPAR13

332、4.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference30 of 400100200Case ICase IICase IIIAS aware onlyAS&WS&HS awareMulti-Element Sparsity-Aware(MESA)ControlCLPCLNVREF_LVR

333、EF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor Dynamically modulate sensing steps by leveraging the sparsity in x,Hebb,or w to save power50%AS50%HS90%AS10%HS10%AS90%HSPower Reduction(a.u.)NSENx 1s CounterxWeight Sparsity subtractorNACT/w sparsity factor DigitalComp.NLESHebb1s CounterNHEBHe

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(SESSION 34 - Compute-In-Memory.pdf)為本站 (2200) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站