《SESSION 34 - Compute-In-Memory.pdf》由會員分享,可在線閱讀,更多相關《SESSION 34 - Compute-In-Memory.pdf(306頁珍藏版)》請在三個皮匠報告上搜索。
1、ISSCC 2024SESSION 34Compute-In-Memory34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference1 of 34A 28nm 83.23TFLOPS/WPOSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications Yang Wang1,Xiao
2、long Yang1,Yubin Qin1,Zhiren Zhao1,Ruiqi Guo1,Zhiheng Yue1,Huiming Han1,Shaojun Wei1,Yang Hu1,Shouyi Yin1,21Tsinghua University,Beijing,China2Shanghai AI Laboratory,Shanghai,China 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International
3、Solid-State Circuits Conference2 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclu
4、sion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference3 of 34FP-CIM for High-accuracy AI Applications Recent AI tasks are becoming increasingly complex.Complex AI application requires FP-CIM for hig
5、h accuracy.Keyword SpottingFaceRecogn.AIAssistantAutomaticDriveAIGC34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference4 of 34Limitation of Conventional FP Data Format If a bullet becomes longer,break
6、it into two bullets Conventional FP cannot achieve high accuracy with low power.Data DistributionFloating Point 16Brain Float 1634.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference5 of 34Principle of P
7、OSIT Data Format If a bullet becomes longer,break it into two bullets POSIT exploits dynamic bit to adapt to varied distributions.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference6 of 34Conventional
8、 FP VS.POSIT If a bullet becomes longer,break it into two bullets POSIT8 saves 27%energy with 0.4%accuracy loss than FP16.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference7 of 34Outline Background a
9、nd Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memo
10、ry Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference8 of 34Challenge 1:Large Power in Regime Processing If a bullet becomes longer,break it into two bullets Dynamic regime increases 2.62 pre-processing energy.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute
11、-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference9 of 34Challenge 2:Cell Under-utilization in CIM Array If a bullet becomes longer,break it into two bullets Dynamic mantissa introduces 41.3%CIM cell underutilization.34.1:A 28nm 83.23TFLOPS/W P
12、OSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference10 of 34Challenge 3:Redundant Toggle in Adder Tree If a bullet becomes longer,break it into two bullets Dynamic aligned accumulation incurs 66.8%power waste.34.1:A 28nm 83.23TF
13、LOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference11 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-co
14、mpute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference12 of 34Feature 1:Bi-directional Regime Pr
15、ocessing Step1:Regime extracting with leading 1/0 detector.Step2:Regime processing with codec and addition.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference13 of 34 If a bullet becomes longer,break
16、it into two bullets BRPU replaces codec-addition with shift-or processing.Feature 1:Bi-directional Regime Processing 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference14 of 34Feature 1:Bi-directional
17、 Regime Processing Same sign addition:logic shift to increases 1s/0s counts.A large shift code always denotes more shift power.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference15 of 34Feature 1:Bi-d
18、irectional Regime Processing BRPU dynamically decodes small|RB|to shifts large|RA|.BRPU minimizes shift code to saves 40%of shift energy.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference16 of 34Feat
19、ure 1:Bi-directional Regime Processing Different sign addition:logic shift to decrease 1s/0s counts.If shift code Rs effective bit-width,it introduces shift error.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circu
20、its Conference17 of 34Feature 1:Bi-directional Regime Processing BRPU dynamically decodes small|RB|to shifts large|RA|.BRPU avoids shift overflow to reduce 50%of shift logic.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-
21、State Circuits Conference18 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion
22、34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference19 of 34 If a bullet becomes longer,break it into two bullets Dynamic mantissa bit-width introduces 48.9%cell waste.Feature 2:Critical-bit Pre-comput
23、e-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference20 of 34 If a bullet becomes longer,break it into two bullets CPCS uses spare bits to achieve dual-bit MAC in each cycle.Feature 2:Critic
24、al-bit Pre-compute-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference21 of 34Feature 2:Critical-bit Pre-compute-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for H
25、igh-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference22 of 34Feature 2:Critical-bit Pre-compute-and-store 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference23 of 34Outl
26、ine Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based
27、 Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference24 of 34 If a bullet becomes longer,break it into two bullets If A and B have no overlap bits,A+B is equal to A|B.Feature 3:Cyclically-alternating Scheduling34.1:A 28nm 83.23TFLOPS/W POS
28、IT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference25 of 34 If a bullet becomes longer,break it into two bullets Even A0/A1 have 1 overlap bit,A0W0+A1W1has to use adder.Feature 3:Cyclically-alternating Scheduling34.1:A 28nm 83.23
29、TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference26 of 34Feature 3:Cyclically-alternating Scheduling If a bullet becomes longer,break it into two bullets All cycles need adders for synchronous bit-serial computing.3
30、4.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference27 of 34 If a bullet becomes longer,break it into two bullets CASU cyclically shifts A0for asynchronous computing with A1.Feature 3:Cyclically-alterna
31、ting Scheduling34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference28 of 34 If a bullet becomes longer,break it into two bullets CASU eliminates overlap bits in former cycles of A0W0+A1W1.Feature 3:Cyc
32、lically-alternating Scheduling34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference29 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-direct
33、ional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Con
34、ference30 of 34Chip Photograph and SummaryOne operation(OP)represents one multiplication or addition.1)Highest efficiency point,0.65V,78MHz,50%input sparsity2)The baseline is 69.76%3)The baseline is 21.304)The baseline is 80.31%34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-A
35、ccuracy AI Applications 2024 IEEE International Solid-State Circuits Conference31 of 34Training and Inference Performance POSIT(16,2)only incurs 0.04%of training loss than FP32.It achieves 10.45TFLOPS/W of average energy efficiency.34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for Hi
36、gh-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference32 of 34Performance ComparisonComparison with SOTA FP CIM Macros34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference33
37、 of 34Outline Background and Motivation Challenges of POSIT-Based CIM Macro Proposed POSITCIM Macro FeaturesBi-directional Regime Processing Codec Critical-bit Pre-compute-and-store CIM ArrayCyclically-alternating Scheduling Adder Tree Measurement and ComparisonConclusion34.1:A 28nm 83.23TFLOPS/W PO
38、SIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference34 of 34ConclusionA POSIT-Based CIM Macro with Bi-directional Regime Codec,Critical-bit Pre-computing-Storing and Cyclically-alternating Scheduling Achieving 83.23TFOPS/W Energy
39、 Efficiency An Energy Efficient POSIT-Based CIM MacroBi-directional Regime Processing Codec Save Pre-processing Energy by Replacing Codec to Shift-ORCritical-bit Pre-compute-and-store CIM Array Improve CIM Utilization by Using Spare Bit for Dual-bit MAC Cyclically-alternating Scheduling Adder Tree R
40、educe Accumulation Power by Simplifying Addition to OR 34.1:A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy AI Applications 2024 IEEE International Solid-State Circuits Conference35 of 34Please Scan to Rate Please Scan to Rate This PaperThis Paper34.2:A 16nm 96Kb Integer/F
41、loating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference1 of 48A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W a
42、nd 33.2-91.2TFLOPS/W for AI-Edge DevicesWin-San Vince Khwa*1,Ping-Chun Wu*2,Jui-Jen Wu1,Jian-Wei Su2,3,Ho-Yu Chen2,Zhao-En Ke2,Ting-Chien Chiu2,Jun-Ming Hsu2,Chiao-Yen Cheng2,Yu-Chen Chen2,Chung-Chuan Lo2,Ren-Shuo Liu2,Chih-Cheng Hsieh2,Kea-Tiong Tang2,Meng-Fan Chang1,21TSMC Corporate Research,Hsinc
43、hu,Taiwan2National Tsing Hua University,Hsinchu,Taiwan,3Industrial Technology Research Institute,Hsinchu,Taiwan,34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-Sta
44、te Circuits Conference2 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port Gain
45、-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference3 of 48Outline
46、Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port Gain-cell Array Scheme(SB-TP-GCA)Performa
47、nce and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference4 of 48Motivations of INT/FP Dual-mode CIM Advanced
48、AI-edge chip desirables:General purpose and computational flexibility High energy efficiency and high area efficiency Sufficient inference accuracy Tradeoff between number formats with energy efficiency and inference accuracy:Floating-point(FP)mode Pros:high inference accuracy Cons:higher energy and
49、 more parameters Integer(INT)mode Pros:higher energy efficiency and less parameters Cons:lower inference accuracy02468101214161800.511.522.533.54INT4INT8BF16CIFAR-10CIFAR-100ImageNetNormalized Accuracy Loss(%)Normalized Energy/Ops(AU)INT4INT8BF16/FP16Better Energy EfficiencyBetter AccuracyResNet-18D
50、esirable to design an INT/FP dual-mode CIM34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference5 of 48Challenges of INT/FP Dual-Mode CIM(1/2)Lo
51、w area utilization in integer-mode multiply-and-accumulation(INT-MAC)Floating-point multiply-and-accumulation functions becoming idle in INT-MAC e.g.exponent computation,mantissa alignment circuitA.Previous FP-CIMARRAY+ARRAYINDRVAlign.ADDERINEWExINWIDLEINT ModeARRAY+ARRAYINDRVAlign.ADDERINEWExINMWMF
52、P ModeARRAYARRAYINDRVAlign.ADDERINEWExINMWMADDERxIN+B.Straight-forward dual-mode ARRAYINDRVAlign.ADDERINExINWxIN+ADDERWARRAYINT ModeFP Mode34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 I
53、EEE International Solid-State Circuits Conference6 of 48 Long system-level latency NN data update interruptions without supporting concurrent write-and-compute High energy consumption Redundant system-to-CIM data transfer during computationLong system latency due to data updateCycleThis Work(2-Port)
54、Conv.Write/Refresh PortCompute PortWCompLess latencyWCompData StationaryCIM ComputeNo OperationWrite(SRAM:W Update/Gain-cell:W-update&refresh)High energy due to frequently data transfer&data accessLess energy 50 xNon-StationaryStationary(ResNet18 model and ImageNet)CIM MacroARRAYCompute UnitW/IN Buf
55、fer1212Less energy 100 xNon-StationaryStationaryCIM internal accessSystem-CIM transferChallenges of INT/FP Dual-Mode CIM(2/2)34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Internatio
56、nal Solid-State Circuits Conference7 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based
57、Two-port Gain-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference8
58、of 48Overview of Dual-mode Gain-cell-CIM Macro Dual-mode local-computing-cell(DM-LCC)Reuses the exponent addition as an adder tree stage for INT-MAC Improves area efficiency in INT mode1Zone-based Input Processing UnitNormal IOCTRLDigital Shift&AdderWL DriverDual-mode Gain-cell Computing ArrayINT-FP
59、 Dual-mode LCC(DM-LCC)Stationary-based Two-port Gain-cell Array(SB-TP-GCA)Gain-cell Computing Block(GC-CB)GC-CBx64INT-FP Dual-mode Gain-cell CIM213x2434.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Dev
60、ices 2024 IEEE International Solid-State Circuits Conference9 of 48Overview of Dual-mode Gain-cell-CIM MacroZone-based Input Processing UnitNormal IOCTRLDigital Shift&AdderWL DriverDual-mode Gain-cell Computing ArrayINT-FP Dual-mode LCC(DM-LCC)Stationary-based Two-port Gain-cell Array(SB-TP-GCA)Gain
61、-cell Computing Block(GC-CB)GC-CBx64INT-FP Dual-mode Gain-cell CIM213x24 Dual-mode zone-based input processing scheme(ZB-IPS)Simplifies subtraction in exponent computationReuses the alignment circuit in FP-mode as sparsity detection in INT-mode Improves energy efficiency and area efficiency234.2:A 1
62、6nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference10 of 48Overview of Dual-mode Gain-cell-CIM Macro Stationary-based Two-port Gain-cell Array(SB-TP-
63、GCA)Supports concurrent data updating and computationReduces system-to-CIM and CIM internal data accessesImproves system energy efficiency and latencyZone-based Input Processing UnitNormal IOCTRLDigital Shift&AdderWL DriverDual-mode Gain-cell Computing ArrayINT-FP Dual-mode LCC(DM-LCC)Stationary-bas
64、ed Two-port Gain-cell Array(SB-TP-GCA)Gain-cell Computing Block(GC-CB)GC-CBx64INT-FP Dual-mode Gain-cell CIM213x24334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-
65、State Circuits Conference11 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port
66、Gain-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference12 of 48Dua
67、l-mode Local-computing-cell(DM-LCC)Gain-cell computing block(GC-CB)Stationary-based gain-cell sub-array(SB-GCA)Dual-mode adder(DM-ADD)Dual-mode multiplexer(DM-MUX)Integer 8bit ModeBF16 ModeS,WM6:0WE7:0Sign Ext.pSUMIN1W0W1 W1W0IN0IN1W07:0W17:0DM-LCCDM-ADD+SB-GCASB-GCA:INT8 W07:0,W17:0BF16 S,WM6:0,WE7
68、:0DM-ADD:INT8(8bW0+8bW1)BF16 (8bINE+8bWE)DM-MUX:INT8 (8bIN0 x 8bW0)+(8bIN1x 8bW1)BF16 (8bINMx 8bWM)SB-GCADM-MultiplierDM-MUXLeading-1IN1WEDM-LCCDM-ADD+DM-MultiplierDM-MUXINEPDES,WMINMA34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and
69、 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference13 of 48Computation Flow(Floating-Point Mode)Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x
70、WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACVComputation Flow ChartAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345BF16 Mode(NACCU=64)D
71、M-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)534.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W fo
72、r AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference14 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0
73、)Computation Flow ChartAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Pha
74、se 3:Computes PDM(n)=INMA(n)x WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 20
75、24 IEEE International Solid-State Circuits Conference15 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)Computation Flow Ch
76、artAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)
77、=INMA(n)x WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Internationa
78、l Solid-State Circuits Conference16 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)Computation Flow ChartAlign INM(n)based
79、 on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x WM(n)n=063
80、 Phase 4:Accumulates 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circui
81、ts Conference17 of 485Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)Computation Flow ChartAlign INM(n)based on PDE(n)ComputesPD
82、E(n)=INE(n)+WE(n)Compute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x WM(n)n=063 Phase 4:Accumulates
83、 64 PDM(n):pMACVM=063(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference18 of 4
84、8Computation Flow(Floating-Point Mode)BF16 Mode(NACCU=64)DM-GCCAZB-IPU(IN Alignment)WEINM0WM0 x8INE2DM-LCCPDE(0)SB-TP-GCA+(DM)xX(DM)13GC-CB#0GC-CB#63(x64)PDM(63)PDE(63)+DSaA4(x32)(x16)FP32 MACVMpMACVM-2 pMACVM-4 x8PDM(0)5Computation Flow ChartAlign INM(n)based on PDE(n)ComputesPDE(n)=INE(n)+WE(n)Com
85、pute PDM(n)INMA(n)x WM(n)Accumulate64 PDM(n)CombinePDE-MAX,pMACVMFloating-point-MAC12345 Floating-point-MAC operation(Five compute phases)Phase 1:Computes PDE(n)=(INE(n)+WE(n)n=063 Phase 2:Aligns INM(n)based on PDE(n)Phase 3:Computes PDM(n)=INMA(n)x WM(n)n=063 Phase 4:Accumulates 64 PDM(n):pMACVM=06
86、3(INMA(n)WM(n)Phase 5:Combines PDE-MAX,pMACVM&outputs FP32 MACV34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference19 of 48Computation Flow(In
87、teger Mode)Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-GCCA
88、ZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumul
89、ate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference20 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)10
90、0%Utilization 2X Operationx8xDM-GCCAZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Spa
91、rsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34
92、.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference21 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-
93、GCCAZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-M
94、AC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-
95、Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference22 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-GCCAZB-IPU(IN Sparsity Aware)W0I
96、NE+WEIN1W1x8+DSaAIN024DM-LCC+(DM)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=0
97、63 Phase 4:Accumulates 64 pMACV2(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Comput
98、ing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference23 of 48Computation Flow(Integer Mode)INT8 Mode(NACCU=128)100%Utilization 2X Operationx8xDM-GCCAZB-IPU(IN Sparsity Aware)W0INE+WEIN1W1x8+DSaAIN024DM-LCC+(DM
99、)X(DM)13(x32)(x16)SB-TP-GCAGC-CB#0pMACV2(0)GC-CB#63(x64)pMACV2(63)23b MACVpMACV4 pMACV8 x8 Integer-MAC operation(Four compute phases)Phase 1:Pre-computes pSUM(n)=(W0(n)+W1(n)n=063 Phase 2:Processes IN Sparsity and IN0(n)&IN1(n)for INT-MAC Phase 3:Computes pMACV2(n)n=063 Phase 4:Accumulates 64 pMACV2
100、(n)MACV=0127(IN(n)W(n)Computation Flow ChartProcess and decode IN0(n)and IN1(n)Pre-computepSUM(n)=W0(n)+W1(n)(1ststage ADD)Compute pMACV2(n)IN0(n)xW0(n)+IN1(n)xW1(n)Accumulate64 pMACV2(n)1234Integer-MAC34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73
101、.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference24 of 48Outline Challenges of Dual-mode Computing-In-Memory Proposed Computation Gain-cell CIM macroOverview of Integer/Floating-Point Dual-mode CIM MacroDual-mode Local-computing-cell(DM-
102、LCC)Dual-mode Zone-based Input Processing Unit Scheme(ZB-IPS)Stationary-based Two-port Gain-cell Array Scheme(SB-TP-GCA)Performance and Measurement Results Conclusion34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W
103、 for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference25 of 48Dual-mode Zone-based Input Processing Unit:Motivation Two factors that leads to(1)area and energy and(2)hardware utilization Usage of full exponent bit-width in computing the number of shifting bits Mantissa bit exte
104、nsion after alignment to suppress truncation data loss00.10.20.30.40.50.68b10b12b16bMobileNet-v2(ImageNet)Accuracy Loss(%)0b4b8b6bExtended Bit#(Alignment)Area overhead&low utilizationFP ModeEXPALIGNMULTADDNumber of shifting bits NSH=(PDE-MAX PDE)Mantissa Bit ExtensionINT ModeIDLEMULTADDIDLEIDLEExtra
105、Exponent Handling8b1b7bBF16SignExpMan9bComparators9b PDE?9b PDE9b PDE-MAX9b PDE9b NSH9bSubtractorsPDE=Product Exponent=INE+WE-Bits34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Inter
106、national Solid-State Circuits Conference26 of 48Dual-mode Zone-based Input Processing Unit:This Work Zone-based exponent handling Divides the product exponent spectrum into zones Computes NSHbased on the difference between PDEand its zone boundary Mantissa bit extension with phases and re-use hardwa
107、re for sparsity No hardware bit extension and higher hardware utilizationThis Work(2 Phase and Sparsity Detection)FP ModeDouble the Mantissa BitsBetter Inference Accuracy andNo Hardware Bit ExtensionBetter Hardware UtilizationEXPMULTZDBAADDINT ModeIN7:0PREMULTSPARADDMACVPhase 1For 15:8pMACVPh1Phase
108、2pMACVPh2Mantissa Bit ExtensionExponent HandlingSparsity Detection This Work(Zone-based Exponent Handling)For 15:8Phase 134.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International
109、Solid-State Circuits Conference27 of 48Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(ZD)Classifies each PDE(n)into
110、 one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:3Zone BIAS Unit(ZBU)ZB-I
111、PUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 202
112、4 IEEE International Solid-State Circuits Conference28 of 48Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#6
113、3#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(ZD)Classifies each PDE(n)i
114、nto one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/
115、W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference29 of 48Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Part
116、ial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(Z
117、D)Classifies each PDE(n)into one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TO
118、PS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference30 of 48Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GC
119、CAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone bias unit(ZBU)Generates 3 zone-references(
120、PDE-REF13)Zone detector(ZD)Classifies each PDE(n)into one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Mac
121、ro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference31 of 48Dual-mode Zone-based Input Processing Unit:Structure Zone-based Input Processing Unit(ZB-IPU)Partial-PDE-MAXfinder(pEMAXF)Finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Zone
122、 bias unit(ZBU)Generates 3 zone-references(PDE-REF13)Zone detector(ZD)Classifies each PDE(n)into one of the three zones 64 Dual-mode input processing block(DM-IPB)Aligns the INMaccording to the zone-shift number(NSHZ)Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZo
123、ne Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE(0)8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16Step-by-Step example will be provided later34.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cel
124、l-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference32 of 48Dual-mode Zone-based Input Processing Unit:Example Two stages operation(FP-MAC)Stage 1:pEMAXF finds the MSB-6b(PDE-MAX8:3)of PDE-MAX Example
125、:PDE(0)=253=011111101=PDE-MAXPDE(1)=243=011110011PDE(64)=236=011101100 ZBU then generates 3 zone-references(PDE-REF13)according to PDE-MAX8:3 Example:PDE-REF1=PDE-MAX8:3,111=255PDE-REF2=PDE-REF1 8=247PDE-REF3=PDE-REF1 16=239Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(
126、0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16PDE-MAX=PDE(0)PDE(0)=253PDE(1)=243PDE(63)=236x64PDE34.2:A 16nm 96Kb Integer/Floating-Point Dual-M
127、ode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference33 of 48Dual-mode Zone-based Input Processing Unit:Example Two stages operation(FP-MAC)Stage 1:pEMAXF finds the MSB-6b(PDE-MAX8:3)of PDE
128、-MAX Example:PDE(0)=253=011111101=PDE-MAXPDE(1)=243=011110011PDE(64)=236=011101100 ZBU then generates 3 zone-references(PDE-REF13)according to PDE-MAX8:3 Example:PDE-REF1=PDE-MAX8:3,111=255PDE-REF2=PDE-REF1 8=247PDE-REF3=PDE-REF1 16=239Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,1
129、11INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16ZFG=1ZFG=2ZFG=3PDE-MAX=PDE(0)PDE(0)=253PDE(1)=243PDE(63)=236x64PDEPDE-REF1247255239PD
130、E-REF2PDE-REF334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference34 of 48Dual-mode Zone-based Input Processing Unit:Example Two stages operat
131、ion(FP-MAC)Stage 2:ZD Classified each PDE(N)into one of the three zones based on its zone-flag(ZFG)Example:PDE(0)=253=011111101=(ZFG=1)PDE(1)=243=011110011=(ZFG=2)PDE(64)=236=011101100=(ZFG=3)DM-IPB computes the zone-shift number(NSHZ)and aligns the INMaccordingly,where NSHZis the inverse of PDE2:0(
132、LSB3b)Example:PDE(0)2:0=3b010=PDE-REF1 PDE(0)PDE(1)2:0=3b100=PDE-REF2 PDE(1)PDE(64)2:0=3b011=PDE-REF3 PDE(2)Zone-detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE
133、-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16ZFG=1ZFG=2ZFG=3PDE-MAX=PDE(0)PDE(0)=253PDE(1)=243PDE(63)=236x64PDEPDE-REF1247255239PDE-REF2PDE-REF334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W a
134、nd 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE International Solid-State Circuits Conference35 of 48ZFG=1ZFG=2ZFG=3Dual-mode Zone-based Input Processing Unit:Example Two stages operation(FP-MAC)Stage 2:ZD Classified each PDE(N)into one of the three zones based on its zone-flag(ZFG)Example:PDE(0)
135、=253=011111101=(ZFG=1)PDE(1)=243=011110011=(ZFG=2)PDE(64)=236=011101100=(ZFG=3)DM-IPB computes the zone-shift number(NSHZ)and aligns the INMaccordingly,where NSHZis the inverse of PDE2:0(LSB3b)Example:PDE(0)2:0=3b010=PDE-REF1 PDE(0)PDE(1)2:0=3b100=PDE-REF2 PDE(1)PDE(64)2:0=3b011=PDE-REF3 PDE(2)Zone-
136、detect-based alignment(ZDBA)PDE-MAX8:3PDE-REF1=PDE-MAX8:3,111INM(0)INE(0)INVZone Detector(ZD)PDE(0)2:0ZFG(0)INE(0)GC-CB#0 DM-IPB#0 INMA(0)ModePDE8:363:0Zone BIAS Unit(ZBU)ZB-IPUDM-GCCAPDE-REF2=PDE-REF1 8Partial-PDE-MAXFinderDM-IPB#63#63 PDE(63)2:0ZFG(63)PDE-REF3=PDE-REF1 16PDE-MAX=PDE(0)PDE(0)=253PD
137、E(1)=243PDE(63)=236x64PDEPDE-REF1247255239=3=4=2Note:9b subtractor is simplified by 6b subtractor and 3b inverterPDE-REF2PDE-REF334.2:A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W and 33.2-91.2TFLOPS/W for AI-Edge Devices 2024 IEEE Intern
138、ational Solid-State Circuits Conference36 of 48 Three zone cases(FP-MAC)PDEwith ZFG=1:INMalignments are executed in Ph1 PDEwith ZFG=2:INMalignments are executed in Ph2 PDEwith ZFG=3:Triggers input-sparsity-aware circuit to reduce compute energyPDE ZonePDE 2:0NSHZPh1 INMA7:0Ph2 INMA7:0PDE-REF2 PDE PD
139、E-REF1(ZFG=1)111000=01IN6IN5IN4IN3IN2IN1IN000000000110001=101IN6IN5IN4IN3IN2IN1IN00000000000111=700000001IN6IN5IN4IN3IN2IN1IN00PDE-REF33 or 2.08 better worst EF,1.82 better peak EF Smaller EF variation compared with DCIM and ACIM 2024 IEEE International Solid-State Circuits Conference26 of 3434.3:A
140、22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM Macro Better performance due to Lightning structure 4x RC saving compared with vertical-cut structure 2.99x accuracy improvement co
141、mpared with weight-wise-cut structure 2024 IEEE International Solid-State Circuits Conference27 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM Macro Better performanc
142、e due to 4-2 compressor 2.73x less peripheral overhead and 1.61x less internal bandwidth 1.71x and 1.43x less area and power cost 2024 IEEE International Solid-State Circuits Conference28 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-
143、Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM Macro Better EF due to ASQC 6.56x energy consumption saving compared with benchmark 2024 IEEE International Solid-State Circuits Conference29 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed
144、 Adder Tree and Analog-Storage Quantizers for Transformer and CNNsLightning-like Hybrid CIM MacroApplication performance based on double-regularization99.994%error point changed on multi-networks 22.11x FoM improvement(Compared with JSSC22&23,ISSCC22&23)2024 IEEE International Solid-State Circuits C
145、onference32 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsDemo System Setup Proposed Hybrid CIM Macro Test on ResNet and Vision-Transformer(ViT)FPGA64KbLightningCIM chipMeasured waveform in P
146、C through ILATest platformMeasured waveform through oscilloscope DC powerILA:Vector testing;Oscilloscope:access time measurement;DC Power:CIM_VDD supplement&power measurement.2024 IEEE International Solid-State Circuits Conference33 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory M
147、acro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsConclusionFeatures of proposed Lightning-like hybrid CIM macro Lightning-like hybrid CIM architecture 4x RC decrease and 2.99x accuracy improvement NOR-based 4-2-compressor for high-precision digital calculation
148、1.71x less area consumption and 1.43x power saving Analog-Storage Quantizer(ASQC)for low-power analog calculation 6.56x less energy cost and better mitigating error accumulation A 22nm 64Kb lightning-like hybrid CIM macro is verified 8bIN-8bW-23bOUT Access time:4.1ns,524.7GOPS/mm20.9V 60.8TOPS/W0.6V
149、 2024 IEEE International Solid-State Circuits Conference34 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsThanks for your kind attention!2024 IEEE International Solid-State Circuits Conference
150、35 of 3434.3:A 22-nm 64-kb Lightning-like Hybrid Computing-in-Memory Macro with a Compressed Adder Tree and Analog-Storage Quantizers for Transformer and CNNsPlease Scan to Rate Please Scan to Rate This PaperThis Paper34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Me
151、mory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference1 of 15A 3nm 32.5 TOPS/W,55.0 TOPS/mm2and 3.78 Mb/mm2Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture Hidehiro Fujiwara1,Haruki Mori1,Wei-Chang
152、 Zhao1,Kinshuk Khare1,Cheng-En Lee1,Xiaochen Peng2,Vineet Joshi3,Chao-Kai Chuang1,Shu-Huan Hsu1,Takeshi Hashizume4,Toshiaki Naganuma4,Chen-Hung Tien1,Yao-Yi Liu1,Yen-Chien Lai1,Chia-Fu Lee1,Tan-Li Chou1,Kerem Akarvardar2,Saman Adham3,Yih Wang1,Yu-Der Chih1,Yen-Huei Chen1,Hung-Jen Liao1,Tsung-Yung Jo
153、nathan Chang11TSMC,Hsinchu,Taiwan2TSMC,San Jose,CA3TSMC,Ottawa,TX4TSMC,Yokohama,Japan34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference2 of 15Outline Ba
154、ckground Circuit Implementation Si Measurement Results Conclusion34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference3 of 15Digital CIMData movement and m
155、ultiply and accumulate(MAC)are keyDigital CIM(DCIM):distributed banks+customized MAC unitDirectly leverage technology scaling.Better testability.010020030040050060022nm 12nm 4nm3nmTOPS/mm20.00.51.01.52.02.53.03.54.022nm 12nm 4nm3nmMb/mm2010020030040050060022nm 12nm 4nm3nmTOPS/WComputing area efficie
156、ncyMemory area efficiencyEnergy efficiency This work This work This work34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference4 of 15Key Features 72 Input C
157、h x 4 Out Ch with signed INT12 format 18 sets of weight in sub array with foundry 6T cell Dual rail design Parallel MAC with look up table(LUT)DFT&BIST interface+CIM BIST34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC
158、Architecture 2024 IEEE International Solid-State Circuits Conference5 of 15Block Diagram Interface dual rail(IDR)like design MAC in VDD(low voltage)domainInterface+CTRL+DFTMACwith LUTDecodeLSData storage(SRAM array+R/W)196 cols(12b x 4In Ch x 4Out Ch)18 rows x 18 segments(18 weight sets x 18 In Ch)V
159、DD domain with low VT devicesVDDM domain with high VT devices34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference6 of 15Sub Array with Flying BL Foundry 6
160、T cell with flying BL architecture5%macro area reduction by array layout optimization Simultaneous access to all banks 48 cols(12b x 4In Ch)18 rows18 rowsLIO for top segBL_TOPBL_BOTLIO for bot segArray(W30)Array(W74)WL_BOT17:0WL_TOP17:0VDDM domainMAC w/LUT(8 input&8 weight)VDD domainXIN70BL_BOT(Mx)B
161、L_TOP(Mx+2)BLB_TOP(Mx+2)BLB_BOT(Mx)18 rows18 rows34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference7 of 15Simple Local IO Read with MUX1:Better throughp
162、ut Write with MUX4:Global routing track saving48 cols(12b x 4In Ch)18 rows18 rowsLIO for top segBL_TOPBL_BOTLIO for bot segArray(W30)Array(W74)WL_BOT17:0WL_TOP17:0VDDM domainMAC w/LUT(8 input&8 weight)VDD domainXIN70SAEBSAEWYB3:0BLBBLDTDBBLEQBQBSAESAEBWrite with MUX4Read with MUX1BL_BOT(Mx)BL_TOP(Mx
163、+2)BLB_TOP(Mx+2)BLB_BOT(Mx)18 rows18 rows34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference8 of 15Parallel MAC vs Serial MAC Parallel MAC:multi-bit x mu
164、lti-bit Serial MAC:multi-bit x single bit+shift&addMultiplier&adder treeWeight with multi-bit(from array)XIN with multi-bitMultiplier&adder treeXIN(bit by bit)Weight with multi-bit(from array)Parallel MACSerial MACMSBLSB34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-
165、Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference9 of 15Less Energy in Parallel MAC High throughput&low toggle rate in parallel MAC 7%less devices and 5%faster speed using LUT approach 27.821.227.425.019.818.425.919.5101520253035Al
166、exNetResNet-50 MobileNetV2 Inception-v1Average toggle rate(%)Serial MACParallel MACOn ImageNet05101520253035404550012345678910 11Toggle rate in parallel MAC(%)BitAlexNetResNet-50MobileNetV2Inception-v1MSBLSBOn ImageNetOn ImageNet34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Comp
167、uting-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference10 of 15Pipeline Structure 3 stages:SRAM array access+2 MAC stages SRAM array access only when weight update for MACMAC1 MAC3:no weight update(data reuse)MAC0&MAC4:weight up
168、dateCycle#MAC0MAC1MAC2MAC3MAC4SRAM array access1234567MAC stage1MAC stage234.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference11 of 15Simulation ResultsDe
169、dicated pipeline design with transistor level STA toolTOPS/W is up to input toggle rate and SRAM array access0.51.01.52.02.53.00.40.50.60.7Cycle time(ns)VDD(V)MAC stage1MAC stage2SRAM array access VDDM=0.675V01020304050600255075100TOPS/W VDD=0.55VInput toggle rate(%)No SRAM array accessSRAM array ac
170、cess/4cyclesSRAM array access/1cycleWeight=1 50%TOPS/W 0.55VInput toggle10%25%SRAM array accessNo53.8 32.5/4cycles30.3 22.9 34.4:A 3nm 32.5 TOPS/W,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-Sta
171、te Circuits Conference12 of 153nm Test Chip MicrographGIOCTRLMAC w/8 Input ChWLDRV/XINDRVWLDRV/XINDRV62.3 m252 mMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChMAC w/8 Input ChBitcell6T cell(0.026 m2)Macro area0.0157 mm2Bit capacity60.7
172、5 kbMAC size72 Input Ch x 4 Output Ch x 18 weight setBit density(Mb/mm2)3.78TOPS/mm210.6 0.4V 55.0 0.9VTOPS/W 0.55VInput toggle10%25%SRAM array accessNo53.8 32.5/4cycles30.3 22.9 CIM BIST based on commercial MBIST engine-99%fault coverage by custom algorithm-Test time 2ms/macro34.4:A 3nm 32.5 TOPS/W
173、,55.0 TOPS/mm2 and 3.78 Mb/mm2 Fully Digital Computing-in-Memory Supporting INT12 x INT12 with Parallel MAC Architecture 2024 IEEE International Solid-State Circuits Conference13 of 15.01.11510203050708090959999.999.99VDD(V)PercentMeasurement Results 40 CVDD=VDDMVDDM fixed Bit MultiplyLess In-Array
174、ConsumeAnalog CIMLarger Out-arrayConsumeWL Driver6T6T6T 6T6T6TA/D BL Driver Shift&Add A/D A/DHeavy ADC-Bit Shift&Multi-bit AdditionA/D ControlWL Driver6T6T6T 6T6T6T BL Driver Shift&Add Adder TreeDigital CIMLarger In-ArrayConsumeLogic Gate-Bit MultiplyLess Out-array ConsumeAdder Tree-Bit Shift&Multi-
175、bit AdditionControl34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference5 of 32Challenges of SRAM CIM(2/3)High precision weight requires a quite heavy hig
176、h precision bit ADCShift&AddN Bit ADCW0W1 Bit Serial InputtingIM-1I0I1I2M+N Bit ADCW0W1 M Bit DACWN-1WN-1IM-1:0 Decode LogicSAR Logic&Decode LogicDAC Cap ArrayFlash:High Throughput&High CostSAR:Low Throughput&Low CostTrade-off of ADC Precision&Throughput&OverheadHigher bit precision increases ADC bu
177、rdenFlash ADC:High throughput but high cost,SAR ADC:Low cost but low throughput34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference6 of 32Challenges of S
178、RAM CIM(3/3)Inner-Product Based Large Fan-In Adder Tree BurdenThe transmission delay increases rapidly as the fan-in increasesMulti-level adder tree introduce higher transmission delayCIM Array Big Fan-In Adder TreeInner Product Based CIMWith higher propagation delay Element-wise summationni,ji,kk,j
179、k 1ca b=Outer Product Based CIMMuch less propagation delayAccumulatorD QD QD QCIM Array Vector-wiseaccumulationnij,ijj 1cb a=34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-S
180、tate Circuits Conference7 of 32110011001100(,1)220(0,(1)1:0(2)(1:0(2 )(2)()2(2)2)2()=NNiiiiNNjiijNNjiijmin i NNiij maxiNIN NW iIN jIN iWjW NIN iW iijjINW(2)(1)(3)Basic observation:N bit N bit Multiplication can be transformed into different forms,corresponding to different CIM topologiesOur Motivati
181、on34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference8 of 32Our MotivationDACINN-1:0W0W1W2WN-12N bitADCO2N-1:02021222(N-1)DAC based bit parallel scheme
182、ACIM topologyHigh throughput but quite high ADC bit precision,more accuracy loss1100()(2)2=NNjiijW iIN j34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Confere
183、nce9 of 32Our Motivation Bit serial scheme based ACIM topologyLower throughput but lower ADC bit precision requirementW0W1W2WN-1N bitADCO2N-1:0Shift&Add2021222(N-1)IN0IN1IN2INN-11100()(2)2=NNjiijIN iW j34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memor
184、y Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference10 of 32Our MotivationW0W1W2WN-1O2N-1:0Shift&AddNORNORNORNOR Bit serial scheme based DCIM topologyNo ADC,no accuracy loss but embedded logic gates needed1100()(2)2=NNjiijIN iW jIN0IN1IN2INN-1AdderTre
185、e34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference11 of 32Our MotivationAnalog PartAnalog bit multiplicationWINOAnalog bit addOIN0IN1Digital bit multi
186、plicationINWANDODigital bit addIN1IN0FASDigital PartAnalog bit shiftCMININ2NDigital bit shiftAdderININ2N(,1)220(0,(1)(2)=min i NNiijmaxiNW ijIN j Motivation of our proposed Hybrid CIM schemeRemove bit shift process from analog CIM thus lowing ADC bit precisionC34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain
187、 Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference12 of 32Outline Proposed Hybrid-domain SRAM-CIM MacroMacro Overall Architecture34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based F
188、loating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference13 of 32Macro Overall Architecture Hybrid-domain 8b Macro with Analog In-array&Digital Out-array Computing Logarithm Bit-width Zero-canceling Residual Analog-Digi
189、tal-Convertor Outer-product Based Floating-Point/Fixed-Point CIM Block ArchitectureActivation BufferCTRLBL DriverWL Driver CIM BlockCIM BlockCIM CellCLK CIM Cell CellCellCellCellCellCellCellCell SRAM Sub-ArrayMultiplier8 Bit Hybrid Domain MacroExponentComputeCircuitSign CircuitAccumulatorINT/FP Adde
190、r16Bit RegistersMulti-bit AccumulatorAnalog Bit Mul&Bit AddResidual Stages SRAM 16b ColSparsity Control Circuit DQDQDQEN12334.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-Sta
191、te Circuits Conference14 of 32Outline Proposed Hybrid-domain SRAM-CIM MacroAnalog In-Array&Digital Out-Array Hybrid-Domain Scheme34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Sol
192、id-State Circuits Conference15 of 32Analog In-Array&Digital Out-Array Hybrid-Domain SchemeAnalog bit multiply and bit adder for(,+)(,)Current based analog CIM scheme with current mirror and transmission gates arrayAnalog Bit MultiplierAnalog Bit AdderAccumulatorADCAijWjmin(i,7)j max(0,i 7)OAi)Analog
193、ij(W j=AnalogOi(Digital)Oi()14ii 0Oi(Digita2l)=min(i,7)14ii 0j max(0,i 7)(Wji)2Aj=Iunit O0(analog)=W0A0O1=W0A1+W1A0O2=W0A2+W1A1+W2A0O14=W7A7O7=W0A7+.+W7A0O8=W1A7+.+W7A1 Iout=IrefIrefW1A7IunitIunitW7A0W7A1W7A7W1A1W1A0W0A0W0A7Weight 6T Cell15Weight 6T Cell0 Weight 6T Cell15Weight 6T Cell0 Weight 6T Ce
194、ll15Weight 6T Cell0 BLBLBBLBLBBLBLBAct 6T Cell BLBLBAct 6T Cell BLBLBAct 6T Cell BLBLBW0A134.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference16 of 32Ana
195、log In-Array&Digital Out-Array Hybrid-Domain SchemeADC and accumulator for=Partial sums are converted into digital domain by ADC and shifted and added by 8-cycle digital accumulator Analog Bit MultiplierAnalog Bit AdderAccumulatorADCAijWjmin(i,7)j max(0,i 7)OAi)Analogij(W j=AnalogOi(Digital)Oi()14ii
196、 0Oi(Digita2l)=min(i,7)14ii 0j max(0,i 7)(Wji)2Aj=DQ O1O2O6O7O8O9O13O141:11:1ADC1ADC21.86x FoM improvementFoM=IN-precision W-precision OUT-ratio Energy EfficiencyISSCC20214ISSCC20223ISSCC20231This work5.763.551.8600.51*When calculating figure of merit(FoM),BF16 is considered as 16*Output Radio=Real
197、output accuracy/Ideal output accuracy*All take maximum average energy efficiency,accuracy running on Cifar-100*34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits
198、Conference27 of 32Shmoo Plot0.950.900.600.850.800.750.700.65VDD(V)Access time(ns)4.08.54.55.05.56.06.57.07.58.0Shmoo:BF16 IN,BF16 W,BF16 OUTMeasurement results BF16-IN,BF16-W,BF16-OUTAccess time(tAC)=4.78ns VDD=0.95VFAILPASS34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point S
199、RAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference28 of 32Technology(nm)Macro Area(mm2)SRAM Capacity(Kb)Supply Voltage(V)Number of InputChannelsInput PrecisionBF16INT8Weight PrecisionBF16INT8Throughput1.98-4.28TFLOPS2.89-5.31TO
200、PSEnergy Efficiency(Averageperformance)16.55-32.78TFLOPS/W22.78-50.53TOPS/WEnergy Efficiency(Peak performance)36.41-72.12TFLOPS/W50.12-111.17TOPS/WAccuracy Loss(Cifar100)-0.05%-1.57%256Chip Summary281.94192Kb0.7-0.95Chip SummaryCIM Macro1.54mm1.26mm*Performance in 90%input sparsity(ResNet-18)*Using
201、ResNet-18 model,the software baseline was 75.31%*34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference29 of 32Measurement SystemDC PowerPCLogic AnalyzerTe
202、st BoardJTAG34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference30 of 32Outline Conclusion34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Flo
203、ating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference31 of 32Conclusion Features of proposed Hybrid-domain FP CIM macroAnalog In-Array&Digital Out-Array Hybrid-Domain CIMLogarithm Bit-Width Residual ADCOuter-product b
204、ased FP/INT CIM Block Architecture A 28nm FP SRAM CIM Macro is VerifiedBF16IN-BF16W-BF16OUT Throughput:1.98-4.28 TFLOPS Energy Efficiency:16.55-72.12 TFLOPS/WINT8IN-INT8W-INT16OUT Throughput:2.89-5.31 TOPS Energy Efficiency:22.78-111.17 TOPS/W34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Bas
205、ed Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference32 of 32Thank you for your kind attention!34.6:A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logar
206、ithm Bit-Width Residual ADC 2024 IEEE International Solid-State Circuits Conference33 of 32Please Scan to Rate Please Scan to Rate This PaperThis Paper34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE Internat
207、ional Solid-State Circuits Conference1 of 27A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and RefreshingYifan He1,Shupei Fan1,Xuan Li1,Luchang Lei1,Wenbin Jia1,Chen Tang1,Yaolei Li1,Zongle Huang1,Zhike Du1,Jinshan Yue2,Xueqing Li1,Huaz
208、hong Yang1,Hongyang Jia1,Yongpan Liu11Tsinghua University,Beijing2Institute of Microelectronics of CAS,Beijing34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference2 of
209、 27eDRAM-based digital CIMWhy digital CIM?Higher throughputs&accuracy However,large logics overhead&low mem.densityWhy eDRAM CIM?Less transistor&potential higher density But has refreshing overheadY.ChihISSCC21 16.3D.WangISSCC22 16.1Y.HeISSCC23 7.3Z.ChenISSCC21 15.3S.KimISSCC23 16.5This WorkEnergy E
210、fficiency(TOPS/W)current-basedcharge-basedtime-baseddigital-based105104103102SRAM-basedeDRAM-basedNVM-based104103102101Den.Comp.of SRAM&eDRAM CIMDigitalAdder TreeeDRAMCellNorm.Ene.Effi.(TOPS/W)105104103102CurrentChargeTimeDigital100 101 102 103 104Norm.Compute Den.(TOPS/mm2)Storage Density(Kb/mm2)10
211、41031021014 5Effi.Comp.of Analog&Digital CIMTechnology Node(nm)716 2240556534.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference3 of 27Existing digital CIM solutionSol
212、ution 1:Circuit OptimizationSolution 3:Alg.Co-optimizationSolution 2:Logic Optimization Existing digital CIM designs focus on computation circuitsDifferent types of FA designsUniversal improvementLimited design spaceEfficient MAC implementation Higher efficieny/throughputComplex logic&larger areaApp
213、roximate computing Break digital logic limit Accuracy lossSource:ISSCC21 16.4,ISSCC22 15.5,ISSCC23 7.334.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference4 of 27Motiv
214、ation to eDRAM-LUT-based digital CIMCDAC-basedSAR-ADCCharge-based CIM cellReconfig.Vref+Previous analog CIM solutionMotivated by ISSCC21 15.3&ISSCC23 15.5Analogous digital CIM solution1.eDRAM benefits DCIM:Higher density2.DCIM benefits eDRAM:Parallel refresh3T eDRAM9 Track Full AdderLUT 0LUT 15.Loca
215、l R&WLocal R&WRefresh time 1xeDRAM ArrayGlobal R&WRefresh time 16x,+.16CS-eDRAM CIM Bank0161610b LUT eLAD 151610b LUT eLAD 01610b LUT eLAD 1Hierarchical WL Decoder34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 I
216、EEE International Solid-State Circuits Conference11 of 27eLAD implementation details.10.LWBLLRBLLWBLLRBLLWBLLRBL16WWLRWLWWLRWLWWLRWLACTGWBL_ENTOPCTRLDecoder_ENDecoder_ENCIM_EN Write BufferGWBLGWBLGWBL4-16 LSB RWL DecodereLAD schematiceLAD improvementeDRAM-LUT-based CIM saves 40%area compared to conv
217、entional DCIMPower gatingeLAD data mapping34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference12 of 27Design choice of LUT-based CIMArea&energy comparison of differen
218、t LUT cellFour-weight eDRAM LUT is a sweet point of density and efficiency trade-offDensity trade-off between different numbers of weights34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-S
219、tate Circuits Conference13 of 27IMREP implementation detailsIMREP schematicT=1T=2T=3T=4T=NLUT0LUT1ProposedIn-memRefreshConventionalSerialRefreshRow 1-4.refreshedunrefreshedRow N.ParallelRefreshing.RENRENBVDDPREWB_RFSwrite-back for refreshwrite back for encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRow
220、RowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP1034.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Confere
221、nce14 of 27In-memory refreshIMREP schematicIn-memory refresh exampleRENRENBVDDPREWB_RFSwrite-back for refreshwrite back for encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRowRowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP10T=1T=2T=3T=4
222、T=NLUT0LUT1ProposedIn-memRefreshConventionalSerialRefreshRow 1-4.refreshedunrefreshedRow N.ParallelRefreshing.34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference15 o
223、f 27In-memory encode(1/2)IMREP schematicIn-memory encode exampleW0,0W1,0W1,0W1,1T=1T=2T=3T=4LUT0LUT1encodedunencodedInter.resultsStage 1:Serial loadingW0,0W0,1Stage 2:EncodingDoubleWritingT=5W0,0W1,0T=6W0,1W0,0+1W1,0W0,0W1,1W1,0+1In-memEncodingRENRENBVDDPREWB_RFSwrite-back for refreshwrite back for
224、encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRowRowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP1034.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024
225、IEEE International Solid-State Circuits Conference16 of 27In-memory encode(2/2)W0,0W1,0W1,0W1,1T=1T=2T=3T=4LUT0LUT1encodedunencodedInter.resultsStage 1:Serial loadingW0,0W0,1Stage 2:EncodingDoubleWritingT=5W0,0W1,0T=6W0,1W0,0+1W1,0W0,0W1,1W1,0+1In-memEncodingRENRENBVDDPREWB_RFSwrite-back for refresh
226、write back for encodingLRBLLWBLLUT 0WB_SUM0IMREP10RowRowRowRowRowRowShared FA from1st stageadder treeRWLWWLRWLWWLRWLWWLLUT 1LRBLLWBLDOUTDOUTRWLWWLRWLWWLRWLWWLRWLWWLLRBLLWBLIMREP10IMREP schematicIn-memory encode example34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Mem
227、ory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference17 of 27Complete in-memory encode flowThe encoding flow guarantees no read-write conflicts0123456789101112131415W00W01W02W03W10W11W12W130123456789101112131415W00W01W02W03W10W11W12W13LUT 0LUT 1W10+
228、W11W00+W01W00+W02W10W00W01W02W11W03W12W13W00W10W11W12W01W13W02W03W10W00W01W02W11W03W12W13W00W10W11W12W01W13W02W03W01+W01W00+W02W01+W02W00+W03W01+W03W02+W03W10+W11W10+W12W11+W12W10+W13W11+W13W12+W13W10W00W01W02W00+W01+W02W03W00W10W11W12W13W00+W01W00+W02W01+W02W00+W03W01+W03W02+W03W10+W11W10+W12W11+W1
229、2W10+W13W11+W13W12+W13W00+W01+W03W00+W02+W03W01+W02+W03W10+W01+W02+W03W10+W11+W12W10+W11+W13W10+W12+W13W11+W12+W13W10+W11+W12+W13Resultread fromLUTWrite into LUTencodedLUTintermediate resultSerial Load (64 cycles)Weight Encode (22 cycles)34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digi
230、tal Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference18 of 27IMREP evaluationIn-mem.refresh improvementIn-mem.encode improvementImproved efficiency&throughput by reducing global data movement-57%-64%+2%00.20.40.60.81EnergyLatency
231、Macro AreaNormalized CostNear-memIn-mem-9.5x-16x+10%34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference19 of 27Outlinen Introduction n Challenge of eDRAM-LUT-Based D
232、igital CIMn Proposed eDRAM Digital CIMlOverall architecture and basic principallTwo-stage eDRAM-LUT-based adder treelIn-memory refreshing and LUT encodingn Measurement Resultsn Conclusion34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encodi
233、ng and Refreshing 2024 IEEE International Solid-State Circuits Conference20 of 27Chip photograph1.62mm2.0mmPLLInput/Output SRAM&Test-modes2Mb 10*5 eDRAM Macro ArrayWWL/RWL Decoder&Drivers8 x 256 x 10b LUT Banks+Adder Tree8 x 256 x 10b LUT Banks+Adder Tree78u221ueDRAMMacro x 1Shifter AddersShifter Ad
234、dersTop CTRLWBL DriversChip Summary28nm HKMGTechnology3T eDRAMBit Cell0.145 um2Cell Area0.017 mm2Macro Area40KbMacro Capacity2MbTotal Capacity10TOPSTotal ThroughputLUT+Adder TreeCompute Circuit64Input Channel16Output Channel0.9-1.2VSupply Voltage400-1100MHzFrequency2.4Mb/mm2Storage Density2684F2/bNo
235、rmalized compute Density19.7 8b,0.9VEnergy Efficiency16.7 TOPS/mm28b,1.2VArea Efficiency34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference21 of 27Test platformCurre
236、nt MeterTest ICFPGAPCHeat PumpDC SourceTest ICThermoelectric Heat PumpThermal ImagerFPGACurrent MeterPCDC Source34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference22
237、 of 27Voltage&efficiency scaling0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20Voltage(V)40050060070080090010001100Frequency(MHz)Best efficiency achieved 0.9V,400MHz Best performance achieved 1.2V,1100MHz0.91.01.11.2Voltage(V)05101520258b Energy Eff.(TOPS/W)10%sparsity20%sparsity30%sparsity40%sparsity50%spa
238、rsitybest efficiency:19.7 TOPS/W 0.9V34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference23 of 27eDRAM retention testMinimal refreshing overhead with in-memory accele
239、ration34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference24 of 27Power and area breakdown42%32%8%18%Energy Breakdown47%34%6%13%Area BreakdowneDRAM ArrayAdder TreeShi
240、ft&AdderCTRLS&DriversMemory circuits dominate both area and energy breakdown128x10 eDRAM Array128x10 eDRAM Array16x10b Adder Tree,+12.5u73u34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-
241、State Circuits Conference25 of 27Comparison with state-of-the-arts CIM macrosExceeding Storage&Compute Density 28nm:8b TOPS/W 0.9-1.1V Output Bits/(Input Bits+Weight Bits+log2(#of Input Channels):8b TOPS/W 0.9-1.1V 8b Peak TOPS/mm2 :8b Peak TOPS/mm2 Storage Density34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TO
242、PS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference26 of 27Outlinen Introduction n Challenge of eDRAM-LUT-Based Digital CIMn Proposed eDRAM Digital CIMlOverall architecture and basic principallTwo-sta
243、ge eDRAM-LUT-based adder treelIn-memory refreshing and LUT encodingn Measurement Resultsn Conclusion34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference27 of 27Conclu
244、sionn Digital CIM has unique advantages on performance,robustness and scalabilityn Previous DCIM works mainly optimize on computation logic instead of taking memory and computation as a whole n The eDRAM-LUT-based CIM allow us to achieve 16.7TOPS/mm2 8b compute density with 2.4Mb/mm2 storage density
245、n Potential improvement can be achieved by combining other high-density emerging devices in the future34.7:A 28-nm 2.4-Mb/mm2 6.9-16.3-TOPS/mm2 eDRAM-LUT-Based Digital Computing-in-Memory Macro with In-Memory Encoding and Refreshing 2024 IEEE International Solid-State Circuits Conference28 of 27Plea
246、se Scan to Rate This Paper34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference1 of 36A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge DevicesTai-Hao Wen*1,Hung
247、-Hsi Hsu*1,2,Win-San Khwa*2,Wei-Hsing Huang1,Zhao-En Ke1,Yu-Hsiang Chin1,Hua-Jin Wen1,Yu-Chen Chang1,Wei-Ting Hsu1,Chung-Chuan Lo1,Ren-Shuo Liu1,Chih-Cheng Hsieh1,Kea-Tiong Tang1,Shih-Hsin Teng3,Chung-Cheng Chou3,Yu-Der Chih3,Tsung-Yung Jonathan Chang3,Meng-Fan Chang1,21National Tsing Hua University
248、(NTHU),Hsinchu,Taiwan,2TSMC Corporate Research,Hsinchu,Taiwan 3TSMC,Hsinchu,Taiwan34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference2 of 36Challenges of floating-point(FP)nonvolatile compute-in-m
249、emory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of floating-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-alignment(K-WPA)Rescheduled multi-bit input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement r
250、esultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference3 of 36Kernel0KernelNConv.Layer-wise Weight Pre-Alignment(PA)Kernel0KernelNBit width for PA=8bBit width for PA=12bDatatrun
251、cation!*S:Sign bitChallenges of Nonvolatile Compute-In-Memory(1/3)Tradeoff of bit width for weight pre-alignment between inference accuracy and weight storage Increasing bit width enhances inference accuracy Decreasing bit width reduces storage needsS S0 0 0S.0 0 0 0S S S S S.S S S S.0S SS.S S S S S
252、S S S S.Mantissa0Mantissa000.0050.010.0150.020.0250.038101200.20.40.60.811.21.41.6StorageAccuracy loss(Conv.)Accuracy loss(Prop.)81012a.u.Bit width for weight PAProposedK-WPA34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International So
253、lid-State Circuits Conference4 of 36Challenges of Nonvolatile Compute-In-Memory(2/3)00.511.522.5(1)(2)(3)Prop.RS-MIC00.511.522.5Latency/EnergySignal Margin(1)(2)(3)Prop.RS-MICa.u.Reduce LatencyLatency/EnergySignal Margin(1)Digital Bit-Serial Multi-bit InputLatency/EnergySignal MarginLatency/EnergySi
254、gnal Margin3x width(2)Analog WL Pulse Count Multi-bit Input(3)Analog Decoded WL pulse-width Multi-bit Input2 cycles11Example:2b IN=33x pulseLatency/EnergySignal MarginProposed RS-MIC1 cycle with FLAGMIC=111An Example of a 2b input:Long latency/energy of MAC operations due to high input bit width in
255、floating-point format Digital bit-serial multi-bit inputs suffer from long latency/energy Analog multi-bit inputs suffer from low signal margin34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference5
256、of 36Challenges of Nonvolatile Compute-In-Memory(3/3)Weight Value 0#of WeightsWeight distribution after pre-alignment2sHRS-Favor SchemeLRS(1)HRS(0)#45%35%Conv.Proposed HF-DSB weight encoding55%65%Weight Storage in ReRAM Array Large current consumption of cell array due to large number of memory cell
257、 for FP weight storage LRS cells consume much larger energy than HRS cells Proposed HF-DSB weight encoding reduce the ratio of LRS cellsLRSHRSEnergy of accessing one bit cellEnergy(a.u.)34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE Inte
258、rnational Solid-State Circuits Conference6 of 36Challenges of floating-point(FP)nonvolatile compute-in-memory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of floating-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-alignment(K-WPA)Rescheduled multi-bi
259、t input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement resultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference7 of 36Overview of
260、 floating-point ReRAM-CIM macroHF-DSB Weight Encoder(C3)WL DriverRS-MIC(C2)Pre-aligned WeightsFPInputs1T1R ReRAM-CIMArraywith K-WPAand HF-DSB(C1&C3)Input Pre-alignmentMux and ADCShifter and Sign-aware Adder(C3)MIC-awareAccumulator(C2)Exponent Processing(C1)FP OutputsCTRLFP computing flow with kernel
261、-wise weight pre-alignment(K-WPA)Reduce accuracy loss during the data truncation in weight pre-alignment Rescheduled multi-bit input compression(RS-MIC)Reduce MAC energy and latency with lossless compressionHRS-favored dual-sign-bit(HF-DSB)weight encodingReduce ReRAM array current consumption34.8:A
262、22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference8 of 36Challenges of floating-point(FP)nonvolatile compute-in-memory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of float
263、ing-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-alignment(K-WPA)Rescheduled multi-bit input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement resultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with
264、 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference9 of 36 Kernel-wise weight pre-alignment(K-WPA)Offline alignment process to eliminate need for on-chip weight alignmentFind the maximum exponent(EK)in each kernelAligns each weights sign and mantissa based on it
265、s exponent difference(EK-Ei)W PAKernel N WeightsEKNPA-MKN+K-WPAKernel 0 Weightsin FP16/BF16 formatKernel 0 Pre-aligned Mantissas(PA-MK0)(Q bits,Q=8,10,12,16)S0E0M0S1E1M1S2E2M2S01.M0S11.M1S2EK0Kernel0-sharedExponentEK0 Ei1.M2+Find Max.E in Kernel0Reg.MACVK0KNEK0KNExponent ProcessingReRAM Array w/K-WP
266、ANormal Read EK0KNPA-MK0(*Q-bits)(w/HF-DSB).PA-MKNOutput(FP32)EINInput PA-MIN(*P-bits)*A Inputs(FP16/BF16)Input Pre-alignment CircuitRS-MICMux and ADCShifter and Sign-aware Adder MIC-aware AccumulatorEK0.EKN123MAC Operation of PA-MK0KNFP computing flow with kernel-wise weight pre-alignment(1/7)34.8:
267、A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference10 of 36 Kernel-wise weight pre-alignment(K-WPA)Before the MAC operation,kernel-shared exponent(EK)are loaded from a ReRAM array row to register.The k
268、ernel-wise pre-aligned mantissa are ready to conduct MAC operation with input.Reg.MACVK0KNEK0KNExponent ProcessingReRAM Array w/K-WPANormal Read EK0KNPA-MK0(*Q-bits)(w/HF-DSB).PA-MKNOutput(FP32)EINInput PA-MIN(*P-bits)*A Inputs(FP16/BF16)Input Pre-alignment CircuitRS-MICMux and ADCShifter and Sign-a
269、ware Adder MIC-aware AccumulatorEK0.EKN123MAC Operation of PA-MK0KNFP computing flow with kernel-wise weight pre-alignment(2/7)W PAKernel N WeightsEKNPA-MKN+K-WPAKernel 0 Weightsin FP16/BF16 formatKernel 0 Pre-aligned Mantissas(PA-MK0)(Q bits,Q=8,10,12,16)S0E0M0S1E1M1S2E2M2S01.M0S11.M1S2EK0Kernel0-s
270、haredExponentEK0 Ei1.M2+Find Max.E in Kernel034.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference11 of 36 The Kernel-wise weight pre-alignment(K-WPA)reduce the data truncation during weight alignme
271、nt K-WPA achieve fine-grained kernel-wise alignment referenceQ bitsPropose K-WPAConventional WPAKernel0Kernel1KernelNKernel0Kernel1KernelNLayer-wiseAlignment Reference Kernel-wiseAlignment Reference Place Value of Weight Preserved data after alignmentTruncated data after alignmentPreserve more Weigh
272、t data!Large weightData truncation!Place Value of Weight Q bitsQ bits1.86 over ISSCC2334.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference33 of 36Technology 22nm CMOS logic process(Ultra low leakag
273、e)Memory deviceFoundry provided 1T1R ReRAMReRAM-CIM Capacity16Mb(16 sub-banks)Input/Weight precisionFP16/BF16Output precisionFP32Macro area(Inc.test mode)8.2mm2 Supply voltage 0.7-0.8VThroughputs(TFLOPS)*1*30.86(BF16)0.78(FP16)Computing density(TFLOPS/mm2)*1*30.104(BF16)0.095(FP16)Energy efficiency(
274、TFLOPS/W)*331.2*1-65.5*2(BF16)28.7*1-60.4*2(FP16)Inference Accuracy(CIFAR-100)*469.48%(Top-1),91.59%(Top-5)(BF16)Inference Accuracy(ImageNet)*571.55%(Top-1),90.17%(Top-5)(BF16)16MbFloating-Point ReRAM-CIMMacroChip Summary34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W
275、 for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference34 of 36Challenges of floating-point(FP)nonvolatile compute-in-memory(nvCIM)for AI edge devicesProposed schemes in nonvolatile CIM macroOverview of floating-point ReRAM-CIM macroFP computing flow with kernel-wise weight pre-
276、alignment(K-WPA)Rescheduled multi-bit input compression(RS-MIC)HRS-favored dual-sign-bit(HF-DSB)weight encodingPerformance and measurement resultsConclusionOutline34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State C
277、ircuits Conference35 of 36ConclusionFeatures of proposed floating-point ReRAM-nvCIM macro:FP computing flow with kernel-wise weight pre-alignment(K-WPA)Reduce accuracy loss due to the data truncation in weight pre-alignment Rescheduled multi-bit input compression(RS-MIC)Reduce MAC energy and latency
278、 with lossless compressionHRS-favored dual-sign-bit(HF-DSB)weight encodingReduce ReRAM array current consumptionA 16Mb floating-point ReRAM-nvCIM macro is verified:Supporting both BF16 and FP16 computing formatMeasured energy efficiency was 31.2 TFLOPS/W under BF16 precision34.8:A 22nm 16Mb Floating
279、-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for AI Edge Devices 2024 IEEE International Solid-State Circuits Conference36 of 36Thanks for your kind attentionAcknowledgements:NSTC-Taiwan,TSRI,NTHU-TSMC JDP34.8:A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro with 31.2TFLOPS/W for A
280、I Edge Devices 2024 IEEE International Solid-State Circuits Conference37 of 36Please Scan to Rate Please Scan to Rate This PaperThis Paper34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-Stat
281、e Circuits Conference1 of 40A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET ProcessLinfang Wang1,2,Weizeng Li1,2,Zhidao Zhou1,2,Hanghang Gao1,2,Zhi Li1,2,Wang Ye1,2,Hongyang Hu1,Jing Liu1,Jinshan Yue1,Jianguo Yang1,Qing Luo1,Chunmeng
282、 Dou1,2,Qi Liu1,3,Ming Liu1,31Institute of Microelectronics of the Chinese Academy of Sciences,Beijing,China2University of Chinese Academy of Sciences,Beijing,China3Fudan University,Shanghai,China34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Stand
283、ard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference2 of 40Outline Motivation and Challenges Proposed Plastic-CIM MacroOverview of Plastic-CIM MacroPlastic Cell Array Differential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fus
284、ed Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference3 of 40Development of nvCIM based Accelerator Edge AI applications call for CIM with learning abilities to adapt to the dynamic and unpredicta
285、ble environments W.Huang,ISSCC,2023M.Chang,ISSCC,2023M.Le Gallo,Nat Electron,2023W.Zhang,Science,202334.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference4 of 40Learning
286、by Back Propagation Conventional NN training:High computational precision Large amounts of intermediate dataCannot keep learning after initial trainingFeed-forward=,x1xix2yjError-computation=,Weight Gradient-computation,=C/y1xiHard to be deployed on edge devicesfor learning!C/y2C/yjC/xiC/yjC/wi,j34.
287、9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference5 of 40=(,+,)Feed-forwardx1xix2w2,jyj Plastic NN*learns from experience:Keep learning in feed-forward after initial trai
288、ningUpdating Hebb by Hebbian theory:“neurons that fire together,wire together”*T.Miconi,ICML,2018,+1=+(1 ),traditional connectionsplastic connectionsExtracted feature vectorOutput resultFully-connectedlayerLearning Tasks Trained Wi,jTrained Wi,j&i,jUpdating Hebbi,jConvolutional layersLearning by Syn
289、aptic Plasticity34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference6 of 40Learning by Synaptic PlasticityFeed-forwardx1xix2w2,jyj=(,+,)ParametersStorageComputationPlast
290、icitycoefficient,NVM for long-term informationMatrix Element-wise Multiplication(MEM)andMatrix-Vector Multiplication(MVM)Hebbian trace,VM for short-term information Plastic NN*learns from experience:Long-/short-term information expects non-volatile/volatile memoryMatrix element-wise multiplication f
291、or the plastic connections*T.Miconi,ICML,201834.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference7 of 40Challenge 1:NVM Solution in Advanced Nodes 5 Transistors Logic-FL
292、ASH(5T-LF)provides a cost-effective solution for non-volatile memory in advanced logic platformsEmerging Memory(RRAM/PRAM/MRAM)Embedded NVMLogic Flashe-FLASH201020142018202255nm40nm28nm180/130nm28/22/14nm65/40nm130/65nm350/180nm14 nmThis WorkAdditional Masks&Specialized ProcessesStandard Logic Proce
293、ss34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference8 of 40Challenge 2:MEM-MVM solution in CIM Previous works need (N+1)cycles to perform MEM-MVM Proposed CIM Array su
294、pporting MEM-MVM in 1 cyclexHebbxCIM for MVMHebbxProposed CIM ArrayHebbxHebbHebbHebbCIM for MEMNumber of columns,N34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference9 o
295、f 40Challenge 3:Low hardware cost ADC solution The area cost of DAC array exponentially increases with ADC resolution,incurring area-efficiency degradation Proposed CIM macro saves 50.4%ADC area by reusing MAC signal sampling capacitors SAR LogicMemoryDACComp.Conv.CIM MacroMemoryMemoryProposed CIM M
296、acroConv.Proposed00.20.40.60.81ADC Overhead(a.u.)50.4%reductionMemoryComp.&SAR Logic 34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference10 of 40Outline Motivation and C
297、hallenges Proposed Plastic-CIM MacroOverview of Plastic-CIM MacroPlastic Cell Array Differential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IE
298、EE International Solid-State Circuits Conference11 of 40Overview of Plastic-CIM Macro 14nm fused 32Kb 5T-LF&4Kb SRAM PCA for efficiently MEM-MVM DMA-ADC with reduced hardware overhead123FBL8:15SWL0SWL1FBL0:7BL63BLB63Block 0Block 1Block 31Input Driver and HVSR3L-DSA185T-LFSRAM185T-LFSRAM185T-LFBL0PWL
299、0WWL0BLB0185T-LFx07:0 x17:0 x637:0PWL63WWL63positive columnsnegative columnssign magnitude to 2s complement(SM-2SC)SM-2SCL1-DSAL2-DSAL3-DSAL3DSAO1L3DSAO31PCASM-2SCL1-DSAL2-DSAL3-DSA5TLF Read and WriteSRAM WL Driver,Read and WriteCPTiming&CTRL8-bit DMA-ADCL1-DSA L2-DSAL3-DSA185T-LFSRAMSRAMSRAM185T-LF
300、SRAML3DSAO0CP,charge pump;HVS,high voltage switches;R3L-DSA,reconfigurable 3-level digital shifter-and-adder;L3DSAO,3rd-level DSA output.12334.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-St
301、ate Circuits Conference12 of 40Computation Flow of P-CIM MacroSM-2SCL1-DSAL3-DSAL2-DSAX0-637:05T-LFSRAMSRAM5T-LFDMA-ADCPCAHebb update SRAM L3DSAO27:0R3L-DSA MVM mode for traditional connections,MEM-MVM mode for plastic connectionsSM-2SCL1-DSAL3-DSAL2-DSAX0-637:05T-LFSRAMSRAM5T-LFDMA-ADCPCAAll SRAM s
302、tore 1L2DSAO19:0R3L-DSAMVM ModeMEM-MVM Mode34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference13 of 40Outline Motivation and Challenges Proposed Plastic-CIM MacroOvervi
303、ew of Plastic-CIM MacroPlastic Cell ArrayDifferential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Confe
304、rence14 of 40PCA Unit Structure Merged 5T-LF*and SRAM cells thanks to the process compatibility Shared SGD/SGS and BL/BLB to reduce the number of metal lines ENAVGPWLWWL BL(SGD)BLB(SGS)SWLFBLFBLLSL MVPCHCLPM0PM1NM0NM1NM2VDDQQBFGCSAM5T-LFSRAM&CSAM*5T-LF cell refers to M.Kim,JSSC,202234.9:A Flash-SRAM
305、-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference15 of 40PCA Unit Structure 5T-LF cells built by different IO devices Under-drive IO transistors for the read and pass gates in the 5T-
306、LF to increase the cell read currents ENAVGPWLWWL BLBLBSWLFBLFBL LSL MVPCHCLPM0PM1NM0NM1NM2VDDQQBFGCSAMTransistor PM0PM1NM0/1/2OthersDevice Typep18p18n18ud12n08/p08W*L Ratio811-34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Pro
307、cess 2024 IEEE International Solid-State Circuits Conference16 of 40PCA Unit Structure SRAM and computing switches are built by the core devices HV paths decoupled from SRAM,CSAM,and the computing pathENAVGPWLWWLBLBLBSWLFBLFBL LSL MVPCHCLPM0PM1NM0NM1NM2VDDQQBFGCSAMHVPathsCompute Path34.9:A Flash-SRA
308、M-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference17 of 405T-LF Programing Bit-wise programing by applying VPRGon PWL/WWL,and inhibit the half-selected cells by applying VDDon BLProgr
309、am-inhibited*FBL(GND)FBL(VDD)PWL(VPRG)WWL(VPRG)BL(VDD)BLB(GND)LSL(Floating)Programmed*SWL(GND)5T-LF5T-LFSRAM*5T-LF cell refers to M.Kim,JSSC,202234.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Sol
310、id-State Circuits Conference18 of 405T-LF ErasingFBL(GND)FBL(GND)PWL(GND)WWL(VERS)BL(GND)BLB(GND)LSL(Floating)Erased*SWL(GND)5T-LF5T-LFSRAM*5T-LF cell refers to M.Kim,JSSC,2022 Row-wise erasing of 5T-LF by applying VERSbetween WWL and PWL34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro
311、for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference19 of 40Enhanced FN Tunneling in the Si Fin Structure Enhanced electric field during programming at the Si fin corner,leading to 185x enhanced local FN tunneling current 25%reduce
312、d minimum VPRGand 66%reduced minimum timeplanecornerSimulated FN-tunneling Current Density (a.u.)1102104106185x increased024681012140.0001 0.0010.01 0.1110Minimum Program TimeUniversity of Michigan,ISSCC 2017 TSMC,VLSI-DAT 2013University of Minnesota,JSSC 2021Virage Logic,ISSCC 2004KAIST,CICC 2023Mi
313、nimum VPRG(V)This Work25%66%PlaneCornerSi finGatePlaneTEM Cross-Sectional View and TCAD Electric Field SimulationCornerE(MV/cm)483216010ms100us1us34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International So
314、lid-State Circuits Conference20 of 40Computing of PCA Unit-Truth Tablex(BL Pulse Width*1)or w(5T-LF VTH*2)Hebb(SRAM Q)Multiplication Value(VMV)00(tACT=0)-00000(VDD)-0000(VTH0)-0(GND)01(tACT=t0)0001(VTH1)1(VDD)000001(VDD-V)01(tACT=t0)0010(VTH2)1(VDD)000010(VDD-2V)01(tACT=t0)0011(VTH3)1(VDD)000011(VDD
315、-3V)10(tACT=2t0)0100(VTH4)1(VDD)000001(VDD-2V)11(tACT=3t0)1111(VTH15)1(VDD)101101(VDD-45V)*1.tACT:BL activation time;t0:the unit time for pulse width modulation.*2.Cell current linearly increases as the threshold voltage(VTH)changes from VTH0to VTH15.CLENAVGPWLWWL BLBLBSWLFBL LSL MVPCHVDDQQBFGCSAMCo
316、mpute Path34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference21 of 40Computing of PCA Unit Phase0(Pre-charging)PWL/WWLSWLGNDVRDFBLFBLVDDVDD/2BLBBLVDDGNDVMVENAVGVDDLSL p
317、re-chargeLocal multiplicationAveragingFBL inhibitionPH1PH2PH0CLENAVGPWLWWL BLBLBSWLFBL LSL MVPCHVDDQQBFGCSAM Pre-charge top-plates of CSAMand parasitic cap of LSL34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE
318、International Solid-State Circuits Conference22 of 40Computing of PCA Unit Phase1(Multiplying)PWL/WWLSWLGNDVRDFBLFBLVDDVDD/2BLBBLVDDGNDVMVENAVGVDD5VLSL pre-chargeLocal multiplicationSGD activation depends on inputs Charge sharing between capacitorsAveragingFBL inhibitionVCLENAVGPWLWWL BLBLBSWLFBL LS
319、L MVPCHVDDQQBFGCSAMCompute Path Input pulses applied to BL,multiply through dis-charge CSAMsPH0PH1PH234.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference23 of 40Computin
320、g of PCA Unit Phase2(Averaging)PWL/WWLSWLGNDVRDFBLFBLVDDVDD/2BLBBLVDDGNDVMVENAVGVDD5VLSL pre-chargeLocal multiplicationSGD activation depends on inputs Charge sharing between capacitorsAveragingFBL inhibitionVCLENAVGPWLWWL BLBLBSWLFBL LSL MVPCHVDDQQBFGCSAM Average between capacitors through charge s
321、haringPH0PH1PH234.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference24 of 40Outline Motivation and Challenges Proposed Plastic-CIM MacroOverview of Plastic-CIM MacroPlast
322、ic Cell Array Differential Merged-into-Array ADC Performance and Measurement Results Conclusion34.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference25 of 40DMA-ADC Struct
323、ureCLPCLNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor Modified SAR logics for grouping and quantizing Common-centroid capacitors grouping strategy to reduce mismatch Multi-Element
324、Sparsity-Aware(MESA)control to leverage sparsity in plastic-NN12312334.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference26 of 40DMA-ADC Work Flow Phase1(Multiplying)CLPC
325、LNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor ADCENAVGCLP/CLNSAOPSAONTop-plate of CSAM connected to MV,while bottom-plate of CSAM connected to VREF_H because the input common-mode
326、 voltage is near VDDMVPCLPVREF_HVREF_LMVPMVPMVPENAVG=034.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference27 of 40DMA-ADC Work Flow Phase2(Averaging)ADCENAVGCLP/CLNSAOPS
327、AON Top-plate of CSAM connected to CL to enable the charge sharingMVPCLPVREF_HVREF_LMVPMVPMVPENAVG=1CLPCLNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor 34.9:A Flash-SRAM-ADC-Fused P
328、lastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference28 of 40DMA-ADC Work Flow Phase3(Quantizing)CLPCLNSAOPSAONADCSWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor ADCENAVGCLP/C
329、LNSAOPSAONBottom-plate of CSAM grouped to ratioed capacitors and switched to quantize the MACVCLPVREF_HVREF_LGrouped to ratioed capacitors(C6 C0)C0C1C2C6VREF_LVREF_H1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C634.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for
330、 Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference29 of 40Common-Centroid Capacitors Grouping StrategyCLPCLNVREF_LVREF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&Timin
331、gDFFsMESA ControlHebbx/w sparsity factor Ensuring the parasitic capacitators of C6 C0(CPAR6 CPAR0)maintains the same ratiosC0Comp.&SAR Logic Ratioed wire parasitic capacitators CPARof C6to C1:PCA row index0816243240485664C6C6C5C5C4C4CPAR1 32CPAR2(33+31)=2CPAR1SWCpara1Cpara2CPAR3(35+34+30+29)=4CPAR13
332、4.9:A Flash-SRAM-ADC-Fused Plastic Computing-in-Memory Macro for Learning in Neural Networks in a Standard 14nm FinFET Process 2024 IEEE International Solid-State Circuits Conference30 of 400100200Case ICase IICase IIIAS aware onlyAS&WS&HS awareMulti-Element Sparsity-Aware(MESA)ControlCLPCLNVREF_LVR
333、EF_HSAOPSAONADC1x1x2x4x8x16x32x1x64*6T SRAM512*5T-LF64*6T SRAM512*5T-LFC0C1C2C3C4C5C6SWDOUTModified SAR Logics&TimingDFFsMESA ControlHebbx/w sparsity factor Dynamically modulate sensing steps by leveraging the sparsity in x,Hebb,or w to save power50%AS50%HS90%AS10%HS10%AS90%HSPower Reduction(a.u.)NSENx 1s CounterxWeight Sparsity subtractorNACT/w sparsity factor DigitalComp.NLESHebb1s CounterNHEBHe