《DCiROM:一種全數字 ROM 計算設計方法可實現任務級 DNN 推理的高能效.pdf》由會員分享,可在線閱讀,更多相關《DCiROM:一種全數字 ROM 計算設計方法可實現任務級 DNN 推理的高能效.pdf(26頁珍藏版)》請在三個皮匠報告上搜索。
1、DCiROM:A Fully Digital Compute-in-ROM Design Approach to High Energy Efficiency of DNN Inference at Task LevelTianyi Yu,Tianyu Liao,Mufeng Zhou,Xiaotian Chu,Guodong Yin,Mingyen Lee,Yongpan Liu,Huazhong Yang,and Xueqing Li11Tsinghua University,LFET/BNRistEmail:Outline Background Motivation Proposed D
2、esign Measurement Conclusion2Outline Background Motivation Proposed Design Measurement Conclusion3Background SRAM CiM enhances performance in data-intensive AI tasks,but due to limited capacity,it suffers from frequent weights reload Recently,a high-density ROM CiM(G.Yin et al.,2023)has been propose
3、d to address the limited capacity challenges of SRAM CiM4SRAM CiM ROM CiMCiM MacroPUData transferAct.bufferLargeROM CiMExternal DRAMROM CiM opportunity:Activations load onlyPUTask-level energy reduction of ROM-based CiMWeightbufferSmallSRAMCiMAct.bufferPUExternal DRAMSRAM CiM challenge:Frequent weig
4、hts reload due to limited capacityReduced by 80%Background By introducing SRAM CiM as finetuning weights,YOLoC and Hidden-ROM(Y.Chen et al.,2022)are proposed to release the bottleneck of flexibility issue5Domain AEmptyDomain BInput driverADCsSRAMarrayLCCSRAMarrayLCCDomain APretrainedDomain BInput dr
5、iverADCsSRAMarrayROMarrayLCCDomain ARandomDomain BInput driverADCsSRAM-ROMhybrid arraySRAM-CiMYOLoCHidden-ROMSource:Y.Chen et al.,ICCAD22.Outline Background Motivation Proposed Design Measurement Conclusion6Motivation Computing density of analog ROM CiM is limited by ADC Memory density of digital SR
6、AM CiM is limited by adder tree7High densityLarge capacityROM CiMADCDACMemory densityComputing densitySRAMMacro area(mm2)LogicMemory12%88%36%64%26%74%22nm65nm114nm 120.2020.0130.017Existing high-memory-density ROM CiMExisting high-computing-density SRAM CiM+Adder tree+Adder bit width+Analog MAC valu
7、eDistributionThe number of activatedrows(N)is limited bylow SNRlevel1level2 level2NMemory densityComputing densityMotivation YOLoC demonstrates a new concept of cutting off off-chip parameters loading with large-capacity ROM CiM and finetuning to various tasks Further area reduction by MAC reusing h
8、as not to be exploited8Truck ROM CiMBranch SRAM CiMNM/UN/DD:channel compression ratioU:channel decompression ratioN+MInput FMOutput FMcompressiondecompressionMACreuseContributions How to further improve the density of ROM CiM?The key contributions of this work DCiROM:A ROM-logic fusion design approa
9、ch that achieves both high memory density and high computing density by greatly simplified ROM circuit and adder treeTwo methods that achieve low extra circuit overhead flexibility of DCiROM on different datasets by ROM CiM resource reusingA 65nm DCiROM chip that has built-in with all weights of Res
10、Net-56,achieving experimentally measured ultra-high 2.06 TOPS/mm2computing density and 487 Kb/mm2memory density on end-to-end inference task9Outline Background Motivation Proposed Design Measurement Conclusion10Proposed Design Proposed DCiROM:A synthesizable ROM-logic fusion CiMdesign approach Densi
11、ty enhancement process from general digital SRAM CiM to ROM-logic fusion CiM11W0,3W0,2W0,1W0,0W1,3W1,2W1,1W1,0Digital SRAM CiMW2,3W2,2W2,1W2,0W3,3W3,2W3,1W3,0Addr1:0SRAM01110100Digital ROM CiMw/o opt.00100000ROMMAC unitROM-logic fusion(this work)2bMerged memory-logic block1110010011100100AiWAddr,3=0
12、Addr0Addr1Addr0MAC unit+Out+Addr1:0Addr1:0WAddr,33b+Out+WAddr,2WAddr,1WAddr,0A3A2A1A0AiAiWAddr,2WAddr,1WAddr,0A2A1A0A3Addr1WAddr,33b+Out+WAddr,2WAddr,1WAddr,0A3A2A1A0Proposed Design Implementation of DCiROM on ResNet-56 convolutional layers PPA comparison of SRAM CiM,ROM CiM and ROM-logic fusion CiM
13、1241%51%32%Normalized valueResNet-5616 conv layersSRAMAddertreeROMAddertreeSRAM CiMROM CiMROM-logic fusionThis workProposed Design PPA improvement space of ROM-logic fusion CiM*Memory-compute ratio:Select 1 column from a 4/8/16-column memory block to perform MAC operation134*8*16*1.8x1.7x1.7x1.7x2.4
14、x2.5x2.7x2.4xROM CiM w/o opt.SRAM CiMImprove space of DCiROMAverage point of 16-memory-1-compute case in sparsity 0.8 Proposed Design Two methods of ROM CiM reusing:ACSEL and ACINJ ACSEL:Select a column including sufficient 1s for accumulating ACINJ:Add a redundant column including enough 1s to ROM1
15、4BaselineROM CiM+Accum.SRAMBranchTrunkComputation unitsIFM(input feature map)OFM(output feature map)Branch outTrunk outWnAnProdnROM CiM&Accum.+SRAMBranch parametersTrunk parametersIFMAccum.Proposed approach Trunk outBranch outOFMOriginal ROM CiMMethodACINJ111111000+ACSELActProd+111110010WnAnProdnPro
16、posed Design ROM CiM reuse evaluation:Feasibility and area reduction ACSEL/ACINJ reduce 56%/47%area overhead at most150.981.240.980.981.110.980.981.050.980.290.291.890.4747%56%28%9%19%3%Branch Compression Ratio(DxU)4 16 64AccuracyNormalized AreaB/LACSELACINJBranchTruckAcc.00.10.20.30.40.50.6DxU=4DxU
17、=16DxU=64Ratio of 1 in ROMBest reuse methodResNet-18 conv2_xConvolution layers4Truck size(64,3,3,64)D/U28Branch size(8,3,3,8)(32,3,3,32)ACSEL.ACINJ.Outline Background Motivation Proposed Design Measurement Conclusion16Measurement Structure of ResNet-56 The precision of input/weight is 8bit/4bit All
18、convolutional layers are mapped on chip17InputFC Layers331616 Conv layers16 channelsL1333216 Conv layers32 channelsL2336416 Conv layers64 channelsL3ClassificationOn ChipL3 ctrlL2 ctrlL1 ctrlL3 bufferL2 bufferL1 bufferMeasurement DCiROM chip architecture High efficiency ROM-logic fusion synthesis18L1
19、/L2/L3 DCiROM(Quantified in 4b)Channel CMAC result 30 x than ROM CiM),high Memory density(4x than SRAM CiM)by ROM-logic fusion,and breaking through the SOTA works trade-off22ISSCC24 3ISSCC24 4CICC24 5This workJSSC23 10CICC24 7Technology65nm28nm22nm65nm28nm28nmCiM operationAnalog SRAMHybrid SRAMDigit
20、al SRAMDigital ROMAnalog eDRAMAnalog ROMCapacity(Kb)801921283024960022528Voltage(V)0.61.10.70.950.60.80.61.21.10.71.1Energy efficiency(TOPS/W)*25516.937.64.211.138.09.021.4914.931.2Memory density(Kb/mm2)*31181154871811656Computing density(TOPS/mm2)*0.781.020.210.280.552.060.190.0300.059FoM(TOPS/mm2
21、Kb/mm2)241085362950002681004197528565713FoM(TOPS/mm2 Kb/mm2)*241824322681004344998*Normalized to 65nmOutline Background Motivation Proposed Design Measurement Conclusion23Conclusion Highlight of DCiROM Ultra-high density FoM(computing density x memory density)2400.511.5210100100055.7xThis workComput
22、ing density(TOPS/mm2)Memory density(Kb/mm2)CICC24(Analog ROM)CICC23(Digital SRAM)ISSCC24(Hybrid SRAM)ISSCC24(Analog SRAM)10.2xJSSC23(Analog eDRAM)*Normalized to 65nm.Conclusion Proposed DCiROM design approachHigh memory density and high computing densityLess area overhead to realize flexibility Feat
23、ures:A fully digital ROM with local computing units supports 10.2x-55.7x density FoMof SOTA CiM worksTask evaluation shows 9.9x system-level energy efficiency improvement over SRAM CiMReduce 53%-85%YOLoC branch area overhead through ROM CiM reusing25Thank You26Tianyi Yu,Tianyu Liao,Mufeng Zhou,Xiaotian Chu,Guodong Yin,Mingyen Lee,Yongpan Liu,Huazhong Yang,and Xueqing Li11Tsinghua UniversityEmail: