《SESSION 23 AI-Accelerators.pdf》由會員分享,可在線閱讀,更多相關《SESSION 23 AI-Accelerators.pdf(417頁珍藏版)》請在三個皮匠報告上搜索。
1、ISSCC 2025SESSION 23AI-Accelerators23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference1 of 49T-REX:A 68-567s/token,0.41-3.95J/tokenTransformer
2、Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFETSeunghyun Moon1,Mao Li1,Gregory K.Chen2,Phil C.Knag2,Ram Kumar Krishnamurthy2,Mingoo Seok11 Columbia University,New York,NY2 Intel,Hillsboro,OR23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accel
3、erator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference2 of 49OutlineI.IntroductionII.Algorithmic ApproachesA.Factorizing TrainingB.External Data CompressionIII.Overall ArchitectureIV.Detailed FeaturesA.Hardw
4、are Support for External Data CompressionB.Dynamic BatchingC.Two-Direction Accessible Register FileV.Measurement ResultsVI.Conclusion23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE Inte
5、rnational Solid-State Circuits Conference3 of 49OutlineI.IntroductionII.Algorithmic ApproachesA.Factorizing TrainingB.External Data CompressionIII.Overall ArchitectureIV.Detailed FeaturesA.Hardware Support for External Data CompressionB.Dynamic BatchingC.Two-Direction Accessible Register FileV.Measu
6、rement ResultsVI.Conclusion23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference4 of 49Challenges in Transformer Accelerators Large External Memo
7、ry Access(EMA)EMA accounts for up to 81%of total energy usage Low Hardware UtilizationAs low as 9%,negatively impacts latency performanceI.Introduction01020304012421mJ%403020100100806040200ViT GPT2BERTsViT GPT2 BERT GPT2 T5mT5External I/Os ConsumesSignificant Energy19EMAOn-Chip0204060801001234567124
8、19%*Computed as(benchmark TOPS/peak TOPS).100806040200ViTGPT2BERTsViTGPT2 BERTLess 50%Performance of Their Peak Values23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid
9、-State Circuits Conference5 of 49Proposed Solutions Factorizing TrainingNew training model decomposing each weight into WSand WDWS:dense and shared across all layersWD:highly sparse and distinct to each layerI.IntroductionLayer#0Layer#1Layer#2WSWD0WD1WD2WSWSWSWD0WD1WD2X0X1X2X1X2X3WD:Layer-Wise Disti
10、nct15%SparsityRuntime TransferredWS:Shared across All LayersHighly Dense,PreloadedWSInvolves All LayersEMA Reduced by 8.5-10.723.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE Internation
11、al Solid-State Circuits Conference6 of 49Proposed Solutions External Data CompressionAll parameters are compressed in external to further reduce EMA.Different compression schemes are applied for each type of data.All quantized values are dequantized on-chip before computation.I.IntroductionPEsWSWD0W
12、D0WDindexvaluequantWS idxDequant.Dequant.Addressq.valSRAMs16816456DRAMNon-UniformQuantization EncodeUniform Quant.To Input Mem.Reduced EMASparse FormatPost CompressionEMA Reduced by 2.1-2.923.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enh
13、anced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference7 of 49Proposed Solutions Dynamic BatchingT-REX monitors input length.If input is 2x(4x)smaller than the max input length,it processes 2(4)inputs.I.IntroductionEMA Reduced&Util.Enhanced by 1.0-3.3XWWXXWC
14、ase#1:SingleLarge InputCase#2:TwoMedium InputsCase#3:FourShort InputsW Used OnceHigh UtilizationW Used 2 TimesHigh UtilizationW used 4 TimesHigh Utilization23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16n
15、m FinFET 2025 IEEE International Solid-State Circuits Conference8 of 49Proposed Solutions Two-Direction Accessible Register File(TRF)We can load and store a matrix row-by-row(R-R)and col-by-col(C-C).It eliminates the latency overhead for multiple SRAM accesses so enhances HW utilization.I.Introducti
16、onUtil.Enhanced by 1.1-1.2AD0 column 01column 12column 23column 3Drow 0 row 1row 2row 3A012323.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference
17、9 of 49OutlineI.IntroductionII.Algorithmic ApproachesA.Factorizing TrainingB.External Data CompressionIII.Overall ArchitectureIV.Detailed FeaturesA.Hardware Support for External Data CompressionB.Dynamic BatchingC.Two-Direction Accessible Register FileV.Measurement ResultsVI.Conclusion23.1:T-REX:A 6
18、8-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference10 of 49Factorizing Training Model Training MethodologyReplace a weight matrix with the product of two sub
19、matrices:W=WSWDWDis trained to be sparse by adding a regularization term to the loss.WDcontains a column-wise fixed number of non-zeros(NZs).II.Algorithmic ApproachesWSXZWD,trainWD,inferf:Column-WiseFixed Non-Zerosd(L+|WD,train|1)dWD,traindL dWD,train+sign(WD,train)=Forward passBackward passZ=(X WS)
20、f(WD,train)Z=(X WS)WD,inferOriginal W23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference11 of 49Factorizing Training Model Sparse FormatWDis st
21、ored with only the indices and values of NZs.Similar with compressed sparse column(CSC),but no need for column pointer due to fixed number of NZs in each columnII.Algorithmic ApproachesWSXZWD,trainWD,inferf:Column-WiseFixed Non-Zeros.v0.v3v2v4v1v5.(2,v0),(3,v1),(1,v2),(2,v3),.Stored as(index,value)c
22、ol0col1 No Need for Column PointerOriginal W23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference12 of 49Factorizing Training Model EMA StrategyW
23、Sis loaded only once at the program time No runtime EMARuntime EMA is only for WD,bias,layer-norm parametersII.Algorithmic ApproachesWSXZWD,trainWD,inferf:Column-WiseFixed Non-ZerosOriginal WShared Param.Distinct Param.Global Buffer(GB)PreloadedNo Runtime EMARuntime EMA23.1:T-REX:A 68-567s/token,0.4
24、1-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference13 of 49Factorizing Training Model Additional Effect:Reduction of#OpsSequential matrix multiplications(MMs)give two option
25、s in computation order.(XWS)WDis more beneficial:hidden size of WS hidden size of WSWDII.Algorithmic ApproachesWSXZWD,trainWD,inferf:Column-WiseFixed Non-ZerosComputation order Opt#1:(XWS)WD4x8x4 dense MMs4x4x8 sparse MMs Opt#2:X(WSWD)8x4x8 sparse MMs4x8x8 dense MMsOriginal W23.1:T-REX:A 68-567s/tok
26、en,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference14 of 49External Data Compression Non-Uniform Quantization to WSOriginal INT16 parameters can be compressed into 4 b
27、its for every model.II.Algorithmic ApproachesOriginal DataUniformNon-UniformUp to 8 bitUp to 4 bit23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conf
28、erence15 of 49External Data Compression Delta Encoding to Index of WDOriginal 8-10b index can be compressed into 4-6b for every model.II.Algorithmic ApproachesOriginal IndexEncoded IndexHigh Bits RequiredLow Bits Enough51218263257686+7+6+8+623.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Acc
29、elerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference16 of 49External Data Compression Delta Encoding to Index of WDStoring delta values still needs the same bit for the worst case.II.Algorithmic Approach
30、esWSWD0W01=2,1,2,1,1,3,1,40=1,3,2,3,0,5,1,4 WD1W123.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference17 of 49External Data Compression Delta Enc
31、oding to Index of WDStoring delta values still needs the same bit for the worst case.Rearrange the columns of WSand corresponding rows of WDafter the trainingII.Algorithmic ApproachesWSWD0W01=2,1,2,1,1,3,1,40=1,3,2,3,0,5,1,4 WD1W11=2,3,2,3,1,3,1,2 0=1,3,2,1,0,3,2,2 No Result ChangedMax.Reduced(53)23
32、.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference18 of 49External Data Compression Shifted Uniform Quantization to Value of WDNear-zeros are al
33、l removed due to the fixed number of NZs constraints.Normalize the distribution to save bits:compressed to 6 bits for every modelII.Algorithmic Approaches0+m+M-M-m0+M-m-M+m0+1-10+m+M-M-mOriginal WDShift Back On-ChipShift by mUniform QuantizationEMA Format23.1:T-REX:A 68-567s/token,0.41-3.95J/token T
34、ransformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference19 of 49Effects of Algorithmic Approaches Total EMA and MAC ReductionsWith factorizing training,EMA reduced by 8.5-10.7#MACs reduced by 1
35、-2.1With external data compression,EMA further reduced by 2.1-2.9II.Algorithmic ApproachesViT-BS2T-M*RD-NMT*1.00.500.250.752.91X2.67X2.67XBERT-L*2.13X10.7X8.89X10.32X8.53X2.14X1.73X1.93XBaseline:original INT16 modelsEMAs with FTEMAs with FT+Comp#MACs with FTFT:factorizing training,Comp:data compress
36、ion,*128-tokens assumed.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference20 of 49OutlineI.IntroductionII.Algorithmic ApproachesA.Factorizing
37、TrainingB.External Data CompressionIII.Overall ArchitectureIV.Detailed FeaturesA.Hardware Support for External Data CompressionB.Dynamic BatchingC.Two-Direction Accessible Register FileV.Measurement ResultsVI.Conclusion23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced
38、External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference21 of 49T-REX ArchitectureIII.Overall ArchitectureRISC-V Core-BasedTop ControllerIntermediateInput/OutputEncoderOutputSharedParameterDistinct ParameterGlobal Buffer(GB)Direc
39、t Memory Access(DMA)MACMACMACMACMACMACMACMACMACMACMACMACMACMACMACMACOut PsumShift by 4INT2BFBF2INTInput BufferInput VectorIAUsOutput BufferFAUsOutput VectorLUTsIN1OUTrstI/O InterfaceAuxiliary FunctionUnit(AFU)x2LUTsIN2ControlInput BufferDense MM(DMM)Core x4Output BufferAccumulatorDequant LUTNon-Unif
40、orm Dequant.Level#0Level#1Level#14Level#15Input or Parameter BufferPE#0PE#1PE#2PE#3PE#4PE#5PE#6PE#7PE#8PE#9PE#10PE#11PE#12PE#13PE#14PE#15Sparse MM(SMM)Core x4Ctrl.Input BufferOutput BufferBias BufferMACMACMACMACMACMACMACMACMACMACMACMACMACMACMAC8x8 MACsAccumulatorPost Bias AdderUniform Dequant.Sparse
41、 Line Buffer23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference22 of 49T-REX ArchitectureIII.Overall ArchitectureRISC-V Core-BasedTop Controlle
42、rIntermediateInput/OutputEncoderOutputSharedParameterDistinct ParameterGlobal Buffer(GB)Direct Memory Access(DMA)MACMACMACMACMACMACMACMACMACMACMACMACMACMACMACMACOut PsumShift by 4INT2BFBF2INTInput BufferInput VectorIAUsOutput BufferFAUsOutput VectorLUTsIN1OUTrstI/O InterfaceAuxiliary FunctionUnit(AF
43、U)x2LUTsIN2ControlInput BufferDense MM(DMM)Core x4Output BufferAccumulatorDequant LUTNon-Uniform Dequant.Level#0Level#1Level#14Level#15Input or Parameter BufferPE#0PE#1PE#2PE#3PE#4PE#5PE#6PE#7PE#8PE#9PE#10PE#11PE#12PE#13PE#14PE#15Sparse MM(SMM)Core x4Ctrl.Input BufferOutput BufferBias BufferMACMACMA
44、CMACMACMACMACMACMACMACMACMACMACMACMAC8x8 MACsAccumulatorPost Bias AdderUniform Dequant.Sparse Line Buffer Dense Matrix Multiplication(DMM)Cores Compute Q/K/V/Out,QKTin attention layerCompute F1/F2 in feed-forward layerPerform tiled MM based on outer productA00A01A02A03B00B10B20B30C00B00B10B20B30A00A
45、01A02A03 TimeLoad ColsLoad RowsIn Single PE,23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference23 of 49T-REX ArchitectureIII.Overall Architectu
46、reRISC-V Core-BasedTop ControllerIntermediateInput/OutputEncoderOutputSharedParameterDistinct ParameterGlobal Buffer(GB)Direct Memory Access(DMA)MACMACMACMACMACMACMACMACMACMACMACMACMACMACMACMACOut PsumShift by 4INT2BFBF2INTInput BufferInput VectorIAUsOutput BufferFAUsOutput VectorLUTsIN1OUTrstI/O In
47、terfaceAuxiliary FunctionUnit(AFU)x2LUTsIN2ControlInput BufferDense MM(DMM)Core x4Output BufferAccumulatorDequant LUTNon-Uniform Dequant.Level#0Level#1Level#14Level#15Input or Parameter BufferPE#0PE#1PE#2PE#3PE#4PE#5PE#6PE#7PE#8PE#9PE#10PE#11PE#12PE#13PE#14PE#15Sparse MM(SMM)Core x4Ctrl.Input Buffer
48、Output BufferBias BufferMACMACMACMACMACMACMACMACMACMACMACMACMACMACMAC8x8 MACsAccumulatorPost Bias AdderUniform Dequant.Sparse Line Buffer Sparse Matrix Multiplication(SMM)CoresCompute Q/K/V/Out,PV in attention layerCompute F1/F2 in feed-forward layerPerform MM based on row or column product(configur
49、able)1)SparseElements2)CorrespondingRows3)Linear Combination2)CorrespondingCols1)SparseElements3)Linear Combination23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-St
50、ate Circuits Conference24 of 49T-REX ArchitectureIII.Overall ArchitectureRISC-V Core-BasedTop ControllerIntermediateInput/OutputEncoderOutputSharedParameterDistinct ParameterGlobal Buffer(GB)Direct Memory Access(DMA)MACMACMACMACMACMACMACMACMACMACMACMACMACMACMACMACOut PsumShift by 4INT2BFBF2INTInput
51、BufferInput VectorIAUsOutput BufferFAUsOutput VectorLUTsIN1OUTrstI/O InterfaceAuxiliary FunctionUnit(AFU)x2LUTsIN2ControlInput BufferDense MM(DMM)Core x4Output BufferAccumulatorDequant LUTNon-Uniform Dequant.Level#0Level#1Level#14Level#15Input or Parameter BufferPE#0PE#1PE#2PE#3PE#4PE#5PE#6PE#7PE#8P
52、E#9PE#10PE#11PE#12PE#13PE#14PE#15Sparse MM(SMM)Core x4Ctrl.Input BufferOutput BufferBias BufferMACMACMACMACMACMACMACMACMACMACMACMACMACMACMAC8x8 MACsAccumulatorPost Bias AdderUniform Dequant.Sparse Line Buffer Auxiliary Function Units(AFUs)Perform all the other functions not MMs,e.g.,Layer normalizat
53、ion,softmax,activation and residual connectionSelectively use integer arithmetic units(IAUs)andBF16 arithmetic units(FAUs)fSoftmaxLayer Norm.GELUResidual23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm F
54、inFET 2025 IEEE International Solid-State Circuits Conference25 of 49OutlineI.IntroductionII.Algorithmic ApproachesA.Factorizing TrainingB.External Data CompressionIII.Overall ArchitectureIV.Detailed FeaturesA.Hardware Support for External Data CompressionB.Dynamic BatchingC.Two-Direction Accessible
55、 Register FileV.Measurement ResultsVI.Conclusion23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference26 of 49Logics for External Data Compression
56、 On-Chip Dequantization in DMM Core for WSIV.Detailed FeaturesGlobal BufferDMM CoreReconfigured When Needed 44 PEsSharedParameterWSLevel#0Level#1Level#14Level#15Dequant LUTLevel#2Param BufferDequantizerQuant.Levels8b164b16DMM Cores Mainly Handle XWS.LUT-based non-uniform dequantizerdecompress 4b WSt
57、o 8b/16b integersWSare separately defined for encoder/decoder and attention/FFNNDMM Cores Also Handle QKT.The dequantizer is not activated.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEE
58、E International Solid-State Circuits Conference27 of 49Logics for External Data Compression Relative Addressing in Top Controller for Index of WD On-Chip Dequantization in SMM Core for Value of WDSMM Cores Mainly Handle(XWS)WD.Relative addressing using delta-encode indicesUniform dequantizer using s
59、cale(M-m)and offset(m)for 6b quantized values of WDSMM Cores Also Handle PV.The delta index and dequantizer are not activated.IV.Detailed FeaturesGlobal BufferTop ControlCurrent AddressresetTo SMM CoresValue of WDIndex of WDInput/OutputSMM CoreGlobal BufferM-m-m+mInput/Output BufferDequantizer88 MAC
60、sInput/OutputValue of WDsign23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference28 of 49Dynamic Batching Technique Single Input BatchIV.Detailed
61、 FeaturesDMMDMMDMMDMMLUT-Based ExponentialQKTVAFUMatrix AdditionSMMSMMSMMSMMQKSQueryKeyTScoreFor QKT(S),four DMM corestake the same Q and different Ks.For softmax(P),two AFUsload single S row-by-row.perform single softmax over all inputs.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accele
62、rator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference29 of 49Dynamic Batching Technique Single Input BatchIV.Detailed FeaturesPVAProbabilityValueAttentionDMMDMMDMMDMMLUT-Based ExponentialQKTVAFUMatrix Additi
63、onSMMSMMSMMSMMFor PV(A),four SMM corestake the different Ps and Vs.For final output,two AFUsadd four SMM results into onesubmatrix of A.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE I
64、nternational Solid-State Circuits Conference30 of 49Dynamic Batching Technique Two-Input BatchIV.Detailed FeaturesQ0K0S0K1Q1S1Query0:1KeyT0:1Score0:1DMMDMMDMMDMMLUT-Based ExponentialQ0K0TV0SMMSMMSMMSMMQ1K1TV1AFUMat Add.Mat Add.For QKT(S),four DMM corestake the same Q for two coresand different Ks.Fo
65、r softmax(P),two AFUsload two different Ss row-by-row.perform two softmax over half of elements.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Confer
66、ence31 of 49Dynamic Batching Technique Two-Input BatchIV.Detailed FeaturesP0P1V0V1A0A1Prob.0:1Value0:1Attn.0:1DMMDMMDMMDMMLUT-Based ExponentialQ0K0TV0SMMSMMSMMSMMQ1K1TV1AFUMat Add.Mat Add.For PV(A),four SMM corestake the different Ps and Vs.For final output,two AFUsadd for SMM results into twosubmat
67、rices of A.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference32 of 49Dynamic Batching Technique Four-Input BatchIV.Detailed FeaturesQ0K0S0K1K2
68、K3Q1Q2Q3S1S2S3Query0:3KeyT0:3Score0:3DMMDMMDMMDMMLUT-Based ExponentialQ0V0SMMSMMSMMSMMQ2V2Q1Q3V1V3K0K1K2K3AFUFor QKT(S),four DMM corestake the different Qs different Ks.For softmax(P),two AFUsload four different Ss row-by-row.perform four softmax over a fourth of elements.23.1:T-REX:A 68-567s/token,
69、0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference33 of 49Dynamic Batching Technique Four-Input BatchIV.Detailed FeaturesP0P1P2P3V0V1V2V3A0A1A2A3Prob.0:3Value0:3Attn.0:3
70、DMMDMMDMMDMMLUT-Based ExponentialQ0V0SMMSMMSMMSMMQ2V2Q1Q3V1V3K0K1K2K3AFUFor PV(A),four SMM corestake the different Ps and Vs.For final output,two AFUsdo nothing for SMM results(they are all submatrices of A).23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Me
71、mory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference34 of 49Dynamic Batching TechniqueIV.Detailed FeaturesViT-B*S2T-M*RD-NMT*Token Length DistributionL-Length2L-Length4L-LengthViT-B*S2T-M*RD-NMT*100502575Hardware Utilization 2.78X 1.27X
72、 BaselineDynamic BatchingBERT-L*BERT-L*3.30X%ViT-B*BERT-L*RD-NMT*2.78X 3.31X Normalized LatencyBaseline:factorizing model with compressionsViT-B*S2T-M*RD-NMT*2.78X 1.31X Normalized EMABERT-L*3.31X S2T-M*1.28X 23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External M
73、emory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference35 of 49Two-Direction Accessible Register File Different Matrix Access DirectionsIV.Detailed FeaturesX00X01X02X03W00W10W20W30Y00WSYXZ00YZWDTile-Wise Outer Product in DMMX loaded colum
74、n-by-column(C-C)WSloaded row-by-row(R-R)Y stored C-C for the subsequent column productColumn Product in SMMY loaded C-CZ stored properly for next operationAssume Z needs to be stored R-R.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhan
75、ced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference36 of 49Two-Direction Accessible Register File TRF ArchitectureIV.Detailed FeaturesaddraddraddraddrdirdirdataindataoutSquare-shapedPrecision not considered for simplicityOne more port(dir)needed.It support
76、s“load”and“store”in both horizontal and vertical directions.Cells are based on register.Low area efficiency Only placed within DMM/SMM coresas input/output buffers.4%area overhead in the chip level23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access
77、 and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference37 of 49Two-Direction Accessible Register File Tile-Wise Outer Product in DMM CoreIV.Detailed FeaturesX00X01X02X03W00W10W20W30Y00WSYXAD01row 0 of X012row 0 of X023row 0 of X034row 1 of X00AD0row
78、0 of W001row 0 of W012row 0 of W024row 1 of W013row 1 of W00Main SRAMsAD0col 0 of Y001empty now2empty now3col 1 of Y00empty nowrow 0 of X00DMM CoreOne Row at a TimeAccumulatorPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEoutbufinbufBinbufAload word,store as columnload word,store as rowinbufAloads words,and stores
79、 vertically(X00T).inbufBloads words,and stores horizontally(W00).23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference38 of 49Two-Direction Acces
80、sible Register File Tile-Wise Outer Product in DMM CoreIV.Detailed FeaturesX00X01X02X03W00W10W20W30Y00WSYXAD01row 0 of X012row 0 of X023row 0 of X034row 1 of X00AD0row 0 of W001row 0 of W012row 0 of W024row 1 of W013row 1 of W00Main SRAMsAD0col 0 of Y001empty now2empty now3col 1 of Y00empty nowrow 0
81、 of X00DMM CoreOne Row at a TimeAccumulatorPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEoutbufinbufBinbufAload column,store to wordOutput(Y00)is stored as is to outbuf.outbuf stores vertical elements to SRAM word.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Acc
82、ess and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference39 of 49inbufoutbufSMM CoreAccumulator+BiasMACMACMACMACMACMACMACMACMACMACMACMACMACMACMACMAClinebufTwo-Direction Accessible Register File Column Product in SMM CoreIV.Detailed FeaturesAD0NZs in
83、 col 0 of W1NZs in col 1 of W2NZs in col 2 of W3NZs in col 3 of WMain SRAMsAD0row 0 of Z001empty now2empty now3empty now4row 1 of Z00AD0subcol 0,col 0 of Y1subcol 1,col 0 of Y2subcol 2,col 0 of Y3subcol 0,col 1 of Y4subcol 1,col 1 of YZ00YZWDload word,store as rowNon-zeros of the first column in WDi
84、s loaded then corresponding columns of Y is loaded.inbuf loads words,and stores horizontally.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conferenc
85、e40 of 49 Column Product in SMM CoreOutput(Z00)is stored as is to outbuf.outbuf stores horizontal elements to SRAM word.Two-Direction Accessible Register FileIV.Detailed FeaturesAD0NZs in col 0 of W1NZs in col 1 of W2NZs in col 2 of W3NZs in col 3 of WMain SRAMsAD0row 0 of Z001empty now2empty now3em
86、pty now4row 1 of Z00AD0subcol 0,col 0 of Y1subcol 1,col 0 of Y2subcol 2,col 0 of Y3subcol 0,col 1 of Y4subcol 1,col 1 of YinbufoutbufSMM CoreAccumulator+BiasMACMACMACMACMACMACMACMACMACMACMACMACMACMACMACMAClinebufZ00YZWDload row,store to word23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Acc
87、elerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference41 of 49Two-Direction Accessible Register File Additional Utilization and Latency Improvement1.12-1.24 of utilization improvements over four transforme
88、r-based modelsSimilar level of latency improvements are following.IV.Detailed FeaturesViT-BS2T-MRD-NMT1005025751.24X 1.21X+Dynamic Batching+Two-Dir Accessible RFBERT-L1.12X%1.20X ViT-BBERT-LRD-NMT1.23X1.11X S2T-M1.20X 1.20XBaseline:dynamic batchingUtilization ImprovementLatency Improvement23.1:T-REX
89、:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference42 of 49OutlineI.IntroductionII.Algorithmic ApproachesA.Factorizing TrainingB.External Data Compressio
90、nIII.Overall ArchitectureIV.Detailed FeaturesA.Hardware Support for External Data CompressionB.Dynamic BatchingC.Two-Direction Accessible Register FileV.Measurement ResultsVI.Conclusion23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhance
91、d Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference43 of 49Chip Photograph and SummaryV.Measurement ResultsDMM Cores(56.0%)AFUs(23.0%)SMM Cores(13.58%)Memory(4.47%)Ctrl+IO(2.99%)Power Breakdown040801201601010010000.40.50.60.70.80.9Power mWFrequency MHzSupply
92、 Voltage V101102103160804000.40.50.60.70.80.9Supply Voltage VFrequency MHz120Power mWFlip-ChipPackagedShared Parameter MemoryInput/Output Memory#0Input/Output Memory#1DMM Core#0DMM Core#1DMM Core#2DMM Core#3AFU#0AFU#1SMM Cores#0-3Distinct Param.MemoryRISCUnrelated CircuitSpecification16-nm FinFETTec
93、hnology0.45 0.85Supply Voltage V60 450Frequency MHz7.12 152.5Power mW10.15Area mm2INT4/8/16Precision Support 1,320On-Chip Memory kBOn-Chip-Level EvaluationPrecisionPerformance TOPS 1)Energy Efficiency TOPS/W 2)Area Efficiency TOPS/mm2 1)System-Level EvaluationModelReduced EMAs per input MBAverage HW
94、 Utilization%Latency s/token1,3)Energy J/token2,3)1)Numbers at 0.85V and 450MHz.2)Numbers at 0.47V and 80MHz.3)Estimated with LPDDR3 SDRAM.BERT-LS2T-MRD-NMTViT-B10.033.480.965.4772.075.857.489.0475233685673.9461.6450.4073.662INT16INT8INT40.20 0.540.81 2.153.25 8.613.8 10.115.2 40.360.8 161.20.02 0.0
95、530.08 0.210.32 0.8523.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference44 of 49Benchmark Analysis Four different benchmarks,15.9-25.6 parameter
96、 reductions from factorizing training(FT)31-66 EMA reductions from FT,data compression(Comp)and dynamic batching(DB)1.2-3.7 utilization improvements from DB and TRFV.Measurement ResultsBERT-Large 28S2T-Medium 27RD-NMT 26ViT-Base 25ModelText ClassificationSpeech-to-TextTranslationImage Classification
97、TaskSST2LibriSpeechIWSLTImageNetDatasetGLUE 91.0%WER 4.1%BLEU 29.7Top-1 76.7%Accuracy1)-2.1%p(Ref 93.1%)+0.6%p(Ref 3.5%)-1.02(Ref 30.72)-1.3%p(Ref 78%)Accuracy Loss INT8/16INT8/16INT8/16INT8/16MM Precision604 37.9126 5.1663 3.04170 6.65Parameters MB 2)60.1736.1865.9531.03EMA Reduction 2)19.5 72.049.
98、1 75.816.7 57.473.9 89.0HW Utilization%1)Lower is better for WER(word error rate)and higher is better for the others.2)INT16 assumed as original model precision.Base+DB+FT+Comp.100502575125MB10.32X 2.67X 1.31X Model:S2T-MediumBase+DB+TRF100502575%1.27X 1.21X Model:S2T-MediumEMA ReductionUtil.Improve
99、ment23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference45 of 49Comparison with Previous WorkV.Measurement ResultsISSCC22 1ISSCC22 2VLSI23 4ISSC
100、C23 19ISSCC24 21This WorkTechnology28-nm CMOS28-nm CMOS28-nm CMOS12-nm FinFET28-nm CMOS16-nm FinFETSupply Voltage V0.56 1.10.6 1.00.68 1.00.62 1.00.7 1.10.45 0.85Frequency MHz50 51080 240200 58077 71750 20060 450Power mW12.06 272.827.04 118.21107 3919 12247.5 469.27.12 152.5Area mm26.826.836.44.620.
101、2510.15Precision Support INT12INT8/16INT8/16FP4/8INT8INT4/8/16On-Chip Memory kB3362164806475001,320EMA-Excluded On-Chip-Level ComparisonPerformance 1)TOPS or TFLOPS0.52 4.07 1.48(INT8)0.37(INT16)1.18 14.82(INT8)0.59 6.67(INT16)0.734(FP4)0.367(FP8)3.41(INT8)0.81 2.15(INT8)0.20 0.54(INT16)Energy Effic
102、iency 2)TOPS/W or TFLOPS/W4.25 27.5612.5 20.5(INT8)3.1 5.1(INT16)3.03 77.35(INT8)1.51 33.7(INT16)6.61 18.1(FP4)3.0 8.24(FP8)22.9 47.8(INT8)15.2 40.3(INT8)3.8 10.1(INT16)Area Efficiency 1)TOPS/mm2 or TFLOPS/mm20.076 0.596 0.217(INT8)0.054(INT16)0.18 2.31(INT8)0.092 0.36(INT16)0.16(FP4)0.08(FP8)0.17(I
103、NT8)0.08 0.21(INT8)0.0197 0.053(INT16)EMA-Included System-Level ComparisonBenchmark Latency 1)s/token584(ViT-Base)3)3,707(GPT2-Large)3)411(BERT-Base)1,623(BERT-Large)384(ViT-Base)3)667(BERT-Base)466(GPT2-Large)567(ViT-Base)3)68(RD-NMT)3)233(S2T-Medium)3)475(BERT-Large)3)Benchmark Energy 2)J/token27.
104、6(ViT-Base)3)92.19(GPT2-Large)3)16.41(BERT-Base)3)58.01(BERT-Large)3)57.02(ViT-Base)3)75.19(BERT-Base)18.1(GPT2-Large)3.662(ViT-Base)3)0.407(RD-NMT)3)1.645(S2T-Medium)3)3.946(BERT-Large)3)1)Numbers at the fastest voltage and frequency.2)Numbers at the most energy-efficient voltage and frequency.3)Es
105、timated with LPDDR3 SDRAM.23.1:T-REX:A 68-567s/token,0.41-3.95J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET 2025 IEEE International Solid-State Circuits Conference46 of 49Comparison with Previous WorkV.Measurement ResultsISSCC22
106、1ISSCC22 2VLSI23 4ISSCC23 19ISSCC24 21This WorkTechnology28-nm CMOS28-nm CMOS28-nm CMOS12-nm FinFET28-nm CMOS16-nm FinFETSupply Voltage V0.56 1.10.6 1.00.68 1.00.62 1.00.7 1.10.45 0.85Frequency MHz50 51080 240200 58077 71750 20060 450Power mW12.06 272.827.04 118.21107 3919 12247.5 469.27.12 152.5Are
107、a mm26.826.836.44.620.2510.15Precision Support INT12INT8/16INT8/16FP4/8INT8INT4/8/16On-Chip Memory kB3362164806475001,320EMA-Excluded On-Chip-Level ComparisonPerformance 1)TOPS or TFLOPS0.52 4.07 1.48(INT8)0.37(INT16)1.18 14.82(INT8)0.59 6.67(INT16)0.734(FP4)0.367(FP8)3.41(INT8)0.81 2.15(INT8)0.20 0
108、.54(INT16)Energy Efficiency 2)TOPS/W or TFLOPS/W4.25 27.5612.5 20.5(INT8)3.1 5.1(INT16)3.03 77.35(INT8)1.51 33.7(INT16)6.61 18.1(FP4)3.0 8.24(FP8)22.9 47.8(INT8)15.2 40.3(INT8)3.8 10.1(INT16)Area Efficiency 1)TOPS/mm2 or TFLOPS/mm20.076 0.596 0.217(INT8)0.054(INT16)0.18 2.31(INT8)0.092 0.36(INT16)0.
109、16(FP4)0.08(FP8)16K).The bottlenecks of ConvFormers in SS are mainly from memory-intensiveBackbone and compute-intensive Segmentation Head(Seg.Head).High ResolutionLow ResolutionUnclearClear23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-Based
110、Layer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference7 of 36Memory Bottlenecks in BackboneChallenge1:Fmap EMA worsens with large token length in high-resolution SS tasks(16K).Challenge2:Redundant EMA exits when applying layer-fusion betwe
111、envanilla attention and convolution in feed-forward network(FFN).23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference8 of 36Mem
112、ory Bottlenecks in BackboneSolution1:Hybrid attention integrates memory-efficient linear attention(LA)with accurate vanilla attention(VA).Solution2:Hybrid attention reduces the number of VA tiles,allowing all VA tiles to be processed together with a single KV EMA.23.2 A 28nm 0.22J/Token Memory-Compu
113、te-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference9 of 36Compute Bottleneck in Seg.HeadChallenge3:Heavy comp.overheads in Seg.Head with low Fmap sparsity.Solutio
114、n3:Increase the number of parameters by weight decompositionto improve Fmap sparsity and reduce computation.Large Intermediate Fmaps*:Optimized by hybrid attentionOverall Breakdown AnalysisSeg.Head Sparsity DistributionPretrainedWeight0 100100DecomposedWeight1 10080DecomposedWeight2 80100SVDFine-Tun
115、ing60%more parameters0 100FmapDecomposedWeight1 10080DecomposedWeight2 8010090%FmapPruningReduce 84%FLOPs123Dense GEMM Output SpMM+Input SpMM23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Se
116、gmentation 2025 IEEE International Solid-State Circuits Conference10 of 36Outline Introduction Key Features of ConvFormer AcceleratorHybrid Attention Processing Unit(HAPU)KV-Weight Reused Layer-Fusion Scheduler(LFS)Cascaded Feature Map Pruner(CFMP)Measurement Results Summary23.2 A 28nm 0.22J/Token M
117、emory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference11 of 36Overall Architecture23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer A
118、ccelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference12 of 36Overall ArchitectureHybrid Attention Processing Unit(HAPU)Each HPAU contains 1K INT8 MACs in 16 PE linesPerform intra-layer fusion and tili
119、ng speculation for VA 23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference13 of 36Overall ArchitectureLayer-Fusion Scheduler(LF
120、S)Various attention operation are clustered by ACUReuse KV to fuse the vanilla attention and convolutionReuse conv.weight to fuse linear attention and convolution 23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded P
121、runing for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference14 of 36Overall ArchitectureCascaded Feature Map Pruner(CFMP)Prune the feature map based on binary map in FMSReuse the offsets in FMS to recover density in DRU23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Awar
122、e CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference15 of 36Hybrid Attention TrainingLinear Attention3could reduce the Fmap by re-ordering computation.Utilize Gumbel Softmax to lea
123、rn the attention choice for each token tile.Incorporate an EMA loss function to control the hybridization ratio.Reference Paper:3ICML 2020,Transformers are RNNs:Fast Autoregressive Transformers with Linear AttentionReduce 80%EMA with 0.1 mIoU drop23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Awar
124、e CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference16 of 36Attention Tiling ManagerPerform intra-layer fusion for VA to alleviate the EMA issue of theremaining large VA tiles.Adju
125、st tiling size on-the-fly for query with buffer overflow speculation.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference17 of
126、36Layer-Fusion Scheduler(LFS)Reuse KV to perform all VA tiles and then override KV by FFN convweights for VA-Conv fusion.Reuse conv weights and the KTV prepared by HAPU for LA-Conv fusion.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLa
127、yer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference18 of 36Layer-Fusion Scheduler(LFS)Step1:Cluster the VA and LA tiles and combine them into two groups.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator wi
128、th Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference19 of 36Layer-Fusion Scheduler(LFS)Step2:Compute all VA tiles in parallel23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with
129、 Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference20 of 36Layer-Fusion Scheduler(LFS)Step3:Fetch off-chip fused conv weight and use them to overlap KV.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Tr
130、ansformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference21 of 36Layer-Fusion Scheduler(LFS)Step4:Fuse each VA output tiles with fused conv weights in FFN.23.2 A 28nm 0.22J/Token Memory-Compute
131、-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference22 of 36Layer-Fusion Scheduler(LFS)Step5:Reuse the fused conv weights in FFN to fuse with LA input.23.2 A 28nm 0.
132、22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference23 of 36Layer-Fusion Scheduler(LFS)Break convolution for non-overlapped fusion and compen
133、sate thebroken receptive field by long-range dependency of attention.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference24 of
134、36Cascaded Feature Map PruningDecompose conv layers and find masks in a post-training manner.Improve the Fmap sparsity of Seg.Head from 15%to 91.1%.Dense GEMM Output SpMM+Input SpMM23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fus
135、ion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference25 of 36Generalized To Sparse AttentionThe CFMP can be reconfigured to support sparse vanilla attention.The tile-size interplays with compute utilization,accuracy and EDP.23.2 A 28nm 0.22J/Token
136、 Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference26 of 36Cascaded Feature Map PrunerWeight mapping conflicts exist in two cascaded SpMM,the first i
137、s tile-column mapping while the second is tile-row mapping.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference27 of 36Cascaded
138、 Feature Map PrunerFMS decodes the mask with early termination and shares offsets to DRU.DRU converts the decoded offsets and performs accumulation sequentially.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pru
139、ning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference28 of 36Outline Introduction Key Features of ConvFormer AcceleratorHybrid Attention Processing Unit(HAPU)KV-Weight Reused Layer-Fusion Scheduler(LFS)Cascaded Feature Map Pruner(CFMP)Measurement Results Summary23.2
140、A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference29 of 36Chip Photograph and Summary3.60mm3.87mmLMBHAPU#0RMBHAPU#1CFMPLFSPLLTop C
141、TRLSIMD CoreISA BufferSpecificationsTechnologyTSMC 28nm 1P9M CMOSSupply Voltage0.65 1.0 VFrequency200 625 MHzDie Area13.93 mm2SRAM3.28 MBPrecisionINT8System Power0.16W 0.65V,200MHz0.97W 1.0V,625MHzPerformance*1*22.45 TOPSEnergy Efficiency*11.69 52.90*3TOPS/WExternal Bandwidth(GB/s)6.4 GB/sModel Type
142、SegFormer-B0PVTv1-TiPVTv2-B0DatasetCityscapesAccuracy/Accuracy Loss(mIoU)75.47/-0.8273.84/-1.2775.49/-0.95Latency*2*4(ms)29.4284.1830.26Energy Consumption*3*4(J/Token)0.220.670.24*1:Off-chip DDR3 interface is excluded for chip-level evaluation.One operation(OP)represents one multiplication or one ad
143、dition.*2:1.0V,625MHz.*3:0.65V,200MHz(with 3 solutions).*4:Off-chip DDR3 interface is included for system-level evaluation.23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IE
144、EE International Solid-State Circuits Conference30 of 36Test Platform and Demonstration23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuit
145、s Conference31 of 36Demonstration23.2 A 28nm 0.22J/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-BasedLayer-Fusion and Cascaded Pruning for Semantic-Segmentation 2025 IEEE International Solid-State Circuits Conference32 of 36Measurement Results25.9 EMA reduct
146、ion by HAPU and LFS with 18s generation time even for simple image(6464 unconditional)Only supports unconditional generation 1 S.VLSI 2024,Y.Qin 2 ISSCC 2024 R.Guo Limitation of Previous ApproachesDec.UNetUNet150No ConditionSupport only Uncond.GenTailored for Many-step DMUtilize Inter-timestep Redun
147、dancy Computation Reuse1Difference Computing2UNettUNett-1Similar Input?ReuseSkip ComputeUNettDiff.Low-bit ComputeIAtIAt-1OAtOAt-1OAtIAtOAtIAt-1UNett-1IAt-1-IAtOAt-1-OAtOAt-1Recover 23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Gro
148、up Quantization 2025 IEEE International Solid-State Circuits Conference8 of 54No Inter-timestep RedundancyI.Introduction(FSDM)Reduced Timestep Reduces Redundancy IA from subsequent timesteps are not similar anymore Previous approaches focusing on inter-timestep redundancy are not available3.Sensitiv
149、e to Quantization2.Increased Enc./Dec.Portion1.No Inter-step Redundancy Many-Step Diffusion ModelFew-Step Diffusion ModelDec.UNetUNetUNetT=1T=49T=50ConditioningText Enc.Img.Enc.High Inter-step RedundancyDecUNetUNetT=1T=2ConditioningText Enc.Img.Enc.No Inter-step RedundancySummary of Recent DMs23.3:E
150、dgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference9 of 54Increased Encoder/Decoder Portion I.Introduction(FSDM,MMC)UNet is No Longer Dominant Reduce portion of UNet w
151、ith small#of timesteps Multi-modal conditioning:Additional encoder network for new modalities1.No Inter-step Redundancy 3.Sensitive to Quantization2.Increased Enc./Dec.PortionDecUNetConditioningText Enc.50 1Reducing#of TimestepsDecUNetConditioningText Enc.Img.Enc.1Multi-modal Conditioning50 StepT2I0
152、12350#of MAC(1e12)4Dec(2.6%)UNet(97.4%)50 StepT2I012350442.7%Dec.(57.3%)Dec(2.6%)UNet(97.4%)Text Enc.(0.1%)1 StepT2I#of MAC(1e12)50 StepT2I012350442.7%Dec.(57.3%)Img.Enc.(22.0%)Dec(2.6%)UNet(97.4%)Dec.(44.6%)33.3%Text Enc.(0.1%)1 StepT2I1 StepI2I#of MAC(1e12)Summary of Recent DMs23.3:EdgeDiff:418.4m
153、J/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference10 of 54Sensitive to QuantizationI.Introduction(FSDM)Increased Precision Demands with Fewer Timesteps Unlike Many-step DMs,Few-ste
154、p DMs are sensitive to quantization Mixed Precision(MP)&Group Quantization(GQ)for high-quality2.Increased Enc./Dec.Portion1.No Inter-step Redundancy 3.Sensitive to Quantization50 StepW:FP16IA:FP1650 Step*W:4b-CIA:8b-TRobust to Quant.*T:Per-tensor/C:Per-channel/G:Per-group QuantizationMany-step Model
155、1 Step*W:4b-CIA:8b-T1 Step*W:4/8b-GIA:4/8b-GVulnerable to Quant.Recovered byMP&GQFew-step ModelSummary of Recent DMs23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conf
156、erence11 of 54Solutions for 3 Characteristics of Recent DMsI.IntroductionEfficient&Accurate Quantization Simple but Promising Mixed precision(MP):different precision across different channel Group quantization(GQ):shares scale factor across a subset of channelsToken&Output ChannelInput ChannelInput&
157、Weight TensorScale FactorPer-TensorQuant.Per-ChannelQuant.Per-GroupQuant.High Precision ComputingLow Precision ComputingMixed-Precision Computing2.Increased Enc./Dec.Portion1.No Inter-step Redundancy 3.Sensitive to Quantization23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Acce
158、lerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference12 of 54Mixed Precision&Group Quantization for DMI.IntroductionEnd-to-end Acceleration of Multi-modal&Few-Step DM Unified optimization of Encoder/UNet/Decoder while preserving quality
159、1.No Inter-step Redundancy 2.Increased Enc./Dec.Portion3.Sensitive to QuantizationEdgeDiff:Accelerate All Encoder/Unet/Decoder of FSDM with Mixed Precision+Group QuantizationAccelerate UNet w/o Inter-timestep RedundancyAccelerate Encoder/Decoder as much as UNetMaintain Image Quality even with 14 Tim
160、estep23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference13 of 54Design Challenge 1:Dynamic SaliencyI.IntroductionDynamically Changing Input Activation Channel-wi
161、se Saliency Previous(LLM):Static saliency&Preemptive handling1,2 of outliers Diffusion Model:Dynamic saliency handling requiredcf.)Preemptive Outlier Handling in LLM7193%Outlier Overlapped in LLM Lamma-7B,5%outlier in FFNOutlierOutlierStatic ReorderOutlierOutlierOutlierOutlierOutlierPrompt 2Prompt 1
162、OutlierOutlierOutlierOutlierOutlier1“Atom,”Y.Zhao 2“AWQ,”J.LinTokenChannelGreatly VaryingSimilar Channel-wise Saliency23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Co
163、nference14 of 54OutlierOutlierOutlierOutlierOutlierOutlierOutlierOutlierOutlierCondition-induced random saliency(5%Outlier Overlapped)Prompt 1Image Ref 1Condition 1Prompt 2Image Ref 2Condition 2TokenTokenChannelDesign Challenge 1:Dynamic SaliencyI.IntroductionDynamically Changing Input Activation Ch
164、annel-wise Saliency Previous(LLM):Static saliency&Preemptive handling1,2 of outliers Diffusion Model:Dynamic saliency handling requiredcf.)Preemptive Outlier Handling in LLM7193%Outlier Overlapped in LLM Lamma-7B,5%outlier in FFNOutlierOutlierStatic ReorderOutlierOutlierOutlierOutlierOutlierPrompt 2
165、Prompt 1OutlierOutlierOutlierOutlierOutlier1“Atom,”Y.Zhao 2“AWQ,”J.Lin23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference15 of 54Design Challenge 2:Mixed-Precisi
166、on MACI.IntroductionBit-scalable PE Support Multiple Precision w/Unit MAC Signed Computing Unit MAC should support both signed&unsigned MAC Previous:Bit-enlarged MAC units1,2 or sign-only MAC units 3,4 1“Bitblade,”Y.Wang 2“HNPU,”D.Han3“Sibia,”D.Im 4“LUTein,”D.ImBoth Signed&Unsigned Unit MAC Required
167、4/8/12/16b Mixed Precision using Bit-scalable PE.Divide into Unit MAC16bS.SignedUnsignedUnsigned8bSSignedUnsigned4bSSignedPrevious 1:Bit-enlarged MACSign Extension Unit28%More Area 19%More Power 4b5b5bS SS5bMult.4b5bMult.8bS1.56 More Op.16b MACSSReduced Effective-bitSigned MAC Only7bPrevious 2:Sign-
168、only MACS4bMult.4bMult.4b4b23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference16 of 54Design Challenge 3:FP Op.for Group Quant.I.IntroductionScaling,Accumulation
169、,Division Operation for Group Quant.FP scaling factor(SF)required for accuracy Power-intensive FP operation(FP scaling&FP accum.&FP division)FP AccumulationFP Scaling(Mult.)Quant.UnitFP DivisionOA(INT)SFOA(FP8/16)Tensor PE.g+SFIA(FP8/16)SFW(INT4)INT MAC110100Relative Logic PowerINT4 MACFP Scaling FP
170、 Accum.FP DivisionPower-intensive FP Operation for Group Quantization164.7284.123.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference17 of 54Proposed Processor:Edge
171、DiffI.IntroductionCondition-aware Reordered Group Mixed Precision(CRMP)Compress-and-Add(CAA)Tiered Accumulation Unit(TAU)Grid-based Quantization Unit(GQU)Bit Shuffle Tree(BST)EdgeDiff:Accelerate All Encoder/Unet/Decoder of FSDM with Mixed Precision+Group QuantizationEfficient INT Unit for Mixed Prec
172、ision Efficient FP Unit for Group Quantization Efficient&Accurate Orchestration of Mixed Precision&Group Quantization 12323.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits
173、 Conference18 of 54Overall Architecture of EdgeDiffIII.EdgeDiff Architecture1.Dynamic Grouping and Re-ordering Unit(DGRU)2.8 Mixed Group Precision Cores(MGPC)3.1280KB Global Memory w/1D-SIMD4.Top RISC ControllerTop Ctrlr.Dynamic Grouping and Reordering Unit(DGRU)Interconnect NetworkLocal Broadcastin
174、g+AccumulationLocal Broadcasting+AccumulationTop Ctrlr.RISC Ctrlr.I-Cache4KBInstructionDecoderMode Reg.Global Mem.Mem.Ctrlr.GMEM#0640KBGMEM#1640KB1D SIMD CoreNonlinear Op.Element-wise Op.MGPC#0MGPC#1MGPC#2MGPC#3MGPC#0MGPC#1MGPC#2MGPC#3MGPC#4MGPC#5MGPC#6MGPC#7MGPC#4MGPC#5MGPC#6MGPC#7IA-BUF(272B)Mixed
175、 Group Precision Core(MGPC).IA-MEM(32KB)W-BUF(264B)W-Scale(1KB)WMEM(32KB)IA-Scale(2KB)ControllerNoC SwitchTensor PE161632PE Col 0_1PE Col 0_15PE Col 15_15PE Col 0_0PSUM MEM(8KB)Reorder and Quantization Unit#0(RQU)Reorder Addr.TranslatorGrid-based Quant.UnitTranspose BufferChannel Grouping Unit#0(CGU
176、)Saliency BufferBypassing LogicSorting UnitD-Cache4KBInter-bit AdderBit-Shuffle Tree(BST)32-way Compressor 162)FP Accumulator1)INT AccumulatorOverflow?Out-of-Range?PE Column(32-way 4b-MAC)Tiered Accum Unit(TAU)Compress-and-Add PE23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Ac
177、celerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference19 of 54Overall Architecture of EdgeDiffIII.EdgeDiff Architecture1.Dynamic Grouping and Re-ordering Unit(DGRU)Supports condition-aware reordered group mixed precisionGrid-based quan
178、t.unit for efficient quantization w/FP scaling factorInterconnect NetworkLocal Broadcasting+AccumulationLocal Broadcasting+AccumulationTop Ctrlr.RISC Ctrlr.I-Cache4KBInstructionDecoderMode Reg.Global Mem.Mem.Ctrlr.GMEM#0640KBGMEM#1640KB1D SIMD CoreNonlinear Op.Element-wise Op.MGPC#0MGPC#1MGPC#2MGPC#
179、3MGPC#0MGPC#1MGPC#2MGPC#3MGPC#4MGPC#5MGPC#6MGPC#7MGPC#4MGPC#5MGPC#6MGPC#7IA-BUF(272B)Mixed Group Precision Core(MGPC).IA-MEM(32KB)W-BUF(264B)W-Scale(1KB)WMEM(32KB)IA-Scale(2KB)ControllerNoC SwitchTensor PE161632PE Col 0_1PE Col 0_15PE Col 15_15PE Col 0_0PSUM MEM(8KB)D-Cache4KBInter-bit AdderBit-Shuf
180、fle Tree(BST)32-way Compressor 162)FP Accumulator1)INT AccumulatorOverflow?Out-of-Range?PE Column(32-way 4b-MAC)Tiered Accum Unit(TAU)Compress-and-Add PETop Ctrlr.Dynamic Grouping and Reordering Unit(DGRU)Reorder and Quantization Unit#0(RQU)Reorder Addr.TranslatorGrid-based Quant.UnitTranspose Buffe
181、rChannel Grouping Unit#0(CGU)Saliency BufferBypassing LogicSorting Unit23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference20 of 54Overall Architecture of EdgeDif
182、fIII.EdgeDiff Architecture2.Mixed Group Precision Cores(MGPC)(161632 4b-4b PE)Compress-and-add PE for efficient bit-scalable computing w/4b unit MACTiered accumulation unit for efficient scaling operation for group quant.Top Ctrlr.Dynamic Grouping and Reordering Unit(DGRU)Interconnect NetworkTop Ctr
183、lr.RISC Ctrlr.I-Cache4KBInstructionDecoderMode Reg.Global Mem.Mem.Ctrlr.GMEM#0640KBGMEM#1640KB1D SIMD CoreNonlinear Op.Element-wise Op.Reorder and Quantization Unit#0(RQU)Reorder Addr.TranslatorGrid-based Quant.UnitTranspose BufferChannel Grouping Unit#0(CGU)Saliency BufferBypassing LogicSorting Uni
184、tD-Cache4KBInter-bit AdderBit-Shuffle Tree(BST)32-way Compressor 162)FP Accumulator1)INT AccumulatorOverflow?Out-of-Range?PE Column(32-way 4b-MAC)Tiered Accum Unit(TAU)Compress-and-Add PELocal Broadcasting+AccumulationLocal Broadcasting+AccumulationMGPC#0MGPC#1MGPC#2MGPC#3MGPC#0MGPC#1MGPC#2MGPC#3MGP
185、C#4MGPC#5MGPC#6MGPC#7MGPC#4MGPC#5MGPC#6MGPC#7IA-BUF(272B)Mixed Group Precision Core(MGPC).IA-MEM(32KB)W-BUF(264B)W-Scale(1KB)WMEM(32KB)IA-Scale(2KB)ControllerNoC SwitchTensor PE161632PE Col 0_1PE Col 0_15PE Col 15_15PE Col 0_0PSUM MEM(8KB)23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusio
186、n Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference21 of 54Proposed Processor:EdgeDiffI.IntroductionCondition-aware Reordered Group Mixed Precision(CRMP)Compress-and-Add(CAA)Tiered Accumulation Unit(TAU)Grid-based Quantiza
187、tion Unit(GQU)Bit Shuffle Tree(BST)EdgeDiff:Accelerate All Encoder/Unet/Decoder of FSDM with Mixed Precision+Group QuantizationEfficient INT Unit for Mixed Precision Efficient FP Unit for Group Quantization Efficient&Accurate Orchestration of Mixed Precision&Group Quantization 12323.3:EdgeDiff:418.4
188、mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference22 of 54Mitigating Dynamic Saliency in IAIII.EdgeDiff Architecture:1)Condition-aware Reordered Group Mixed Precision(CRMP)Channel
189、 Saliency Fluctuated by Conditioning Monitoring channel-wise saliency of IA Grouping and reordering IA&W Similar scale in a group&Easy outlier handlingIA Precision ControlW Precision ControlDynamic Channel SaliencyChannelGlobal MEMO ChI Ch8b Weight(Per-Channel SF)Batch 2Batch 1Different Scale in a G
190、roup Large Group Quant ErrorChannelIA Precision ControlW Precision Control1)Saliency MonitoringChannel SaliencyChannel-wise MeanDynamic Channel SaliencyChannelGlobal MEMBatch 2Batch 1O ChI Ch8b Weight(Per-Channel SF)Different Scale in a Group Large Group Quant ErrorChannelIA Precision ControlW Preci
191、sion Control1)Saliency Monitoring2)Grouping3)Re-orderingChannel SaliencyChannel-wise MeanDynamic Channel SaliencyChannelSimilar Scale in a GroupGlobal MEMSF(FP8)INT8INT4Batch 2Batch 1 DGRUOutlierDifferent Scale in a Group Large Group Quant ErrorGroupingReorderingO ChI Ch8b Weight(Per-Channel SF)IA P
192、recision ControlW Precision Control1)Saliency Monitoring2)Grouping3)Re-ordering1)Re-ordering-onlyChannel SaliencyChannel-wise MeanDynamic Channel SaliencyChannelSimilar Scale in a GroupQuant.&Reorder by IA OrderGlobal MEMO ChSF(FP8)INT8INT4INT8INT4SF(INT4)I Ch8b Weight(Per-Channel SF)Batch 2Batch 1
193、DGRU DGRUOutlierDifferent Scale in a Group Large Group Quant ErrorGroupingReorderingReordering23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference23 of 54Dynamic
194、Grouping&Reordering Unit(DGRU)III.EdgeDiff Architecture:1)Condition-aware Reordered Group Mixed Precision(CRMP)Step 1)Channel Grouping Unit:Grouping According to Saliency Sorting saliency value from 320 channel with recursive partitioning algorithmChannel Grouping Unit.RecursiveAddr Ctrlr.Partition
195、IndexStartEndID0 Idx_S0 Idx_E01 Idx_S1 Idx_E1Pivot Value Saliency(320 Ch)SaliencyBUFSorted Addr BUF1D SIMD in Global MEM68476501241101239 10 11310 987 6Recursive Partitioning53853014 21110789611301104782956102710911Intra-Group Order No Impact23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffu
196、sion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference24 of 54Dynamic Grouping&Reordering Unit(DGRU)III.EdgeDiff Architecture:1)Condition-aware Reordered Group Mixed Precision(CRMP)Step 1)Channel Grouping Unit:Grouping Acc
197、ording to Saliency Sorting saliency value from 320 channel with recursive partitioning algorithm Bypassing intragroup sortingChannel Grouping Unit.RecursiveAddr Ctrlr.Partition IndexStartEndID0 Idx_S0 Idx_E01 Idx_S1 Idx_E1Pivot Value Saliency(320 Ch)SaliencyBUFSorted Addr BUF1D SIMD in Global MEMByp
198、assing Check=BypassStart IndexEnd Index4b5b84765012411012391011310 98BypassBPBypass7 6Group-bypassed Sorting53853014 21110789611301104782956Bypass Intra-Group SortingGrouping Latency(Cycle)1k2k3k3329.8%0RequiredBaseline Bypass Reducing 3-latency by 29.8%23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Fe
199、w-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference25 of 54Dynamic Grouping&Reordering Unit(DGRU)III.EdgeDiff Architecture:1)Condition-aware Reordered Group Mixed Precision(CRMP)Step 2)Reorder and Quant Unit
200、:Rearrange IA&W for Quant.Reorder while core loads tensor from global memory in streaming manner Address translation with a sorted address from CGUChannel Grouping Unit.RecursiveAddr Ctrlr.Partition IndexStartEndID0 Idx_S0 Idx_E01 Idx_S1 Idx_E1Pivot Value Saliency(320 Ch)SaliencyBUFSorted Addr BUF1D
201、 SIMD in Global MEMBypassing Check=BypassStart IndexEnd Index4b5bSorted Addr BUFGrid-based Quant.(Detailed in Feature 3)Transpose BUFReorder and Quant.Unit Global MEMAddress TranslationTo IMEM of MGPC10 Group32 Ch0Ch in G01Ch in G12Ch in G2Ch AddrGIDCh Addrof GroupsCh AddrToken AddrGroup IDToken Add
202、r.9Ch in G93Ch in G3DMA Address Controller(GMEM to IMEM)Sequential Addr.Reordered Addr.Reordered Activation23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference26
203、of 54Results of Applying CRMPIII.EdgeDiff Architecture:1)Condition-aware Reordered Group Mixed Precision(CRMP)Successfully Reordered IA and Yield Uniform Channel Scales Varying precision ratio(LP/HP)according to network typeof timestepsEncoder(Convolution Layer)UNet(QKV Generation)(Convolution Lay
204、er)(FFN Layer)Decoder(Convolution Layer)Token256 ChannelToken256 ChannelToken320 Channel320 ChannelTokenToken320 ChannelDynamic Re-orderDynamic Re-orderHigh PrecisionLow Precision8 Group 32 chHigh PrecisionLow Precision10 Group 32 chLow PrecisionHigh Precision10 Group 32 chHigh PrecisionLow Precisio
205、n10 Group 32 chHigh PrecisionLow Precision8 Group 32 chDynamic Re-order8b(HP)4b(LP)StepPrecision Ratio1425%0%75%100%8b(HP)4b(LP)StepPrecision Ratio1420%10%80%90%8b(HP)4b(LP)StepPrecision Ratio1425%0%75%100%8b(HP)4b(LP)StepPrecision Ratio1420%10%80%90%8b(HP)4b(LP)StepPrecision Ratio1420%10%80%90%*SDX
206、L-Turbo-I2I Model 1step/4step23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference27 of 54Performance Improvement w/CRMPIII.EdgeDiff Architecture:1)Condition-aware
207、 Reordered Group Mixed Precision(CRMP)1.49 2.72 Energy Efficiency Improvement 1.61 2.00 Speed Up T2I&I2I Benchmark*SDXL-Turbo-T2I/Exclude EMA/50MHzEnergy Efficiency TOPS/WEnc.UNetDec.Enc.UNetDec.1w/CRMPw/o CRMP1 Step4 Step10201.901.781.492.721.971.79*Baseline:50-step(4b)/Proposed:1/4-step(Mixed 4/8b
208、)w/CRMPw/o CRMPSpeed Up over Many-step DM0T2II2IT2II2I10204 Step1 Step4.37.88.37.89.713.115.71.673.91.612.001.9323.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conferen
209、ce28 of 54Proposed Processor:EdgeDiffI.IntroductionCondition-aware Reordered Group Mixed Precision(CRMP)Compress-and-Add(CAA)Tiered Accumulation Unit(TAU)Grid-based Quantization Unit(GQU)Bit Shuffle Tree(BST)EdgeDiff:Accelerate All Encoder/Unet/Decoder of FSDM with Mixed Precision+Group Quantization
210、Efficient INT Unit for Mixed Precision Efficient FP Unit for Group Quantization Efficient&Accurate Orchestration of Mixed Precision&Group Quantization 12323.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE I
211、nternational Solid-State Circuits Conference29 of 54Bit-scalable Operation for Mixed PrecisionIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Combine 4b-Unit MACs for Higher-bit Operation Twos complement representation to support signed multiplication Four sign mode req
212、uired(SS,SU,US,UU)Reconfiguration overhead0UUW3:0IA3:0SSW7:4IA7:4USW3:0IA7:4SUW7:4IA3:010SIA1ModeSWW7:0IA7:0SW=1SW=0SIA=1SIA=08b-8b MACSi,j Value 4 Sign Mode-1 1-1 11 11 1-1 1-1 11 11 1USUU1 11 11 11 11 11 11 11 1-1-11 1-1-11 11 11 11 11 1SU1-1-1 1-1-11 1-1 1-1 11 11 1SSi Si,j 2i+j Wki&IAkj4j4Produc
213、t=23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference30 of 54Concept of Compress-and-Add(CAA)III.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle
214、 Tree(BST)Reducing Overhead by Changing Accumulation Order Baseline:Inter-bit(mult.w/sign reconfiguration)Inter-channel(add)CAA:Inter-channel(compressor)Inter-bit(add w/sign reconfiguration)Inter-bit Adder(IBA)32 Ich.(k)SIASW4 IA bits(j)Wki&IAkjk32Si,j 2i+ji4j4i Si,j 2i+j Wki&IAkj4j4k32Multiplier w/
215、Sign Reconfig.Adder-Tree after Mult.CompressorShift-and-Add w/Sign Reconfig.Mult.+ATCAA23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference31 of 54Detail CAA Micr
216、o-architectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Bit-wise Compressor(BC)16+Inter-bit Adder(IBA)BC 16:Accumulating 1b-1b products across 32 channel IBA:Accumulating partial-sum with minimal sign reconfiguration overheadInter-bit Adder(IBA)32 Ich.(k)SIASW4 IA
217、 bits(j)CAA PE(32-way 4b 4b)Wki&IAkjk32Si,j 2i+ji4j413b9b9b9b9b6b16 PSUM from 16 BC#3#2#1Shift-Add(W#0)SW2ADD/SUB6b4ADDADDShift-Add(IA)ADDADD/SUBADD2SIA8-to-4 Comp#0.W31IA31W0IA0W1IA1W2IA2W3IA3W4IA4W5IA5W6IA6W7IA7+8-to-4#18-to-4#38-to-4#2Bit Shuffle TreeFAFAHAFAFAHAHAP0P1P2P3+6b23.3:EdgeDiff:418.4mJ
218、/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference32 of 54Detail CAA Micro-architectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Step 1:Bit-wise
219、Product Each 1b weight multiplies each 1b input activation 32 of partial products per bit To the bit-wise compressor8-to-4 Comp#0+8-to-4#18-to-4#38-to-4#2Bit Shuffle TreeFAFAHAFAFAHAHAP0P1P2P3+6b.W31IA31W0IA0W1IA1W2IA2W3IA3W4IA4W5IA5W6IA6W7IA713b9b9b9b9b6b16 PSUM from 16 BC#3#2#1Shift-Add(W#0)SW2ADD
220、/SUB6b4ADDADDShift-Add(IA)ADDADD/SUBADD2SIAInter-bit Adder(IBA)SIASW32 Ich.(k)4 IA bits(j)k32Si,j 2i+ji4j4 Wki&IAkjCAA PE(32-way 4b 4b)23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-
221、State Circuits Conference33 of 54Detail CAA Micro-architectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Step 2:Inter-channel Accumulation 32-to-6 compressor for summing 1b-products across 32 input channels Compact partial sum for each bit8-to-4 Comp#0.W31IA31W0IA0
222、W1IA1W2IA2W3IA3W4IA4W5IA5W6IA6W7IA7+8-to-4#18-to-4#38-to-4#2Bit Shuffle TreeFAFAHAFAFAHAHAP0P1P2P3+6b13b9b9b9b9b6b16 PSUM from 16 BC#3#2#1Shift-Add(W#0)SW2ADD/SUB6b4ADDADDShift-Add(IA)ADDADD/SUBADD2SIAInter-bit Adder(IBA)SIASW32 Ich.(k)4 IA bits(j)Si,j 2i+ji4j4k32 Wki&IAkjCAA PE(32-way 4b 4b)23.3:Ed
223、geDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference34 of 54Detail CAA Micro-architectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)St
224、ep 3:Inter-bit Accumulation(Weight)Shift-and-add for 4b weight ADD/SUB logic MSB sign reconfiguration by sign mode(SW)8-to-4 Comp#0.W31IA31W0IA0W1IA1W2IA2W3IA3W4IA4W5IA5W6IA6W7IA7+8-to-4#18-to-4#38-to-4#2Bit Shuffle TreeFAFAHAFAFAHAHAP0P1P2P3+6b13b9b9b9b9b6b16 PSUM from 16 BC#3#2#1Shift-Add(W#0)SW2A
225、DD/SUB6b4ADDADDShift-Add(IA)ADDADD/SUBADD2SIASIASWInter-bit Adder(IBA)32 Ich.(k)4 IA bits(j)i4j4k32 Wki&IAkjSi,j 2i+jCAA PE(32-way 4b 4b)23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Soli
226、d-State Circuits Conference35 of 54Detail CAA Micro-architectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Step 4:Inter-bit Accumulation(Input Activation)Shift-and-add for 4b input activation ADD/SUB logic MSB sign reconfiguration by sign mode(SIA)13b9b9b9b9b6b16 P
227、SUM from 16 BC#3#2#1Shift-Add(W#0)SW2ADD/SUB6b4ADDADDShift-Add(IA)ADDADD/SUBADD2SIA8-to-4 Comp#0.W31IA31W0IA0W1IA1W2IA2W3IA3W4IA4W5IA5W6IA6W7IA7+8-to-4#18-to-4#38-to-4#2Bit Shuffle TreeFAFAHAFAFAHAHAP0P1P2P3+6bSIASWInter-bit Adder(IBA)32 Ich.(k)4 IA bits(j)Wki&IAkjk32Si,j 2i+ji4j4CAA PE(32-way 4b 4b
228、)23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference36 of 54Detail CAA Micro-architectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tre
229、e(BST)Bit Shuffle Tree(BST)before Compressor In CAA architecture,compressor power is dominant BST logic inserted to reduce compressor power Wki&IAkjk32Si,j 2i+ji4j4Inter-bit Adder(IBA)SIASW32 Ich.(k)4 IA bits(j)13b9b9b9b9b6b16 PSUM from 16 BC#3#2#1Shift-Add(W#0)SW2ADD/SUB6b4ADDADDShift-Add(IA)ADDADD
230、/SUBADD2SIA8-to-4 Comp#0.W31IA31W0IA0W1IA1W2IA2W3IA3W4IA4W5IA5W6IA6W7IA7+8-to-4#18-to-4#38-to-4#2Bit Shuffle TreeFAFAHAFAFAHAHAP0P1P2P3+6bCompressorPowerDominant!CAA PE(32-way 4b 4b)23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Gr
231、oup Quantization 2025 IEEE International Solid-State Circuits Conference37 of 54Detail Bit Shuffle Tree(BST)ArchitectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Realigning the Bit Patterns for Less Frequent Toggles BST:2.5-level tree structure of Unit Shuffling L
232、ogic(USL)USL:Simple logic(1 AND+1 OR)that align 2 bits inputDSSSDDSDSDSDSSDDDDSSDSDSDSDSIn1In0In1In0In1In0In1In0DSDSDSDSIn1In0In1In0In1In0In1In0DSDSIn1In0In1In0Bit Shuffle Tree(BST)Unit Shuffling LogicIn0In1DS0000011010111110Align 1 to DenseMaintain#of 1 Arbitrary InputAligned OutputIn0In1SparseDens
233、e23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference38 of 54Ch.ID:81624Bit Distribution(Ratio of 1)32.51InputOutputOutput1000010110001011Input0001110101100100Det
234、ail Bit Shuffle Tree(BST)ArchitectureIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)Realigning the Bit Patterns for Less Frequent Toggles Input vector are aligned as pass through BST Toggle rate reduction:Random position 0/1 Aligned 0/1DSSSDDSDSDSDSSDDDDSSDSDSDSDSIn1In
235、0In1In0In1In0In1In0DSDSDSDSIn1In0In1In0In1In0In1In0DSDSIn1In0In1In0Bit Shuffle Tree(BST)Aligned 0/1 RandomPosition 0/1(Aligned 0/1)(Random 0/1)23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE Internationa
236、l Solid-State Circuits Conference39 of 54Performance of CAA with BSTIII.EdgeDiff Architecture:2)Compress-and-Add(CAA)PE with Bit-Shuffle Tree(BST)BST:1.72 Toggle Rate Reduction of FA/HA in CompressorCAA+BST:36.6%Power Reduction without Area Overhead Relative AreaMult+AT CAA+BST01.5AddMult.CompBSTShi
237、ft-AddNo Overhead1.0581.000MAC Op.Power mWMult+ATCAA onlyCAA+BST10015005036.6%CompShift-AddCompBSTShift-Add250MHz24.7%AddMult.Toggle Ratew/o BSTw/BST0.510.8670.5031.7223.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantizatio
238、n 2025 IEEE International Solid-State Circuits Conference40 of 54Proposed Processor:EdgeDiffI.IntroductionCondition-aware Reordered Group Mixed Precision(CRMP)Compress-and-Add(CAA)Tiered Accumulation Unit(TAU)Grid-based Quantization Unit(GQU)Bit Shuffle Tree(BST)EdgeDiff:Accelerate All Encoder/Unet/
239、Decoder of FSDM with Mixed Precision+Group QuantizationEfficient INT Unit for Mixed Precision Efficient FP Unit for Group Quantization Efficient&Accurate Orchestration of Mixed Precision&Group Quantization 12323.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed
240、-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference41 of 54Concept of Tiered Accumulation Unit(TAU)III.EdgeDiff Architecture:3)Tiered Accumulation Unit(TAU)and Grid-based Quantization Unit(GQU)High-bit(FP32)for Accumulation Due to FP Scaling Factor Not
241、 all partial-sum requires wide range Tiering into INT-Accum.&FP-Accum.32Scale FactorSFIASFWINT4FP8/FP16SFIA,ManSFWINT MACINTScaleMult.+FP Accumulator(F-AC)INT Accumulator(I-AC)MultOutlier?I-AC Accum.F-AC Accum.(I-AC PSUM)Overflow?F-AC Accum.(MAC PSUM)Accum.End?23.3:EdgeDiff:418.4mJ/Inference Multi-M
242、odal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference42 of 54Detail Operation of TAU Tier 1III.EdgeDiff Architecture:3)Tiered Accumulation Unit(TAU)and Grid-based Quantization Unit(GQU)(Tier 1:I-AC)Chec
243、king Range&In-range Data Accumulation ORDU checks whether the PSUM is in-range(#of active bit)+(SF exponent)I-AC bit Accumulate inlier&Bypass outlierI-ACINT MACSUMF-ACI-AC PSUMMAC PSUMINT ScaleOut-of-Range Detection Unit(ORDU)SFIA,Exp+X0000001XXX.UsedOutlier?Comp.1Comp.2Comp.15LDC#31.Arithmetic Shif
244、tOAQExp.Ctrlr.OAExpOAMan1Scale FactorGrid Value(n-7.5)for n=015Mult.0Mult.1Mult.2Mult.15.11100-6.511010-5.501111+7.511000-7.5.SFManSFExpGrid Generate CircuitSFExpSFManScale Factor Grid23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered
245、Group Quantization 2025 IEEE International Solid-State Circuits Conference47 of 542-Stage GQU Operation (2)III.EdgeDiff Architecture:3)Tiered Accumulation Unit(TAU)and Grid-based Quantization Unit(GQU)Step 2:Level Detect Circuit(LDC)Arithmetically shift OA and compare against the SFG to detect OAQSc
246、ale FactorGrid Value(n-7.5)for n=015Mult.0Mult.1Mult.2Mult.15.11100-6.511010-5.501111+7.511000-7.5.SFManSFExpGrid Generate CircuitSFExpSFManLDC#31Level Detect Circuit#0Unary Code 2s Comp Value4b.Comp.0Comp.1Comp.2Comp.15LDC#31.Scale Factor GridArithmetic ShiftOAQExp.Ctrlr.OAExpOAMan123.3:EdgeDiff:41
247、8.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference48 of 54Performance Improvement w/GQUIII.EdgeDiff Architecture:3)Tiered Accumulation Unit(TAU)and Grid-based Quantization Unit
248、(GQU)95.4%Power&66.7%Area Saved for 4-bit Quantization21.7 5.4 Energy Saving for 4/8/12/16-bit Quantization*Measured 50MHz5101520Quant.Energy Saving04b8b12b16bNave Design(FP Div.)21.710.87.25.4023110150GQU PerformanceNave Design+Grid-based+Shared GGCPeak Power mWRelative Area95.4%66.7%523.3:EdgeDiff
249、:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference49 of 54Chip Photography and SummaryIV.Implementation Results4.5 mm4.5 mmMGPC#0MGPC#1MGPC#2MGPC#3MGPC#4MGPC#5MGPC#6MGPC#7GM
250、EM#0GMEM#1DGRUSIMDTop Ctrlr.1 MAC=2OPs,I/O Voltage=1.8VSpecificationsSamsung 28nm 1P8M CMOS4.5 mm 4.5 mm(20.25 mm2)600 KB0.68-1.0 VTechnologyChip AreaSupply VoltageData TypeSRAM1)50MHz,0.68V 2)SDXL-Trubo,512512 Image3)Excluding EMA,4)250MHz,1.0V 5)Including EMA(Estimated w/DDR3 SDRAM)CoreGlobalOther
251、s1280 KB72KBINT 4/8/12/16GEMMScale FactorAccum.IA:FP8/16W:INT4FP3250-250MHzFrequencyEnergy1,3mJText-to-Image GenerationTask2Image-to-Image Generation1 Step4 Step418.4786.3506.5846.0Generation Time4,5 s1 Step4 Step2.127.332.317.52Per-group(32)Per-channelEnergy Efficiency1TOPS/WQuantizationINT4INT834.
252、48.639.910.0Diffusion Mode Performance23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference50 of 5441.6%Chip Power Reduction1.6 2.0 Speed-up Even on few-step T2I a
253、nd I2I BenchmarkMeasurement ResultsIV.Implementation ResultsBaselinePower Consumption mW+TAU+CAA/BST+GQU1002003000MGPCGQUDGRU+GMEM*Measured 50MHz,0.68V316.7230.9190.6190.619.219.219.211.541.6%Speed Up(Baseline:50-step)Few-step onlyImage-to-Image(I2I)Generation512512 Image from Reference Image and Te
254、xt Prompts7.83.9Few-step w/EdgeDiff13.17.82.01-step4-step1.7Speed Up(Baseline:50-step)Few-step onlyText-to-Image(T2I)512512 Image Generation from Text Prompt9.74.3Few-step w/EdgeDiff15.78.31.91-step4-step1.623.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-P
255、recision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference51 of 54T2I benchmark with MS-COCO text prompts 30KI2I benchmark with MS-COCO reference image+custom promptsT2I&I2I Generation Test with Public DatasetIV.Implementation Results30.62 dBI2IText PromptOutpu
256、t Image31.19 dBPrompt 1Prompt 2 31.24 dBPrompt 3InputImageReference Image from MS-COCO Dataset1-step4-stepT2IOutput Image0.310(FP:0.321)20.09(FP:19.76)0.310(FP:0.317)19.25(FP:19.00)CLIP ScoreFIDText Prompt23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Pre
257、cision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference52 of 54Performance ComparisonsIV.Implementation Results1)250MHz,1.0V 2)50MHz,0.68V 3)Measured SDXL-Trubo-T2I,512512 Image Size 4)EMA is excluded Energy estimated using reported maximum efficiency 5)w/inte
258、r-timestep redundancy 6)w/o inter-timestep redundancyThis WorkTechnology(nm)FunctionSupply Voltage(V)Frequency(MHz)Die Area(mm2)PrecisionPerformanceTOPSPeakDiffusion ModelSupport280.68-1.050-25020.25Few-step Diffusion ModelINT 4/8/12/16+Group Quant.31.3(INT4)/7.8(INT8)3.05(INT 12)/1.95(INT16)139.9(I
259、NT4)/34.4(INT4-GQ)10.0(INT8)/8.6(INT8-GQ)2Generation Energy mJ/Inf.3,4418.4(1 Step)/786.3(4 Step)2DenoisingFull DMModality+Few-step ModelO(End-to-End)Conditional:Text,ImageBenchmark14.2-15.5(1 Step-4 Step)2,3S.VLSI 24ISSCC 2428220.6-1.00.6-1.050-540120-5403.673.70Diffusion ModelDiffusion ModelINT+FP
260、16/BF16Approx.FP166.636(BF16)4.424(FP16)0.47-12.02(FP16)73.34(BF16)67.89(FP16)1.67-52.01(FP16)14005(50 Step,Unet Only)14766(1 Step)28475(50 Step)Multi-step OnlyX(UNet Only)UnconditionalMulti-step OnlyX(UNet Only)UnconditionalISSCC 22280.56-1.150-5106.82Vision TransformerXINT120.52-4.07(INT12)1.91-27
261、.56(INT 12)-58.01-67.89(50 Step)-8.2(ImageNet ViT-B)Energy Efficiency TOPS/WXX1)250MHz,1.0V 2)50MHz,0.68V 3)Measured SDXL-Trubo-T2I,512512 Image Size 4)EMA is excluded Energy estimated using reported maximum efficiency 5)w/inter-timestep redundancy 6)w/o inter-timestep redundancyThis WorkTechnology(
262、nm)FunctionSupply Voltage(V)Frequency(MHz)Die Area(mm2)PrecisionPerformanceTOPSPeakDiffusion ModelSupport280.68-1.050-25020.25Few-step Diffusion ModelINT 4/8/12/16+Group Quant.31.3(INT4)/7.8(INT8)3.05(INT 12)/1.95(INT16)139.9(INT4)/34.4(INT4-GQ)10.0(INT8)/8.6(INT8-GQ)2Generation Energy mJ/Inf.3,4418
263、.4(1 Step)/786.3(4 Step)2DenoisingFull DMModality+Few-step ModelO(End-to-End)Conditional:Text,ImageBenchmark14.2-15.5(1 Step-4 Step)2,3S.VLSI 24ISSCC 2428220.6-1.00.6-1.050-540120-5403.673.70Diffusion ModelDiffusion ModelINT+FP16/BF16Approx.FP166.636(BF16)4.424(FP16)0.47-12.02(FP16)73.34(BF16)67.89(
264、FP16)1.67-52.01(FP16)14005(50 Step,Unet Only)14766(1 Step)28475(50 Step)Multi-step OnlyX(UNet Only)UnconditionalMulti-step OnlyX(UNet Only)UnconditionalISSCC 22280.56-1.150-5106.82Vision TransformerXINT120.52-4.07(INT12)1.91-27.56(INT 12)-58.01-67.89(50 Step)-8.2(ImageNet ViT-B)Energy Efficiency TOP
265、S/WXXFirst Support of End-to-end/Few-step/Multi-modal DM1)250MHz,1.0V 2)50MHz,0.68V 3)Measured SDXL-Trubo-T2I,512512 Image Size 4)EMA is excluded Energy estimated using reported maximum efficiency 5)w/inter-timestep redundancy 6)w/o inter-timestep redundancyThis WorkTechnology(nm)FunctionSupply Volt
266、age(V)Frequency(MHz)Die Area(mm2)PrecisionPerformanceTOPSPeakDiffusion ModelSupport280.68-1.050-25020.25Few-step Diffusion ModelINT 4/8/12/16+Group Quant.31.3(INT4)/7.8(INT8)3.05(INT 12)/1.95(INT16)139.9(INT4)/34.4(INT4-GQ)10.0(INT8)/8.6(INT8-GQ)2Generation Energy mJ/Inf.3,4418.4(1 Step)/786.3(4 Ste
267、p)2DenoisingFull DMModality+Few-step ModelO(End-to-End)Conditional:Text,ImageBenchmark14.2-15.5(1 Step-4 Step)2,3S.VLSI 24ISSCC 2428220.6-1.00.6-1.050-540120-5403.673.70Diffusion ModelDiffusion ModelINT+FP16/BF16Approx.FP166.636(BF16)4.424(FP16)0.47-12.02(FP16)73.34(BF16)67.89(FP16)1.67-52.01(FP16)1
268、4005(50 Step,Unet Only)14766(1 Step)28475(50 Step)Multi-step OnlyX(UNet Only)UnconditionalMulti-step OnlyX(UNet Only)UnconditionalISSCC 22280.56-1.150-5106.82Vision TransformerXINT120.52-4.07(INT12)1.91-27.56(INT 12)-58.01-67.89(50 Step)-8.2(ImageNet ViT-B)Energy Efficiency TOPS/WXX3.3 6.8 Lower Ene
269、rgy Consumption23.3:EdgeDiff:418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference53 of 54Demonstration System of EdgeDiffIV.Implementation Results23.3:EdgeDiff:418.4mJ/Inferenc
270、e Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization 2025 IEEE International Solid-State Circuits Conference54 of 54EdgeDiff:Highly Energy Efficient Diffusion Model AcceleratorFor Efficient Few-step Inference/Multi-modal Conditioning1)Condition-awa
271、re Reordered Group Mixed Precision(CRMP)1.612.00 speed-up with mixed precision&group quantization2)Compress-and-Add(CAA)PE with Bit Shuffle Tree(BST)36.6%power saving of mixed precision PE3)Tiered Accumulation Unit(TAU)&Grid-based Quantization Unit(GQU)80.3%higher efficiency of core with group quant
272、ization 95.4%energy saving of quantization unitConclusionsV.Conclusion418.4mJ/Inf Text-to-Image&506.5mJ/Inf.Image-to-Image Generation for Mobile Applications23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE Internationa
273、l Solid-State Circuits Conference1 of 39Nebula:A 28nm 109.8 TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise AggregationChangchun Zhou1,Tianling Huang1,Yanzhe Ma1,Yuzhe Fu1,Xiangjie Song1,Siyuan Qiu1,Jiacong Sun1,Min Liu1,Ge Li1,Yifan He2,Yuchao Yang1,3,Hailong Ji
274、ao11Shenzhen Graduate School,Peking University,Shenzhen,China2Reconova Technologies,Xiamen,China3Peking University,Beijing,China 23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Confe
275、rence2 of 39Outline Background Design Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pipelined block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Fe
276、aturing Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference3 of 39Outline Background Design Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pipeline
277、d block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference4 of 39Promising Trend of 3D Point CloudSemantic Segmenta
278、tionAutonomous DrivingRoboticsDrone Navigation3D GeometryMixed Reality23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference5 of 39Promising Trend of 3D Point CloudSemantic Segme
279、ntationAutonomous DrivingRoboticsDrone Navigation3D GeometryMixed RealityEasy acquisitionHigh resolutionEfficient data format23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conferenc
280、e6 of 393D Point Cloud Neural Network(PNN)Farthest Point Sampling(FPS)is commonly used to compress and enhance representation ability Solve issuesDisorderedDiscrete A mapping layerSamplingGrouping Consecutive MLPs11 kernel An aggregation layerMax-pooling23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelera
281、tor Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference7 of 39Outline Background Design Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pi
282、pelined block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference8 of 39 Sequential operation#Operation N2N:The numb
283、er of points#Cycle N2Challenge 1:High Latency Caused by FPS23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference9 of 39 Sequential operation#Operation N2N:The number of points#C
284、ycle N2 Uniform partitioning NOT suitableChallenge 1:High Latency Caused by FPS23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference10 of 39 RPs in samplingRPs:Retained points i
285、n sampling DPs in grouping,convolution,and aggregationDPs:Dropout points in samplingChallenge 2:Significant Redundant Operations23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Confer
286、ence11 of 39 Irregular EMARandom gather/scatterEMA:External memory accessChallenge 3:Frequent Irregular Memory Access23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference12 of 3
287、9 Irregular EMARandom gather/scatter#Memory access increases#Data Channel2#Iteration#Data Challenge 3:Frequent Irregular Memory Access23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits
288、Conference13 of 39Overall Architecture of Nebula 16-core sampling 8-core grouping 16162 PE array 8-core aggregation23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference14 of 39O
289、utline Background Design Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pipelined block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adapt
290、ive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference15 of 39Feature 1:Tree-based Adaptive Partitioning Sampling Goal:Reduce and balance#Point Adaptive partitioningHierarchically partition by octreeBlock size sparsityMaintain global feature23
291、.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference16 of 39Feature 1:Tree-based Adaptive Partitioning Sampling13 S.Kim,et al.,Symp.on VLSI,2021.19 C.Zhou et al.,ICCAD,2023.Goal:
292、Reduce and balance#Point Adaptive partitioningHierarchically partition by octreeBlock size sparsityMaintain global feature23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference17
293、 of 39 Iteratively partitionRoot node to leaves#Point Skip grouping of DPsDelayed-aggregation-Skip convolution of DPsGrouping guidance-Skip aggregation of DPs23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE Internation
294、al Solid-State Circuits Conference22 of 39Feature 2:Sampling-Based Multi-Skipping Strategy Reversely output RPs for all sampling layers during the first sampling layer23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE In
295、ternational Solid-State Circuits Conference23 of 39Feature 2:Sampling-Based Multi-Skipping Strategy Skip neighbor point search for DPs Skip grouping centered on some RPs23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE
296、International Solid-State Circuits Conference24 of 39Feature 2:Sampling-Based Multi-Skipping Strategy Similar pattern with groupingNeighbor search for DPsAggregation centered on RPs23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregatio
297、n 2025 IEEE International Solid-State Circuits Conference25 of 39Feature 2:Sampling-Based Multi-Skipping Strategy23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference26 of 39 Ab
298、out60%computations reduction60.1%for PointNeXt-ClsModelNet40 58.2%for PointNeXt-SegS3DIS 0.5%area overheadFeature 2:Sampling-Based Multi-Skipping Strategy23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International S
299、olid-State Circuits Conference27 of 39Outline Background Design Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pipelined block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Nebula:A 28nm 109.8TO
300、PS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference28 of 39Feature 3:Pipelined Block-wise Delayed-Aggregation Goal:Address EMA*bottleneckIrregularity and iterative load Block-wise delay-aggregationDela
301、y aggregation after convolutionPartition&Aggregate within each blockLoad data only once&block as access unit*EMA:External memory access23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits
302、 Conference29 of 39Feature 3:Pipelined Block-wise Delayed-Aggregation Goal:Address EMA*bottleneckIrregularity and iterative load Block-wise delay-aggregation Parallel&pipeline shorten latencyGrouping and convolution in parallelAggregation in pipeline23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator
303、Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference30 of 39Feature 3:Pipelined Block-wise Delayed-Aggregation Load data to chipBlock-level access Fetch NIT and featureNIT:Neighbor Index Table Perform aggregationNIT guidedChan
304、nel-wise max.On-chip reuseFeatureNIT23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference31 of 39Feature 3:Pipelined Block-wise Delayed-Aggregation External Memory Access99%for
305、PointNeXt-Cls95%for PointNeXt-SegLatency Shortening51.3%for PointNeXt-Cls55%for PointNeXt-Seg23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference32 of 39Outline Background Desi
306、gn Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pipelined block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Sk
307、ipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference33 of 39Chip Summary23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference34 of 39Voltage
308、-Frequency Scaling Frequency from 20 MHz to 200 MHz 0.57 V-0.94 V Highest energy efficiency109.8TOPS/W 0.57V,20 MHz,PointNeXt-Cls(8b)23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits C
309、onference35 of 39Performance Improvements Up to 17.3 speedup 10.7 energy savings BaselineDisable all of the three techniques(APS,SMS,BDA)23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circui
310、ts Conference36 of 39MICRO21 VLSI21VLSI22 ISSCC22 ISSCC23CICC24Technology28 nm65 nm65 nm28 nm28 nm28 nmArea mm26.8(Core)16(Die)4.1(Die)13(Die)2.7(Core)13(Die)ModelNet40S3DIS52631),526.32)275.41),Area Efficiency TOPS/mm20.270.110.14N/A0.0720.115.61),0.562)3.91),0.392)Energy Efficiency TOPS/W0.8611.91
311、3.6194.138.548.11),109.82)261),59.42)16.944.2Frame Rate Frame/s687.284.47.2129.9This Work28 nm1.5 (Core)Comparisons with the SOTA Point Cloud Accelerators1)200 MHz.2)20 MHz.23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 I
312、EEE International Solid-State Circuits Conference37 of 39Outline Background Design Challenges of 3D PNN Acceleration Key FeaturesTree-based adaptive partitioning sampling(APS)Sampling-based multi-skipping strategy(SMS)Pipelined block-wise delayed-aggregation(BDA)Measurement Results Conclusion23.4:Ne
313、bula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive Partition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference38 of 39Conclusion A PNN accelerator for 3D point cloud analysisTree-based adaptive partitioning sampling technique Balance partiti
314、oning and maintain accuracySampling-based multi-skipping strategy Skip redundant operations and computation cyclesPipelined block-wise delayed-aggregation method Eliminate irregularity&iterative load of EMA and improve parallelism23.4:Nebula:A 28nm 109.8TOPS/W 3D PNN Accelerator Featuring Adaptive P
315、artition,Multi-Skipping,and Block-Wise Aggregation 2025 IEEE International Solid-State Circuits Conference39 of 39Conclusion A PNN accelerator for 3D point cloud analysisTree-based adaptive partitioning sampling technique Balance partitioning and maintain accuracySampling-based multi-skipping strate
316、gy Skip redundant operations and computation cyclesPipelined block-wise delayed-aggregation method Eliminate irregularity&iterative load of EMA and improve parallelism109.8 TOPS/W,5.6 TOPS/mm2,and 5.3k fps109.8 TOPS/W,5.6 TOPS/mm2,and 5.3k fps23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-
317、Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference1 of 39MAE:A 3nm 0.168mm 576MAC Mini AutoEncoderwith Line-Based Depth-First Scheduling for Generative AI in Vision on Edge DevicesShih-Wei Hsieh,Chia-Hung Yuan,Ming-Hung Li
318、n,Ping-Yuan Tsai,You-Yu Nian,Chia-Yuan Cheng,Hung-Wei Chih,Po-Han Chiang,Ming-Hsuan Chiang,Yuan-Jung Kuo,Yu-Wei Wu,Yi-Syuan Chen,Po-Heng Chen,Sandy Huang,Ming-En Shih,Chia-Ping Chen,Abrams Chen,ShenKai Chang,Chih-Ming Wang,Po-Yu Yeh,Jett Liu,Yung-Chang Chang,Chung-Yi Chen,Chi-Cheng Ju,CH Wang,Yucheu
319、n Kevin JouMediaTek,HsinChu,Taiwan23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference2 of 39Outline Introduction Overall Architecture of Mini AutoEncoder(MAE)Key Feat
320、ures Line-based Depth-First Layer fusion w/Branch HandlingLocal H-Reuse and Direct A/C Unit LinkageMulti-objective Scheduler w/Optimized Iterative Search Measurement Results Summary23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Ed
321、ge Devices 2025 IEEE International Solid-State Circuits Conference3 of 39Outline Introduction Overall Architecture of Mini AutoEncoder(MAE)Key Features Line-based Depth-First Layer fusion w/Branch HandlingLocal H-Reuse and Direct A/C Unit LinkageMulti-objective Scheduler w/Optimized Iterative Search
322、 Measurement Results Summary23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference4 of 39Generative AI in Vision Diffusion models offer state-of-the-art image generation
323、 quality High computational demands pose challenges for edge devices“The scene of an academic conference with a signboard with the word ISSCC printed on it”Image restorationText-to-imageScene synthesis*R.Rombach et al.,”High-Resolution Image Synthesis with Latent Diffusion Models”,CVPR,2022.23.5.MAE
324、:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference5 of 39Reduced Complexity of Diffusion Models Latent Diffusion Models(LDM)Apply diffusion in autoencoder-compressed latent
325、space to reduce complexityFurther reduced by replacing VAE with Tiny AE for Stable Diffusion(TAESD)DecoderDenoising-UNetN-iter.EncoderConditioning Module(opt.)(Task-agnostic)(Task-specific)(Task-agnostic)PixelspaceLatentspaceLatentspacePixelspaceLDM architectureModel ComplexityModelGMACDenoising-Une
326、t(20 Steps)805VQ-VAE(enc)338VQ-VAE(dec)607TAESD(enc)47.5TAESD(dec)55.0-86%91%*Resolution:512x512 64x64 512x51223.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference6 of
327、39Challenge of deploying LDM on Edge GAI models add to the already diverse existing AI applicationsDespite the reduced complexity of LDM and TAESD,they still require significant computational resources.0.10.11 1101010010010001000MobileNetV1RestNet18yolov7FSRCNN 4x(4K)TAESDDenoiseU20 stepsGMACDiverse
328、 Complexity of AI ModelsClassificationSRGen AIDetection23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference7 of 39Complementing General NPU with Tiny NPU General NPU w
329、/large peak TOPS:High computational power,suitable for low latency tasksHigher fixed power and less spatial utilization for varying layers Specialized tiny NPU w/small peak TOPS:Efficient for smaller models in power-sensitive scenariosUses lower power and offers better spatial utilization+Large Peak
330、 TOPSLess LatencyHigher PowerGeneral NPUGood Urate Across WorkloadsLow PowerSmaller Throughput Tiny NPU23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference8 of 39Heter
331、ogeneous NPU system Leveraging the strength of both cores:Achieving overall power and latency reduction compared to a single coreDRAMDecoderDenoising-UNetN-iter.EncoderTiny NPUGeneral NPUDRAMDecoderDenoising-UNetN-iter.EncoderEncDenoise-UDec EncDenoise-UDec EncEncDenoise-UDecEncDenoise-UDecEncDenois
332、e-UDecUse tiny APU to run TAESD more efficientlySingle CoreHeterogeneous Cores23.5.MAE:A 3nm 0.168mm 576MAC Mini AutoEncoder with Line-Based Depth-First Scheduling for Generative AI in Vision on Edge Devices 2025 IEEE International Solid-State Circuits Conference9 of 39Designing Tiny NPU for Edge De
333、vices Must conform to a limited area budget:Limits the number of MAC and the amount of on-chip memoryISSCC20 15VLSI21 12ISSCC21 13ISSCC21 11ISSCC23 10ISSCC23 9ISSCC23 8ISSCC24 7TCAS2114This work0123456789100512102415362048256030723584#of MAC(K)On-Chip Memory(KB)Tech:3nm#of MACSRAM(KB)Area(mm2)ISSCC24*7633610721.37This Work5762560.168*M.E.Shih et al.,“NVE:A 3nm 23.2 TOPS/W 12b-Digital-CIM-Based Neu