《具有分層結構和奇偶數據映射的高并行內存 NTT 引擎.pdf》由會員分享,可在線閱讀,更多相關《具有分層結構和奇偶數據映射的高并行內存 NTT 引擎.pdf(27頁珍藏版)》請在三個皮匠報告上搜索。
1、High-Parallel In-Memory NTT Engine with Hierarchical Structure and Even-Odd Data Mapping Institute of Microelectronics,Chinese Academy of Sciences1Capital Normal University2Institute of Computing Technology,Chinese Academy of Sciences3,University of Chinese Academy Sciences4Bing Li1,Huaijun Liu2,Yib
2、o Du3,4,Ying Wang3,4OutlineBackground and MotivationProposed Method Overview Architecture&Data MappingEvaluation and ResultsConclusionFully Homomorphic EncryptionMedical TreatmentCloud ComputingMachine LearningFitness App FHE ReviewViand A,et al.,S&P 2021 Data Security Powerful Functionality High Co
3、mputational OverheadClassic NTT Challenges&Advantagesa0A0!#a4A1-1a2A2!#a6A3-1a1A4!#a5A5-1a3A6!#a7A7-1!#$#!#$#!#%#$#-1-1-1-1-1-1-1-1Stage1Stage2Stage3Algorithm In-Place Cooley-Tukey-based NTTInput:a=(an1,.,a0)R,n-th root of unity in%with bit-reversed orderOutput:A=NTT(a)in bit-reversed order1:=2:fo
4、r(=1;=2)do3:=/24:for(=0;n-1 3.t2=t1 mu4.t3=t2 n+15.r1=c%2n+16.r2=(t3 q)%2n+17.r=r1-r2Condition:r q/2?(r-q):rReturn rImplementing in CIMCalculation:r=c mod q(q:n bit)1.x=cn-1;2.a=x q/2?(r-q):rReturn rOptimizationMod Algorithm Optimization Adapt the original Barrett algorithm to the efficient implemen
5、tation on CIM111010111111111000000001110101000000001110101Right shift000000001110101cxxa829,qn=829,qn=MSBMSBLSBLSB111010111111111c000000000001011na000000100000000t00b1000000000000000000000011111111()tb+()cna+Sub(a)Shift in CIM(b)Subtraction in CIM Low Latency Low Energy Left shiftMod Algorithm Optim
6、izationMod Module-Data MappingRTLA0,msbA0,lsbA3,msbA3,lsbSense AmplifierSubArray0SubArray64Read/Write&ComparatorWL Decoder&DriverSense AmplifierSubArray128Sense AmplifierSubArray192Sense AmplifierSubArray191Sense AmplifierSubArray255Sense AmplifierMOD PEMOD PEResult A0qmsbqlsbqmsbqlsbResult A3RTLA25
7、2,msbA252,lsbA255,msbA255,lsbSense AmplifierSubArray63SubArray127Sense AmplifierResult A252qmsbqlsbqmsbqlsbResult A255RTLA0,msbA0,lsbA3,msbA3,lsbSense AmplifierSubArray0SubArray64Read/Write&ComparatorWL Decoder&DriverSense AmplifierSubArray128Sense AmplifierSubArray192Sense AmplifierSubArray191Sense
8、 AmplifierSubArray255MOD PESense AmplifierMOD PEResult A0qmsbqlsbqmsbqlsbResult A3RTLA252,msbA252,lsbA255,msbA255,lsbSense AmplifierSubArray63SubArray127Sense AmplifierResult A252qmsbqlsbqmsbqlsbResult A255Mod Module-ComputationOutlineBackground&MotivationProposed Method Overview Architecture&Data M
9、appingEvaluation and ResultsConclusionEvaluation SetupDesignPlatformAlgorithmNTT Parameters(n,log2q)HP-CIM(Ours)6T SRAMMVM(32K,32)BP-NTT6T SRAMCT Butterfly(1024,16)MeNTT6T SRAMCT Butterfly(32K,32)RM-NTTReRAMMVM(1024,16)CryptoPIM(Baseline)ReRAMButterfly(32K,32)HP-CIM SettingsMVM Module16 PEs,32 KB/PE
10、256 SubArrays/PE,6416 SubArrayMOD Module2 PEs,8 KB/PE128 SubArrays/PE,864 SubArrayEvaluation Result(256,14)(512,16)(1024,16)Normalized Latency0.54%100%0.54%0.51%0.41%0.72%1.56%142%328%(,7)33.5%34.2%41.3%100%100%75%50%25%1%0.5%0HP-CIM RM-NTT MeNTT BP-NTT CryptoPIM90.1%100%HP-CIM achieves a latency re
11、duction of up to 3.08 compared to thefastest existing CIM-based NTT accelerator,RM-NTTEvaluation Result(256,14)(512,16)(1024,16)5.62%100%11.6%0.01%0.21%(,7)0.44%0.19%5.58%6.45%0.12%100%100%8.09%0.46%100%60%20%0.6%0.4%0.2%01.32%HP-CIM RM-NTT MeNTT BP-NTT CryptoPIMNormalized Energy HP-CIM provides sig
12、nificant energy savings of up to 4.96 overthe most energy-efficient prior solution,MeNTTEvaluation Result Under large-scale NTT parameter settings,HP-CIM outperformsother designs in terms of latency and energyLatency(us)HP-CIMMeNTTCryptoPIM47935130175286150Energy(uj)500400300150100500200015001000300
13、2001000n=32K,log2q=323.7x13x1.7x20 xEvaluationResult050000010000001500000200000025000002565121024Execution Time(us)Polynomial Order(n)CPU_16bOurs_16bCPU_32bOurs_32b HP-CIM reduces execution time by over 2.4 compared to CPUConclusion1.High Parallelism with Hierarchical SRAM ArchitectureIntroduced a d
14、igital SRAM-based CIM NTT engine,utilizing a hierarchical structure to achieve high parallelism and scalability for large-scale NTT operations.2.Novel Even-Odd Data Mapping StrategyProposed an even-odd data mapping approach to optimize memory utilization,enabling efficient reuse of intermediate comp
15、utation results for better scalability.3.Integrated Mod Computation within CIM ArraysDeveloped efficient mod operations directly within CIM arrays using SRAM read-write capabilities,eliminating the need for extra peripheral circuits and enhancing area and energy efficiency.4.Significant Performance
16、and Energy ImprovementsAchieved up to 3.08 faster execution and 4.96 energy savings compared to prior CIM-based designs,validated through extensive comparisons with state-of-the-art methods.High-Parallel In-Memory NTT Engine with Hierarchical Structure and Even-Odd Data Mapping Institute of Microele
17、ctronics,Chinese Academy of Sciences1Capital Normal University2Institute of Computing Technology,Chinese Academy of Sciences3,University of Chinese Academy Sciences4Bing Li1,Huaijun Liu2,Yibo Du3,4,Ying Wang3,4THANK YOUReference1 Gentry C.Fully homomorphic encryption using ideal latticesC.Proceeding
18、s of the forty-first annualACM symposium on Theory of computing,Bethesda,Maryland,2009:169-178.2 Fan J,Vercauteren F.Somewhat Practical Fully Homomorphic EncryptionJ.IACR Cryptology ePrintArchive,2012,2012(2012):144-162.3 Kim S,Kim J,Kim M J,et al.Bts:An accelerator for bootstrappable fully homomorp
19、hic encryptionC.Proceedings of the 49th Annual International Symposium on Computer Architecture,New York,2022:711-725.4 Samardzic N,Feldmann A,Krastev A,et al.F1:A fast and programmable accelerator for fullyhomomorphicencryptionC.MICRO-54:54thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,G
20、reece,2021:238-252.5 He Y,Qu S,Lin G,et al.Processing-in-SRAM acceleration for ultra-low power visual 3D perceptionC.Proceedings of the 59th ACM/IEEE Design Automation Conference,San Francisco California,2022:295-300.6 Li D,Pakala A,Yang K.MeNTT:A compact and efficient processing-in-memory number theoretictransform(NTT)acceleratorJ.IEEE Transactions on Very Large Scale Integration(VLSI)Systems,2022,30(5):579-588.7 Albrecht M,Chase M,Chen H,et al.Homomorphic encryption standardJ.Protecting privacy throughhomomorphic encryption,2021:31-62.