《用于全同態加密的高效密鑰交換加速器.pdf》由會員分享,可在線閱讀,更多相關《用于全同態加密的高效密鑰交換加速器.pdf(35頁珍藏版)》請在三個皮匠報告上搜索。
1、Efficient Key Switching Accelerator for Fully Homomorphic EncryptionSeoyoon Jang,Sungjin Park,Dongsuk JeonSeoul National UniversitySeoul,South KoreaASP-DAC 2025Motivation Advent of FHE Fully Homomorphic Encryption(FHE)The savior of privacy-preserving computation in cloud service The main bottleneck
2、operation,Key-Switching(KS)BottleneckMotivation What makes KS expensive?Expensive Number Theoretic Transform(NTT)and inverse-NTT operations in KS =()=01 At least log computations Irregular memory access pattern 0 1 2 3 4 5 6 701234567Stage 1Stage 2Stage 3Motivation What makes KS expensive?Expensive
3、Number Theoretic Transform(NTT)and inverse-NTT operations in KS Frequent transitions in data access patterns(Element/Ring/Coefficient-wise)Element-wiseMotivation What makes KS expensive?Element-wiseData access within each polynomial(NTT/INTT,)Ring-wise Expensive Number Theoretic Transform(NTT)and in
4、verse-NTT operations in KS Frequent transitions in data access patterns(Element/Ring/Coefficient-wise)Motivation What makes KS expensive?Ring-wiseElement-wiseRing-wiseRing-wiseCoefficient-wise Expensive Number Theoretic Transform(NTT)and inverse-NTT operations in KS Frequent transitions in data acce
5、ss patterns(Element/Ring/Coefficient-wise)Motivation What makes KS expensive?Expensive Number Theoretic Transform(NTT)and inverse-NTT operations in KS Frequent transitions in data access patterns(Element/Ring/Coefficient-wise)Overall KS dataflowMotivation What makes KS expensive?Expensive Number The
6、oretic Transform(NTT)and inverse-NTT operations in KS Frequent transitions in data access patterns(Element/Ring/Coefficient-wise)Overall KS dataflowRing-wiseRing-wiseCoefficient-wiseElement-wise Redundant external memory access Motivation What makes KS expensive?Expensive Number Theoretic Transform(
7、NTT)and inverse-NTT operations in KS Frequent transitions in data access patterns(Element/Ring/Coefficient-wise)Overall KS dataflowRing-wiseRing-wiseCoefficient-wiseElement-wise Redundant external memory access Design a dedicated KS accelerator for maximum energy efficiencyParameter selected for KS
8、acceleratorlog,=17,35,9,4:Number of coefficients for each polynomial:Maximum circuit depth level:decomposition number=+1/1 computations,required swk Constraints for parameters selectionSecurity level 128Directly related with/log large enough to guarantee=1520 for bootstrapping that enables FHE provi
9、ding a sufficient number of lightweight prime moduli for large =2 20 21 22+1sparse!enable efficient HW implementation of modular multipliersOverall Design of KS accelerator Router transferring instructions and external data to the target core LUT for moduli set and modulus-related constants required
10、 by each coreOverall Design of KS accelerator NTT unit NTT/iNTT operation(unified)Process=29 coefficients per each unit(=)Modular-Multiply-and-Accumulate(MMAC)unit Conv operation Local distributor:internal routerMMAC unitNTT unitNTT unitProposed Design Techniques for Energy EfficiencyMMMMI.Modular M
11、ultiplier for Sparse Moduli SetII.NTT UnitA.Efficient Twiddle Factor Generator(TFG)B.Conflict-free Addressing Scheme for Single-port MemoryIII.Bandwidth-efficient Behavior in CoreI.Modular Multiplier for Sparse Moduli Set=2 20 21 22+1 and in Barrett modular multiplication replaced with shift-adders
12、=22/=2 20 21 22 1Benefits of sparsity(2 1+1,)I.Modular Multiplier for Sparse Moduli Set=2 20 21 22+1 and in Barrett modular multiplication replaced with shift-adders =22/=2 20 21 22 1Benefits of sparsity(2 0+1,)41 moduli available Sufficient for bootstrapping in FHEUsing =59Take the most advantage o
13、ut of this inherent sparsity!I.Modular Multiplier for Sparse Moduli Set =22(0+1)1+1 2+1 1=2 20 21 22+1Simplified computation shift-adding removing one multiplication=22/=2 20 21 22 12=221 20 21 22 1()I.Modular Multiplier for Sparse Moduli Set=2 20 21 22+1Simplified computation shift-adding removing
14、one multiplication(V*T)=(+1)s sparsity=22/=2 20 21 22 1 1 1 1 from sign bit of 20 in I.Modular Multiplier for Sparse Moduli Set=2 20 21 22+1Simplified computation shift-adding removing one multiplication=22/=2 20 21 22 1Area(Power)vs.Non-Sparse:47.8(46.0)%vs.Sparse 1:24.6(22.5)%1 Kim et al.,“Fpga-ba
15、sed accelerators of fully pipelined modular multipliers for homomorphic encryption,”ReConFig,2019.()=01 II-A.Efficient Twiddle Factor Generator(TFG)twiddle factors required for each NTT/iNTT on coefficients Too much overhead in data loading latency and memory area!Twiddle Factor Generator Saving mem
16、ory area for the twiddle factors Generate twiddle factors during NTT/iNTT operationsII-A.Efficient Twiddle Factor Generator(TFG)TFG saving memory area for the twiddle factors from log Generate twiddle factors using geometric progression with log seed elementsII-A.Efficient Twiddle Factor Generator(T
17、FG)Additional pre-processing stage for further reduction of seed elements(1)Pre-processing(before NTT/iNTT starts)Input:Seed elements Output:Secondary seed elements(2,3)Geometric progression(run-time)Input:Secondary seed elements Output:Twiddle factorsMM:modular multiplierII-A.Efficient Twiddle Fact
18、or Generator(TFG)Reduction of twiddle factor memory area log Additional pre-processing stage for further reduction on seed elements 2 Kim et al.,“Ark:Fully homomorphic encryption accelerator with run-time data generation and inter-operation key reuse,”MICRO,2022.6 Geelen et al.,”Basalisc:Flexible as
19、ynchronous hardware accelerator for fully homomorphic encryption,”preprint,arXiv,2022.17 Kim et al.,“Hardware architecture of a number theoretic transform for a bootstrappable rns-based homomorphic encryption scheme,”FCCM,2020.II-B.Conflict-free Addressing(CFA)Scheme for Single-port MemoryDual-port/
20、2/2Single-portMemoryNetwork of butterfly units for NTT UnitPotential conflicts:Read-after-writeMemory access(multiple banks)Write tNumber of cycles for pipelining()Read(+)=16Write result of Read tWrite tRead tConflictII-B.Conflict-free Addressing(CFA)Scheme for Single-port MemoryDual-port/2/2Single-
21、portMemoryNetwork of butterfly units for NTT UnitWrite tNumber of cycles for pipelining()Read(+)=16Write result of Read tWrite tRead tNo throughput degradation,Reduce silicon area&powerCFAAccess by(,)01 0,/2)II-B.Conflict-free Addressing(CFA)Scheme for Single-port MemoryGoal:No conflict!+16=1,Networ
22、k pipelining stage()Dual-port/2/2Single-portCFAAccess by(,)01 0,/2)II-B.Conflict-free Addressing(CFA)Scheme for Single-port MemoryGoal:No conflict!+16=1,Network pipelining stage()3:2 3:2 +16=3:2 (3:2()+1)=1Then,should satisfy:However,this is not satisfied for all t II-B.Conflict-free Addressing(CFA)
23、Scheme for Single-port MemoryGoal:No conflict!+16=1,Network pipelining stage()3:2 3:2 +16=3:2 (3:2()+1)=1 3:2 3:2 +16=3:2 (3:2()+1)=1Applying Bit-reverse+GrayNow,this is satisfied for all t!II-B.Conflict-free Addressing(CFA)Scheme for Single-port Memory +16=1,=0,=12,15=2,=0,3Example of Stage 0 Stage
24、 1 in INTTMemory access conflict most likely occur during stage transitionGoal:No conflict!II-B.Conflict-free Addressing(CFA)Scheme for Single-port MemoryExample of Stage 0 Stage 1 in INTT +16=1,Memory access conflict-free,=0,=12,15=2,=0,3Same throughput Area(Power)67.87(44.39)%III.Bandwidth-efficie
25、nt Behavior in CoreFrequent data access pattern transition between NTT and MMAC unit Expensive external memory access!MMAC unitNTT unitNTT unitIII.Bandwidth-efficient Behavior in CoreDataflow in KS modified for better data utilization,using dedicated buffers(Tmp)between NTT and MMAC unitsIII.Bandwid
26、th-efficient Behavior in CoreDataflow in KS modified for better data utilization,using dedicated buffers(Tmp)between NTT and MMAC units External memory access 38.7%Chip ImplementationGD(LD):Global/Local DistributorComparison with Prior Works35Conclusion Designed a dedicated accelerator for KS that r
27、equires frequent transitions in data access patterns incurring redundant expensive external memory accesses Proposed design techniques on various levels for high energy efficiency modular multiplier,NTT unit,and data access behavior in core,thus full-stack optimization As a result,the design shows s
28、ignificant improvement in performance in energy efficiency compared with prior FHE implementations.Although designed specifically for KS,it remains highly applicable since KS operations dominate power,time,and bandwidth across the entire computation.Techniques can be also applied to other parameter sets modular multiplier as long as same moduli set used and +1 41,NTT unit down to =216(the least amount to support bootstrapping reported in literature)