《26.d2s4-3-aug23-Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom Extension_Andes_Chris.pdf》由會員分享,可在線閱讀,更多相關《26.d2s4-3-aug23-Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom Extension_Andes_Chris.pdf(21頁珍藏版)》請在三個皮匠報告上搜索。
1、Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom ExtensionChun-Nan KeSr.Tech.ManagerAndes TechnologyTaking RISC-VMainstream2Agendal Qualitative Comparison of Matrix Instructionsl Challenges of RVV on Matrix Operationsl SW Scalability/Portability Supportl Efficient Tili
2、ng using VRF Load/Storesl Preliminary Resultsl ConclusionTaking RISC-VMainstream3Qualitative Comparison for Matrix InstructionsProposalPerformancePowerAreaProgram Model FU Utilization-rate Memory Bandwidth Diverse Metrics Context Switch Overhead Data Exchange Efficiency RF Storage R/W port Scalabili
3、ty VLEN AgnosticIntegrated VRF Efficient IO scheme High Utilization-rate High Compute-Mem Ratio Low Dissipation Low YesHybrid Seamless Scalable Outer-Product Support Transpose/Column Access Medium Medium YesAttached Facility Better Physical Implementation Data Compression support Extra Dissipation H
4、igh Yes:Reuse Vector Register:Andes AMM,SiFive Intelligence(not disclosed):Additional Matrix Register:ARM SME:Independent Matrix Register:T-head,Streaming ComputingTodays focusWatchingWatchingTaking RISC-VMainstream4Challenges of RVV on Matrix Operationsl Computational Capacity Challengesn Constrain
5、ed computing power based on vector-product/reduced-suml I/O Efficiency of Tiling VRFn load/store instruction struggles for forming matrix tilesn permute/slide instruction overheads for reshapingl Data Reuse/Localityn Inner product exhibits poor memory bandwidth requirementsl SW Scalabilityn Tiling/w
6、idening present porting difficulties cross diverse VLENsl Boundary/Tail Handlingn Legacy vl csr is inadequate for managing matrix/tensor boundariesTaking RISC-VMainstream5Andes Custom Extension(ACE)on RVVTaking RISC-VMainstream6I/O Efficiency of Tiling VRFl Leverage Standard RVV instructions:n load/
7、store:u Linear load/store vl/se.vu Constant stride load/store vl/sse.vu Segment load vl/ssege.vn permutation:u Vector slide(up/down)vslideup/down.vTaking RISC-VMainstream7RVV Overheads of Forming Matrix Tile(1)lPseudo Code demonstration for linear-load+Slide to form an 8x8 INT8 Matrix Tile nTake Exa
8、mple of VLEN 512,Feature Map/Weight format INT8;linear vload to 8 VRFsvle8.v v0,(base),vlvle8.v v1,(base+K),vlvle8.v v2,(base+K*2),vlvle8.v v3,(base+K*3),vlvle8.v v4,(base+K*4),vlvle8.v v5,(base+K*5),vlvle8.v v6,(base+K*6),vlvle8.v v7,(base+K*7),vl;vslideup/down to merge vt in multi-cycle permutatio
9、nsvslideup.vx va,v0,0,vm;vai=v0ivslideup.vx va,v1,8,vm;vai+8=v1ivslideup.vx va,v2,16,vm;vai+16=v2ivslideup.vx va,v3,24,vm;vai+24=v3ivslideup.vx va,v4,32,vm;vai+32=v4ivslideup.vx va,v5,40,vm;vai+40=v5ivslideup.vx va,v6,48,vm;vai+48=v6ivslideup.vx va,v7,56,vm;vai+56=v7i(1)xT(L2$hit)+8*yTslide(multi-cy
10、cle perm.)at least (x+8y)T for 1stva load(2)2ndva load:8*yTslide(multi-cycle perm.)8yT for 2ndva permutationMain MemoryVRF(Linear)VRF(Tile)Taking RISC-VMainstream8RVV Overheads of Forming Matrix Tile(2)lExample latencies of forming sub-matrix tiles using RVVnAssume VLEN 512 and Input Feature Map is
11、in INT8Tiling Shape1stvloadSeq.vload(permutation)U-rate(GeMM 128x128)2x329*T6*T8x8(Linear w/o Transpose)27*T24*T exp(seg_size)loopm:loopn:vand vc,0 x0loopk:asp_vload va,memloop_portion:asp_vload vb,memamm vc,vb,va,portionloop loop_portionloop loopkvquan vcloop loopnloop loopmTaking RISC-VMainstream1
12、4A Proposed Optimal Programming Model(1)Enhanced MAC Utilization-rate and Compute-Mem RationVLEN 12832 MACsnVLEN 1024256 MACsnVLEN 25664 MACsnVLEN 512128 MACsTaking RISC-VMainstream15A Proposed Optimal Programming Model(2);E.g.VLEN 512,INT8 IFM/Weight Tilingloopm:loopn:vand vd0-vd3,0 x0loopk:asp_vlo
13、ad va,memasp_vload vb,mem ammvd0,vb,va,p0;vaportion 0ammvd1,vb,va,p1;vaportion 1ammvd2,vb,va,p2;vaportion 2ammvd3,vb,va,p3;vaportion 3loop loopkvquan vd0-vd3loop loopnloop loopm+Compute bound:MAC utilization-rate+Compute-mem:Low memory bandwidth requirement“Baseline operations“to be capsulated as a
14、Macro/Builtins-VLEN(stride/portion size)-VSRC1/2 w/Memory Addr-VDSTTaking RISC-VMainstream16Preliminary Results of Reduced Memory BandwidthGain for Tiling Hybrid Inner/Outer ProductCompute-memMemory Access Bandwidth RequirementTaking RISC-VMainstream17Preliminary Results for GeMMArchitecture(VLEN/DL
15、EN/AMM 512/512/512)Speed-upU-rate(%)Std.RVV(libvec)1x15%AMM(optimal w/o unroll)3.3x39%AMM(optimal w/unroll)6.9x82%lScenario:GeMM 128x128*128x128Taking RISC-VMainstream18Conclusionl Assessing Std.RVV on Matrix Operation,Including:n SW Scalability/Portability cross Diverse Platformsn Memory Bandwidth
16、Requirementsn Load/Store Performance Bottlenecksn Challenges on Matrix boundary handlingl Integrated-Vector Register Files with Customized Matrix Extension ISA is Briefly Summarizedl Preliminary Results Demonstrate Significant Performance Improvementl A Reference Arch.and Preliminary Results for RISC-V M-ext.TG.Taking RISC-VMainstream21掃一掃,關注晶心