26.d2s4-3-aug23-Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom Extension_Andes_Chris.pdf

編號:155501 PDF 21頁 2.06MB 下載積分:VIP專享
下載報告請您先登錄!

26.d2s4-3-aug23-Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom Extension_Andes_Chris.pdf

1、Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom ExtensionChun-Nan KeSr.Tech.ManagerAndes TechnologyTaking RISC-VMainstream2Agendal Qualitative Comparison of Matrix Instructionsl Challenges of RVV on Matrix Operationsl SW Scalability/Portability Supportl Efficient Tili

2、ng using VRF Load/Storesl Preliminary Resultsl ConclusionTaking RISC-VMainstream3Qualitative Comparison for Matrix InstructionsProposalPerformancePowerAreaProgram Model FU Utilization-rate Memory Bandwidth Diverse Metrics Context Switch Overhead Data Exchange Efficiency RF Storage R/W port Scalabili

3、ty VLEN AgnosticIntegrated VRF Efficient IO scheme High Utilization-rate High Compute-Mem Ratio Low Dissipation Low YesHybrid Seamless Scalable Outer-Product Support Transpose/Column Access Medium Medium YesAttached Facility Better Physical Implementation Data Compression support Extra Dissipation H

4、igh Yes:Reuse Vector Register:Andes AMM,SiFive Intelligence(not disclosed):Additional Matrix Register:ARM SME:Independent Matrix Register:T-head,Streaming ComputingTodays focusWatchingWatchingTaking RISC-VMainstream4Challenges of RVV on Matrix Operationsl Computational Capacity Challengesn Constrain

5、ed computing power based on vector-product/reduced-suml I/O Efficiency of Tiling VRFn load/store instruction struggles for forming matrix tilesn permute/slide instruction overheads for reshapingl Data Reuse/Localityn Inner product exhibits poor memory bandwidth requirementsl SW Scalabilityn Tiling/w

6、idening present porting difficulties cross diverse VLENsl Boundary/Tail Handlingn Legacy vl csr is inadequate for managing matrix/tensor boundariesTaking RISC-VMainstream5Andes Custom Extension(ACE)on RVVTaking RISC-VMainstream6I/O Efficiency of Tiling VRFl Leverage Standard RVV instructions:n load/

7、store:u Linear load/store vl/se.vu Constant stride load/store vl/sse.vu Segment load vl/ssege.vn permutation:u Vector slide(up/down)vslideup/down.vTaking RISC-VMainstream7RVV Overheads of Forming Matrix Tile(1)lPseudo Code demonstration for linear-load+Slide to form an 8x8 INT8 Matrix Tile nTake Exa

8、mple of VLEN 512,Feature Map/Weight format INT8;linear vload to 8 VRFsvle8.v v0,(base),vlvle8.v v1,(base+K),vlvle8.v v2,(base+K*2),vlvle8.v v3,(base+K*3),vlvle8.v v4,(base+K*4),vlvle8.v v5,(base+K*5),vlvle8.v v6,(base+K*6),vlvle8.v v7,(base+K*7),vl;vslideup/down to merge vt in multi-cycle permutatio

9、nsvslideup.vx va,v0,0,vm;vai=v0ivslideup.vx va,v1,8,vm;vai+8=v1ivslideup.vx va,v2,16,vm;vai+16=v2ivslideup.vx va,v3,24,vm;vai+24=v3ivslideup.vx va,v4,32,vm;vai+32=v4ivslideup.vx va,v5,40,vm;vai+40=v5ivslideup.vx va,v6,48,vm;vai+48=v6ivslideup.vx va,v7,56,vm;vai+56=v7i(1)xT(L2$hit)+8*yTslide(multi-cy

10、cle perm.)at least (x+8y)T for 1stva load(2)2ndva load:8*yTslide(multi-cycle perm.)8yT for 2ndva permutationMain MemoryVRF(Linear)VRF(Tile)Taking RISC-VMainstream8RVV Overheads of Forming Matrix Tile(2)lExample latencies of forming sub-matrix tiles using RVVnAssume VLEN 512 and Input Feature Map is

11、in INT8Tiling Shape1stvloadSeq.vload(permutation)U-rate(GeMM 128x128)2x329*T6*T8x8(Linear w/o Transpose)27*T24*T exp(seg_size)loopm:loopn:vand vc,0 x0loopk:asp_vload va,memloop_portion:asp_vload vb,memamm vc,vb,va,portionloop loop_portionloop loopkvquan vcloop loopnloop loopmTaking RISC-VMainstream1

12、4A Proposed Optimal Programming Model(1)Enhanced MAC Utilization-rate and Compute-Mem RationVLEN 12832 MACsnVLEN 1024256 MACsnVLEN 25664 MACsnVLEN 512128 MACsTaking RISC-VMainstream15A Proposed Optimal Programming Model(2);E.g.VLEN 512,INT8 IFM/Weight Tilingloopm:loopn:vand vd0-vd3,0 x0loopk:asp_vlo

13、ad va,memasp_vload vb,mem ammvd0,vb,va,p0;vaportion 0ammvd1,vb,va,p1;vaportion 1ammvd2,vb,va,p2;vaportion 2ammvd3,vb,va,p3;vaportion 3loop loopkvquan vd0-vd3loop loopnloop loopm+Compute bound:MAC utilization-rate+Compute-mem:Low memory bandwidth requirement“Baseline operations“to be capsulated as a

14、Macro/Builtins-VLEN(stride/portion size)-VSRC1/2 w/Memory Addr-VDSTTaking RISC-VMainstream16Preliminary Results of Reduced Memory BandwidthGain for Tiling Hybrid Inner/Outer ProductCompute-memMemory Access Bandwidth RequirementTaking RISC-VMainstream17Preliminary Results for GeMMArchitecture(VLEN/DL

15、EN/AMM 512/512/512)Speed-upU-rate(%)Std.RVV(libvec)1x15%AMM(optimal w/o unroll)3.3x39%AMM(optimal w/unroll)6.9x82%lScenario:GeMM 128x128*128x128Taking RISC-VMainstream18Conclusionl Assessing Std.RVV on Matrix Operation,Including:n SW Scalability/Portability cross Diverse Platformsn Memory Bandwidth

16、Requirementsn Load/Store Performance Bottlenecksn Challenges on Matrix boundary handlingl Integrated-Vector Register Files with Customized Matrix Extension ISA is Briefly Summarizedl Preliminary Results Demonstrate Significant Performance Improvementl A Reference Arch.and Preliminary Results for RISC-V M-ext.TG.Taking RISC-VMainstream21掃一掃,關注晶心

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(26.d2s4-3-aug23-Exploring Matrix Multiplication Techniques and Enhancements Using RISC-V Custom Extension_Andes_Chris.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站