當前位置：首頁 > 報告詳情

基于模型的 GPU 性能估算及其在深度學習運算性能優化上的應用.pdf

上傳人： li 編號：29588 2021-02-07 PDF PDF 28頁 1.10MB

該報告所屬合集： 2020年GTC中國線上大會嘉賓演講PPT資料合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/28

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《基于模型的 GPU 性能估算及其在深度學習運算性能優化上的應用.pdf》由會員分享，可在線閱讀，更多相關《基于模型的 GPU 性能估算及其在深度學習運算性能優化上的應用.pdf（28頁珍藏版）》請在三個皮匠報告上搜索。

1、Modelbased Cost Estimation onGPUand its application in deep learning operation optimizations金躍螞蟻2020ANT#page#螞蟻集團ANT GROUPWoodpecker-AutoSearchIntroductionIn this project instead of searching for insects， we develop aelq!ssod asesey Joy uleas o Mlomauely paseq-8ululeel aulueuprogram or hardware desi

2、gn， given a math operation and ahardware platform.Unliketraditionaltechnologies，Woodpecker-optimizingAutoSearch uses machine to write fast programs， instead oftedious hand optimizing by domain experts.The big question wed like to ask in this project is: can machinebeat human in coding？需要公開，請聯系文檔作者#p

3、age#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programEfficient programGraph computingWeldor hardwareHalideSpatialCUDAGraphitMath optimizationsData analysisHardwareCPUFeedbackGPUOptimization algorithmsMeasurementPerf modelFPGARLBayesiaGeneticProfilingProgramAIi-NPUMCMCSAHis

4、torical dataHardwareMobile/Embed請聯系文檔作者部交流使用好開#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programCost ModelGraph computingHalideWeldSpatialCUDAGraphitMath optimizationsData analysisEfficient programor hardware請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domai

5、nsDeep learningParameterized programGraph computingHalideWeldSpatialCUDAGraphitMath optimizationsData analysis請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPHalideA Domain-Specific Language(DSL） and compiler for image processing pipelines.https:/ to use Separating algorithm and schedule opens the door for efficie

6、nt optimization.Algorithm: what you want to compute Schedule: how you want to compute it including choices about memorylocality，redundant computation and parallelism Blur functionAlgorithm:g（x,y）=x+y；f（x,y)=（g（x,y-1）+g（x,y)+g（x,y+1）/3需要公開，請聯系文檔作者3內部交術中用#page#螞蟻集團Efficient Program for GPUANT GROUPgpu

7、_block（f.so.y.yo._block_id_y，t32）gpu_block（f.s0.x.xo.Halide Schedule:f.gpu_tile(x，y,xo，yo，xi，yi,8，8）letf.so.y.yi.basn(f.se_block_id_y*8），t34）.thread_id_y)*f.stride.1）（f.s0.x.xo._block（g(._threadid.5門thread_id_x)+（g(f.s0.x.xo._block_id_xthread_idx）/3）+f.extent.0）日e1sef 1.35 day Genetic search: 1600 c

8、hoices 12 min需要公開，請聯系文檔作者#page#螞蟻集團Genetic Search AlgorithmANT GROUPSHARE MEM TYPELOOP ORDER TYPETHREADYGeneTHREADXTHREAD Z5141111111814110ChromosomeTILE_XTILE_RZTILE_YTILE_ZSearch space請聯系文檔作者部交流使用要公開#page#螞蟻集團ANT GROUPMutationFitnessChromosome0.4CrossoverWereplacethechildcandidate8withthemostsimil

9、arcandidate0.2in the valid search space0.1Roulette wheel selection11%rotate41%18%28%內部交流使用要公開，請聯系文檔作者#page#螞蟻集團Experiment ResultANT GROUPConverge in 10minutes-2.8X faster than NVIDIA cuDNN， 1.5X faster than TVMWoodpecker GeneticWoodpecker Random2.5Woodpecker GridTVMCUDNN2.080.53005006007008009001002

10、004001000僅內部Trials#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programEfficient programGraph computingWeldor hardwareHalideSpatialCUDAGraphitMath optimizationsData analysisHardwareCPUFeedbackSPUOptimization algorithmsMeasurementPerf modelFPGARLBayesiaGeneticProfilingPro

11、gram白AIi-NPUMCMCSAHistorical dataHardwareMobile/Embed請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programCost ModelGraph computingHalideWeldSpatialCUDAGraphitMath optimizationsData analysisEfficient programor hardware請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPCost M

12、etrics Global load efficiency Parallel degree Global store efficiencyData Reuse(Register） Shared Memory Bank Conflict Data Reuse (Shared memory）Branch Efficiency請聯系文檔作者部交流使用#page#Formula螞蟻集團ANT GROUPcost=parallelism_degreewi*memory_access_effw2*register_reusew3shared_mem_reusew4大shared_mem_bank_conf

13、lictw5 branch_effw6log（cost)=w1*log(parallelism_degree）+w2*log(memory_access_eff）+w3*log（register_reuse）+w4*log(shared_mem_reuse）+w5*log（shared_mem_bank_conflict）+w6*log（branch_eff）請聯系文檔作者部交流使用#page#螞蟻集團Cost Model- NNANT GROUPCost Metricslogbatch normlinearbatch normReLUnomalizePredicted Cost請聯系文檔作者

14、部交流使用#page#螞蟻集團ANT GROUPCorrelation CoefficientPearsonsr=0.7Pearsonsr= 0.881515predictpredictrealreal10105050200200400800600100020040060080001000Scheduleschedule請聯系文檔作者部交流使用公開#page#螞蟻集團ANT GROUPCost Model vs. Genetic SearchUSearch Time口Performance1210102Cost ModelGeneticSearch請聯系文檔作者部交流使用#page#GPU O

15、ptimizations螞蟻ANT#page#螞蟻集團ANT GROUPArgmax Operation歡1500,000Parallel with 1500 ThreadsNo ParallelValueIndex15001500No Parallel（less Data），請聯系文檔作者內部交流使用要公開，#page#螞蟻集團ANT GROUPMatrix Multiply(T） OperationMatrix 1ParallelwithParallelwithParallelwith64x64x150064x6464x6464641500,000Matrix2Output Matrix1

16、5001500,000Transpose請聯系文檔作者#page#螞蟻集團ANT GROUPMatrix Multiply(T） OperationMatrix 1ParallelwithParallelwithParallelwith64x64x150064x6464x6464641500,000Matrix2Output Matrix15001500,000Transpose請聯系文檔作者#page#螞蟻集團ANT GROUPPerformanceFeed into Cost ModelArgmaxMatrixMultiply（T）35331.332.69302.5252Speed1.5151.00時k0.551.000OCUDNNOursCuDNNOurs要公開，請聯系文檔作者部交流使用#page#THEENDTHANKYOU！Zhao，RuiZhang YaoChen YongJin YueLiu Yongchao螞蟻ANT

相關圖表

本文介紹了一種基于機器學習的編程框架Woodpecker-AutoSearch，旨在為數學運算和硬件平臺尋找最優化的程序設計。與傳統優化技術不同，Woodpecker-AutoSearch利用機器自動編寫高效程序，而不是由領域專家進行繁瑣的手動優化。項目核心是探索機器是否能在編碼方面超越人類。文中提到的Halide是一種針對圖像處理管道的領域特定語言（DSL）和編譯器，它通過分離算法和調度來優化代碼。Halide的調度器可以自動進行諸如循環分割、重新排序、展開和存儲布局等優化。此外，文中還討論了遺傳搜索算法在硬件優化中的應用，以convolution為例，展示了遺傳搜索在優化維度和選擇方面的效率。最后，通過實驗結果表明，本項目的方法在性能上優于NVIDIA cuDNN和TVM。

"機器學習如何優化硬件設計？" "Halide如何提高GPU編程效率？" "遺傳算法在深度學習優化中的應用"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站