《基于模型的 GPU 性能估算及其在深度學習運算性能優化上的應用.pdf》由會員分享,可在線閱讀,更多相關《基于模型的 GPU 性能估算及其在深度學習運算性能優化上的應用.pdf(28頁珍藏版)》請在三個皮匠報告上搜索。
1、Modelbased Cost Estimation onGPUand its application in deep learning operation optimizations金躍螞蟻2020ANT#page#螞蟻集團ANT GROUPWoodpecker-AutoSearchIntroductionIn this project instead of searching for insects, we develop aelq!ssod asesey Joy uleas o Mlomauely paseq-8ululeel aulueuprogram or hardware desi
2、gn, given a math operation and ahardware platform.Unliketraditionaltechnologies,Woodpecker-optimizingAutoSearch uses machine to write fast programs, instead oftedious hand optimizing by domain experts.The big question wed like to ask in this project is: can machinebeat human in coding?需要公開,請聯系文檔作者#p
3、age#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programEfficient programGraph computingWeldor hardwareHalideSpatialCUDAGraphitMath optimizationsData analysisHardwareCPUFeedbackGPUOptimization algorithmsMeasurementPerf modelFPGARLBayesiaGeneticProfilingProgramAIi-NPUMCMCSAHis
4、torical dataHardwareMobile/Embed請聯系文檔作者部交流使用好開#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programCost ModelGraph computingHalideWeldSpatialCUDAGraphitMath optimizationsData analysisEfficient programor hardware請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domai
5、nsDeep learningParameterized programGraph computingHalideWeldSpatialCUDAGraphitMath optimizationsData analysis請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPHalideA Domain-Specific Language(DSL) and compiler for image processing pipelines.https:/ to use Separating algorithm and schedule opens the door for efficie
6、nt optimization.Algorithm: what you want to compute Schedule: how you want to compute it including choices about memorylocality,redundant computation and parallelism Blur functionAlgorithm:g(x,y)=x+y;f(x,y)=(g(x,y-1)+g(x,y)+g(x,y+1)/3需要公開,請聯系文檔作者3內部交術中用#page#螞蟻集團Efficient Program for GPUANT GROUPgpu
7、_block(f.so.y.yo._block_id_y,t32)gpu_block(f.s0.x.xo.Halide Schedule:f.gpu_tile(x,y,xo,yo,xi,yi,8,8)letf.so.y.yi.basn(f.se_block_id_y*8),t34).thread_id_y)*f.stride.1)(f.s0.x.xo._block(g(._threadid.5門thread_id_x)+(g(f.s0.x.xo._block_id_xthread_idx)/3)+f.extent.0)日e1sef 1.35 day Genetic search: 1600 c
8、hoices 12 min需要公開,請聯系文檔作者#page#螞蟻集團Genetic Search AlgorithmANT GROUPSHARE MEM TYPELOOP ORDER TYPETHREADYGeneTHREADXTHREAD Z5141111111814110ChromosomeTILE_XTILE_RZTILE_YTILE_ZSearch space請聯系文檔作者部交流使用要公開#page#螞蟻集團ANT GROUPMutationFitnessChromosome0.4CrossoverWereplacethechildcandidate8withthemostsimil
9、arcandidate0.2in the valid search space0.1Roulette wheel selection11%rotate41%18%28%內部交流使用要公開,請聯系文檔作者#page#螞蟻集團Experiment ResultANT GROUPConverge in 10minutes-2.8X faster than NVIDIA cuDNN, 1.5X faster than TVMWoodpecker GeneticWoodpecker Random2.5Woodpecker GridTVMCUDNN2.080.53005006007008009001002
10、004001000僅內部Trials#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programEfficient programGraph computingWeldor hardwareHalideSpatialCUDAGraphitMath optimizationsData analysisHardwareCPUFeedbackSPUOptimization algorithmsMeasurementPerf modelFPGARLBayesiaGeneticProfilingPro
11、gram白AIi-NPUMCMCSAHistorical dataHardwareMobile/Embed請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPAlgorithmsfromvarious domainsDeep learningParameterized programCost ModelGraph computingHalideWeldSpatialCUDAGraphitMath optimizationsData analysisEfficient programor hardware請聯系文檔作者部交流使用8開#page#螞蟻集團ANT GROUPCost M
12、etrics Global load efficiency Parallel degree Global store efficiencyData Reuse(Register) Shared Memory Bank Conflict Data Reuse (Shared memory)Branch Efficiency請聯系文檔作者部交流使用#page#Formula螞蟻集團ANT GROUPcost=parallelism_degreewi*memory_access_effw2*register_reusew3shared_mem_reusew4大shared_mem_bank_conf
13、lictw5 branch_effw6log(cost)=w1*log(parallelism_degree)+w2*log(memory_access_eff)+w3*log(register_reuse)+w4*log(shared_mem_reuse)+w5*log(shared_mem_bank_conflict)+w6*log(branch_eff)請聯系文檔作者部交流使用#page#螞蟻集團Cost Model- NNANT GROUPCost Metricslogbatch normlinearbatch normReLUnomalizePredicted Cost請聯系文檔作者
14、部交流使用#page#螞蟻集團ANT GROUPCorrelation CoefficientPearsonsr=0.7Pearsonsr= 0.881515predictpredictrealreal10105050200200400800600100020040060080001000Scheduleschedule請聯系文檔作者部交流使用公開#page#螞蟻集團ANT GROUPCost Model vs. Genetic SearchUSearch Time口Performance1210102Cost ModelGeneticSearch請聯系文檔作者部交流使用#page#GPU O
15、ptimizations螞蟻ANT#page#螞蟻集團ANT GROUPArgmax Operation歡1500,000Parallel with 1500 ThreadsNo ParallelValueIndex15001500No Parallel(less Data),請聯系文檔作者內部交流使用要公開,#page#螞蟻集團ANT GROUPMatrix Multiply(T) OperationMatrix 1ParallelwithParallelwithParallelwith64x64x150064x6464x6464641500,000Matrix2Output Matrix1
16、5001500,000Transpose請聯系文檔作者#page#螞蟻集團ANT GROUPMatrix Multiply(T) OperationMatrix 1ParallelwithParallelwithParallelwith64x64x150064x6464x6464641500,000Matrix2Output Matrix15001500,000Transpose請聯系文檔作者#page#螞蟻集團ANT GROUPPerformanceFeed into Cost ModelArgmaxMatrixMultiply(T)35331.332.69302.5252Speed1.5151.00時k0.551.000OCUDNNOursCuDNNOurs要公開,請聯系文檔作者部交流使用#page#THEENDTHANKYOU!Zhao,RuiZhang YaoChen YongJin YueLiu Yongchao螞蟻ANT