《Hammer:一體化的模型壓縮和 NAS 引擎框架.pdf》由會員分享,可在線閱讀,更多相關《Hammer:一體化的模型壓縮和 NAS 引擎框架.pdf(46頁珍藏版)》請在三個皮匠報告上搜索。
1、名快手Hammer!:A UnifiedFramework of ModelCompression and NASSpeaker: Xiufeng Xie, Hongmin XuAI Platform, Seattle Al lab,and FeDA lab KwaiKey Contributors: Hongmin Xu Xiufeng Xie Jiianchao TanYiGuo Huan Yuan Jixiang Li Xiangru Lian and JiLiu#page#名快手ContributorsKey contributors#page#why Do We Need DNN M
2、odel Compression?名快手Speech recognitionBeauty photosFace unlock共國:1悅璃Online recommendation (iinference)#page#Goals in Model Compression名快手Model compressionAccuracyComputation efficiencyEnergy efficiency#page#Energy 8 Computing名快手Cost A LotExample: run an Al application that costs 100M FlopsOn ServerO
3、n Phone8Cost 1000 (5) per dayCost 1.8M (5) per daySource: 1 and assuming 300M DAU and 100 videos/user#page#名快手However. Model Size ls Not The Whole Story10%Model Size/FlopsEnergyLatencyDepend on the hardware#page#Model Compression Tools名快手ToolToolUnifiedModelModelNAS forHardwareNamePruningAuthorFrame
4、workQuantizationAwarenessCompressionKwaiXXXPocketFlowTencentSXXINNXMicrosoftX一DistillerIntelXXXPaddlePaddleBaiduXXXXGoogleXTensorflow LiteXXAIMETXQualcommXSXCondensaNvidiaHammer:JointPruning + Quantization +NASOthers:PruningQuantization#page#名快手Ubiquitous AI optimization with an allin-one framework電
5、中中AccuracymaxS Latency toleranceApplication requirementHardware resource budgetS Resource budgetAdd as many consttraints as the user wants#page#名快手Hammer! Supported1Compression Strategies88888aooo2(8oPruning(剪枝)(量化)QuantizationNeural architecture search時可電建中(NAS,神經結構搜索)#page#名快手Hammer! ls GPU-Hardwa
6、re-AwareSearch for a DNN structure that best fits GPU hardwareModular booster#input channelis multiple of N三貿#output channelis multiple ofNN=8LatencyOptimize DNNs running latency on the target GPUProfilingLatencyOptimizerModelEnergyOptimize DNNsenergy cost on the target GPUProfilingEnergyOptimizerMo
7、delMemoryOptimize DNNs memory consumption within hardwares quotaProfilingMemoryOptimizerModel#page#名快手Hammer! Supported Network ArchitecturesPruned channelsDog(0.01)Cat(0.01)Ship(0.92)CNNPlane(0.06)一Image model to video modelRNNwSkip connection#page#All-In-One Framework: DAG Representation名快手NNisthe
8、directedacyclicgraph(DAG,有向無環圖)0Model compression is equal to cut-off less important edges in DAG,under some certainconstraintsEdge from computing nodei to nodeLComputing nodeInputOutputEdge indicatorEdge indicator for NAS#page#All-In-One Framework: Unified Compression Strategies名快手minc(w,e)M NN wei
9、ghtsw.eC NN loss functioneit+ejt+ert=1S.t.Featureeief0,13,i,1Pruning=1VLEONASNAS2ONAS:SetofnodeswhereNAS pathsconvergeQuantization0aMQ:Quantized weight space#page#All-In-One Framework: Hardware-Aware Model Compression名快手minc(w,e)w.eS.t.Features:ECPUCe)S EbudgetEnergyTGPUCe)S TbudgetLatencyIHardware-
10、aware constraintsMGPUCe)S MbudgetMemoryC(e)S CpudgetFlopsN(e) NbudgetParam numberSCe) S SbudgetSparsity#page#All-In-One Framework: Hardware-Aware Model Compression名快手minc(wv,e)WNN weightsW.eC NN loss functions.t.FeaturesVModular boosterIA0= NpoulaiEK:Rj7A0= NpouufajEk:1k)N=8#page#All-In-One Framewor
11、k: Complicated Network Support名快手minc(w,e)w.eeikFeatures:S.t.eikeik=ejk,DADDSkip connectionDADD:Setofnodes whereoutputs frommultiple previous nodes add together#page#All-In-One Framework: Summary名快手minc(w,e)W.e38880Skip connectionPruning建NASQuantizationHardware-awareresource constraints三洗Modular boo
12、ster#page#All-In-One Framework: Summary名快手minc(w,e)w.e88880FanSkip connectionPruning3DU=YA68中NASE(e) EbudgetT(e)S TbudgetQuantizationC(e)CaeidgetHardware-awareM(e)resource constraintsgudgetN(e)S NbudgeteumodN=O,VjiE(kkModular boosterS(e) S Sbudget1ANpoujk:ik)N=8How to use these constraints?We have i
13、mplement all the details in Hammer, users simpplyy turn them on/off and set resource budgets#page#名快手Hammer!Comparison with SOTA Algorithm#page#名快手Hammer!Comparison with SOTA Algorithm#page#SOTA Benchmark Comparison: FLOPs & Accuracy名快手Evaluation Setup: ResNet56 on CIFAR10GoodHammer Compressior05Unc
14、ompressed baseline-0.5DeltaTop1-1.01502060709008FLOPs Compression Ratio w.rtOriginalDNN Model%Bad#page#SOTA Benchmark Comparison: FLOPs & Accuracy名快手Evaluation Setup: MobileNetV2 on CIFAR10GoodHammer Compression for Mobilenetv2 on CIFAR100.5_Uncompressedbaseline0.012-0.5uracy公-1.0T-dol2301.5-2.04050
15、203009708090FLOPS Compression Ratio w.rt Original DNN Model %Bad#page#SOTA Benchmark Comparison: BitoPs & Accuracy名快手Evaluation Setup: ResNet56 on CIFAR10GoodHammer CompressionUncompressed basejin0.0Accuracy1%0.5Delta Top-1.20191030405090097080BitoPs Compression Ratio w.r.t.Original DNN Model %Bad#p
16、age#名快手Hammer! Fits Various Application ScenariosApplications to showcaseFrame-by-frame video enhancementDe-ArtifactsVideo super-resolutionVideo cartoonizationReal-time visual detection and recognitionFace landmarks detectionHand segmentation3Dhand pose detectionSequential model compressionVideo sty
17、le transfer3D human mesh reconstructionRecommendation systemRecall modelRanking model#page#名快手Hammer!Frame-by-frame video enhancement#page#名快手Hammerl in De-ArtifactsDe-Artifacts (DeArts)Video makers upload compressed videoVideo compression causes artifactsComplicated DNN to remove artifactsOriginalD
18、eArts Output#page#名快手Hammer! in De-Artifacts Compared To SOTAHammer! boosts the efficiency of De-ArtifactsGoodDeArt CompressionFirst_orderMeta_PrunningGroup_SparsityDMCPNetAdaptAMCHammer(Flops)Hammer(Latency)60657075850610095Latency(ms)Bad#page#名快手Hammer! in Low-Flops Video Super-Resolution (4)Input
19、 VideoOriginal Model (7 frames/s)Pruned Model (40 frames/s)#page#名快手Hammer! in Low-Flops Video Super-Resolution (4)Input VideoOriginal Model (7 frames/s)Pruned Model (40 frames/s)#page#名快手Hammer! in Low-Flops VideoCartoonizationOriginal ModelPruned ModelInput Video(4G Flops)7755M Flops: mid-range ph
20、ones)#page#名快手Hammer! in Low-Flops VideoCartoonizationOriginal ModelPruned ModelInput Video(4G Flops)(755M Flops: mid-range phones)#page#名快手Hammer!Real-Time Visual Detectiion and Recognition#page#名快手Hammer! in Low-Latency Face Landmarks DetectionBadModel Pruning ComparisonBasalinPruned Modi laMguide
21、dbylatency)0.825PrunedModelMguidedbyFLOPs)uidadby FLOPsPruned Model 180775382N10.4M07250.7008.5M10.9M13.4MLatency(ms)18.2MGood10128#page#名快手Hammer! in 3D Hand Pose EstimationReduce to 60% LatencyOriginal Modelwith no accuracy drop.(manually optimiized)#page#名快手Hammer!Sequential model compression#pag
22、e#Single Framework Model to Compressed Recurrent Model名快手User provide a DNN model withTrain Lossimage input to hammerOuttInputtOuttInputtHammer一Temporal LossOutt-1Outt=F(npute)Oute =F(Inputt,Outt-i)#page#Video Style Transfer (Frame-by-Frame vS. Sequential Model)名快手Original Model(Frame-by-Frame)Hamme
23、rl Pruned Sequential ModelPruned Model (Frame-by-Frame)30%latengy30%latengySevere jittersSmooth trackiing under the same latengy#page#名快手Hammer! in 3D Human Mesh ReconstructionPruned Model(Frame-by-Frame)Hammer! Pruned Sequential ModelOriginal Model (Frame-by-Frame)250M Flops250M Flops390M FlopsSeve
24、re jittersSmooth trackiing under the same Flopss#page#名快手Hammer!Recommendation system#page#名快手Hammer! in Recommendation SystemOnline InferencePrune the Two Tower Recall Mode64ltem SizeCross-Entropy LoSSXUser4ltem MatrixDot Production(x)BatchMatrix64?魚KItem Size256Topk512IndeBatchScore MatrixUser Fea
25、turesltem FeaturesX#page#名快手Hammer! in Recommendation SystemGoal: Increase online inference QPS,without hurting performanceUser/ltem embedding feature size and type influence online inference QPSUser/itemUser/itemQPSSupportingIncomingFeature sizeFeature datatypeItem size4823Float321.5MillionNAOrigin
26、al Recall Model32Int83 Million1142Pruned &Quantized+300kyuan per dayRecall Model+29 user click#page#名快手Hammer! in Recommendation SystemBaseline vs. Different Model Compression methods over DLRM benchmarkDLRM: open-source recommendation framework from Facebook78.97578.95078.92578.900Acc78.875sinoECC78.850Handcrafted78.825MPTP78.800Dense Model78.7750.50.70.86:01.00.6Compression Ratio (CR)#page#名快手Hammer! Awards2019年CCF科學技術獎科技進步杰出獎授予北市達佳互聯信息技術有限公司“確土實時人工智集平臺”項目主要完成人鄭文、劉、王華意、張國鑫、師小燕、邊紅昌等要委院公木路板#page#THANKS