《陳俊潔-深度學習系統的性能提升.pdf》由會員分享,可在線閱讀,更多相關《陳俊潔-深度學習系統的性能提升.pdf(44頁珍藏版)》請在三個皮匠報告上搜索。
1、深度學習系統的性能提升陳俊潔 天津大學演講嘉賓陳俊潔國家優青,天津大學特聘研究員,博導,軟件工程團隊負責人研究方向主要為基礎軟件測試、可信人工智能、數據驅動的軟件工程等。榮獲中國科協青年托舉人才、CCF優博、電子學會自然科學一等獎等獎項。近年共發表學術論文70篇,其中CCF A類論文50余篇,獲六項最佳論文獎(包括五項CCF-A類會議ACM SIGSOFT Distinguished Paper Award,以及一項CCF-B類會議ISSRE的Best Research Paper Award)。成果在華為、百度等多家知名企業落地。擔任CCF-A類會議ASE 2021評審過程主席,Dagstu
2、hl研討會聯合主席,以及軟件工程領域全部CCF-A類會議的程序委員會成員等。目 錄CONTENTS1.深度學習系統的回歸性能提升2.深度代碼模型的魯棒性能力提升3.深度代碼模型部署后性能即時提升深度學習系統的回歸性能提升PART 01Regression in Deep Learning SystemsIt is important to detect regression faults!DL System Ver1.0DL System Ver2.0DL System Ver3.0New RequirementsFixing/ImprovementAccuracy:40%Accuracy:6
3、0%Existing Works Have LimitationsSOTA techniques can not be directly adapt to solve this issue.DL System Ver1.0Accuracy=91%DL System Ver2.0Accuracy=91.5%Code ChangeNeuron ChangeRegression Fuzzing in Traditional SoftwareDL Systems do not have explicit logical structuresNeuron change nearly affect all
4、 the neurons while code change only affect limited partsFuzzing for Deep Learning ModelsIgnore the difference between different versions of the DL modelsOverlook important properties of the testing,such as fidelity and diversity.DeepHunter:Fuzzing guided by fine-grained neuron coverage in a specific
5、 version DiffChaser:Detect disagreements in Quantization by generating test cases toward decision boundary locates code changes in software evolution and utilize them to guide the regression fuzzingIgnores Difference:Poor Fault-TriggeringOverlooksFidelity&Diversity1122Our Idea of DRFuzzChallenge 1:F
6、ault-TriggeringSolution:Amplifying the prediction difference between versions through effective mutation to trigger more faults.Challenge 3:DiversitySolution:Using seed maintenance to generate test inputs trigger different regression faults.Challenge 2:FidelitySolution:Designing GAN-based fidelity a
7、ssurance method to ensure fidelity.Our Approach:DRFuzzGAN-based Fidelity Assurance*Guarantee fidelity of test inputsSeed Maintenance*Improve Diversity of test inputs *Generate fault-triggering test inputs Mutation MutationMutation Rule Selection GAN-based Fidelity AssuranceGAN Scoring&FilteringSeedM
8、utated InputsHigh-Fidelity InputsMutation RulesSeed PoolInput MutationGAN-DiscriminatorInputExecutionOriginal ModelRegression ModelPotential Test Input EvaluationRegression FaultsTree-based TrimmingSeed Probability Update Seed MaintenanceMutationMutation Rules:We design 16 mutation rules:Pixel-Level
9、 Mutation&Image-Level Mutation MCMC-guided Mutation Rule Selection:Mutation rules that can generate test inputs with high fidelity and amplify the prediction difference towards becoming a regression fault,should be selected frequently.Pixel-Level Mutation:Image-Level Mutation:Pixel Coloring ReverseP
10、ixel ShufflingImage RotatingImage Translation=#Regression Fault-triggeringFidelity =1,1 )12GAN-based Fidelity AssuranceUsing DCGAN(GAN-based approach)preserve semantics to reducing discarding test inputs with high fidelity from image-level mutation rules.trainGeneratorDiscriminatorDiscriminatorSeedM
11、utated Input0.900.91Training Phase:Predicting Phase:Train Set12Seed MaintenanceTree-based Trimming The Trimming process aims to trigger more diverse faulty behaviors by removing redundant seed to adjust seed selection probability.Subjects and Regression Scenarios TaskNameTrain SetTest SetModel Digit
12、 RecognitionMNIST60k10kLeNet5 Object RecognitionCifar-1060k10kVGG16 Clothes RecognitionFASHION-MNIST 60k10kAlexNet Road Number RecognitionSVHN73,25726,032ResNet18The subjects are diverse,involving different tasks/models/regression scenarios.Supplementary TrainingAdversarial TrainingModel FixingModel
13、 PruningRQ1:EffectivenessDRFuzz outperforms the compared approaches stably on all the regression scenarios in terms of various metrics.Effectiveness on Different Regression Scenarios#RFI:Regression fault-triggering test inputs;#RF:Dynamic diversity of test inputs;Seed,Faulty Behavior#Seed:Static Div
14、ersity of test inputs;(Seed)#GF:general faults detected on the regression model;RQ2:AblationApproach#RFI#RF#Seed#GFDRFuzz70,09316,4646,942231,675DRFuzz-r(No MCMC)53,03714,3096,523185,354DRFuzz-NG(No GAN)83,04221,0447,748279,544DRFuzz-NSM(No Seed Maintenance)36,9367,1093,239136,723Ablation Experiment
15、 ResultsDRFuzz(left)vs DRFuzz-NG(right)blurry noisy over-changedThe GAN-based Fidelity Assurance technique can filter out more than 20%of fault-triggering inputs with low fidelityRQ3:Robustness EnhancementFinetuning Accuracy on Different Regression ScenariosFinetuning DL models with the test inputs
16、generated by DRFuzz can fix 77.72%87.03%regression faults from DRFuzz and can defend 52.26%80.68%attack from DiffChaser and 66.63%79.88%attack from DeepHunter.深度代碼模型的魯棒性能力提升PART 02Deep Code ModelsDL have been widely used to process source code!Code GenerationClone DetectionAuthorship AttributionFunc
17、tionality ClassificationCode CompletionModel Robustness is CriticalTestingEnhancementAdversarial examples are important to test&enhance model robustness!Deep Code ModelAdversarial Examples.Prediction Results.Testing ReportAdversarial Examples.Training Set.Augmented Set.Adversarial Training12The inpu
18、ts(i.e.,source code)for deep code models are discrete.1Source code has to strictly stick to complex grammar and semantics constraints.2Conclusion:the existing adversarial example generation techniques in other areas are hardly applicable to deep code modelUnique Characteristics of Adversarial Exampl
19、es for Deep Code Models:Deep Code Models are not RobustSemantic-preserving adversarial examples can alter the prediction results!Workflow of current techniquesDesigning semantic-preserving code transformation rules.identifier renaming,etc.Searching ingredients from the space for transforming an orig
20、inal input to a semantic-preserving adversarial example.Model prediction changes,etc.Adversarial Example Generated by ALERTAdversarial Example Generated by CARROTvoid main()char a101=0;gets(a);/Some code.void main()char argc101=0;gets(argc);while(0);/Some code.+while(0);a argcLimitationsSOTA techniq
21、ues still suffer from effectiveness&efficiency Issues!Almost Infinitevoid f1(int a,int n)int i;int j;int k;for(i=0;i n;i+)for(j=0;j aj+1)k=aj;aj=aj+1;aj+1=k;123456789101112Target InputGround-truth Label:sortPrediction Results:sort(96.52%)anijkIdentifiersaa,array,at,area,au,am,alpha,ata,ad,auto,argc,
22、ac,ar,ab.nu,sn,nc,len,cn,m,ns,pn,nb,nn,np,x,un,nan,fn,num,nt.it,chi,li,ui,ci,ia,ei,iii,oi,ini,ji,ai,phi,bi,gi,ie,ik.jump,js,jit,jc,jan,jp,ji,kj,bj,oj,adj,jl,aj,jj,je,ja.uk,ko,ku,kw,sk,key,ck,ak,mk,ky,tk,ks,kin,ke,km,rank.IngredientsComplexitynmThe Ingredient Space is Enormous12Greedy model predictio
23、n changes guided search process is likely to fall into optimum.3Frequently invoking the target model could affect test efficiency via adversarial example generation.Novel Perspective:Code-Difference-Guided Adversarial Example Generationvoid f1(int a,int n)int i;int j;int k;for(i=0;in;i+)for(j=0;jaj+
24、1)k=aj;aj=aj+1;aj+1=k;1234567891011121314Ground-truth Label:sortPrediction Results:sort(96.52%)int f2(int t,int len)int i;int j;i=0;j=0;while(len!=0)ti=len%10;len/=10;i=i+1;while(j i)if(tj!=t(i-j)-1)return 0;j=j+1;return 1;1234567891011121314Ground-truth Label:palindromePrediction Results:palindrome
25、(99.98%)void f3(int t,int len)int i;int j;int k;i=0;while(i len)j=0;while(j tj+1)k=tj;tj=tj+1;tj+1=k;j=j+1;i=i+1;1234567891011121314Ground-truth Label:sortPrediction Results:palindrome(90.88%)Target InputReference InputAdversarial ExampleHave Different Semantics&Small Code DifferencePreserve the Sem
26、antics of f1&Reduce Code Difference Brought by f2Complexity:nm 2Our Approach:CODATestEnhanceTargetInputsReferenceInputsSelectionEquivalentStructureTransformationIdentifierRenamingTransformationAdversarialExamplesTargetModelTraining SetAugmanted SetOverview of CODAStructure DifferenceIdentifier Diffe
27、renceReference Inputs SelectionTargetInputCodeModel0.10.00.60.3Softmax ConfidenceMasked CodeSimilarityTop-N Reference Inputs1st Class2nd ClassTrainingDataHow to select reference inputs for reducing the ingredient space?1The prediction result is more likely to be changed from 1st Class to 2nd Class.2
28、Smaller code difference can effectively limit the number of ingredients.Equivalent Structure TransformationHow to reduce structure difference between target input and reference inputs?1applying equivalent structure transformations rule in a probabilistic way to reduce occurring distribution differen
29、ce2considering all common kinds of code structures(i.e.,loop,branch,and sequential).Identifier Renaming TransformationIntermediateInputIdentifierSimilarityIterativeTransformationReferenceIdentifiersAdversarialExampleHow to reduce identifier difference between target input and reference inputs?1Ident
30、ifier renaming transformation refers to replacing the identifier in the target input with the identifier in reference inputs.2To ensure the naturalness,we consider the semantic similarity between identifiers and design an iterative transformation process.Subjects TaskTrain/Validate/Test ClassLanguag
31、e ModelAcc.Vulnerability Prediction21,854/2,732/2,7322CCodeBERTGraphCodeBERTCodeT563.76%63.65%63.83%Clone Detection90,102/4,000/4,0002JavaCodeBERTGraphCodeBERTCodeT596.97%97.36%98.08%Authorship Attribution528/13266PythonCodeBERTGraphCodeBERTCodeT590.35%89.48%92.30%Functionality Classification41,581/
32、10,395104CCodeBERTGraphCodeBERTCodeT598.18%98.66%98.79%Defect Prediction 27,058/6,7644C/C+CodeBERTGraphCodeBERTCodeT584.37%83.98%81.54%The subjects are diverse,involving different tasks/models/classes/languages.5 Tasks3 Pre-trained Models2104 Classes4 ProgrammingLanguagesRQ1:Effectiveness and Effici
33、encyCODA outperforms ALERT&CARROT in terms of the rate of revealed faults(RFR).CODA performs less time and fewer model invocations than ALERT&CARROT.Metric:Rate of Revealed Faults Model Invocations Metric:RQ2:Model Robustness EnhancementCODA helps enhance the model robustness more effectively than A
34、LERT&CARROT,in terms of reducing faults revealed by the adversarial examples.Evaluation Set Augmented Training SetMetric:Accuracy RQ3:Contribution of Main ComponentsWe constructed three variants of CODA:w/o RIS(Referrence Inputs Selection)w/o EST(Equivalent Structure Transformation)w/o CDG(Code Diff
35、erence Guidance in EST)w/o IRT(Identifier Renaming Transformation)All the three components make contributions to the overall effectiveness of CODA.Metric:Rate of Revealed Faults RQ4:Naturalness of Adversarial ExamplesUser Study(5-point Likert scale)4 participates450 code snippetsThe adversarial exam
36、ples generated by CODA are natural closely to the naturalness-aware ALERT.深度代碼模型部署后性能即時提升PART 03Performance Issues with Deployed Deep Code ModelsDeployed ModelCorrectPredictionErroneousPredictionAccuracy 100%Challenges in enhancing deployed model performanceIts crucial to improve the performance of
37、deployed deep code models!Existing strategiesDesigning more advanced networks for retraining models1Incorporating more data for fine-tuning models212LimitationsTime-consuming caused by manual labeling&heavy computationsLargely inexplicable caused by complex parameters and datasetsMany Mispredictions
38、 are Caused by Noise in InputsDenoising in image processing field 1Denoising in speech recognition field 2LRCnetNoisy ImageDenoised ImageNoisy SpeechDenoised SpeechAeGANReason:complex environment,image quatization.Formate:continuous pixel valuesReason:background noise,difference speaker.Formate:cont
39、inuous signal valuesAdvantages of Input DenoisingImproving the model performance on-the-fly1Retraining-free,efficiency boost212Limitations for Denoising CodeDenoising in Continuous Space vs.Discrete InputsComplex syntactic&semantic constraints in CodeEnhancing explainable ability of technique for co
40、rrecting mispredictions31 Ren J,Zhang Z,et al.“Robust low-rank convolution network for image denoising.”MM 2022.2 Abdulatif S,Armanious K,et al.“Aegan:Time-frequency speech denoising via generative adversarial networks.”EUSIPCO 2022.Input Denoising for Deep Code Models(1)Noisy Code(2)Denoised CodeCh
41、allengesThis motivates the potential of on-the-fly improving performance of(deployed)deep code models through identifier-level input denoising.12How to identify mispredicted inputs from the incomming code snippets?How to localize noise(identifier-level)resulting in misprediction from a given code sn
42、ippet?3How to cleanse noise to make the code snippet be predicted correctly?Noisy identifiers:the identifier makes the largest contribution to the misprediction.Overview of CodeDenoiseIncomingCode SnippetMispredicted InputIdentificationNoiseLocalizationNoiseCleansingUserDeployed ModelCodeDenoiseThe
43、usage of CodeDenoise in practice:We treat CodeDenoise as a post-processing module and intergrate it with the deployed code model as a system for making predictions in practice.Mispredicted Input IdentificationIncomingCode SnippetRandomizedSmoothingPerturbed Code SnippetsMispredictedorNot?C1-How to i
44、dentify mispredicted inputs from the incoming code snippets?1In the field of CV,randomized smoothing is widely used to certify the classification result of a given image by checking the results of randomly perturbed images in the neighborhood.2To design adapted randomized smoothing for deep code mod
45、els,we should:(1)define the perturbation strategy(2)and control the perturbation degree on input code.IdentifierRenamingPerturbationThresholdDeep Code ModelIdentificationResultNoise LocalizationC2-How to localize noise resulting in misprediction from a given code snippet?1The attention mechanism is
46、widely used to analyze the contribution of each element in the code(in particular,it is the core of the state-of-the-art Transformer architecture).2Insight:for mispredicted inputs,the identifiers with larger contributions to the misprediction are more likely to be noise in the code snippet.Misclassi
47、fiedCode SnippetDeepCode ModelAttention MechanismCodeHeatmapNoisyIdentifiersidentifier_1:0.33identifier_2:0.20.identifier_k:0.09Noise CleansingC3-How to cleanse noise to make the code snippet be predicted correctly?1Exiting masked identifier prediction(MIP)models aim to predict the tokens at the mas
48、ked locations,but they only consider the naturalness but not cleanliness.2To predict a clean identifier to replace the noisy identifier,CodeDenoise builds a masked clean identifier prediction(MCIP)model based on clean training data.MaskedMispredicted Code SnippetMasked CleanIdentifier PredictionDeno
49、isedCode SnippetClean Training DataMasked CleanIdentifierTraining PhaseInference PhaseLossMasked CleanIdentifier PredictionSubjects TaskTrain/Validate/Test ClassLanguageModel Acc.AuthorshipAttribution528/13266PythonCodeBERTGraphCodeBERTCodeT583.58%77.27%83.33%DefectPrediction 27,058/6,7644C/C+CodeBE
50、RTGraphCodeBERTCodeT585.47%83.90%82.29%FunctionalityClassificationC10441,581/10,395104CCodeBERTGraphCodeBERTCodeT597.87%98.61%98.60%FunctionalityClassificationC+1000320,000/80,000/100,0001000C+CodeBERTGraphCodeBERTCodeT585.00%81.62%86.49%FunctionalityClassificationPython800153,600/38,400/48,000800Py
51、thonCodeBERTGraphCodeBERTCodeT593.91%97.39%97.62%FunctionalityClassificationJava25048,000/11,909/15,000250JavaCodeBERTGraphCodeBERTCodeT596.30%97.79%97.48%The subjects are diverse,involving different tasks/models/classes/languages.6 Datasets3 Pre-trained Models41000 Classes4 ProgrammingLanguagesRQ1:
52、Effectiveness and Efficiency of CodeDenoiseCodeDenoise outperforms Fine-tuning with larger correction success rate and smaller mis-correction rate.CodeDenoise outperforms Fine-tuning in terms of ovelall accuracy.Metric:Correction Success Rate Mis-Correction Rate Overall Accuracy Metric:RQ2:Contribut
53、ion of Each Main ComponentMetricsCodeDenoisedeepginiCodeDenoiserandLCodeDenoiserandCCodeDenoiseMIPCodeDenoiseCorrection Success Rate 16.91%14.65%10.84%15.50%21.91%Mis-correction Rate 0.52%0.41%0.52%0.34%0.09%#Identifier Changes 2.253.793.272.271.58We constructed four variants of CodeDenoise:CodeDeno
54、isedeepgini:Randomized-smoothing-based strategy DeepGini-based strategy CodeDenoiserandR:Attention-based strategy Random strategy CodeDenoiserandC:MCIP-based strategy Random strategy CodeDenoiseMIP:MCIP-based strategy MIP-based strategyAll the three components make contributions to the overall effec
55、tiveness of CodeDenoise.RQ3:Influence of Hyper-parameters12345Correction Success Rate 21.91%22.85%23.95%25.27%26.08%Mis-correction Rate 0.09%0.14%0.16%0.20%0.29%Time(s)0.48 0.63 1.031.431.70N12345Correction Success Rate 21.91%23.30%24.66%25.25%25.99%Mis-correction Rate 0.09%0.09%0.08%0.08%0.08%Time(s)0.48 0.71 0.871.13 1.63We studied the influence of two hyper-parameters in CodeDenoise:the threshold to limit the perturbation degree N:the number of perturbed code snippetsWe obtained default settings by balancing effectiveness and efficiency for practical use.THANKS