《基于大模型的缺陷靜態檢查.pdf》由會員分享,可在線閱讀,更多相關《基于大模型的缺陷靜態檢查.pdf(41頁珍藏版)》請在三個皮匠報告上搜索。
1、婁一翎計算機科學技術學院復旦大學基于大模型的缺陷靜態檢測基于大模型的缺陷靜態檢測LLM-based StaticLLM-based Static BugBug Detection Detection StaticStatic BugBug DetectionDetectionSoftwareStatically analyzing the codeIdentifying bugs/vulnerabilitiesStaticStatic analysisanalysis toolstoolsData-driven&learning-Data-driven&learning-based based
2、 Buggy/correct code instanceML/DL modelsRecent trend:LLM-based bug detectionRecent trend:LLM-based bug detectionReseachers are very interested in“how effectively do LLMs detect bugs?”It seems quite promising that LLMs can identify bugs sometimes,But still not always perfect for the precision and rec
3、all in pracice.Recent trend:LLM-based bug detectionRecent trend:LLM-based bug detectionAdvanced Prompting StrategiesFine-tuningCrafted instructionsProject InformationCWE General KnowledgeCoT reasoningAST/CFG in PromptThere emerge many studies exploring how different prompting strategies can help LLM
4、 in bug dectionLLMs1 Zhang C,Liu H,Zeng J,et al.Prompt-enhanced software vulnerability detection using chatgpt.ICSE 2024 Poster.2 Purba,Moumita Das,et al.Software vulnerability detection using large language models.ISSREW 2023.3 Fu,Michael,et al.Chatgpt for vulnerability detection,classification,and
5、 repair:How far are we?APSEC 2023.Recent trend:advanced prompt strategies in LLM-based bug detectionRecent trend:advanced prompt strategies in LLM-based bug detection1 Zhang C,Liu H,Zeng J,et al.Prompt-enhanced software vulnerability detection using chatgpt.ICSE 2024 Poster.2 Purba,Moumita Das,et al
6、.Software vulnerability detection using large language models.ISSREW 2023.3 Fu,Michael,et al.Chatgpt for vulnerability detection,classification,and repair:How far are we?APSEC 2023.Crafted instructionsProject InformationCWE General KnowledgeCoT reasoningAST/CFG in PromptIt still remain unexploredune
7、xplored how do LLMs perform compared to traditional techniques(i.e.,based on static analysis)?LimitationsLimitations ofof traditionaltraditional techniquestechniquesHow LLMs address the limitationsHow LLMs address the limitations ofof traditionaltraditional techniquestechniquesBoundary of knowledge
8、-E.g.,the specifications of APIs are not comprehensively includedScalability issue of analysis mechanism -E.g.,path explosion in inter-procedure analysis Generality issue to specific domains -E.g.,manually implement the checking rules for business-related bugsLLMs are good at summarizing the intenti
9、on of codeLLMs can avoid diving into some procedure based on API intention LLMs can detect buggy behaviors based on natural language description This talk is aboutThis talk is about1.1.Synergy of LLMs and Static AnalysisSynergy of LLMs and Static AnalysisUsing LLMs to refining souce/sinks and reacha
10、bility analysis2.Enhance LLMs with Bug Knowledge Base2.Enhance LLMs with Bug Knowledge BaseUsing LLMs to build and use bug-specific knowledge base This talk is aboutThis talk is about1.1.Synergy of LLMs and Static AnalysisSynergy of LLMs and Static AnalysisUsing LLMs to refining souce/sinks and reac
11、hability analysis2.Enhance LLMs with Bug Knowledge Base2.Enhance LLMs with Bug Knowledge BaseUsing LLMs to build and use bug-specific knowledge base Case study on ResourceCase study on Resource leakleak detectiondetection (Background)(Background)LockManager.acquireLock()If(LockManager!=null)LockMana
12、ger.releaseLock()RAR pairRAR pair:the pair of the R Resource A Acquisition method and the R Resource Release API methode.g.,ResourceResource reachabilityreachability validationvalidationAn unreachable resource would not cause leaks even without the releaseCase study on ResourceCase study on Resource
13、 leakleak detectiondetectionResource AcquireResource ReleaseConstruct control-flow pathsIdentify paths related to resourceCheck the resource reachabilityCheck if the resource is releasedUnreachableAccurate RAR pairs PoolPrecise Resource Reachability Validation KeyKey ChallengesChallengesExistingExis
14、ting staticstatic analysisanalysis toolstoolsPredefine a set of RAR pairs and perform string matchPredefine several rules(e.g.,res!=null)Incomplete RAR pairsMissing unreachable paths Challenges in precise context-sensitive and intuitive reasoningIncorrect RAR pairs False positiveFalse positive/negat
15、iveLimitationLimitation 1:1:Incomplete/IncorrectIncomplete/Incorrect RARRAR pairpair poolpoolCompleteComplete RAR Pair PoolKeyKey ChallengesChallengesExistingExisting staticstatic analysisanalysis toolstoolsPredefine a set of RAR pairs and perform string matchFalseFalse NegativeNegative(low recall):
16、It is infeasible to detect resource leaks that are related to RAR undefined in the initial RAR pair poolA huge number of RAR pairs in open-source projects:e.g.,738 RAR pairs related to the Lock resource ChallengeChallenge 1:1:howhow to to buildbuild a a generalgeneral resourceresource leakleak detec
17、tiondetection tooltool thatthat couldcould covercover a a widewide rangerange of of RARRAR parspars in in diversediverse projects?projects?Incomplete RAR pairsIncorrect RAR pairs False positive/negativeLimitationLimitation 2:2:MechanicalMechanical resourceresource reachabilityreachability validation
18、validationFalseFalse PositivePositive(low precision):The unreachable resource without release would be considered as resource leak.Miss potential reachability validation checke.g.,!res.Disabled()ChallengeChallenge 2 2:howhow to to buildbuild a a generalgeneral resourceresource leakleak detectiondete
19、ction tooltool thatthat couldcould precisely identifyprecisely identify thethe resourceresource reachabilityreachability validationvalidation in in diversediverse projectsprojects?Precise Resource Reachability Validation Predefine several rules(e.g.,res!=null)False PositiveMissing unreachable paths
20、Resource AcquireResource ReleaseUnreachableFalse AlarmsMotivation:Motivation:toto improveimprove existingexisting staticstatic analysisanalysis approachesapproachesChallengeChallenge 1:1:howhow to to buildbuild a a generalgeneral resourceresource leakleak detectiondetection tooltool thatthat couldco
21、uld covercover a a widewide rangerange of of RARRAR parspars in in diversediverse projects?projects?MiningMining resource-relatedresource-related knowledgeknowledge fromfrom thethe massivemassive corpuscorpus in in open-sourceopen-source softwaresoftware.EnhancingEnhancing existingexisting analysis-
22、basedanalysis-based approachesapproaches withwith thethe minedmined knowledgeknowledge forfor a bettera better understandingunderstanding of theof the codecode intentionintention.PredefinedPredefined RulesRulesResourceResource managementmanagement knowledgeknowledge basebase(e.g.,RAR pairs,reachabil
23、ity checking operations)ChallengeChallenge 2 2:howhow to to buildbuild a a generalgeneral resourceresource leakleak detectiondetection tooltool thatthat couldcould precisely performprecisely perform thethe resourceresource reachabilityreachability validationvalidation in in diversediverse projectspr
24、ojects?DomainDomain KnowledgeKnowledgeModelsOverviewOverview ofof MIROKMIROK MiningMining resource-relatedresource-related knowledgeknowledge fromfrom thethe massivemassive corpuscorpus in in open-sourceopen-source softwaresoftware to to improveimprove resourceresource leakleak detectiondetection Ev
25、aluation:the improvement over basic static analysisEvaluation:the improvement over basic static analysisMIROK mines 1,313 new Abs-RAR pairs from 1,454,224 Java methods (89.2%(89.2%areare valid)valid)MIROK instantiates 6,314 RAR pairs in 2,261 Maven libraries (93.3%are valid)Our mined RAR pairs are r
26、eleased for the community and could be integrated into existing resource leak detection tools.MIROK detects 761 leaks v.s.baselines detects 16873.4%(188)are manually checked as true positive Benchmark:46,389 Java code snippets in Stack OverflowOur Method:Rule-based matching based on 1,197 valid Abs-
27、RAR pairsBaseline:Rule-based matching based on 26 seed Abs-RAR pairsBenchmark:10 compilable Java projects from GitHubOur Method:Findbugs*=Findbugs+6,314 RAR pairs mined by MIROKBaseline:original Findbugs Results:Findbugs*:15 reports,7 are true bugs(PR was accepted)Findbugs:9 reports,4 are true bugsO
28、verviewOverview ofof INFERROIINFERROIStep1:Use LLM to identify resource-oriented codeStep2:Provide static analysis with the identified resource-oriented code for resource leak detectionINFERROI:INFERROI:LLM-basedLLM-based intentionintention inferenceinference The answer returned by GPT-4Prompt templ
29、ate in INFERROIFormalized (client,167),(client,186),(client,185)Resource-orientedResource-oriented intentionintentionINFERROI:enhancing static analysis with identified intentionINFERROI:enhancing static analysis with identified intentionAlternatively:the inferred intention can be represented in the
30、format accepted by existing static analysis tools(e.g.,representing as the source/sink specification query in CodeQL)Evaluation:Evaluation:on existingon existing resource leak detection datasetsresource leak detection datasets INFERROI achieves a best trade-off between both detection rate and false
31、alarms.INFERROI coverages a a widewide rangerange of resource types.Evaluation:Evaluation:detectdetect unknownunknown resourceresource leaksleaks onon open-sourceopen-source projectsprojectsIn the 100100 methods sampled from open-source projects in Github,InferROI reports 16 16 resourceresource leak
32、sleaks and 12 12 are annotated as are annotated as true bugstrue bugs (7(7 bugsbugs areare confirmedconfirmed byby developers)developers)Accepted PRsEvaluation:Evaluation:comparedcompared toto basicbasic GPT-4GPT-4Directly applying GPT-4 without combining with analysis techniques has very high false
33、 positivesThis talk is aboutThis talk is about1.1.Synergy of LLMs and Static AnalysisSynergy of LLMs and Static AnalysisUsing LLMs to refining souce/sinks and reachability analysis2.Enhance LLMs with Bug Knowledge Base2.Enhance LLMs with Bug Knowledge BaseUsing LLMs to build and use bug-specific kno
34、wledge base Using existingUsing existing bugsbugs toto boost LLM-based bug detection boost LLM-based bug detection Similar bugs recur during software evolution or among similar softwareSimilar Code ContextSimilar Root CauseSimilar Fixing SolutionsProviding relevant existing bugs in the input conetxt
35、 of LLMs(Using the in-context learning capabilities of models)UsingUsing existingexisting bugsbugs toto boost LLM-based bug detectionboost LLM-based bug detection viavia RAGRAGA classic pipeline of RAG(Retrieval-based Augmentation Generation)11 A Survey on RAG Meeting LLMs:Towards Retrieval-Augmente
36、d Large Language ModelsAssertion generationCode completionProgram repairRAG has shown promising effectiveness in many software engineering tasks.Step1:Retrieving the relevant info Step1:Retrieving the relevant info from the knowledge basefrom the knowledge baseStep2:Putting the Step2:Putting the ret
37、rieved info in the input retrieved info in the input promtpromtChallenges:usingChallenges:using existingexisting bugsbugs toto boost LLM-based bug detectionboost LLM-based bug detection viavia RAGRAGA classic pipeline of RAG(Retrieval-based Augmentation Generation)11 A Survey on RAG Meeting LLMs:Tow
38、ards Retrieval-Augmented Large Language ModelsKnowledge Knowledge BaseBaseRetrieval Retrieval MechanismMechanismKey Components Key Components in RAG in RAGInference Inference MechanismMechanismHow to represent existing bugs in the knowledge base?How to use the retrieved bugs to prompt LLMs?Just code
39、 snippets?Directly append in the input?How to find the most relevant bugs?Code similarity?Motivating examplesMotivating examplesWhen only retriving only based on code simialrityIt is very likely to get semantically or functionally different bugs Better retrieval strategy is required Motivating examp
40、lesMotivating examplesWhen only putting the relevant buggy code in the prompt,Its hard for LLMs to get the correlation between the retrieved bug and the given code Better in-context prompting strategy is required Our insight Our insight RepresentSummarizeClusterExisting Bugs Knowledge-levelrepresent
41、ation FunctionalityRoot CauseFixing SolutionInstead of straightforward code snippetsfurther representing existing bugs with in high-level knowledge of natural language descriptions 1.To retrieve the lexically-different but semantically-similar bugs 2.To faciliate the comprehension capabilities of LL
42、Ms for the input VulVul-RAG Approach Pipeline:Knowledge-level RAG for vulnerability detection-RAG Approach Pipeline:Knowledge-level RAG for vulnerability detectionStepStep1:1:ConstructingConstructing a a knowledgeknowledge bugbug of of existingexisting CVEsCVEsVul-RAG PipelineVul-RAG PipelineStepSte
43、p2:2:RetrievingRetrieving thethe related vulnerability related vulnerability knowledge for the given knowledge for the given code code VulVul-RAG Approach Pipeline:Knowledge-level RAG for vulnerability detection-RAG Approach Pipeline:Knowledge-level RAG for vulnerability detectionVul-RAG PipelineVul
44、-RAG PipelineStepStep3 3:reasoning:reasoning whether the given code whether the given code is vulnerable based on is vulnerable based on the retrieved knowledge the retrieved knowledge VulVul-RAG Approach Pipeline:Knowledge-level RAG for vulnerability detection-RAG Approach Pipeline:Knowledge-level
45、RAG for vulnerability detectionVul-RAG PipelineVul-RAG PipelineStep1:Constructing knowledge base(off-line)Step1:Constructing knowledge base(off-line)Step2:Retrieving relevant vulnerability knowledgeStep2:Retrieving relevant vulnerability knowledge Query GenerationQuery Generation:the abstract purpos
46、e,detailed behavior,and the code itself Candidate Knowledge RetrievalCandidate Knowledge Retrieval:three-dimension similary Candidate Knowledge Re-rankingCandidate Knowledge Re-ranking:re-rank candidate knowledge items with the Reciprocal Rank FusionStep3:Detection Reasoning Step3:Detection Reasonin
47、g Prompt for LLMsIf the given code:(i)with the similar vulnerability causes(ii)without the relevant fixing operationsit will be considered as vulnerable Evaluation:benchmark Evaluation:benchmark PairVulPairVulDetailed statistics of the benchmark PairVulWe first construct a new benchmark PairVul by m
48、ining from CVEs of Linux KernelPairVul exclusively includes the pairs of vulnerable code and the relevant patched codeEvaluation:compared with SOTA vulnerability detection techniques Evaluation:compared with SOTA vulnerability detection techniques Compared with existing fine-tuning based techniquesC
49、ompared with basic GPT-4 and code-level RAGPain Points in IndustryPain Points in IndustryIn industry,there are many customized code checking rules related to to the business or specific domains It takes lots of efforts to manually implement them into static analysis checking tools Most rules are des
50、cribed in natural language,which can be leveraged to construct the knowledge base and to incorparate into our LLM-based detection framework.Discussion and future improvements Discussion and future improvements The limitationlimitation of knowledge-level RAG for bug/vulnerablity detection The strengt
51、hstrength of knowledge-level RAG for bug/vulnerablity detection DomineDomine-specific-specific:knowledge-level RAG framework is suitable for business-related bugs or specific bug types.ExtentabilityExtentability:the knowledge base can be continously extended with newly-released bugs.Interpretability
52、Interpretability of the detection results:we perform a user study to confirm that the knowledge can quickly/precisely help developers confirm the detection resultsCan be practically deployed for industrial code review(especially for their domain-specific bugs)False Negative:Inaccurate vulnerability
53、knowledge descriptionsUnretrieved relevant vulnerability knowledge Non-existent relevant vulnerability knowledge in the knowledge baseFalse Positive:Mismatched fixing solutionsIrrelevant vulnerability knowledge retrieval SummarySummaryCheck out our papers for more details!Mining Resource-Operation K
54、nowledge to Support Resource Leak Detection.FSE 2023.Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention Inference.Arxiv 2024.Vul-RAG:Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG.Arxiv.2024How to better extract input contexts for LLMs?How to better identify dynamically-loaded constraints via LLMs?Configuration/external API queryOpen challenges