《LLSM:具有 EDA 引導的 CoT 提示、混合嵌入和 AIG 定制加速的 LLM 增強邏輯綜合模型.pdf》由會員分享,可在線閱讀,更多相關《LLSM:具有 EDA 引導的 CoT 提示、混合嵌入和 AIG 定制加速的 LLM 增強邏輯綜合模型.pdf(25頁珍藏版)》請在三個皮匠報告上搜索。
1、LLSM:LLM-enhanced Logic Synthesis Model with EDA-guided CoT Prompting,Hybrid Embedding and AIG-tailored AccelerationShan Huang*,Jinhao Li*,Zhen Yu,Jiancai Ye,Jiaming Xu,Ningyi Xu,Guohao Dai*Equal contributionShanghai Jiao Tong UniversityCorrespondence to:Guohao Dai ASP-DAC 2025OutlineOutlinePage 2Ba
2、ckgrounds and MotivationsRelated WorksChallenges and Techniques Overview EDA-guided CoT Prompting Text-Circuit Hybrid Embedding EDA-Tailored Acceleration Experiment ResultsExtension WorksElectronic Design Automation(EDA)Electronic Design Automation(EDA)Page 3Spec/ArchitectureDesignLogicDesignPhysica
3、lDesignSign-offMem.UnitComuteUnitCtrl.UnitCtrl.UnitCom.UnitMem.UnitLogic optimization&Map to a netlistPlacement&RoutingRTL codewritten byengineersTapeoutVerify functionality&manufacturabilityEDA refers to the use of EDA software tools to complete the functional design,synthesis,verification,physical
4、 design of VLSI chips.Key objective:Optimize the Power,Performance,Area(PPA)of the listplaceImportance of logic synthesisImportance of logic synthesisPage 4 Logic synthesis is time-consuming(50%)and has high capital cost(55%)in EDA process.TimeCost of capitalQualification of IP(26%,45%)Logic Synthes
5、is(RTLNetlist)(50%,55%)Physical Design(NetlistTapeout)(21%,56%)RTL DesignArch Design1 https:/ proportion,cost proportion)startend100%LogicLogicSynthesisSynthesisPage 5Logic synthesis is iterative in chip design.Predicting synthesis results can reduce iteration overhead.Fast,including syntax parsing,
6、design checking,etc.(15%)Slow,extensive heuristic processes(50%)Slow,further optimization(35%)Iter.SlowReduced circuit depthImport process library filesRTLCodeTraditional logic synthesis flow AI-assisted logic synthesis flow 1.Translation2.Logicoptimization3.Process mapping+PPAResultPPAResultAIModel
7、predictGraph Neural Networks(GNN)TransformerRTLCodeIter.FastOutlineOutlinePage 6Backgrounds and MotivationsRelated WorksChallenges and Techniques Overview EDA-guided CoT Prompting Text-Circuit Hybrid Embedding EDA-Tailored Acceleration Experiment ResultsExtension WorksGNN model circuits as graphs an
8、d extract graph-level features for predicting PPA,but face the inherent problemsGNNGNN-based methods for Logic Synthesisbased methods for Logic SynthesisPage 7Directed-AcylicGraphPIaANDPIbPIcPIdANDANDPO1 Akansha S.Over-squashing in graph neural networks:A comprehensive surveyJ.arXiv preprint arXiv:2
9、308.15568,2023.2 Rusch T K,Bronstein M M,Mishra S.A survey on oversmoothing in graph neural networksJ.arXiv preprint arXiv:2303.10993,2023.GNNOver-smoothing1GNN layersNode feature similarityAccuracyOver-squashing2Long distance weak connectionTransformerTransformer-based methods for Logic Synthesisba
10、sed methods for Logic SynthesisTransformer flats circuit to sequence,but faces scalability problems and cannot be applied to large graphsPage 81 Xu,Ceyu,Chris Kjellqvist,and Lisa Wu Wills.SNSs not a synthesizer:a deep-learning-based synthesis predictor.Proceedings of the 49th Annual International Sy
11、mposium on Computer Architecture.2022.2 https:/ https:/ AND ANDPOAttention MatrixTransformer1The attention of each node pair is calculatedSequentialModelingPIaPIbPIcPIdAND AND ANDPOCompute/Storage Complexity O(N2)NVIDIAB2002,N=2.081012Apple M43,N=2.81011Datasets used in academia also have circuits w
12、ith N105OutlineOutlinePage 9Backgrounds and MotivationsRelated WorksChallenges and Techniques Overview EDA-guided CoT Prompting Text-Circuit Hybrid Embedding EDA-Tailored Acceleration Experiment ResultsExtension WorksOverviewOverviewPage 10Logic Synthesis FlowLLMLM EncoderTech.1 EDA-guided CoT Promp
13、tingDownstreamGNNPredict PPA ResultsLLSMPrevious workTech.2 Text-Circuit Hybrid EmbeddingAIG-tailoredSpMMStateCacheTranslationLogicoptimizationProcess mappingRTLCodeTech.3 EDA-Tailored AccelerationLogic Synthesis CoTText EmbeddingCircuit SummaryFusionSynthesis FlowTechniqueTechnique1:1:EDAEDA-guided
14、guidedCoTCoTPromptingPromptingPage 11ChallengeLLMs lack the knowledge to analyze RTL code,and its expensive to train or fine-tune1 Chang,Kaiyan,et al.Data is all you need:Finetuning llms for chip design via an automated design-data augmentation framework.Proceedings of the 61st ACM/IEEE Design Autom
15、ation Conference.2024.2 Liu,Mingjie,et al.Chipnemo:Domain-adapted llms for chip design.arXiv preprint arXiv:2311.00176(2023).Lack of RTL Code Data1PretrainingDomain-AdaptivePretrainingFoundation ModelsLLaMA2(7B,13B,70B)EDA-domainFoundation Models(7B,13B,70B)ModelAlignmentEDA-domainChat Models(7B,13B
16、,70B)High Training Cost2Thousands of GPU hoursHundreds of GPU hoursTechniqueTechnique1:1:EDAEDA-guidedguidedCoTCoTPromptingPromptingPage 12ApproachTraining-free CoT method to guide LLM summarize size and gate-level information of RTL Code.Inputs:inData,clk,Outputs:outDataFunction of the Circuit:The
17、circuit implementsNave outputFunctional of each moduleScale:The overall structure involves a total of 10 delay stages due toOutput with CoTWithout Gate-levelInfromationWith Gate-levelInfromationEstimation:Each multiplier is estimated to be composed of 32 AND gates and 31 OR gates,while each adder co
18、nsists of 32 full adders.Each full adder is estimated to be 3 gates.-Multiplier gates:32 AND+31 OR=63*12=756 gates-Adder gates:32*3 gates*10=960 gatesEDA-guided Chain-of Thought methodsAnalyzing RTL IOAnalyzing modules and layersEstimate gatesnumberEstimate layersnumberRolePromptExampleLogic Synthes
19、is CoTThe information of the netlist after logical synthesis is deduced by logical analysisLLMRolePromptExampleLLMNave methodsTechniqueTechnique2:2:TextText-Circuit Hybrid EmbeddingCircuit Hybrid EmbeddingPage 13ChallengeClosed LLM results in the inability to extract feature embeddings and circuit s
20、ummary cannot be directly input into downstream modelsLLMClosedRTLCodeCircuitSummaryEmbeddingText modalCircuit modalPIaANDPIbPIcPIdANDANDPODownstreamModelsCircuitSummaryThe circuit summary&graph cannot be the downstream model input at the same timeEmbeddings cannot be obtained from closed-source LLM
21、sTechniqueTechnique2:2:TextText-Circuit Hybrid EmbeddingCircuit Hybrid EmbeddingPage 14Using a small Language Model as a text encoder togenerate text embeddingLLMClosedRTLCodeCircuitSummaryGraph modalPIaANDPIbPIcPIdANDANDPODownstreamGNNApproachLMSynthesis FlowTrans.PreddictedPPALM as an encoderThe l
22、ightweight trainable model trains both GNN and LM weights to improve prediction accuracyWeighted sumGraph EmbeddingText EmbeddingFused EmbeddingTechniqueTechnique3:3:EDAEDA-Tailored AccelerationTailored AccelerationPage 15BackgroundThe bottleneck of GNN is message propagation on edges,which can be a
23、bstracted as SpMM operatorGraph Neural Network(GNN)NabstractNInputnode featureANDPIaPIbNOutputnode featurenon-zero inadjacent matrixCSR(Efficient)COO(Not Efficient)00122231201232abcdefgrowIndcolIndvalue023671201232abcdefgrowPtrcolIndvalueCommonSparseFormatadjacent matrixPIaANDPIbAggregate embedding
24、from source Nodesabcdefgadjacent matrixTechniqueTechnique3:3:EDAEDA-Tailored AccelerationTailored AccelerationPage 16ChallengeThe introduction of LM result in slower inference and More sparse circuit graphs Time-consuming format conversion from COO to CSRTwo orders of magnitude slower than GNNCircui
25、tSummaryLMSynthesis Flow1000tokens20tokensAnd-InverterGraph(AIG)Edge Index(COO)CSRFormat0510110 xFormatConversioncuSPARSEComputingNorm.timeConvertcuSPARSE1 APICall1 NVIDIA sparse computing library,https:/ FormatAdjacent MatrixGNNToo SparseGNN Sparse FlowCost of LMCost of format conversionTechniqueTe
26、chnique3:3:EDAEDA-Tailored AccelerationTailored AccelerationPage 17Different input changes are the logic synthesis flow,and AIG has structural featuresInsightAND2 inputsAIGs Adjacent Matrix AIG In-degree PercentageNOT1 inputPrimary Input 0 inputIn-degree=2 is dominant9.38%7.80%82.82%012AIGs structur
27、al featuresCircuitSummaryLMSynthesis FlowRedundant computingof Circuit SummaryFixedVariable20 tokens1000 tokensTechniqueTechnique3:3:EDAEDA-Tailored AccelerationTailored AccelerationPage 18Using ELLPACK2 for efficient memory accessand fuse conversion and computing on GPUApproachAIGs Adjacent Matrix
28、2 0 5 1 2 3 1 36 427 6ValueIndexHigher access efficiencyFuse format conversion and computationELLPACK2 with padding:PaddingsLeverage AIG structural features toimprove parallelismStore state to cacheLMOffline StagestatecacheLMStoreLoadembeddingOnline StageCircuitSummarySynthesis FlowInference with20
29、tokensOutlineOutlinePage 19Backgrounds and MotivationsRelated WorksChallenges and Techniques Overview EDA-guided CoT Prompting Text-Circuit Hybrid Embedding EDA-Tailored Acceleration Experiment ResultsExtension WorksExperiment SetupExperiment SetupSetup GPU:A100,nvcc 11.8,Pytorch 2.0.1,PyG v2.5.3 Da
30、taset:OpenABC,23 IP,1500logic synthesis flow Baseline:OpenABC LOSTIN LM-model Mamba-130m DeBERTa-base Training 20 epochs Learning rate(0.1 for LM,0.01 forGNN)Page 201 Chowdhury A B,Tan B,Karri R,et al.Openabc-d:A large-scale dataset for machine learning guided integrated circuit synthesisJ.arXiv pre
31、print arXiv:2110.11292,2021.Communication/BusProtocolControllerCrptoDSPProcessorEvaluationEvaluationResultResultPage 213.49%and 1.19%average MAPE reductionArea prediction5.76%and 6.80%average MAPE reductionDelay prediction02468101214161820AreaDelayMAPE(%)reductionOpenABCOurs+OpenABCLOSTINOurs+LOSTIN
32、00.511.522.5ac97_ctrlaesaes_secworksaes_xcryptdes3_areadynamic_nodeethernetfirfpui2ciirjpegmem_ctrlpicosocsascsha256simple_spispiss_pcmtv80usb_phyvga_lcdwb_dmaaverageSpeedupSpeedupResultResultPage 22AIG-Tailored SpMM kernel achieves an average of 1.74 speedup compared with cuSPARSEAn average of end-
33、to-end 1.37 speedup compared with PyG00.511.522.533.5ac97_ctrlaesaes_secworksaes_xcryptdes3_areadynamic_nodeethernetfirfpui2ciirjpegmem_ctrlpicosocsascsha256simple_spispiss_pcmtv80usb_phyvga_lcdwb_dmaaverageSpMM Speedup vs cuSPARSEEnd-to-End Speedup vs PyGOutlineOutlinePage 23Backgrounds and Motivat
34、ionsRelated WorksChallenges and Techniques Overview EDA-guided CoT Prompting Text-Circuit Hybrid Embedding EDA-Tailored Acceleration Experiment ResultsExtension WorksExtension:Extension:AIGAIG-basedbasedGATGATaccelerationaccelerationThread workload reallocation and skip redundant computingPage 24Fus
35、ed-GAT kernel1AIG-GAT kernel 1 in-degree=1in-degree=2warp93.75%thread wastesync overhead()()():node feature:threadwarp 1 2 1 2No sync overhead 32 thread 32 threadaggregationaggregation1.54x average speedup and 46.8%memory usage reduction over PyGNo thread wasteSkip RedundantRedundant Softmax1 Zhang,
36、Hengrui,et al.Understanding gnn computational graph:A coordinated computation,io,and memory perspective.Proceedings of Machine Learning and Systems 4(2022):467-484.Page 25LLSM:LLM-enhanced Logic Synthesis Model with EDA-guided CoT Prompting,Hybrid Embedding and AIG-tailored AccelerationShan Huang,supervised by Prof.Guohao D,