《學習修剪和低秩自適應以實現緊湊語言模型部署.pdf》由會員分享,可在線閱讀,更多相關《學習修剪和低秩自適應以實現緊湊語言模型部署.pdf(24頁珍藏版)》請在三個皮匠報告上搜索。
1、Copyright 2025 Arizona Board of RegentsLearning to Prune and Low-Rank Adaptation for Compact Language Model DeploymentAuthors:Asmer Hamid Ali(aali115asu.edu),Fan Zhang,Li Yang,Deliang FanEfficient,Secure and Intelligent Computing(ESIC)Laboratory(https:/faculty.engineering.asu.edu/dfan/)Arizona State
2、 UniversityCopyright 2025 Arizona Board of RegentsOutline1.Motivation and Problem StatementChallenges in deploying large pre-trained models.Limitations of existing methods.2.Key ContributionsOverview of proposed approach and its significance.3.Parameter-Efficient Fine-Tuning and Model PruningBackgro
3、und on PEFT techniques.Importance of structured pruning for efficiency.4.Methodology OverviewTrainable pruning masks.Integration with low-rank adaptation.5.Efficient Pruning and Low-Rank AdaptationDetailed explanation with equations and benefits.6.Experimental SetupModels,datasets,and evaluation met
4、rics.7.ResultsPerformance analysis and comparison with baselines.8.ConclusionSummary of contributions and future directions.Copyright 2025 Arizona Board of RegentsGrowing computational demands of large pre-trained models(LPMs).PEFT techniques address training overhead but fail to optimize inference
5、efficiency.Need for a compact and efficientdeployment-ready solution.Figure 1:Chart showing the growth in the size of models overtime with annotations on memory usage and limits of hardware(Source:LLM:The Rise of Data)Motivation and Problem StatementCopyright 2025 Arizona Board of RegentsGrowing com
6、putational demands of large pre-trained models(LPMs).PEFT techniques address training overhead but fail to optimize inference efficiency.Need for a compact and efficientdeployment-ready solution.Figure 2:Table comparing LLaMA-7B models with various PEFT methods,showing parameter reductions and accur
7、acy trade-offs.(Source:Charith Chandra Sai Balne et al.,Parameter Efficient Fine Tuning:A Comprehensive Analysis Across Applications,arXiv:2404.13506,2024)Motivation and Problem StatementCopyright 2025 Arizona Board of Regents1.Trainable Pruning MethodologyOptimizes the structure of LPMs during fine
8、-tuning.Includes learnable binary masks for channel-wise pruning.2.Low-Rank Adaptation IntegrationIncorporates low-rank adaptation to reduce computational overhead while maintaining accuracy.3.Efficiency GainsDemonstrates up to 18%speed-up in inference with real-world hardware.Figure 3:Proposed Appr
9、oachKey ContributionCopyright 2025 Arizona Board of RegentsParameter-Efficient Fine-Tuning and Model PruningCopyright 2025 Arizona Board of RegentsFigure 5:Structure of DoRA(Source:S.-Y.Liu et al.Dora:Weight-decomposed low-rank adaptation,2024.Figure 4:Structure of LoRA(Source:E.J.Huet al.Lora:Low-r
10、ank adaptation of largelanguage models,2021)Parameter-Efficient Fine-Tuning and Model PruningCopyright 2025 Arizona Board of RegentsFigure 6:Pruning TechniquesParameter-Efficient Fine-Tuning and Model PruningFigure 5:Structure of DoRA(Source:S.-Y.Liu et al.Dora:Weight-decomposed low-rank adaptation,
11、2024.Figure 4:Structure of LoRA(Source:E.J.Huet al.Lora:Low-rank adaptation of largelanguage models,2021)Copyright 2025 Arizona Board of RegentsMethodology OverviewFigure 5:Overview of the proposed approachCopyright 2025 Arizona Board of RegentsMethodology Overview1.Trainable Pruning Masks:Introduce
12、 binary masks to prune unimportant weights in both frozen and trainable components.Figure 5:Overview of the proposed approachCopyright 2025 Arizona Board of RegentsMethodology Overview1.Trainable Pruning Masks:Introduce binary masks to prune unimportant weights in both frozen and trainable component
13、s.2.Integration with Low-Rank Adaptation:Decompose weights into magnitude and direction using low-rank adaptation(based on DoRA).Optimize the pruning process by focusing only on magnitude vectors,minimizing training overhead.Figure 5:Overview of the proposed approachCopyright 2025 Arizona Board of R
14、egentsMethodology Overview1.Trainable Pruning Masks:Introduce binary masks to prune unimportant weights in both frozen and trainable components.2.Integration with Low-Rank Adaptation:Decompose weights into magnitude and direction using low-rank adaptation(based on DoRA).Optimize the pruning process
15、by focusing only on magnitude vectors,minimizing training overhead.3.Hardware-Compatible Compact Model:The final pruned model retains its compact structure.Achieves significant inference speed-up on commercial GPUs and CPUs.Figure 5:Overview of the proposed approachCopyright 2025 Arizona Board of Re
16、gentsEfficient Pruning and Low-Rank AdaptationCopyright 2025 Arizona Board of RegentsEfficient Pruning and Low-Rank Adaptation1.Integration of Pruning with Low-Rank Adaptation:Use a trainable binary mask(mb)to optimize the magnitude vector in the DoRA framework.Low-rank adaptation ensures computatio
17、nal efficiency while maintaining accuracy.2.Low-Rank Adaptation:Copyright 2025 Arizona Board of RegentsEfficient Pruning and Low-Rank Adaptation1.Integration of Pruning with Low-Rank Adaptation:Use a trainable binary mask(mb)to optimize the magnitude vector in the DoRAframework.Low-rank adaptation e
18、nsures computational efficiency while maintaining accuracy.2.Low-Rank Adaptation:Pruned Weight Update:Copyright 2025 Arizona Board of RegentsExperimental Setup1.Model Used:DistilBERT,RoBERTa(RoBbase),and LLaMA-7B2.Dataset:a.GLUE Benchmark for DistilBERT and RoBERTa.b.Commonsense reasoning datasets(e
19、.g.,BoolQ,PIQA,ARC)for LLaMA-7B.3.Hardware and Training Details:a.GPU:NVIDIA A5000b.Batch size:8,mixed precision for efficiency.c.Optimizer:Adam,learning rate fine-tuned during stages(e.g.,5e-5 1e-5).Copyright 2025 Arizona Board of RegentsResults for LLaMA-7B1.Accuracy Gains:Pruned-LLaMA-7B(Ours)ach
20、ieves a competitive average accuracy(62.77%),outperforming most models like LLM-Pruner and LoRAPrune.2.Model Size Efficiency:Reduces the number of trainable parameters to 5.09B,compared to the baselines 6.74B,while maintaining comparable performance.3.Task-Specific Highlights:a.Excels in PIQA(79.54%
21、),outperforming all baselines.b.Strong performance on BoolQ(72.12%)and WinoGrande(67.93%)compared to LLM-Pruner and LoRAPrune.Copyright 2025 Arizona Board of RegentsResults for RoBbase1.Accuracy Gains:Pruned-RoBbase(Ours)achieves the highest average accuracy(87.46%),outperforming the baseline and Lo
22、RA methods.2.Model Size Efficiency:Pruned-RoBbasereduces model size to 429.3 MB,significantly smaller than the baseline RoBbase(476.84 MB).3.Task-Specific Highlights:Achieves the best accuracy on SST-2(95.3%)and RTE(83.8%)while maintaining competitive performance on other tasks.Copyright 2025 Arizon
23、a Board of RegentsResults for DistilBERT1.Accuracy Gains:Pruned-DistilBERT(Ours)achieves the highest average accuracy(82%),outperforming other methods.Consistently performs better across tasks like RTE(65.2%),SST-2(91.4%),and MNLI(83.4%).2.Model Size Efficiency:Pruned-DistilBERT achieves an efficien
24、t model size of 222.85 MB,smaller than most other methods like LoRA(258.32 MB)and DoRA(268.5 MB).3.Improved Performance vs.Baseline:Outperforms DistilBERT Baseline by 6.4%in average accuracy(82%vs.79.6%).Copyright 2025 Arizona Board of RegentsAnalysis of Sparsity and Inference GainsCopyright 2025 Ar
25、izona Board of RegentsConclusionSummary of Contributions:Introduced a novel trainable pruning methodology for structured pruning.Integrated pruning with low-rank adaptation to reduce computational costs while maintaining accuracy.Key Results:Up to 24.5%sparsity across layers.Up to 18%inference speed
26、-up on real-world hardware.Broader Impact:Enables practical deployment of large pre-trained models in resource-constrained environments.Balances performance,efficiency,and deployment feasibility.Copyright 2025 Arizona Board of RegentsReferencesE.J.Hu et al.Lora:Low-rank adaptation of large language
27、models,2021.S.-Y.Liu et al.Dora:Weight-decomposed low-rank adaptation,2024.V.Sanh et al.Distilbert,a distilled version of bert:smaller,faster,cheaper and lighter,2020.Y.Liu et al.Roberta:A robustly optimized bert pretraining approach,2019.H.Touvron et al.Llama:Open and efficient foundation language
28、models,2023.X.Ma et al.Llm-pruner:On the structural pruning of large language models,2023.M.Zhang et al.Loraprune:Pruning meets low-rank parameter-efficient finetuning,2023M.Xia et al.Sheared llama:Accelerating language model pre-training via structured pruning,2024E.Jang et al.Categorical reparamet
29、erization with gumbel-softmax,2017.T.Chen et al.Lorashear:Efficient large language model structured pruning and knowledge recovery,2023.H.Zhou et al.Lora-drop:Efficient lora parameter pruning based on output evaluation,2024.M.Valipour et al.Dylora:Parameter efficient tuning of pre-trained models usi
30、ng dynamic search-free low-rank adaptation,2023.Copyright 2025 Arizona Board of RegentsThank You!Any Questions?AcknowledgementThis work is supported in part by the National Science Foundation under Grant No.2314591,No.2505326,No.2452573,No.2452657,No.2503906,and No.2505209.Contact information:Asmer Hamid Ali:aali115asu.eduDeliang Fan:dfanasu.eduCopyright 2025 Arizona Board of Regents