當前位置：首頁 > 報告詳情

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf

上傳人： Y**** 編號：623323 2025-04-09 PDF PDF 中文版中文版中文版 PPTX PPTX PPTX 29頁 3.29MB 收藏收藏已收藏上傳文檔翻譯

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第1頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第2頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第3頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第4頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第5頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第6頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第7頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第8頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第9頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第10頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第11頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第12頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第13頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第14頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第15頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第1頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第2頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第3頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第4頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第5頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第6頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第7頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第8頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第9頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第10頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第11頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第12頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第13頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第14頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第15頁

/ 15

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第1頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第2頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第3頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第4頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第5頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第6頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第7頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第8頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第9頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第10頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第11頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第12頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第13頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第14頁

高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf_第15頁

本資源只提供15頁預覽，全部文檔請下載后查看！喜歡就下載吧，查找使用更方便 VIP專享文檔

《高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf》由會員分享，可在線閱讀，更多相關《高通（Qualcomm）：2024端側生成式AI：全棧優化與應用實踐報告（英文版）（29頁）.pdf（29頁珍藏版）》請在三個皮匠報告上搜索。

1、Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.March 28,2024QCOMResearchQCOMResearchtinyAI Forum on Generative AI and Foundation Models on the EdgeSnapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its

2、 subsidiaries.On-Device Generative AIFatih PorikliFatih PorikliSenior Director,TechnologyQualcomm Technologies,Inc.2TodaysagendaWhy on-devicegenerative AIFull-stack AI optimizationsfor large vision models Stable Diffusion Stable Diffusion Full-stack AI optimizationsfor large language models Llama 2L

3、lama 2Q&A23LLM:Large language mode;LVM:Language vision model AIMET is a product of Qualcomm Innovation Center,Inc.Qualcomm AI Research is an initiative of Qualcomm Technologies,Inc.Leading machinelearning researchfor on-device AIacross the entirespectrum of topics Platform Platform researchresearchA

4、ppliedAppliedresearchresearchFundamentalFundamentalresearchresearchAI researchAI researchGenerative AIMulti-task learning&MoESelf-supervised learningReinforcement&Imitation learningEmbodied AIModel quantization,compression,NASHW-SW co-designDistillation of generative modelsPower managementAI Model E

5、fficiencyToolkit(AIMET)Deep learningfor 3D/geometryAudio and videoperceptionAI for wirelessand RF sensingEnergy-efficientperceptionAI forchip designOn-device learningTransformers&attentionGraph and kerneloptimizationFederated learningAI assistant(LLM)Computer vision&perceptionDiffusion based generat

6、ors(LVM)Deep learning for graphics Voice AIFull-stack AI research&optimizationmodel,HW,SW innovation across each layer4OnOn-devicedeviceintelligence is intelligence is paramountparamountProcess data closest to the source,complement the cloudPrivacyReliabilityLow latencyCostEnergyPersonalization5 5Ge

7、nerative AI will impact use cases across device categoriesGen AI can help improve customerand employee experience in retail,such as providing recommendationsfor inventory and store layout“Suggest inventoryand store layoutchanges to increaseuser satisfaction in the sports section”IoTIoTGen AI is tran

8、sforming productivityby composing emails,creating presentations,and writing codePCPCPhonePhone“Make mereservations fora weekend getawayat the place Bob recommended”Gen AI canbecome a truedigital assistantXRXRGen AI canhelp createimmersive3D virtualworldsbased onsimplepromptsAutomotiveAutomotiveGen A

9、I canbe used for ADAS/ADto help improvedrive policy bypredicting thetrajectory andbehavior of various agents“Make me a status presentation formy boss basedon inputs frommy team”620242023*Assuming INT4 parametersOn-device AIto supporta variety ofGen AI modelsCompute:from GFLOPs to TFLOPsModel size:fr

10、om millions to billions of parametersWe can run models withover 10 billion parameters10 billion parameterson device today*on device today*andanticipate this growing substantially in the coming in the coming yearsyears0.010.11101001000collaborative roboticsmathematical reasoningmulti-modal(LMM)text-t

11、o-video(LVM)code generationtext-to-image(LVM)dialog and NLP(LLM)perceptionModel size(billions of parameters)1101001000ResNet50SD 1.5GFLOPs7Output imageThe prompt:Panoramic view of mountains of Vestrahorn and perfect reflection in shallow water,soon after sunrise,Stokksnes,South Iceland,Polar Regions

12、,natural lighting,cinematic wallpaperVAE:Variational Auto Encoder;CLIP:Contrastive Language-Image Pre-TrainingWhat isdiffusion?Reverse diffusion(subtract noise or denoise)Forward diffusion(add noise)Image generationStableDiffusionarchitectureUNet is the biggest componentmodel of Stable DiffusionMany

13、 steps,often 20 or more,are used for generatinghigh-quality imagesSignificant computeis requiredInput promptInput promptStable Diffusion(1B+parameters)CLIP text encoder(123M123M parameters)SchedulerUNet(860M860M parameters)VAE decoder(49M49M parameters)StepVase in Greek style with intricate patterns

14、 and design8KnowledgedistillationCreate a smaller modelwith fewer parametersRun faster inferenceon target deploymentMaintain predictionquality close tothe teacherLess training timeTraining a smaller“student”modelto mimic a larger“teacher”modelTeacher modelTraining dataStudent modelLogitsKnowledge di

15、stillationLogitsSoft labelsMatch logits of the modelsto transfer teacher modelrepresentation and minimizedistillation loss(KL divergence)OutputOutputCross entropy lossGroundtruth9 9Original Stable Diffusion UNetPruning&knowledge distillationMore efficient architecture design through pruning and know

16、ledge distillationReducing UNet compute(FLOPs),model size,and peak memory usage Efficient UNetConvolutionalblockAttentionblock1010DDIM:Denoising Diffusion Implicit Models;MSE:Mean-squared errorStep distillation for the DDIM schedulerTeach the student model to achieve in one step what the teacher ach

17、ieves in multiple stepsTeacher:2 Teacher:2 UNetsUNetsStudent:1 Student:1 UNetUNetMSE lossMSE loss1111FIDFIDCLIP CLIP Inference latencyInference latencyBaseline(SDBaseline(SD-1.5)1.5)Fast SDFast SD17.14*0.30375.05 seconds20.080.30040.56 seconds11FastFastStableStableDiffusionDiffusionReduces UNetforwa

18、rd passesto less than 20StepdistillationCombines conditional and unconditional generationGuidanceconditioningReduces compute(FLOPs),model size,peak memory usageEfficientUNetReparameterization fromepsilon to velocity spacefor robust distillation*:These results are not directly comparable since baseli

19、ne Stable Diffusion was trained with over 20 x larger dataset than fast Stable Diffusion.SD:Stable Diffusion Our full-stack AI optimization of Stable Diffusion significantly improves latency while maintaining accuracy e-to-vBaseline Baseline Stable Stable DiffusionDiffusionspeedup vs baselineStable

20、Diffusion9x9x12Worlds fastest AItext-to-image generative AI on a phoneTakes less than 0.6 seconds for generating 512x512 images from text promptsEfficient UNet architecture,guidance conditioning,and step distillationFull-stack AI optimization to achieve this improvement13LVM:Language vision modelAI

21、acceleration on the QualcommHexagon NPU of the Snapdragon 8Gen 3 Mobile ProcessorFull-stack AI optimizationfor LVMRuns completelyRuns completelyon the deviceSignificantly reducesSignificantly reducesruntime latency andpower consumptionContinuously improvesContinuously improvesthe Qualcomm AI StackQu

22、alcomm AI Engine directfor improved performance andminimized memory spillage Knowledge distillation for pruning and removing of attention blocks,resulting in accurate model with improved performance and power efficiencyDesigning an efficient diffusionmodel through knowledgedistillation for high accu

23、racy1414LLMs are highly bandwidth limited rather than compute limitedIllustration of autoregressive language modelingSingle-token generation architecture of large languages models results in high memory bandwidthRecitethefirstlawofroboticsRecitethefirstlawofroboticsArobotmaynotinjureahumanbeingArobo

24、tmaynotinjureahumanrobotmaynotinjureahumanAHuge bandwidthEach parameter of the model must be read to generate each token(e.g.,read 7B parameters for Llama 7B to generate a single token)DRAMNPUDDRTCMTransformer layer 1Transformer layer NEmbeddingsLM headLLMLLM15LLM quantization LLM quantization motiv

25、ationsmotivationsLLM LLM q quantization uantization challengeschallengesA 4x smaller model(i.e.,FP16-INT4)Reduce memorybandwidth and storageReduce latencyReduce power consumptionMaintain accuracy ofFP published modelsPost-training quantization(PTQ)may not be accurateenough for 4-bitThe training pipe

26、line(e.g.,dataor rewards)is not available for quantization aware training(QAT)ShrinkingShrinking an LLM an LLM for increased performance for increased performance while maintaining accuracy while maintaining accuracy is challengingis challenging161:Perplexity is average over several test sets,includ

27、ing wikitext and c4(subset)Quantization-aware training with knowledge distillationReduces memory footprint while solving quantization challenges ofmaintaining model accuracy and the lack of original training pipeline 1 1 Point increasein perplexity1 1 1%Decrease inaccuracyConstruct atraining looptha

28、t can runtwo modelson the sameinput dataTeacherLlama-2-Chat 7BFP16Student Llama-2-Chat 7BINT4Datasettrue labelsTeacher logitsStudent logitsLoss1:KL lossLoss1:KL loss(Teacher soft logits,student soft logits)Loss2:Cross entropy Loss2:Cross entropy loss loss(True labels,student hard logits)KD loss func

29、tion combines the KL divergence loss and hard-label based CE loss Hard logits(no temperature)Soft logits(temperature=4)ClassesProbability17Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generates a few speculative tokens at a timeTarget model decides whichto a

30、ccept in one passRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsArobotshouldAArobotrobotshouldshouldnotnotSpeculative decodingspeeds up token rate by tradingoff compute for bandwidthToken generatedfrom draftToken checked&acceptedby target1

31、8Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generates a few speculative tokens at a timeTarget model decides whichto accept in one passRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsArobotshouldA

32、robotshouldnotArobotshouldnotArobotshouldnotArobotshouldnotharmmaymaymaySpeculative decodingspeeds up token rate by tradingoff compute for bandwidthToken generatedfrom draftToken checked&acceptedby target19Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generat

33、es a few speculative tokens at a timeTarget model decides whichto accept in one passRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsrobotmayAArobotnotnotinjureamayinjureahumanmaynotinjureaSpeculative decodingspeeds up token rate by tradingo

34、ff compute for bandwidthToken generatedfrom draftToken checked&acceptedby target20Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generates a few speculative tokens at a timeTarget model decides whichto accept in one passRecitethefirstlawofroboticsRecitethefirs

35、tlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsrobotAArobotnotnotinjureamayinjureahumanmaynotinjureahumanmaynotinjureahumannotinjureahumanbeingSpeculative decodingspeeds up token rate by tradingoff compute for bandwidthToken generatedfrom draftToken checked&acceptedby targetTrai

36、n a significantly smaller draftTrain a significantly smaller draftLLM for speculative decodingLLM for speculative decodingwhile maintaining enoughwhile maintaining enoughaccuracy is challengingaccuracy is challengingSmall draft model Small draft model motivationsmotivationsSmall draft model Small dr

37、aft model challengeschallenges10 x smaller draft model than target modelFast resultsReduce memory bandwidth,storage,latency,and power consumptionThe training pipeline(e.g.,dataor rewards)is not availableCover multiple families,e.g.,7B and 13B modelsMatch the distribution of the targetmodel for highe

38、r acceptance rate22Speculative decoding provides speedup with no accuracy lossUsing our research techniques on Llama 2-7B Chat,we achieved Up to20tokens per second24Worlds fastest Llama 2-7B on a phoneAt Snapdragon Summit 2023Up to 20 tokens per secondDemonstrating both chat and application interact

39、ion on deviceWorlds first demonstration of speculative decoding running on a phone25Worlds first large multimodal model(LMM)on an Android phoneLLMs can now see7+billion parameter LMM,LLaVA,with text,speech,and image inputsMulti-turn intuitive conversations about an image at a responsive token rateFu

40、ll-stack AI optimization to achieve high performance at low powerEnhanced privacy,reliability,personalization,and cost with on-device processingLLM:Large Language Model;LLaVA:Large Language and Vision AssistantAt MWC202426Our first low rank adaptation(LoRA)on an Android phone1+billion parameter Stab

41、le Diffusion with LoRA adapter for customized experiencesLoRA enables scalability and customization of on-device generative AI across use casesFull-stack AI optimization to achieve high performance while fast switching between adapters and minimizing memory needEnhanced privacy,reliability,personali

42、zation,and cost with on-device processingAt MWC2024import qai_hub as hub#selectdevicedevice=hub.get_device(“qualcomm-snapdragon-8gen2”)#producemodeljob hub.submit_compile_and_profile_job(torce_modelname=“MyDetector”,device=device,input_shapes=(3,720,1024)#deployto devicemodel=job.download_target_mod

43、el()Qualcomm AI HubLibrary of fully optimized AI models for deployment across Snapdragon and Qualcomm platformsModel SelectionTarget PlatformDeploymentTest and validate AIHUB.QUALCOMM.COM28On-device generative AI offers many benefitsGenerative AI is happening now on the deviceOur on-device AI leader

44、shipis enabling generative AI to Connect with usConnect with usQ QCOMResearch https:/ us on:For more information,visit us at:& youNothing in these materials is an offer to sell any of the componentsor devices referenced herein.2018-2023 Qualcomm Technologies,Inc.and/or its affiliatedcompanies.All Ri

45、ghts Reserved.Qualcomm,Snapdragon,Adreno,Hexagon,Kryo,FastConnect,and Qualcomm Spectra are trademarks or registered trademarks of QualcommIncorporated.Other products and brand names may be trademarksor registered trademarks of their respective owners.References in this presentation to“Qualcomm”may m

46、ean Qualcomm Incorporated,Qualcomm Technologies,Inc.,and/or other subsidiaries or business units withinthe Qualcomm corporate structure,as applicable.Qualcomm Incorporated includes our licensing business,QTL,and the vast majority of our patent portfolio.Qualcomm Technologies,Inc.,a subsidiary of Qua

47、lcomm Incorporated,operates,along with its subsidiaries,substantially all of our engineering,research and development functions,and substantially all of our products and services businesses,including our QCT semiconductor business.Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.Qualcomm patented technologies are licensed by Qualcomm Incorporated.

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站