《高通(Qualcomm):2024端側生成式AI:全棧優化與應用實踐報告(英文版)(29頁).pdf》由會員分享,可在線閱讀,更多相關《高通(Qualcomm):2024端側生成式AI:全棧優化與應用實踐報告(英文版)(29頁).pdf(29頁珍藏版)》請在三個皮匠報告上搜索。
1、Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.March 28,2024QCOMResearchQCOMResearchtinyAI Forum on Generative AI and Foundation Models on the EdgeSnapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its
2、 subsidiaries.On-Device Generative AIFatih PorikliFatih PorikliSenior Director,TechnologyQualcomm Technologies,Inc.2TodaysagendaWhy on-devicegenerative AIFull-stack AI optimizationsfor large vision models Stable Diffusion Stable Diffusion Full-stack AI optimizationsfor large language models Llama 2L
3、lama 2Q&A23LLM:Large language mode;LVM:Language vision model AIMET is a product of Qualcomm Innovation Center,Inc.Qualcomm AI Research is an initiative of Qualcomm Technologies,Inc.Leading machinelearning researchfor on-device AIacross the entirespectrum of topics Platform Platform researchresearchA
4、ppliedAppliedresearchresearchFundamentalFundamentalresearchresearchAI researchAI researchGenerative AIMulti-task learning&MoESelf-supervised learningReinforcement&Imitation learningEmbodied AIModel quantization,compression,NASHW-SW co-designDistillation of generative modelsPower managementAI Model E
5、fficiencyToolkit(AIMET)Deep learningfor 3D/geometryAudio and videoperceptionAI for wirelessand RF sensingEnergy-efficientperceptionAI forchip designOn-device learningTransformers&attentionGraph and kerneloptimizationFederated learningAI assistant(LLM)Computer vision&perceptionDiffusion based generat
6、ors(LVM)Deep learning for graphics Voice AIFull-stack AI research&optimizationmodel,HW,SW innovation across each layer4OnOn-devicedeviceintelligence is intelligence is paramountparamountProcess data closest to the source,complement the cloudPrivacyReliabilityLow latencyCostEnergyPersonalization5 5Ge
7、nerative AI will impact use cases across device categoriesGen AI can help improve customerand employee experience in retail,such as providing recommendationsfor inventory and store layout“Suggest inventoryand store layoutchanges to increaseuser satisfaction in the sports section”IoTIoTGen AI is tran
8、sforming productivityby composing emails,creating presentations,and writing codePCPCPhonePhone“Make mereservations fora weekend getawayat the place Bob recommended”Gen AI canbecome a truedigital assistantXRXRGen AI canhelp createimmersive3D virtualworldsbased onsimplepromptsAutomotiveAutomotiveGen A
9、I canbe used for ADAS/ADto help improvedrive policy bypredicting thetrajectory andbehavior of various agents“Make me a status presentation formy boss basedon inputs frommy team”620242023*Assuming INT4 parametersOn-device AIto supporta variety ofGen AI modelsCompute:from GFLOPs to TFLOPsModel size:fr
10、om millions to billions of parametersWe can run models withover 10 billion parameters10 billion parameterson device today*on device today*andanticipate this growing substantially in the coming in the coming yearsyears0.010.11101001000collaborative roboticsmathematical reasoningmulti-modal(LMM)text-t
11、o-video(LVM)code generationtext-to-image(LVM)dialog and NLP(LLM)perceptionModel size(billions of parameters)1101001000ResNet50SD 1.5GFLOPs7Output imageThe prompt:Panoramic view of mountains of Vestrahorn and perfect reflection in shallow water,soon after sunrise,Stokksnes,South Iceland,Polar Regions
12、,natural lighting,cinematic wallpaperVAE:Variational Auto Encoder;CLIP:Contrastive Language-Image Pre-TrainingWhat isdiffusion?Reverse diffusion(subtract noise or denoise)Forward diffusion(add noise)Image generationStableDiffusionarchitectureUNet is the biggest componentmodel of Stable DiffusionMany
13、 steps,often 20 or more,are used for generatinghigh-quality imagesSignificant computeis requiredInput promptInput promptStable Diffusion(1B+parameters)CLIP text encoder(123M123M parameters)SchedulerUNet(860M860M parameters)VAE decoder(49M49M parameters)StepVase in Greek style with intricate patterns
14、 and design8KnowledgedistillationCreate a smaller modelwith fewer parametersRun faster inferenceon target deploymentMaintain predictionquality close tothe teacherLess training timeTraining a smaller“student”modelto mimic a larger“teacher”modelTeacher modelTraining dataStudent modelLogitsKnowledge di
15、stillationLogitsSoft labelsMatch logits of the modelsto transfer teacher modelrepresentation and minimizedistillation loss(KL divergence)OutputOutputCross entropy lossGroundtruth9 9Original Stable Diffusion UNetPruning&knowledge distillationMore efficient architecture design through pruning and know
16、ledge distillationReducing UNet compute(FLOPs),model size,and peak memory usage Efficient UNetConvolutionalblockAttentionblock1010DDIM:Denoising Diffusion Implicit Models;MSE:Mean-squared errorStep distillation for the DDIM schedulerTeach the student model to achieve in one step what the teacher ach
17、ieves in multiple stepsTeacher:2 Teacher:2 UNetsUNetsStudent:1 Student:1 UNetUNetMSE lossMSE loss1111FIDFIDCLIP CLIP Inference latencyInference latencyBaseline(SDBaseline(SD-1.5)1.5)Fast SDFast SD17.14*0.30375.05 seconds20.080.30040.56 seconds11FastFastStableStableDiffusionDiffusionReduces UNetforwa
18、rd passesto less than 20StepdistillationCombines conditional and unconditional generationGuidanceconditioningReduces compute(FLOPs),model size,peak memory usageEfficientUNetReparameterization fromepsilon to velocity spacefor robust distillation*:These results are not directly comparable since baseli
19、ne Stable Diffusion was trained with over 20 x larger dataset than fast Stable Diffusion.SD:Stable Diffusion Our full-stack AI optimization of Stable Diffusion significantly improves latency while maintaining accuracy e-to-vBaseline Baseline Stable Stable DiffusionDiffusionspeedup vs baselineStable
20、Diffusion9x9x12Worlds fastest AItext-to-image generative AI on a phoneTakes less than 0.6 seconds for generating 512x512 images from text promptsEfficient UNet architecture,guidance conditioning,and step distillationFull-stack AI optimization to achieve this improvement13LVM:Language vision modelAI
21、acceleration on the QualcommHexagon NPU of the Snapdragon 8Gen 3 Mobile ProcessorFull-stack AI optimizationfor LVMRuns completelyRuns completelyon the deviceSignificantly reducesSignificantly reducesruntime latency andpower consumptionContinuously improvesContinuously improvesthe Qualcomm AI StackQu
22、alcomm AI Engine directfor improved performance andminimized memory spillage Knowledge distillation for pruning and removing of attention blocks,resulting in accurate model with improved performance and power efficiencyDesigning an efficient diffusionmodel through knowledgedistillation for high accu
23、racy1414LLMs are highly bandwidth limited rather than compute limitedIllustration of autoregressive language modelingSingle-token generation architecture of large languages models results in high memory bandwidthRecitethefirstlawofroboticsRecitethefirstlawofroboticsArobotmaynotinjureahumanbeingArobo
24、tmaynotinjureahumanrobotmaynotinjureahumanAHuge bandwidthEach parameter of the model must be read to generate each token(e.g.,read 7B parameters for Llama 7B to generate a single token)DRAMNPUDDRTCMTransformer layer 1Transformer layer NEmbeddingsLM headLLMLLM15LLM quantization LLM quantization motiv
25、ationsmotivationsLLM LLM q quantization uantization challengeschallengesA 4x smaller model(i.e.,FP16-INT4)Reduce memorybandwidth and storageReduce latencyReduce power consumptionMaintain accuracy ofFP published modelsPost-training quantization(PTQ)may not be accurateenough for 4-bitThe training pipe
26、line(e.g.,dataor rewards)is not available for quantization aware training(QAT)ShrinkingShrinking an LLM an LLM for increased performance for increased performance while maintaining accuracy while maintaining accuracy is challengingis challenging161:Perplexity is average over several test sets,includ
27、ing wikitext and c4(subset)Quantization-aware training with knowledge distillationReduces memory footprint while solving quantization challenges ofmaintaining model accuracy and the lack of original training pipeline 1 1 Point increasein perplexity1 1 1%Decrease inaccuracyConstruct atraining looptha
28、t can runtwo modelson the sameinput dataTeacherLlama-2-Chat 7BFP16Student Llama-2-Chat 7BINT4Datasettrue labelsTeacher logitsStudent logitsLoss1:KL lossLoss1:KL loss(Teacher soft logits,student soft logits)Loss2:Cross entropy Loss2:Cross entropy loss loss(True labels,student hard logits)KD loss func
29、tion combines the KL divergence loss and hard-label based CE loss Hard logits(no temperature)Soft logits(temperature=4)ClassesProbability17Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generates a few speculative tokens at a timeTarget model decides whichto a
30、ccept in one passRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsArobotshouldAArobotrobotshouldshouldnotnotSpeculative decodingspeeds up token rate by tradingoff compute for bandwidthToken generatedfrom draftToken checked&acceptedby target1
31、8Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generates a few speculative tokens at a timeTarget model decides whichto accept in one passRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsArobotshouldA
32、robotshouldnotArobotshouldnotArobotshouldnotArobotshouldnotharmmaymaymaySpeculative decodingspeeds up token rate by tradingoff compute for bandwidthToken generatedfrom draftToken checked&acceptedby target19Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generat
33、es a few speculative tokens at a timeTarget model decides whichto accept in one passRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsrobotmayAArobotnotnotinjureamayinjureahumanmaynotinjureaSpeculative decodingspeeds up token rate by tradingo
34、ff compute for bandwidthToken generatedfrom draftToken checked&acceptedby target20Llama 2Llama 2 draftA good draft model predicts with a high acceptance rateDraft model generates a few speculative tokens at a timeTarget model decides whichto accept in one passRecitethefirstlawofroboticsRecitethefirs
35、tlawofroboticsRecitethefirstlawofroboticsRecitethefirstlawofroboticsrobotAArobotnotnotinjureamayinjureahumanmaynotinjureahumanmaynotinjureahumannotinjureahumanbeingSpeculative decodingspeeds up token rate by tradingoff compute for bandwidthToken generatedfrom draftToken checked&acceptedby targetTrai
36、n a significantly smaller draftTrain a significantly smaller draftLLM for speculative decodingLLM for speculative decodingwhile maintaining enoughwhile maintaining enoughaccuracy is challengingaccuracy is challengingSmall draft model Small draft model motivationsmotivationsSmall draft model Small dr
37、aft model challengeschallenges10 x smaller draft model than target modelFast resultsReduce memory bandwidth,storage,latency,and power consumptionThe training pipeline(e.g.,dataor rewards)is not availableCover multiple families,e.g.,7B and 13B modelsMatch the distribution of the targetmodel for highe
38、r acceptance rate22Speculative decoding provides speedup with no accuracy lossUsing our research techniques on Llama 2-7B Chat,we achieved Up to20tokens per second24Worlds fastest Llama 2-7B on a phoneAt Snapdragon Summit 2023Up to 20 tokens per secondDemonstrating both chat and application interact
39、ion on deviceWorlds first demonstration of speculative decoding running on a phone25Worlds first large multimodal model(LMM)on an Android phoneLLMs can now see7+billion parameter LMM,LLaVA,with text,speech,and image inputsMulti-turn intuitive conversations about an image at a responsive token rateFu
40、ll-stack AI optimization to achieve high performance at low powerEnhanced privacy,reliability,personalization,and cost with on-device processingLLM:Large Language Model;LLaVA:Large Language and Vision AssistantAt MWC202426Our first low rank adaptation(LoRA)on an Android phone1+billion parameter Stab
41、le Diffusion with LoRA adapter for customized experiencesLoRA enables scalability and customization of on-device generative AI across use casesFull-stack AI optimization to achieve high performance while fast switching between adapters and minimizing memory needEnhanced privacy,reliability,personali
42、zation,and cost with on-device processingAt MWC2024import qai_hub as hub#selectdevicedevice=hub.get_device(“qualcomm-snapdragon-8gen2”)#producemodeljob hub.submit_compile_and_profile_job(torce_modelname=“MyDetector”,device=device,input_shapes=(3,720,1024)#deployto devicemodel=job.download_target_mod
43、el()Qualcomm AI HubLibrary of fully optimized AI models for deployment across Snapdragon and Qualcomm platformsModel SelectionTarget PlatformDeploymentTest and validate AIHUB.QUALCOMM.COM28On-device generative AI offers many benefitsGenerative AI is happening now on the deviceOur on-device AI leader
44、shipis enabling generative AI to Connect with usConnect with usQ QCOMResearch https:/ us on:For more information,visit us at:& youNothing in these materials is an offer to sell any of the componentsor devices referenced herein.2018-2023 Qualcomm Technologies,Inc.and/or its affiliatedcompanies.All Ri
45、ghts Reserved.Qualcomm,Snapdragon,Adreno,Hexagon,Kryo,FastConnect,and Qualcomm Spectra are trademarks or registered trademarks of QualcommIncorporated.Other products and brand names may be trademarksor registered trademarks of their respective owners.References in this presentation to“Qualcomm”may m
46、ean Qualcomm Incorporated,Qualcomm Technologies,Inc.,and/or other subsidiaries or business units withinthe Qualcomm corporate structure,as applicable.Qualcomm Incorporated includes our licensing business,QTL,and the vast majority of our patent portfolio.Qualcomm Technologies,Inc.,a subsidiary of Qua
47、lcomm Incorporated,operates,along with its subsidiaries,substantially all of our engineering,research and development functions,and substantially all of our products and services businesses,including our QCT semiconductor business.Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.Qualcomm patented technologies are licensed by Qualcomm Incorporated.