高通:2024多模態AI的感官融合——視覺、聽覺與交互技術白皮書(英文版)(34頁).pdf

編號:178432 PDF  中文版  PPTX 34頁 14.52MB 下載積分:VIP專享
下載報告請您先登錄!

高通:2024多模態AI的感官融合——視覺、聽覺與交互技術白皮書(英文版)(34頁).pdf

1、Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.September 24,2024How AI is learning tosee,hear,and interactIntegratingsenses:Roland MemisevicSenior Director of Engineering at Qualcomm AI ResearchJoint workwith Sunny Panchal,Apratim Bhattacha

2、ryya,Guillaume Berger,Antoine Mercier,Reza Pourreza,Sanjay Haresh,and othersSnapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.2Agenda Key concept:streaming architecture Importance of datasets for end-to-end training Efficient human-AI interact

3、ion and video-based reasoning Improving streaming video LLMs using auxiliary tasks Q&A23Generative AIcapabilitiescontinue toincreaseMODALITY AND USE CASECAPABILITY AND KPIAgentsExecute multi-step tasks with reasoning autonomously to achieve a goalVideo&3DGenerating content for a richerand more reali

4、stic experienceVoice UIVoice is a natural and intuitiveinterface for conversationLarge multimodal modelsUtilizing more sensinginput modalities to betterunderstand the worldPersonalizationFine-tuned models customized to consumers,enterprises,or industries(e.g.,LoRA)Longer context windowAllows in-dept

5、h conversationsHigher resolutionProcess higher fidelity imagesfor better accuracyLoRA:low-rank adaptation4LM:Language vision modelAI acceleration on the QualcommHexagon NPU of the Snapdragon 8Gen 3 Mobile ProcessorQualcomm AI Engine directfor improved performance andminimized memory spillage Knowled

6、ge distillation for pruning andremoving of attention blocks,resultingin accurate model with improvedperformance and power efficiencyDesigning an efficient diffusionmodel through knowledgedistillation for high accuracyFull-stack AI optimizationfor LMsRuns completelyon the deviceSignificantly reducesr

7、untime latency andpower consumptionContinuously improvesthe Qualcomm AI Stack5To scale,the center of gravity of AI processing is moving to the edgeHybrid AIDistribute workloads among cloud andedge/devices to deliver more powerful,efficient,and highly optimized experiencesOn deviceImmediacy|Reliabili

8、ty|Personalization|Privacy|Security Cost|Energy Central cloudEase of development&deploymentTraining|Very large modelsAggregation|Absolute performanceEdge cloud(on-prem or nearby)Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation56LLM:Large Language Model;LLaVA:Large Langua

9、ge and Vision AssistantWorlds first large multimodal model(LMM)on anAndroid phoneLLMs can now see7+billion parameter LMM,LLaVA,with text,speech,and image inputsMulti-turn intuitive conversationsabout an image at a responsivetoken rateFull-stack AI optimizationto achieve high performanceat low powerE

10、nhanced privacy,reliability,personalization,and cost withon-device processing78Goal:Training AI models to see and interact with humansSMART HOMEMOBILEROBOTICS89What to Say and When to Say it:Video-Language Model and Benchmark for Situated Interactions(2024);OpenEQA:Embodied Question Answering in the

11、 Era of Foundation Models(2024);VQA:visual question answering 9Researching visually-grounded LLMs with the abilityto reason and interact with the environmentSituated vision-language models Process a live video streamin real time and dynamicallyinteract with users Determine what to sayand when to say

12、 it Enable the path to humanoidsOpen-ended,asynchronousinteraction with situated agentsis an open challenge Limited to turn-based interactionsabout offline documents or images Limited to capturing momentarysnapshots of reality ina VQA-style dialogueVisually-grounded LLMVisionActionrecognitionOrchest

13、ratorFrontendLLMTTS10PipelinePipelinePipelineNeural networkPipelineNeural networkPipelineNeural networkPipelineNeural networks have replaced increasinglycomplex computational pipelines SPEECHTO TEXTOBJECTRECOGNITIONLANGUAGETRANSLATIONAudioTextPixelsObjectsEnglishFrench2010201220141011End-to-end back

14、prop for agents11(AUTO-REGRESSIVE)NEURAL NETWORKINPUT STREAMBEHAVIOR STREAMKey concept:Multi-modal streaming architecture13End-to-end learning requires amulti-modal streaming architecture13(AUTO-REGRESSIVE)NEURAL NETWORKINPUT STREAMBEHAVIOR STREAMAUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.

15、g.,camera)TRAINEDEND-TO-END An auto-regressive languagemodel is a useful componentof a multi-modal agent becauseit is already able to performa dialogue with a user Additionally,language makesit easy to encode surrogate tasksfor a degree of“common sense”to emerge 14End-to-end learning requires amulti

16、-modal streaming architecture14 Visual foundation models that combine an image featureextractor with a language model back-bone have becomeincreasingly common There are multiple different ways to combine visualinformation with language model tokes,e.g.:Cross-attention(e.g.,Flamingo)Dedicated vision

17、tokens(e.g.,Llava)good for applications like Captioningand Visual Question AnsweringAUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.g.,camera)a live agent that can utilize a real-time camera feedrequires a system that can continuously attend to visual input Challenges:Freely interleaved vision

18、frames and language tokens Dependences between vision frame-rate and token rate Training data,allowing a model to learn what to say and when Recent work:“VideoLLM-online:Online Video Large Language Model for Streaming Video“,Chen et al.,2024 and our work,which I will present in the next slidesCONTEX

19、T WINDOWFRAMETOKENTOKENTOKENFRAMETOKENTOKENTOKENFRAMETOKENTOKENFlamingo:a Visual Language Model for Few-Shot Learning”,Alayrac et al 2022“Visual Instruction Tuning”,Liu et al.2023However,Importance of datasetsfor end-to-end training 16Datasets for end-to-end training of visual assistantsKey requirem

20、ent for end-to-end training:aligned video feed(frames)+assistants comments(tokens)“HoloAssist:an EgocentricHuman Interaction Datasetfor Interactive AI Assistantsin the Real World”Wang et al.2024 1st person videos showing a varietyof tasks(20 tasks across 16 objects)“Can Foundation Models Watch,Talk

21、and Guide You Step by Stepto Make a Cake?”Bao et al.20231st person videos showingpreparation of cupcakes“Live Fitness Coachingas a Testbed forSituated Interactions”Panchal et al.20243rd person videos showing fitnessexercises and their corrections17Live Fitness Coaching as a Testbed for Situated Inte

22、raction,Panchal,Bhattacharyya,et al.202417Aimed at the development of interactive multi-modal vision-language models based in the controlled but challenging fitness coaching domain FIT-Coachbenchmarkand datasetA novel interactive visualcoaching benchmark anddataset as a test-bed forreal-time,real-wo

23、rldsituated interactionFitness questions dataset148exercises400k+fine-grainedquestion-answer pairs300kshort-clip videos470+hours1900uniqueparticipants1.1M+high-levelquestion-answer pairsFitness feedback dataset21uniqueparticipants9+hours offitnesscoachingsession148exercise sessions 3.5minuteslong se

24、ssionswith 5 to 6exercises18Fitness assistant dataset and benchmarkLong-range videos showing the user exercising,along with aligned comments by the coach(200 sessions across 5-6 exercises each)Short video clips showing the user performing individual exercises,along with labels for performance and co

25、mmon mistakes(300k clips of duration 5-10 seconds each)SHORT CLIPSLONG-RANGETrainTestTrainTestNumber of videos290,77516,42915369Unique Participants1,800+100217Average Duration(s)5.6 1.15.6 1.2213.4 3.1213.7 3.3Exercises per Video115-65-6Total Number of Exercises1481482323Total Classes18661690Fitness

26、 QuestionsTotal High-level Questions1,193,05678,390Total Fine-grained Questions404,08280,694Fitness FeedbacksAverage Feedbacks per Exercise2.0 10.12.4 6.95.0 1.35.0 1.2Average Silence Period(s)n/an/a5.2 1.45.3 1.2Average Feedback Length(words)9.0 6.19.1 5.06.3 3.86.6 4.019Fitness assistant dataset a

27、nd benchmarkLong fitness sessions dataset Short fitness clips dataset20DATASETDOMAINHUMANACTIONSINTERACTIVE MISTAKESCORRECTIVEFEEDBACKSDOMAINEXPERTISELENGTHAction Recognition DatasetsNTU RGB+D FitnessxxxFineGym Fitnessxxx708Procedural Activity DatasetsYouCook2 Cookingxxxxx176Epic-Kitchens Cookingxxx

28、xx100HowTo100M Daily-lifexxxx134kEgo-4D Daily-lifexxxxx3670Ego-Exo4D Daily-lifexxxx1422Assembly-101 Toy assm.xxxx513Interactive AI Assistant DatasetsWTAG Cookingxxx10HoloAssist Obj.manip.xxx166QEVD(Ours)Fitness474Our dataset meets all the needs of interactive AI assistantsEfficient human-AI interact

29、ion and video-based reasoning22SELF-ATTNSELF-ATTNSELF-ATTNLANGUAGE BACKBONEPROMPTSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNsmoothSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNonVisualstreamSELF-ATTNCROSS-ATTNSELF-ATTNCROSS-ATTNSELF-ATTN3D CNNSELF-ATTNCROSS-ATTNSELF-ATTNCROSS-ATTNSELF-ATTN3D CNNSEL

30、F-ATTNCROSS-ATTNSELF-ATTNCROSS-ATTNSELF-ATTN3D CNNAUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.g.,camera)Detailed architecture:Learning what to say and when to say it23Steppable causal 3d convolutionsenable efficient streaming motion perceptionExisting visionlanguage modelsuse a 2d CNN orvis

31、ion transformeras the visualfeature extractor This makes them unsuitable fortasks such as fitness coaching,which involve understandingof human behaviors and motion patternsWe use a 3d CNN as the featureextractor,which we have shownto be well-suited to end-to-endlearning(“Is end-to-end learningenough

32、 for fitness activityrecognition?”,Mercier et al.2023)AUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.g.,camera)Efficient visual streaming at inference time can be enabled using steppable,causal convolutions:Standard ConvCausal ConvSteppable ConvPrevioustimestepsNewtimestep“Is end-to-end learni

33、ng enough for fitness activity recognition?”,Mercier et al.2023Enhance your app with the ability to see&interact with humans via any RGB camera: Improving streaming video LLMs using auxiliary tasks25*“The something-something video database for learning and evaluating visual common sense”(Goyal et al

34、.2017)Language generation is not only a useful task,but it alsohelps a model acquire a degree of“common sense”Using a language decoder to provide surrogatetasks to the model at training time Pre-training a model ona difficult captioning task(Something-somethingby Goyal et al.2017)7,734,347,154,455,8

35、59,762,8Generating complextextual descriptionsGenerating simpletextual descriptionsClassification on178 class actionsClassification on 40 action groupsBaselineclassificationonimages Training fromscratchallows us to improveprediction accuracyon a separate HomeCooking Task:“On the effectiveness of tas

36、k granularity for transfer learning”(Mahdisoltani,et al.2018)26A vision-language model can learn low-level visual skillsby encoding visual information as languageEncoding visual information as languageis a natural way to teach a vision-languagemodel low-level visual skills,such asobject identificati

37、on,detection,etc.The use of these visual skills atinference time is like performingchain-of-thought reasoning forvisual inference tasks“Look,Remember and Reason:Grounded reasoning in videoswith language models”Bhattacharyya,et al.2024 Example:Something-Else(Materzynska et al.,2020):Example:CATER(Gir

38、dhar et al.,2020):21 3332121 3332121 3332113 18 18 21MethodBaseCompositionalTop 1Top 5Top 1Top 5STIN+OIE+NL (Materzynska et al.,2020,MIT)78.194.556.281.3Video-ChatGPT(Maaz et al.,2023)52.675.838.667.8LRR(w/o Surrogate Tasks)52.675.850.170.8LRR(fine-tuned)80.296.162.086.3LRR(joint)-61.185.4MethodStat

39、ic CameraMoving CameraTop 1Top 5Top 1Top 5ALOE(Ding et.Al.)74.094.059.790.1TFC V3D(Zhang et.al.)79.795.5-LRR(w/o Surrogate Tasks)68.588.762.786.7LRR(fine-tuned)84.197.280.496.7LRR(joint)81.097.373.795.627Stochastic probing allows us to distill visual skills into the model Encoding the extracted low-

40、level information as tokens growsthe context window and it can be inefficient Relying on explicit representations of low-level computer vision features(such as bounding box positions)may also lead to brittleness We therefore propose to distill low-level visual skills into the modelusing a process we

41、 refer to as Stochastic Probing:Stochastic probing:During training,prompt a model at random time-steps to perform low-level visual tasks ACRECompositionalSystematicInference Speed*(sec)ALOE(Ding et.Al.)91.793.9-LRR99.399.50.061LRR(Stochastic Probing)98.299.21.415Training on visual skillscan boost pe

42、rformanceover classic approachesStochastic probingboosts efficiencyat inference time A similar approach:“Distilling Step-by-Step!Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes”,Hsie,et al.,2023*timing on an A100 GPU28End-to-end training in conjunction with stoch

43、astic probing allowsa model to provide useful and accurate feedback in real-time29Qualitative results:end-to-end learning enablesvideo LLMs to deliver accurate live feedbackGroundtruthStream-VLMLLaMA-VIDLLaVA-NextQuestion:Provide an appropriate feedback for the userVideo-LLaMA:We see a young man sta

44、nding in a kitchen,wearing a red shirt and white shorts.Video-ChatGPT:The user has successfully demonstrated the ability to performa balancing act on a pair of stools.Coach-LLaMA:This is awesome.Lets keep the intensity high!30Quantitative results:end-to-end learning enables videoLLMs to deliver accu

45、rate live feedbackZero-shot prompting results:METHODMETEORROUGE-LBERTLLM-Acc.InstructBLIP 0.0470.0400.8391.64Video-LLaVA 0.0570.0250.8471.82Video-ChatGPT 0.0980.0780.8502.27Video-LLaMA 0.1010.0770.8592.28LLaMA-VID 0.1000.0790.8592.33LLaVA-Next 0.1040.0780.8582.39Fine-tuning results:METHODMETEORROUGE

46、-LBERTLLM-Acc.T-F-ScoreSocratic-Llama-2-7B0.0940.0710.8602.390.50Video-ChatGPT*0.1080.0930.8632.420.50LLaMA-VID*0.1060.0900.8602.400.50STREAM-VLM0.1250.1160.8632.560.59STREAM-VLM(w/o 3D CNN)0.0900.0830.8572.170.51STREAM-VLM(w/o Action-Tokens0.1250.1100.8612.560.5031DATASET/SIMULATOR#TASKSLANGUAGEMUL

47、TIMODAL PROMPTSACTION GRANULARITYCOMPOSITIONALITY#DEMONSTRATIONSRealRoboTurk 3xxAction Deltasx111hrsBridgeData 71xxAction Deltasx7.2kOpen-X xAction Deltasx1MRH20T xAction Deltasx100kFMB 7xxAction Deltas22.5kSimulatedCALVIN 34xAction DeltasBehaviour-1K 1000 xxAction DeltasxManiskill2 20 xxAction Delt

48、asx70kVIMA 17Posesx650kClevrSkill(our)36Action Deltas+Poses330kOutlook:CLEVRskills dataset for robotics foundation models32Running AI on device saves memorycosts and reduces latency,while bringingenhanced reliability,security,and more.Our framework addresses current method constraints in e.g.,causal

49、 or compositional spatiotemporal reasoning over actions.Qualcomm AI Research has achievedstate-of-the-art results in multimodal LMsQualcomm has made tools available for the ML community to develop multimodal interaction-based apps3233QuestionsConnect with QCOMResearch https:/ youNothing in these mat

50、erials is an offer to sell any of the components or devices referenced herein.Qualcomm Technologies,Inc.and/or its affiliated companies.All Rights Reserved.Qualcomm and Snapdragon are trademarks or registered trademarks of Qualcomm Incorporated.Other products and brand names may be trademarks or reg

51、istered trademarks of their respective owners.References in this presentation to“Qualcomm”may mean Qualcomm Incorporated,Qualcomm Technologies,Inc.,and/or other subsidiaries or business units withinthe Qualcomm corporate structure,as applicable.Qualcomm Incorporated includes our licensing business,Q

52、TL,and the vast majority of our patent portfolio.Qualcomm Technologies,Inc.,a subsidiary of Qualcomm Incorporated,operates,along with its subsidiaries,substantially all of our engineering,research and development functions,and substantially all of our products and services businesses,including our QCT semiconductor business.Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.Qualcomm patented technologies are licensed by Qualcomm Incorporated.Follow us on:For more information,visit us at &

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(高通:2024多模態AI的感官融合——視覺、聽覺與交互技術白皮書(英文版)(34頁).pdf)為本站 (白日夢派對) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站