《高通:2024多模態AI的感官融合——視覺、聽覺與交互技術白皮書(英文版)(34頁).pdf》由會員分享,可在線閱讀,更多相關《高通:2024多模態AI的感官融合——視覺、聽覺與交互技術白皮書(英文版)(34頁).pdf(34頁珍藏版)》請在三個皮匠報告上搜索。
1、Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.September 24,2024How AI is learning tosee,hear,and interactIntegratingsenses:Roland MemisevicSenior Director of Engineering at Qualcomm AI ResearchJoint workwith Sunny Panchal,Apratim Bhattacha
2、ryya,Guillaume Berger,Antoine Mercier,Reza Pourreza,Sanjay Haresh,and othersSnapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.2Agenda Key concept:streaming architecture Importance of datasets for end-to-end training Efficient human-AI interact
3、ion and video-based reasoning Improving streaming video LLMs using auxiliary tasks Q&A23Generative AIcapabilitiescontinue toincreaseMODALITY AND USE CASECAPABILITY AND KPIAgentsExecute multi-step tasks with reasoning autonomously to achieve a goalVideo&3DGenerating content for a richerand more reali
4、stic experienceVoice UIVoice is a natural and intuitiveinterface for conversationLarge multimodal modelsUtilizing more sensinginput modalities to betterunderstand the worldPersonalizationFine-tuned models customized to consumers,enterprises,or industries(e.g.,LoRA)Longer context windowAllows in-dept
5、h conversationsHigher resolutionProcess higher fidelity imagesfor better accuracyLoRA:low-rank adaptation4LM:Language vision modelAI acceleration on the QualcommHexagon NPU of the Snapdragon 8Gen 3 Mobile ProcessorQualcomm AI Engine directfor improved performance andminimized memory spillage Knowled
6、ge distillation for pruning andremoving of attention blocks,resultingin accurate model with improvedperformance and power efficiencyDesigning an efficient diffusionmodel through knowledgedistillation for high accuracyFull-stack AI optimizationfor LMsRuns completelyon the deviceSignificantly reducesr
7、untime latency andpower consumptionContinuously improvesthe Qualcomm AI Stack5To scale,the center of gravity of AI processing is moving to the edgeHybrid AIDistribute workloads among cloud andedge/devices to deliver more powerful,efficient,and highly optimized experiencesOn deviceImmediacy|Reliabili
8、ty|Personalization|Privacy|Security Cost|Energy Central cloudEase of development&deploymentTraining|Very large modelsAggregation|Absolute performanceEdge cloud(on-prem or nearby)Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation56LLM:Large Language Model;LLaVA:Large Langua
9、ge and Vision AssistantWorlds first large multimodal model(LMM)on anAndroid phoneLLMs can now see7+billion parameter LMM,LLaVA,with text,speech,and image inputsMulti-turn intuitive conversationsabout an image at a responsivetoken rateFull-stack AI optimizationto achieve high performanceat low powerE
10、nhanced privacy,reliability,personalization,and cost withon-device processing78Goal:Training AI models to see and interact with humansSMART HOMEMOBILEROBOTICS89What to Say and When to Say it:Video-Language Model and Benchmark for Situated Interactions(2024);OpenEQA:Embodied Question Answering in the
11、 Era of Foundation Models(2024);VQA:visual question answering 9Researching visually-grounded LLMs with the abilityto reason and interact with the environmentSituated vision-language models Process a live video streamin real time and dynamicallyinteract with users Determine what to sayand when to say
12、 it Enable the path to humanoidsOpen-ended,asynchronousinteraction with situated agentsis an open challenge Limited to turn-based interactionsabout offline documents or images Limited to capturing momentarysnapshots of reality ina VQA-style dialogueVisually-grounded LLMVisionActionrecognitionOrchest
13、ratorFrontendLLMTTS10PipelinePipelinePipelineNeural networkPipelineNeural networkPipelineNeural networkPipelineNeural networks have replaced increasinglycomplex computational pipelines SPEECHTO TEXTOBJECTRECOGNITIONLANGUAGETRANSLATIONAudioTextPixelsObjectsEnglishFrench2010201220141011End-to-end back
14、prop for agents11(AUTO-REGRESSIVE)NEURAL NETWORKINPUT STREAMBEHAVIOR STREAMKey concept:Multi-modal streaming architecture13End-to-end learning requires amulti-modal streaming architecture13(AUTO-REGRESSIVE)NEURAL NETWORKINPUT STREAMBEHAVIOR STREAMAUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.
15、g.,camera)TRAINEDEND-TO-END An auto-regressive languagemodel is a useful componentof a multi-modal agent becauseit is already able to performa dialogue with a user Additionally,language makesit easy to encode surrogate tasksfor a degree of“common sense”to emerge 14End-to-end learning requires amulti
16、-modal streaming architecture14 Visual foundation models that combine an image featureextractor with a language model back-bone have becomeincreasingly common There are multiple different ways to combine visualinformation with language model tokes,e.g.:Cross-attention(e.g.,Flamingo)Dedicated vision
17、tokens(e.g.,Llava)good for applications like Captioningand Visual Question AnsweringAUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.g.,camera)a live agent that can utilize a real-time camera feedrequires a system that can continuously attend to visual input Challenges:Freely interleaved vision
18、frames and language tokens Dependences between vision frame-rate and token rate Training data,allowing a model to learn what to say and when Recent work:“VideoLLM-online:Online Video Large Language Model for Streaming Video“,Chen et al.,2024 and our work,which I will present in the next slidesCONTEX
19、T WINDOWFRAMETOKENTOKENTOKENFRAMETOKENTOKENTOKENFRAMETOKENTOKENFlamingo:a Visual Language Model for Few-Shot Learning”,Alayrac et al 2022“Visual Instruction Tuning”,Liu et al.2023However,Importance of datasetsfor end-to-end training 16Datasets for end-to-end training of visual assistantsKey requirem
20、ent for end-to-end training:aligned video feed(frames)+assistants comments(tokens)“HoloAssist:an EgocentricHuman Interaction Datasetfor Interactive AI Assistantsin the Real World”Wang et al.2024 1st person videos showing a varietyof tasks(20 tasks across 16 objects)“Can Foundation Models Watch,Talk
21、and Guide You Step by Stepto Make a Cake?”Bao et al.20231st person videos showingpreparation of cupcakes“Live Fitness Coachingas a Testbed forSituated Interactions”Panchal et al.20243rd person videos showing fitnessexercises and their corrections17Live Fitness Coaching as a Testbed for Situated Inte
22、raction,Panchal,Bhattacharyya,et al.202417Aimed at the development of interactive multi-modal vision-language models based in the controlled but challenging fitness coaching domain FIT-Coachbenchmarkand datasetA novel interactive visualcoaching benchmark anddataset as a test-bed forreal-time,real-wo
23、rldsituated interactionFitness questions dataset148exercises400k+fine-grainedquestion-answer pairs300kshort-clip videos470+hours1900uniqueparticipants1.1M+high-levelquestion-answer pairsFitness feedback dataset21uniqueparticipants9+hours offitnesscoachingsession148exercise sessions 3.5minuteslong se
24、ssionswith 5 to 6exercises18Fitness assistant dataset and benchmarkLong-range videos showing the user exercising,along with aligned comments by the coach(200 sessions across 5-6 exercises each)Short video clips showing the user performing individual exercises,along with labels for performance and co
25、mmon mistakes(300k clips of duration 5-10 seconds each)SHORT CLIPSLONG-RANGETrainTestTrainTestNumber of videos290,77516,42915369Unique Participants1,800+100217Average Duration(s)5.6 1.15.6 1.2213.4 3.1213.7 3.3Exercises per Video115-65-6Total Number of Exercises1481482323Total Classes18661690Fitness
26、 QuestionsTotal High-level Questions1,193,05678,390Total Fine-grained Questions404,08280,694Fitness FeedbacksAverage Feedbacks per Exercise2.0 10.12.4 6.95.0 1.35.0 1.2Average Silence Period(s)n/an/a5.2 1.45.3 1.2Average Feedback Length(words)9.0 6.19.1 5.06.3 3.86.6 4.019Fitness assistant dataset a
27、nd benchmarkLong fitness sessions dataset Short fitness clips dataset20DATASETDOMAINHUMANACTIONSINTERACTIVE MISTAKESCORRECTIVEFEEDBACKSDOMAINEXPERTISELENGTHAction Recognition DatasetsNTU RGB+D FitnessxxxFineGym Fitnessxxx708Procedural Activity DatasetsYouCook2 Cookingxxxxx176Epic-Kitchens Cookingxxx
28、xx100HowTo100M Daily-lifexxxx134kEgo-4D Daily-lifexxxxx3670Ego-Exo4D Daily-lifexxxx1422Assembly-101 Toy assm.xxxx513Interactive AI Assistant DatasetsWTAG Cookingxxx10HoloAssist Obj.manip.xxx166QEVD(Ours)Fitness474Our dataset meets all the needs of interactive AI assistantsEfficient human-AI interact
29、ion and video-based reasoning22SELF-ATTNSELF-ATTNSELF-ATTNLANGUAGE BACKBONEPROMPTSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNsmoothSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNSELF-ATTNonVisualstreamSELF-ATTNCROSS-ATTNSELF-ATTNCROSS-ATTNSELF-ATTN3D CNNSELF-ATTNCROSS-ATTNSELF-ATTNCROSS-ATTNSELF-ATTN3D CNNSEL
30、F-ATTNCROSS-ATTNSELF-ATTNCROSS-ATTNSELF-ATTN3D CNNAUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.g.,camera)Detailed architecture:Learning what to say and when to say it23Steppable causal 3d convolutionsenable efficient streaming motion perceptionExisting visionlanguage modelsuse a 2d CNN orvis
31、ion transformeras the visualfeature extractor This makes them unsuitable fortasks such as fitness coaching,which involve understandingof human behaviors and motion patternsWe use a 3d CNN as the featureextractor,which we have shownto be well-suited to end-to-endlearning(“Is end-to-end learningenough
32、 for fitness activityrecognition?”,Mercier et al.2023)AUTO-REGRESSIVE LLMLANGUAGEOR ACTIONSEXTERNALINPUT(e.g.,camera)Efficient visual streaming at inference time can be enabled using steppable,causal convolutions:Standard ConvCausal ConvSteppable ConvPrevioustimestepsNewtimestep“Is end-to-end learni
33、ng enough for fitness activity recognition?”,Mercier et al.2023Enhance your app with the ability to see&interact with humans via any RGB camera: Improving streaming video LLMs using auxiliary tasks25*“The something-something video database for learning and evaluating visual common sense”(Goyal et al
34、.2017)Language generation is not only a useful task,but it alsohelps a model acquire a degree of“common sense”Using a language decoder to provide surrogatetasks to the model at training time Pre-training a model ona difficult captioning task(Something-somethingby Goyal et al.2017)7,734,347,154,455,8
35、59,762,8Generating complextextual descriptionsGenerating simpletextual descriptionsClassification on178 class actionsClassification on 40 action groupsBaselineclassificationonimages Training fromscratchallows us to improveprediction accuracyon a separate HomeCooking Task:“On the effectiveness of tas
36、k granularity for transfer learning”(Mahdisoltani,et al.2018)26A vision-language model can learn low-level visual skillsby encoding visual information as languageEncoding visual information as languageis a natural way to teach a vision-languagemodel low-level visual skills,such asobject identificati
37、on,detection,etc.The use of these visual skills atinference time is like performingchain-of-thought reasoning forvisual inference tasks“Look,Remember and Reason:Grounded reasoning in videoswith language models”Bhattacharyya,et al.2024 Example:Something-Else(Materzynska et al.,2020):Example:CATER(Gir
38、dhar et al.,2020):21 3332121 3332121 3332113 18 18 21MethodBaseCompositionalTop 1Top 5Top 1Top 5STIN+OIE+NL (Materzynska et al.,2020,MIT)78.194.556.281.3Video-ChatGPT(Maaz et al.,2023)52.675.838.667.8LRR(w/o Surrogate Tasks)52.675.850.170.8LRR(fine-tuned)80.296.162.086.3LRR(joint)-61.185.4MethodStat
39、ic CameraMoving CameraTop 1Top 5Top 1Top 5ALOE(Ding et.Al.)74.094.059.790.1TFC V3D(Zhang et.al.)79.795.5-LRR(w/o Surrogate Tasks)68.588.762.786.7LRR(fine-tuned)84.197.280.496.7LRR(joint)81.097.373.795.627Stochastic probing allows us to distill visual skills into the model Encoding the extracted low-
40、level information as tokens growsthe context window and it can be inefficient Relying on explicit representations of low-level computer vision features(such as bounding box positions)may also lead to brittleness We therefore propose to distill low-level visual skills into the modelusing a process we
41、 refer to as Stochastic Probing:Stochastic probing:During training,prompt a model at random time-steps to perform low-level visual tasks ACRECompositionalSystematicInference Speed*(sec)ALOE(Ding et.Al.)91.793.9-LRR99.399.50.061LRR(Stochastic Probing)98.299.21.415Training on visual skillscan boost pe
42、rformanceover classic approachesStochastic probingboosts efficiencyat inference time A similar approach:“Distilling Step-by-Step!Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes”,Hsie,et al.,2023*timing on an A100 GPU28End-to-end training in conjunction with stoch
43、astic probing allowsa model to provide useful and accurate feedback in real-time29Qualitative results:end-to-end learning enablesvideo LLMs to deliver accurate live feedbackGroundtruthStream-VLMLLaMA-VIDLLaVA-NextQuestion:Provide an appropriate feedback for the userVideo-LLaMA:We see a young man sta
44、nding in a kitchen,wearing a red shirt and white shorts.Video-ChatGPT:The user has successfully demonstrated the ability to performa balancing act on a pair of stools.Coach-LLaMA:This is awesome.Lets keep the intensity high!30Quantitative results:end-to-end learning enables videoLLMs to deliver accu
45、rate live feedbackZero-shot prompting results:METHODMETEORROUGE-LBERTLLM-Acc.InstructBLIP 0.0470.0400.8391.64Video-LLaVA 0.0570.0250.8471.82Video-ChatGPT 0.0980.0780.8502.27Video-LLaMA 0.1010.0770.8592.28LLaMA-VID 0.1000.0790.8592.33LLaVA-Next 0.1040.0780.8582.39Fine-tuning results:METHODMETEORROUGE
46、-LBERTLLM-Acc.T-F-ScoreSocratic-Llama-2-7B0.0940.0710.8602.390.50Video-ChatGPT*0.1080.0930.8632.420.50LLaMA-VID*0.1060.0900.8602.400.50STREAM-VLM0.1250.1160.8632.560.59STREAM-VLM(w/o 3D CNN)0.0900.0830.8572.170.51STREAM-VLM(w/o Action-Tokens0.1250.1100.8612.560.5031DATASET/SIMULATOR#TASKSLANGUAGEMUL
47、TIMODAL PROMPTSACTION GRANULARITYCOMPOSITIONALITY#DEMONSTRATIONSRealRoboTurk 3xxAction Deltasx111hrsBridgeData 71xxAction Deltasx7.2kOpen-X xAction Deltasx1MRH20T xAction Deltasx100kFMB 7xxAction Deltas22.5kSimulatedCALVIN 34xAction DeltasBehaviour-1K 1000 xxAction DeltasxManiskill2 20 xxAction Delt
48、asx70kVIMA 17Posesx650kClevrSkill(our)36Action Deltas+Poses330kOutlook:CLEVRskills dataset for robotics foundation models32Running AI on device saves memorycosts and reduces latency,while bringingenhanced reliability,security,and more.Our framework addresses current method constraints in e.g.,causal
49、 or compositional spatiotemporal reasoning over actions.Qualcomm AI Research has achievedstate-of-the-art results in multimodal LMsQualcomm has made tools available for the ML community to develop multimodal interaction-based apps3233QuestionsConnect with QCOMResearch https:/ youNothing in these mat
50、erials is an offer to sell any of the components or devices referenced herein.Qualcomm Technologies,Inc.and/or its affiliated companies.All Rights Reserved.Qualcomm and Snapdragon are trademarks or registered trademarks of Qualcomm Incorporated.Other products and brand names may be trademarks or reg
51、istered trademarks of their respective owners.References in this presentation to“Qualcomm”may mean Qualcomm Incorporated,Qualcomm Technologies,Inc.,and/or other subsidiaries or business units withinthe Qualcomm corporate structure,as applicable.Qualcomm Incorporated includes our licensing business,Q
52、TL,and the vast majority of our patent portfolio.Qualcomm Technologies,Inc.,a subsidiary of Qualcomm Incorporated,operates,along with its subsidiaries,substantially all of our engineering,research and development functions,and substantially all of our products and services businesses,including our QCT semiconductor business.Snapdragon and Qualcomm branded products are products of Qualcomm Technologies,Inc.and/or its subsidiaries.Qualcomm patented technologies are licensed by Qualcomm Incorporated.Follow us on:For more information,visit us at &