《張馳-基于多模態大語言模型的GUI智能體.pdf》由會員分享,可在線閱讀,更多相關《張馳-基于多模態大語言模型的GUI智能體.pdf(37頁珍藏版)》請在三個皮匠報告上搜索。
1、基于多模態大語言模型的GUI智能體張馳 博士騰訊QQ光影實驗室/西湖大學 AGI 實驗室演講嘉賓張馳入選美國斯坦福大學發布的2023全球前2頂尖科學家榜單張馳博士,現就職于騰訊公司擔任研究科學家。將在 2024 年秋季加入西湖大學擔任博導助理教授,并創建 AGI 實驗室。2021年博士畢業于新加坡南洋理工大學計算機學院。他的博士研究課題專注于高效低成本的機器學習和計算機視覺算法,目前的研究方向是大模型時代下的多模態模型與AIGC模型。曾在CVPR、ICCV、NeurIPS、TPAMI等頂級國際會議期刊上發表多篇學術論文。目 錄CONTENTS1.背景:大語言模型與 Agent2.難點與挑戰3.
2、用戶界面與人機交互4.基于 GUI 的智能體 AppAgentLarge language models as agentPrompt:Imagine that you are a home robot,and is asked to go to the kitchen to fetch something from the fridge.The image below shows your current position.Please plan your next action.GPT-4V:My next action would be turn right and move forwa
3、rd towards the hallway,as I can see that the kitchen is likely located in that direction.I would then continue to navigate through the hallway until I reach the kitchen and locate the fridge.Large language models as agentPrompt:Now you are at the kitchen,this is what you can see from there.Can you s
4、ee the fridge from here?What is your next action?GPT-4V:From this position,I can see the fridge on the right side of the kitchen.My next action would be to move forward and turn slightly to the right to approach the fridgeLarge language models as agentPrompt:Imagine that you are a robot operating a
5、computer You are asked to find a detailed recipe of Mapo Tofu.Below is what you see on the computer screen,predict your next moveGPT-4V:Next move:I would move the mouse to the Google Chrome icon located on the left side of the screen,approximately 1/3 of the way down from the top.I would then click
6、on the icon to open the web browser.DecisionPerceptionActionChallengesLimitation of text-only input in current agent designControl alone is a big disciplineInsufficient knowledge in LLM&high learning costA Simple Agent PrototypeSignificance of InterfaceAdvantagesActions have clear semantic meaning,p
7、roviding a user-friendly interaction methodLarge language model focus on decision makingRelatively easy perception of environment and stateAgent on InterfaceDevelop UI-based agent frameworks with strong generalization capabilities Develop an multimodal LLM that can accept multimodal inputs Developin
8、g an embodied agent that can interact with interfaces,such as mobile phonesGoalAppAgent:Multimodal Agents as Smartphone UsersAppAgentAppAgentAppAgentAppAgentChallenge 1:How to output control?AppAgentChallenge 2:How to learn?Complex softwareEasy softwareVSAppAgentChallenge 3:How to generalize?AppAgen
9、t Tap(element:int)Long_press(element:int)Swipe(element:int,direction:str,dist:str)Text(text:str)Back()Exit()Action Space20Supervised finetuning of LLM Are you sure?Zhang,Chi,et al.Appagent:Multimodal agents as smartphone users.(2023).“Briefly describe the functions of the four buttons in the bottom
10、barof this screenshot.”“Can you provide the bounding box coordinates for the Profile Photo button on the screen?”Human-Like system designHow do we human learn?Quick Learning processSelf-explorationWatching demos Good generalization capabilityHuman-Like system designOverall Idea:Exploration-Deploymen
11、t Explore how to use an App(self-exploration or watching demos)Generate a document after exploration,used as an external database Refer to the Doc for better decision makingHuman-Like system designExploration PhaseExploration PhaseDeployment PhaseTask:“beautify this photo for me”Deployment PhaseDepl
12、oyment PhaseExperiment Revolutionizing Human-Computer InteractionGUICMDLUISignificance and Impact Digital employee and super assistantsSignificance and Impact Software testing and user studySignificance and Impact32ApplicationThe AppAgent framework and ideas have been validated in the companys key p
13、rouducts such as image editing and 3D room layout generation,and will continue to be explored and developed in the future.“remove the left person”segments the image firstInpaint the regionMore Research on AgentsErase the personZhang,Chi,et al.Appagent:Multimodal agents as smartphone users.arXiv prep
14、rint arXiv:2312.13771(2023).33ApplicationWe also design agent,MovieLLM,to create movies for training long-video understandingmodels.Song Z,Wang C,Sheng J,Zhang C,et al.MovieLLM:Enhancing Long Video Understanding with AI-Generated MoviesJ.arXiv preprint,2024More Research on Agents34AppAgent的應用AppAgent框架與思路在公司的圖像編輯,3D小屋等重點業務能上得到驗證,未來會被繼續探索與發展基于GPT4設計的Agent選擇并擺放家具位置More Research on Agents本地模型能力適配35 多圖,長context,理解能力 OCR能力(多語言)Visual pointing的能力(通過網格,數字指代圖像元素)調用函數、工具的能力 In-context能力(格式跟隨)App/UI的先驗增強 特殊圖像內容判斷能力(P圖)提高 LLM 預測速度LLM for AgentsTHANKS