張馳-基于多模態大語言模型的GUI智能體.pdf

編號:182443 PDF 37頁 5.22MB 下載積分:VIP專享
下載報告請您先登錄!

張馳-基于多模態大語言模型的GUI智能體.pdf

1、基于多模態大語言模型的GUI智能體張馳 博士騰訊QQ光影實驗室/西湖大學 AGI 實驗室演講嘉賓張馳入選美國斯坦福大學發布的2023全球前2頂尖科學家榜單張馳博士,現就職于騰訊公司擔任研究科學家。將在 2024 年秋季加入西湖大學擔任博導助理教授,并創建 AGI 實驗室。2021年博士畢業于新加坡南洋理工大學計算機學院。他的博士研究課題專注于高效低成本的機器學習和計算機視覺算法,目前的研究方向是大模型時代下的多模態模型與AIGC模型。曾在CVPR、ICCV、NeurIPS、TPAMI等頂級國際會議期刊上發表多篇學術論文。目 錄CONTENTS1.背景:大語言模型與 Agent2.難點與挑戰3.

2、用戶界面與人機交互4.基于 GUI 的智能體 AppAgentLarge language models as agentPrompt:Imagine that you are a home robot,and is asked to go to the kitchen to fetch something from the fridge.The image below shows your current position.Please plan your next action.GPT-4V:My next action would be turn right and move forwa

3、rd towards the hallway,as I can see that the kitchen is likely located in that direction.I would then continue to navigate through the hallway until I reach the kitchen and locate the fridge.Large language models as agentPrompt:Now you are at the kitchen,this is what you can see from there.Can you s

4、ee the fridge from here?What is your next action?GPT-4V:From this position,I can see the fridge on the right side of the kitchen.My next action would be to move forward and turn slightly to the right to approach the fridgeLarge language models as agentPrompt:Imagine that you are a robot operating a

5、computer You are asked to find a detailed recipe of Mapo Tofu.Below is what you see on the computer screen,predict your next moveGPT-4V:Next move:I would move the mouse to the Google Chrome icon located on the left side of the screen,approximately 1/3 of the way down from the top.I would then click

6、on the icon to open the web browser.DecisionPerceptionActionChallengesLimitation of text-only input in current agent designControl alone is a big disciplineInsufficient knowledge in LLM&high learning costA Simple Agent PrototypeSignificance of InterfaceAdvantagesActions have clear semantic meaning,p

7、roviding a user-friendly interaction methodLarge language model focus on decision makingRelatively easy perception of environment and stateAgent on InterfaceDevelop UI-based agent frameworks with strong generalization capabilities Develop an multimodal LLM that can accept multimodal inputs Developin

8、g an embodied agent that can interact with interfaces,such as mobile phonesGoalAppAgent:Multimodal Agents as Smartphone UsersAppAgentAppAgentAppAgentAppAgentChallenge 1:How to output control?AppAgentChallenge 2:How to learn?Complex softwareEasy softwareVSAppAgentChallenge 3:How to generalize?AppAgen

9、t Tap(element:int)Long_press(element:int)Swipe(element:int,direction:str,dist:str)Text(text:str)Back()Exit()Action Space20Supervised finetuning of LLM Are you sure?Zhang,Chi,et al.Appagent:Multimodal agents as smartphone users.(2023).“Briefly describe the functions of the four buttons in the bottom

10、barof this screenshot.”“Can you provide the bounding box coordinates for the Profile Photo button on the screen?”Human-Like system designHow do we human learn?Quick Learning processSelf-explorationWatching demos Good generalization capabilityHuman-Like system designOverall Idea:Exploration-Deploymen

11、t Explore how to use an App(self-exploration or watching demos)Generate a document after exploration,used as an external database Refer to the Doc for better decision makingHuman-Like system designExploration PhaseExploration PhaseDeployment PhaseTask:“beautify this photo for me”Deployment PhaseDepl

12、oyment PhaseExperiment Revolutionizing Human-Computer InteractionGUICMDLUISignificance and Impact Digital employee and super assistantsSignificance and Impact Software testing and user studySignificance and Impact32ApplicationThe AppAgent framework and ideas have been validated in the companys key p

13、rouducts such as image editing and 3D room layout generation,and will continue to be explored and developed in the future.“remove the left person”segments the image firstInpaint the regionMore Research on AgentsErase the personZhang,Chi,et al.Appagent:Multimodal agents as smartphone users.arXiv prep

14、rint arXiv:2312.13771(2023).33ApplicationWe also design agent,MovieLLM,to create movies for training long-video understandingmodels.Song Z,Wang C,Sheng J,Zhang C,et al.MovieLLM:Enhancing Long Video Understanding with AI-Generated MoviesJ.arXiv preprint,2024More Research on Agents34AppAgent的應用AppAgent框架與思路在公司的圖像編輯,3D小屋等重點業務能上得到驗證,未來會被繼續探索與發展基于GPT4設計的Agent選擇并擺放家具位置More Research on Agents本地模型能力適配35 多圖,長context,理解能力 OCR能力(多語言)Visual pointing的能力(通過網格,數字指代圖像元素)調用函數、工具的能力 In-context能力(格式跟隨)App/UI的先驗增強 特殊圖像內容判斷能力(P圖)提高 LLM 預測速度LLM for AgentsTHANKS

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(張馳-基于多模態大語言模型的GUI智能體.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站