《毛宇航_RLChina23 - 周日上午 - 毛航宇 - 從 強化學習(多)智能體 到 大語言模型(多)智能體(1)_watermark.pdf》由會員分享,可在線閱讀,更多相關《毛宇航_RLChina23 - 周日上午 - 毛航宇 - 從 強化學習(多)智能體 到 大語言模型(多)智能體(1)_watermark.pdf(35頁珍藏版)》請在三個皮匠報告上搜索。
1、從 強化學習(多)智能體到 大語言模型(多)智能體1毛航宇商湯科技RLChina2023 “大模型與AI Agent”目錄SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAAI20NCC-MARLAAAI20STEERSubmit to AAAI24Arxiv 23.05LLaMACArxiv 23.11SingleMultiDRL TRL LLM-based Agent SEIHAISEIHAI:A Sample-efficient
2、Hierarchical AI for the MineRL CompetitionMotivation驗證agents在Open-ende環境中的不斷學習能力成為AI的一個重要方向MineCraft成為天然的“演練場”SEIHAI是第一個在NeurIPS MineRLCompetition中完全learning-based達到“鐵器時代”的agentMineCraft難點item依賴、稀疏獎勵+長episode、無任何語義SEIHAISEIHAI:A Sample-efficient Hierarchical AI for the MineRL Competitiontraining the
3、 scheduler boils down to a classification taskSEIHAISEIHAI:A Sample-efficient Hierarchical AI for the MineRL Competition目錄SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAAI20NCC-MARLAAAI20STEERSubmit to AAAI24Arxiv 23.05LLaMACArxiv 23.11Single
4、MultiDRL TRL LLM-based Agent Gated-ACMLLearning Agent Communication under Limited Bandwidth by Message PruningMotivationMulti-agent communication是個很古老的研究課題,研究how、what、to whom to communicate但實際問題中通信帶寬有限,如何在limited bandwidth下進行通信?Gated-ACMLLearning Agent Communication under Limited Bandwidth by Messag
5、e Pruning如何設置T=動態(如下圖)和靜態(?)將limited bandwidth轉化為message pruningmessage pruning轉化為binary classification如何設置T=動態(如下圖)和靜態(?)Gated-ACMLLearning Agent Communication under Limited Bandwidth by Message PruningNCC-MARLNeighborhood Cognition Consistent Multi-Agent Reinforcement LearningMotivationMulti-agent
6、怎么才能像人一樣很好的合作?人在合作時有什么特性?認知一致性!NCC-MARLNeighborhood Cognition Consistent Multi-Agent Reinforcement LearningNCC-MARLNeighborhood Cognition Consistent Multi-Agent Reinforcement Learning一致性近似變分推理目錄SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAA
7、I20NCC-MARLAAAI20STEERSubmit to AAAI24Arxiv 23.05LLaMACArxiv 23.11SingleMultiDRL TRL LLM-based Agent TIT/PDiTTransformer in Transformer as Backbone for Deep Reinforcement LearningPDiT:Interleaving Perception and Decision-making Transformers for Deep Reinforcement LearningMotivation在DRL時期,網絡結構至關重要,Du
8、eling Network和Value Iteration Network獲NIPS/ICML best paper在TRL事情,網絡結構同樣至關重要?找一個適配RL(perception-decision)的Transformer結構!TIT/PDiTTransformer in Transformer as Backbone for Deep Reinforcement LearningPDiT:Interleaving Perception and Decision-making Transformers for Deep Reinforcement LearningMethodsFro
9、m DT to vanilla-PDiT to PDiT(perception-decision interleaving Transformer)TIT/PDiTTransformer in Transformer as Backbone for Deep Reinforcement LearningPDiT:Interleaving Perception and Decision-making Transformers for Deep Reinforcement LearningSTEERStackelberg Decision Transformer for Asynchronous
10、Action Coordination in Multi-Agent SystemsMotivation多智能體決策范式:同步or異步?STEERStackelberg Decision Transformer for Asynchronous Action Coordination in Multi-Agent Systems目錄SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAAI20NCC-MARLAAAI20STEERSubmi
11、t to AAAI24Arxiv 23.05LLaMACArxiv 23.11SingleMultiDRL TRL LLM-based Agent TPTUTPTU:Large Language Model-based AI Agents for Task Planning and Tool UsageTPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world SystemsMotivationLLM已經初步具備通識能力,可以認為是通用的“世界模型”LLM如何賦
12、能智能體?(從RLer角度看)關鍵點在什么地方?Task Planning(類似于RL中的long-term decision)!Tool Usage(類似于RL中的external environment)!TPTUTPTU:Large Language Model-based AI Agents for Task Planning and Tool UsageTPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world SystemsTPTUTPTU:Larg
13、e Language Model-based AI Agents for Task Planning and Tool UsageTPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world SystemsExperiments3.2 Evaluation on Task Planning Ability3.2.1 TPTU-OA:Tool Order Planning3.2.2 TPTU-OA:Tool Order Planning and Subtask De
14、scription Generation3.2.3 TPTU-OA:The Planning of Tool-Subtask Pair3.2.4 TPTU-OA:The Planning of Tool-Subtask Pair with Unrelated Tools3.2.5 TPTU-SA:The Planning of Tool-Subtask Pair Generation3.3 Evaluation on Tool Usage Ability3.3.1 The Effectiveness of Single Tool Usage3.3.2 TPTU-OA and TPTU-SA:T
15、ool Usage for Multiple Tools3.4 Insightful ObservationsTPTU-V2TPTU:Large Language Model-based AI Agents for Task Planning and Tool UsageTPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world SystemsMotivationLLM已經初步具備通識能力,可以認為是通用的“世界模型”LLM如何賦能智能體?(從RLer角度看)關
16、鍵點在什么地方?Task Planning(類似于RL中的long-term decision)!=真實系統task太復雜Tool Usage(類似于RL中的external environment)!=真實系統tool多、tool functionality語義交叉TPTU-V2TPTU:Large Language Model-based AI Agents for Task Planning and Tool UsageTPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in
17、 Real-world SystemsTPTU-V2TPTU:Large Language Model-based AI Agents for Task Planning and Tool UsageTPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world SystemsLLaMACControlling Large Language Model-based Agents for Large-Scale Decision-Making:An Actor-Cri
18、tic ApproachMotivationLLM已經初步具備通識能力,可以認為是通用的“世界模型”LLM如何賦能多(large-scale)智能體?(從MARLer角度看)關鍵點在什么地方?Exploration and ExploitationLLM HallucinationToken EfficiencyLLaMACKey Design-TripletCritic1.將探索和利用放到centralized critic,N-agent EE轉變為2-agent EE,降低探索成本,同時減少non-stationary。2.通過accessor來平衡探索和利用,修復critic 1/2幻
19、覺帶來的問題。3.Memory Redundant Information Filtering和Internal/External Feedback,使信息更加精簡準確,減少迭代次數,進而減少token數量。Controlling Large Language Model-based Agents for Large-Scale Decision-Making:An Actor-Critic ApproachLLaMACControlling Large Language Model-based Agents for Large-Scale Decision-Making:An Actor-C
20、ritic ApproachLLaMACControlling Large Language Model-based Agents for Large-Scale Decision-Making:An Actor-Critic ApproachLLaMACControlling Large Language Model-based Agents for Large-Scale Decision-Making:An Actor-Critic Approach總結SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS
21、24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAAI20NCC-MARLAAAI20STEERSubmit to AAAI24Arxiv 23.05LLaMACArxiv 23.11SingleMultiDRL TRL LLM-based Agent 介紹了從【強化學習(多)智能體】到【大語言模型(多)智能體】的范式轉變,以及一些關鍵技術思考SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAAI20NCC-MA
22、RLAAAI20STEERSubmit to AAAI24Arxiv 23.05LLaMACArxiv 23.11Single-AgentMulti-AgentLarge-scaleDRL TRL LLM-based Agent MF-MARLICML20DebateRole PlayingArxiv?范式越來越固定范式越來越固定結論SEIHAINeurIPS20,DAI21TIT/PDiTSubmit to AAMAS24Arxiv 22.12TPTUNeurIPS24-FMDMTPTU-V2Arxiv 23.11Gated-ACMLAAAI20NCC-MARLAAAI20STEERSubm
23、it to AAAI24Arxiv 23.05LLaMACArxiv 23.11研究可行性-未來潛力 Single-AgentMulti-AgentLarge-scaleDRL TRL LLM-based Agent MF-MARLICML20DebateRole PlayingArxiv?技術發展方向范式越來越固定參考論文SEIHAI:A Sample-efficient Hierarchical AI for the MineRL Competition.DAI21.Learning Agent Communication under Limited Bandwidth by Messag
24、e Pruning.AAAI20.Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning.AAAI20.Transformer in Transformer as Backbone for Deep Reinforcement Learning.Arxiv22.PDiT:Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning.Arxiv22.Stackelberg Decision Trans
25、former for Asynchronous Action Coordination in Multi-Agent Systems.Arxiv23.TPTU:Large Language Model-based AI Agents for Task Planning and Tool Usage.Arxiv23.TPTU-v2:Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems.Arxiv23.Controlling Large Language Model-based Agents for Large-Scale Decision-Making:An Actor-Critic Approach.Arxiv23.36堅持原創,讓 AI 引領人類進步面向全球,商湯科技堅持原創,致力讓 AI 推動經濟、社會和人類發展,建設未來。400 900 5986(周一到周五 9:00-18:00)商業合作