1、DataFunSummitDataFunSummit#20232023多智能體強化學習大模型初探郝曉田-天津大學-博士在讀NOAHS ARK LAB01多智能體決策大模型面臨的挑戰為什么強化學習需要大模型?多智能體決策大模型有哪些挑戰?02動作語義網絡ICLR21 Action Semantics Network:Considering the Effects of Actions in Multiagent Systems03置換不變性、置換同變性ICLR-23 Boosting MARL via Permutation Invariant and Permutation Equivaria
2、nt Networks04跨任務自動化課程學習AAMAS-23 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning目錄CONTENTNOAHS ARK LABDataFunSummitDataFunSummit#2023202301多智能體決策大模型面臨的挑戰NOAHS ARK LAB基本概念 什么是合作式多智能體系統?NOAHS ARK LAB游戲AI中的多“英雄”協作AlphaStar(DeepMind)Dota2(OpenAI-5)Honor of Kings(Tencent)多用戶-多商
3、品推薦智能倉儲多車輛協調(云計算、排產)多資源調度、協同優化滴滴出行多車輛協作調度多車輛運輸投遞優化現實世界中的大量實際問題可以建模為包含了多個主體的協同控制和優化問題。由多個參與主體,合作地優化某個(或多個)相同的目標函數?;靖拍?合作式多智能體強化學習建模方式NOAHS ARK LABMultiagent Markov Decision Processes(MMDP):Decentralized Partially Observable MDP(Dec-POMDP):Joint policy =1,=argmax=0(,).Xobs:類型、距離、相對橫縱坐標、血量、護甲,action:無
4、操作、上下左右移動、攻擊某個敵方單位基本概念 合作式多智能體強化學習建模方式NOAHS ARK LABMultiagent Markov Decision Processes(MMDP):Decentralized Partially Observable MDP(Dec-POMDP):Joint policy =1,=argmax=0(,).X 難點1:維度災難狀態觀測空間隨實體數量指數增長聯合動作空間隨實體數量指數爆炸 難點2:學習樣本效率低 難點3:通用性、泛化性差什么是多智能體強化學習大模型?設計模型使具有比較好的泛化性,一個模型可以解決多個類似問題NOAHS ARK LABStarC
5、raftDota2Honor of Kings相同游戲不同場景不同游戲不同場景MMM21c3s5z2m_vs_1z3s_vs_5z3s5z3s5z_vs_3s6z(星際爭霸)更大模型能給強化學習帶來什么好處?大模型在自然語言處理、計算機視覺等領域已取得突破性成果(ChatGPT3.5約有1750億參數)。強化學習領域:BBF(Bigger,Better,Faster)1NOAHS ARK LAB1 Bigger,Better,Faster:Human-level Atari with human-level efficiency,ICML-2023.Environment samples to
6、 reach human-level performance on Atari(over 26 games).(Atari-100k)BBF results in similar performance to model-based EfficientZero with at least 4x reduction in runtime.Larger network+self-supervision+increasing replay ratio+parameter reset多智能體強化學習大模型面臨哪些挑戰?Different entity numbers and types:不同場景的智能
7、體(或實體)數量、種類不同;Different feature inputs:實體的特征不同 觀測(obs)、狀態(state)不同;Different action spaces:動作空間不同;Different rewards:獎勵函數不同;NOAHS ARK LABobs:類型、距離、相對橫縱坐標、血量、護甲,網絡輸入維度、含義等不同策略網絡輸出維度、含義不同值函數網絡輸出尺度不同類比語言模型對多智能體系統進行統一描述 Align multiagent systems and languagesNOAHS ARK LABLanguage model詞表 句子描述客觀世界詞tokenize
8、r詞向量(模型底座)word2vec神經網絡Entity-factored description of multiagent system屬性表 實體表 動作表(動作語義)構成多智能體系統屬性tokenizer實體向量(模型底座)word2vec神經網絡種類位置血量護甲(類似關系型數據庫)向上向下攻擊描述obsstate觀測/狀態3條重要設計先驗 動作語義網絡ICLR-2021 Action Semantics Network:Considering the Effects of Actions in Multiagent Systems.置換不變性、置換同變性、模型變長輸入ICLR-202
9、3 Boosting MARL via Permutation Invariant and Permutation Equivariant Networks.遷移學習、跨任務的自動化課程學習AAMAS-2023 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning.NOAHS ARK LABDataFunSummitDataFunSummit#2023202302動作語義網絡NOAHS ARK LABICLR-2021 Action Semantics Network:Considering th
10、e Effects of Actions in Multiagent SystemsASN(Action Semantics Network)ASN considers different actions influence on other agents and designs neural networks based on the action semantics,e.g.,move or attack actions.NOAHS ARK LABe.g.,move actionse.g.,attack actionsASN(Action Semantics Network)ASN con
11、siders different actions influence on other agents and designs neural networks based on the action semantics,e.g.,move or attack actions.NOAHS ARK LABe.g.,move actionse.g.,attack actionsDataFunSummitDataFunSummit#2023202303置換不變性與置換同變性NOAHS ARK LABICLR-2023 Boosting MARL via Permutation Invariant and
12、 Permutation Equivariant Networks.Motivation Entity-factored modeling in MARLNOAHS ARK LAB A multiagent environment typically consists of entities,including learning agents and non-player objects.Both the state s and each agents observation oare usually composed of the features of the m entities:x0,
13、xm,each xj X,e.g.,and .The curse of dimensionalityIf simply representing the state s or the observation oas a concatenation of the entities features in a fixed order,the state space or observation space will grow exponentially as the entity number increases,which results in low sample efficiency and
14、 poor scalability of existing MARL methods.Main ideaThere exists symmetry features in multiagent systems containing homogeneous agents.Building/functions insensitive to the entities order can significantly reduce the state/observation space by a factor of!,i.e.,from X(concatenating/in a fixed order)
15、to Xm!(thus alleviating the curse of dimensionality).and are the feature dimension of each entity in and.X狀態、觀測空間隨實體數量指數增長系統的狀態刻畫的是實體集合的客觀信息,不隨“輸入順序的變化而”6 homogeneous agents 6 statesMotivation Entity-factored modeling in MARLNOAHS ARK LAB There are 2 types of actions:equiv,invin typical MA environme
16、nts.Entity-correlated actions equiv:e.g.,attack which enemy entity or heal which ally entity(StarCraft),pass the ball to which teammate(Football);Normal(entity-uncorrelated)actions inv:e.g.,move in different directions.Attack&heal in StarCraftpass the ball in Football Gamemove in different direction
17、sMotivation Design permutation insensitive/functionsNOAHS ARK LAB To build/functions insensitive to the order of the entities features(x0,xm),we should take the type of the actions into consideration.For entity-correlated actions equiv,permute the input entities order should also permute the corresp
18、onding outputs order.For normal(entity-uncorrelated)actions inv,permute the input entities order should not permute the outputs order.permute the input entities order One-to-one correspondence is an arbitrary permutation matrix operating on x0,xmTMethod Designing Permutation Invariant and Permutatio
19、n Equivariant Policy NetworksNOAHS ARK LAB is an arbitrary permutation matrix operating on The target is to inject the PI and PE inductive biases into the policy or the value function.Tasks only contain one of the two types of actions can be considered as special cases.Method Minimal Modification Pr
20、inciple(易用性)NOAHS ARK LAB Existing algorithms have invested a lot in handling MARL-specific problems,e.g.,they usually incorporate RNNs into the backbone module B to handle the partially observable inputs.We propose to only modify the input module A and the output module D while keeping the backbone
21、 module Band output module C unchanged.=,=,=,=()Method Dynamic Permutation Network(DPN)NOAHS ARK LAB The core idea of DPN is to always assign the same weight matrix to each entitys features.We build a weight selection network of which the output dimension is m.PI Input Layer ANo matter what order th
22、e m entities in the observation,the output of layer A is computed as:The input order change will result in the same output order change,thus achieving PEPE Output Layer DThe j-th output of D is computed as:T=,=,=,=()Method Hyper Policy Network(HPN)NOAHS ARK LAB How to automatically generate the weig
23、ht matrix to embed the input and the output We use hypernetworks to directly generate the weight matrices for the input layer and the output layer.PI Input Layer ANo matter what order the m entities in the observation,the output of layer A is computed as:The input order change will result in the sam
24、e output order change,thus achieving PE.PE Output Layer DThe j-th output of D is computed as:TExperiments StarCraft Multiagent Challenge(SMAC)NOAHS ARK LAB Two teams:allies and enemies.The ally units are controlled by the agents while the enemy units are controlled by the built-in rule-based bots.SO
25、TA performance(we achieve 100%win-rates in all hard and super hard maps!)Experiments StarCraft Multiagent Challenge(SMAC)NOAHS ARK LAB Our design follow the Minimal Modification Principle and can be easily plugged into any MARL algorithms and boost their performance very easy to use!Our HPN+QPLEX vs
26、 QPLEX(in SMAC)Our HPN+MAPPO vs MAPPO(in SMAC)Experiments StarCraft Multiagent Challenge V2(SMAC-V2)NOAHS ARK LAB Random start positions and random unit types.Experiments Multiagent Particle Environment(MPE)&Google Research Football(GRF)NOAHS ARK LAB For MPE,actions only consist of movements.Therefo
27、re,only the PI property is needed.For GRF,each agent has 19 discrete actions,including moving,sliding,shooting and passing Our HPN+QMIX vs QMIX(in Google Football)Experiments Transferability of HPNNOAHS ARK LAB Apart from achieving PI and PE,another benefit of HPN is that it can naturally handle var
28、iablenumbers of inputs and outputs.Therefore,HPN can be used to design more efficient multitasklearning and transfer learning algorithms.(網絡結構的可泛化性、可遷移性)Summary Simple but efficient implementations of PI and PE modules in MARL which can be trained in an end-to-end way.A plug-in module for the policy
29、 function or the value function in any MARL algorithm.Achieving SOTA results on typical MARL benchmarks.NOAHS ARK LABhttps:/ ARK LABAAMAS-2023 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning.Motivation 跨任務的課程學習(多個任務,如何自適應調度訓練順序)目標任務 T(非常難,很難從零開始學會)起始任務 T1,候選任務集合 T2,Tn,從中選
30、擇任務構成“最優學習序列”回答如下兩個核心問題:選哪個課程作為下一個學習的目標合適?前面學到的知識在新的課程如何復用?Reload the previous policy model?不同task的模型輸入維度(agent數量、種類、狀態、觀測)不同;NOAHS ARK LABMethod 基于難度和任務相似度選擇課程,基于HPN網絡結構實現策略遷移和復用 基于評估Return選擇難度適中(中間40%)任務 把當前策略在備選任務集合的所有任務上進行評估,得到對應的Return.根據Return排序,選擇處于中間40%的任務,下一課程在這些任務中產生?;赥ask similarity,在難度適
31、中任務中選擇與目標任務最接近的(目標任務相關度目標任務要有幫助)NOAHS ARK LAB上面兩個任務序列,第一次課程選擇時,5m_vs_6m和25m的任務難度差不多。但是前者是更好的中間課程,與目標任務的相關度更高5m-5m_vs_6m-8m_vs_10m5m-25m-8m_vs_10m衡量rollout state visitation distribution差異Method 基于HPN網絡結構實現策略遷移和復用 整體框架NOAHS ARK LAB模型支持變長輸入和輸出(支持不同agent數量、種類)Experiments Can PORTAL improve the learning
32、efficiency on all curricula?NOAHS ARK LABMarines5m(initial),5m vs 6m,6m,7m,8m vs 9m,8m vs 10m,10m,15m,20m,25m,7m vs 9m(final)Stalkers&Zealots(S&Z)2s3z(initial),2s4z,2s5z,3s4z,2s3z vs 2s4z,2s3z vs 2s5z,1s4z vs 4s1z,1s4z vs 5s1z,3s5z,1s4z vs 6s1z,3s5z vs 3s6z,3s5z vs 3s7z,3s5z vs 8s2z,3s5z vs 4s6z,3s5
33、z vs 4s7z,3s5z vs 4s8z(final)Medivac&Marauders&Marines(MMM)MMM0(initial),MMM1,MMM2,MMM3,MMM4,MMM5,MMM6,MMM7,MMM8,MMM9,MMM10(final).After the policy have learned on 5m vs 6m,the tasks with moderate difficulty are 8m vs 10m,20m,25m.Necessity Of Task Similarity Measure總結 概述多智能體決策大模型面臨的挑戰類比語言模型對多智能體系統進行
34、描述和建模的方案 3條重要設計先驗:動作語義網絡ICLR-2021 Action Semantics Network:Considering the Effects of Actions in Multiagent Systems.置換不變性、置換同變性、模型變長輸入ICLR-2023 Boosting MARL via Permutation Invariant and Permutation Equivariant Networks.遷移學習、跨任務的自動化課程學習AAMAS-2023 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning.歡迎大家一起合作,進一步研究強化學習大模型!NOAHS ARK LAB天津大學強化學習實驗室歡迎你的加入!NOAHS ARK LAB實驗室主頁:http:/rl.beiyang.ren/感謝觀看NOAHS ARK LAB