當前位置：首頁 > 報告詳情

張希_RLChina talk1126_watermark.pdf

上傳人：張** 編號：155587 2024-02-15 PDF PDF 42頁 8.30MB

該報告所屬合集： RLChina 2023 大會PPT&課件資料合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/42

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《張希_RLChina talk1126_watermark.pdf》由會員分享，可在線閱讀，更多相關《張希_RLChina talk1126_watermark.pdf（42頁珍藏版）》請在三個皮匠報告上搜索。

1、融合大語言模型的智能體學習與決策融合大語言模型的智能體學習與決策構建強化學習世界模型構建強化學習世界模型張張希希 Xi Sheryl Zhang Nov.26,20232Why Reinforcement Learning?Active Learning vs Passive Learning Paradigms The main distinction with supervised learning Active learner interacts with the environment at training time,say,by posing queries or perfor

2、ming experiments Passive learner only observes the information provided by the environment(or the teacher)without influencing or directing it 3RL Nomenclature*Partially Observed MDP(POMDP)model is usually advocated when the agent has no access to the exact system state but only an observationof the

3、state.known/unknownstationary/non-stationarySergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial4RL Algorithm:Policy Evaluation&ImprovementMDP ControlGeneralized Policy Iteration(GPI)Richard S.Sutton and Andrew G.Barto,Reinforcement Learning:An Int

4、roduction,Second Edition,The MIT Press 2018FQ2:How can we learn a good policy?FQ1:How good is a specific policy?5Computational RL Anatomy Deep RLDeep Neural Networks wSergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial6DQNAlphaGoMuZeroAlphaStarAta

5、riGoChessStarCraftSuccesses obtained via DRLBeyond the Imitation Game:Quantifying and extrapolating the capabilities of language models,2023Checkmate-in-one taskLarger models are better at finding legal chess moves,but struggle to find checkmating moves.How about using LM?None of the BIG-G models te

6、sted can solve checkmate-in-one taskNot so goodNatural Natural Language inLanguage inRLRL7How might intelligent agents ground language understanding in their own embodied perception?Natural language inNatural language in RLRL89Natural language inNatural language in RLRLtask-agnostic controlRelabeled

7、 goal conditioned behavioral cloning(GCBC)Learning from Play(LfP)How can LLMs power autonomous agents?Learning Latent Plans from Play,201910A Gentle Start:PaLM-E PaLM-E:an embodied multimodal language model,202311A Gentle Start:PaLM-EResults on general language tasks Results on planning successOverv

8、iew of transfer learning demonstrated by PaLM-ESome Conclusions:Dataset:the full mixture is better than a single robot;Params:Frozen LLM,training an encoder is feasible;PaLM-E agent can be utilized in tasks such as VQA,NLG,etc.12PaLM-E Pipeline PaLM-E:an embodied multimodal language model,2023Input&

9、Scene Representations for Different Sensor ModalitiesLow-level skills/actions State estimation vectorVision Transformer(ViT)Object-centric representationsObject Scene Representation Transformer(OSRT)Entity ReferralsPaLM4B ViT+PaLM-8B22B ViT+PaLM-62B22B ViT+PaLM-540BfreezeAbilities:perception,visuall

10、y-grounded dialogue,and planningIt could be posed as building theWorld ModelLearning-Theoretic View13 Understand the world Plan the tasks Hasten the learning efficiency Mindstorms in Natural Language-Based Societies of Mind,2023How can LLMs power autonomous agents?Improve algorithmic generalization

11、Provide an intrinsic cost A Path Towards Autonomous Machine Intelligence,LeCun Yann,202214Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigatio

12、nAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDec

13、isionActionBuilding the World Model15Delve into a World ModelLeCun Yann,A Path Towards Autonomous Machine Intelligence,https:/ can we imitate human intelligence?joint embedding predictive architectures(JEPA)MDPPredictive model16Computational RL:More World ModelsWang et al.,Denoised MDP,ICML 2022How

14、can we imitate human intelligence?Variational boundCritic lossActor lossHafner et al.,DreamerV2,ICLR 202117Can LLMs help in understanding the world?Deep RL SketchIs the latent space based on visual sufficient for learning a complex world?State AbstractionNot that simple!Amy Zhang,201918Can LLMs help

15、 in understanding the world?Hallucination!The heuristic function determines when the trajectory is inefficient or contains hallucinationand should be stopped.ReAct:Synergizing reasoning and acting in language models,2023Reflexion:an autonomous agent with dynamic memory and self-reflection,202319Mind

16、storms in Natural Language-Based Societies of Mind,2023More than one LLMs 20Dreamer：magnificent backbone for MBRLDreamer series:Past and PresentLearning Latent Dynamics for Planning from Pixels,2019 Planning in latent spaces Recurrent State Space Model:RSSM is the key contribution of the PlaNet,and

17、the structure of this dynamics model has been consistently utilized throughout the subsequent Dreamer Series Variational encoder infers approximate state posteriors from past observations and actions21Dreamer v1：making decisions by gradient analysis instead of planning Dreamer series:Past and Presen

18、t Learning long-horizon behaviors by latent imagination Empirical performance for visual controlGradient propagation:Take action(like SAC):Target：22Computational RL:Our DesignIt aims to describe how do we understand the worldThe architectures by no means unique23Computational RL:Our DesignState Abst

19、ractionOff-policy modelw.non-stationaryassumptionLatent Space LearningGoal:understanding the observed world for decision making Control tasks provided by Deepmind Control SuiteLiu,Sc.,Zhang,X.Li.,Ys,Zhang,Yf.,&Cheng,J.(2023,May).On the data-efficiency with contrastive image transformation in reinfor

20、cement learning.In International Conference on Learning Representations.24Computational RL:Our Designmake the world pretty much like stationary,say,by parameterization Key Idea:data manipulation can make understanding easierRepresentation Invariant FrameworkProperties of functions are discussed(smoo

21、thness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a GaussianBalestriero et al.,Learning in High Dimension Always Amounts to Extrapolation,Arxiv 202125Computational RL:Our Designmake the world pretty much like stationary,say,by

22、parameterization Key Idea:data manipulation can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a Gaussian26Computational RL:Our DesignKey Idea:data manipulat

23、ion can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learning27Computational RL:Our DesignInsights:theoretical understanding of the possible rolesSSL via data augmentationQ-LearningRegularization via statistics Inductive Bias33D

24、reamer v3：A strong baseline with out-of-the-box usability,requiring no adjustment of hyperparametersDreamerV3:SOTA ChallengerDreamerV3 is the first algorithm capable of autonomously collecting diamonds in Minecraft from scratch,without any human data or pre-training.How about there is a pre-training

25、 for DreamerV3?35Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented

26、ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model Simulation platform with thousands of

27、 diverse open-ended tasks.Internet-scale multimodal Minecraft knowledge base.Novel algorithm for embodied agents with large-scale pre-training.36Can LLMs help in agent planning?MineDojo:Building Open-Ended Embodied Agents with Internet-Scale Knowledge,202237Can LLMs help in agent planning?Voyager:An

28、 Open-Ended Embodied Agent with Large Language Models,2023Yes,it can.38MCTSutilizes LLMs commonsense knowledge to generate the initial belief of statesRandom Rolloutchoose action based on Q value,visit count&LLM policyUpdate Q valueLLM world modelCan LLMs help in agent planning?Large Language Models

29、 as Commonsense Knowledge for Large-Scale Task Planning,202339MCTSCan LLMs help in agent planning?Large Language Models as Commonsense Knowledge for Large-Scale Task Planning,2023LLM as a commonsense world modelLLM as a heuristic policyInsight:How do we choose between L-Model and L-Policy*?One idea

30、is the minimum description length(MDL)principle.Theoretical analysis suggests that a hypothesis with a shorter description length has a smaller generalization error and is preferred.*L-Policy:treat the LLM as a policy and query it directly for the next actions.ADPP40Deployable Hardware Control Syste

31、msTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning Algorit

32、hmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model41Can LLMs help in RL generalization?The study of generalisation in deep Reinforcement Learn

33、ing(RL)aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time,avoiding overfitting to their training environments.5The goal of generalization in RL is to make RL algorithms perform well in test domains that are unseen duringtraining.61Kirk R,Zhang

34、A,Grefenstette E,et al.A Survey of Zero-shot Generalisation in Deep Reinforcement LearningJ.Journal of Artificial Intelligence Research,2023,76:201-264.2Ni T,Eysenbach B,Salakhutdinov R.Recurrent model-free RL can be a strong baseline for many POMDPsC/International Conference on Machine Learning.PML

35、R,2022:16691-16723.enable agents adapt to the real wordAgent work in many tasksAgent adapt to new environments42Generalization in RLHow to Generalization Meta-learning Robust RL Representation learning Multi-task Learning Meta learning for RL to fast online adaptationMulti-task learning for RL to so

36、lve different tasks Adversarial and robust RL to deal with environment interferences Decision transformer for large decision modelState abstraction and representation learning for environment changes43Similarity MetricRobust representations of the visual scene should be insensitive to irrelevant obj

37、ects or details.11 Zhang,A.,McAllister,R.T.,Calandra,R.,Gal,Y.,&Levine,S.(2020,October).Learning Invariant Representations for Reinforcement Learning without Reconstruction.In International Conference on Learning Representations.2 Agarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Con

38、trastive Behavioral Similarity Embeddingsfor Generalization in Reinforcement Learning.In International Conference on Learning Representations.Policy Similarity Metric defines a notion of similarity between states originated from different environments by the proximity of the long-term optimal behavi

39、or from these states.Bisimulation MetricSimilar behavior with different rewardDifferent behavior,similar reward44Similarity MetricAgarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning.In Internation

40、al Conference on Learning Representations.Policy Similarity Metric-Bisimulation MetricWasserstein DistanceLocal Optimal Behavior DifferenceTime-discountLong-term Optimal Behavior DifferenceContrastive Learning ArchitecturePSEs align the labeled states that have the same distance from the obstacle,th

41、e invariant feature that generalizes across tasks.45Is reward enough with LLMs?Reward Design with Language Models,2023Conclusion:LLMs are efficient in-context learners.They are able to provide reward signals that are consistent with a users objectives from examples even a single example with an expl

42、anation will suffice.An LLM is able to identify well-known objectives and provide objective-aligned reward signals in a zero-shot setting.It can train objective-aligned agents when ground truth rewards are not present in complex,longer-horizon tasks.46ADPPPerceptionPlanningDecisionActionBuilding the

43、 World ModelMore important topics RLHF for language-based decision making Limitation of LLMs trained without actions Adopting online/offline RL with LLMs Designing long-term/short-term memory Long-horizon reasoning in RL Hierarchical RL with high-level skill discovery Can AGI Agents Be Trained with

44、World Model?Tuesday,December 5,2023WHEN REPRESENTATION LEARNING MEETS CAUSAL INFERENCE47 Social life incentivizes evolution of intelligence“Because corvids and apes share these cognitive tools,we argue that complex cognitive abilities evolved multiple times in distantly related species with vastly d

45、ifferent brain structures in order to solve similar socioecological problems.”-Science vol.306 issue 5703 pp 1903-1907 Open-ended world could produce:Theory of mind,negotiation,social skills,empathy,real language understanding Human-level concept learningGrandmaster-level in StarCraft IIWhich one is more difficult?How to use self-playin world model?Ilya Sutskever,2019Thank You!Q&A

相關圖表

本文主要探討了如何利用大型語言模型（LLMs）和決策基礎模型（DBMs）構建智能體學習與決策的強化學習世界模型。文章首先比較了主動學習和被動學習范式，然后介紹了部分觀測MDP（POMDP）模型，以及如何利用深度強化學習（DRL）算法如政策評估和改善。文章還討論了如何利用LLMs在復雜、長時序任務中訓練目標對齊的智能體，并提出了在決策中使用LLMs的一些挑戰和解決方案。關鍵數據包括： 1. Dreamer系列算法在Minecraft中自主收集鉆石的能力。 2. 使用LLM作為共同感知的 world model 和 heuristic policy 的概念。 3. 計算RL中我們的設計，如狀態抽象和表示變換。 4. 利用對比學習架構和政策相似性度量來提高強化學習的一般化能力。 5. 零樣本泛化在深度強化學習中的研究，包括元學習、魯棒RL、表示學習、多任務學習和對抗性魯棒RL等。文章最后提出了關于如何使用自我博弈在世界模型中進行學習的問題，并指出人類水平的概念和大師級別的技能在哪個領域更具挑戰性。

"LLMs如何助力強化學習？" "世界模型如何塑造AI智能體？" "LLMs在決策制定中的潛力何在？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站