張希_RLChina talk1126_watermark.pdf

編號:155587 PDF 42頁 8.30MB 下載積分:VIP專享
下載報告請您先登錄!

張希_RLChina talk1126_watermark.pdf

1、融合大語言模型的智能體學習與決策融合大語言模型的智能體學習與決策 構建強化學習世界模型構建強化學習世界模型張張 希希 Xi Sheryl Zhang Nov.26,20232Why Reinforcement Learning?Active Learning vs Passive Learning Paradigms The main distinction with supervised learning Active learner interacts with the environment at training time,say,by posing queries or perfor

2、ming experiments Passive learner only observes the information provided by the environment(or the teacher)without influencing or directing it 3RL Nomenclature*Partially Observed MDP(POMDP)model is usually advocated when the agent has no access to the exact system state but only an observationof the

3、state.known/unknownstationary/non-stationarySergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial4RL Algorithm:Policy Evaluation&ImprovementMDP ControlGeneralized Policy Iteration(GPI)Richard S.Sutton and Andrew G.Barto,Reinforcement Learning:An Int

4、roduction,Second Edition,The MIT Press 2018FQ2:How can we learn a good policy?FQ1:How good is a specific policy?5Computational RL Anatomy Deep RLDeep Neural Networks wSergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial6DQNAlphaGoMuZeroAlphaStarAta

5、riGoChessStarCraftSuccesses obtained via DRLBeyond the Imitation Game:Quantifying and extrapolating the capabilities of language models,2023Checkmate-in-one taskLarger models are better at finding legal chess moves,but struggle to find checkmating moves.How about using LM?None of the BIG-G models te

6、sted can solve checkmate-in-one taskNot so goodNatural Natural Language inLanguage inRLRL7How might intelligent agents ground language understanding in their own embodied perception?Natural language inNatural language in RLRL89Natural language inNatural language in RLRLtask-agnostic controlRelabeled

7、 goal conditioned behavioral cloning(GCBC)Learning from Play(LfP)How can LLMs power autonomous agents?Learning Latent Plans from Play,201910A Gentle Start:PaLM-E PaLM-E:an embodied multimodal language model,202311A Gentle Start:PaLM-EResults on general language tasks Results on planning successOverv

8、iew of transfer learning demonstrated by PaLM-ESome Conclusions:Dataset:the full mixture is better than a single robot;Params:Frozen LLM,training an encoder is feasible;PaLM-E agent can be utilized in tasks such as VQA,NLG,etc.12PaLM-E Pipeline PaLM-E:an embodied multimodal language model,2023Input&

9、Scene Representations for Different Sensor ModalitiesLow-level skills/actions State estimation vectorVision Transformer(ViT)Object-centric representationsObject Scene Representation Transformer(OSRT)Entity ReferralsPaLM4B ViT+PaLM-8B22B ViT+PaLM-62B22B ViT+PaLM-540BfreezeAbilities:perception,visuall

10、y-grounded dialogue,and planningIt could be posed as building theWorld ModelLearning-Theoretic View13 Understand the world Plan the tasks Hasten the learning efficiency Mindstorms in Natural Language-Based Societies of Mind,2023How can LLMs power autonomous agents?Improve algorithmic generalization

11、Provide an intrinsic cost A Path Towards Autonomous Machine Intelligence,LeCun Yann,202214Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigatio

12、nAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDec

13、isionActionBuilding the World Model15Delve into a World ModelLeCun Yann,A Path Towards Autonomous Machine Intelligence,https:/ can we imitate human intelligence?joint embedding predictive architectures(JEPA)MDPPredictive model16Computational RL:More World ModelsWang et al.,Denoised MDP,ICML 2022How

14、can we imitate human intelligence?Variational boundCritic lossActor lossHafner et al.,DreamerV2,ICLR 202117Can LLMs help in understanding the world?Deep RL SketchIs the latent space based on visual sufficient for learning a complex world?State AbstractionNot that simple!Amy Zhang,201918Can LLMs help

15、 in understanding the world?Hallucination!The heuristic function determines when the trajectory is inefficient or contains hallucinationand should be stopped.ReAct:Synergizing reasoning and acting in language models,2023Reflexion:an autonomous agent with dynamic memory and self-reflection,202319Mind

16、storms in Natural Language-Based Societies of Mind,2023More than one LLMs 20Dreamer:magnificent backbone for MBRLDreamer series:Past and PresentLearning Latent Dynamics for Planning from Pixels,2019 Planning in latent spaces Recurrent State Space Model:RSSM is the key contribution of the PlaNet,and

17、the structure of this dynamics model has been consistently utilized throughout the subsequent Dreamer Series Variational encoder infers approximate state posteriors from past observations and actions21Dreamer v1:making decisions by gradient analysis instead of planning Dreamer series:Past and Presen

18、t Learning long-horizon behaviors by latent imagination Empirical performance for visual controlGradient propagation:Take action(like SAC):Target:22Computational RL:Our DesignIt aims to describe how do we understand the worldThe architectures by no means unique23Computational RL:Our DesignState Abst

19、ractionOff-policy modelw.non-stationaryassumptionLatent Space LearningGoal:understanding the observed world for decision making Control tasks provided by Deepmind Control SuiteLiu,Sc.,Zhang,X.Li.,Ys,Zhang,Yf.,&Cheng,J.(2023,May).On the data-efficiency with contrastive image transformation in reinfor

20、cement learning.In International Conference on Learning Representations.24Computational RL:Our Designmake the world pretty much like stationary,say,by parameterization Key Idea:data manipulation can make understanding easierRepresentation Invariant FrameworkProperties of functions are discussed(smoo

21、thness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a GaussianBalestriero et al.,Learning in High Dimension Always Amounts to Extrapolation,Arxiv 202125Computational RL:Our Designmake the world pretty much like stationary,say,by

22、parameterization Key Idea:data manipulation can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a Gaussian26Computational RL:Our DesignKey Idea:data manipulat

23、ion can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learning27Computational RL:Our DesignInsights:theoretical understanding of the possible rolesSSL via data augmentationQ-LearningRegularization via statistics Inductive Bias33D

24、reamer v3:A strong baseline with out-of-the-box usability,requiring no adjustment of hyperparametersDreamerV3:SOTA ChallengerDreamerV3 is the first algorithm capable of autonomously collecting diamonds in Minecraft from scratch,without any human data or pre-training.How about there is a pre-training

25、 for DreamerV3?35Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented

26、ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model Simulation platform with thousands of

27、 diverse open-ended tasks.Internet-scale multimodal Minecraft knowledge base.Novel algorithm for embodied agents with large-scale pre-training.36Can LLMs help in agent planning?MineDojo:Building Open-Ended Embodied Agents with Internet-Scale Knowledge,202237Can LLMs help in agent planning?Voyager:An

28、 Open-Ended Embodied Agent with Large Language Models,2023Yes,it can.38MCTSutilizes LLMs commonsense knowledge to generate the initial belief of statesRandom Rolloutchoose action based on Q value,visit count&LLM policyUpdate Q valueLLM world modelCan LLMs help in agent planning?Large Language Models

29、 as Commonsense Knowledge for Large-Scale Task Planning,202339MCTSCan LLMs help in agent planning?Large Language Models as Commonsense Knowledge for Large-Scale Task Planning,2023LLM as a commonsense world modelLLM as a heuristic policyInsight:How do we choose between L-Model and L-Policy*?One idea

30、is the minimum description length(MDL)principle.Theoretical analysis suggests that a hypothesis with a shorter description length has a smaller generalization error and is preferred.*L-Policy:treat the LLM as a policy and query it directly for the next actions.ADPP40Deployable Hardware Control Syste

31、msTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning Algorit

32、hmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model41Can LLMs help in RL generalization?The study of generalisation in deep Reinforcement Learn

33、ing(RL)aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time,avoiding overfitting to their training environments.5The goal of generalization in RL is to make RL algorithms perform well in test domains that are unseen duringtraining.61Kirk R,Zhang

34、A,Grefenstette E,et al.A Survey of Zero-shot Generalisation in Deep Reinforcement LearningJ.Journal of Artificial Intelligence Research,2023,76:201-264.2Ni T,Eysenbach B,Salakhutdinov R.Recurrent model-free RL can be a strong baseline for many POMDPsC/International Conference on Machine Learning.PML

35、R,2022:16691-16723.enable agents adapt to the real wordAgent work in many tasksAgent adapt to new environments42Generalization in RLHow to Generalization Meta-learning Robust RL Representation learning Multi-task Learning Meta learning for RL to fast online adaptationMulti-task learning for RL to so

36、lve different tasks Adversarial and robust RL to deal with environment interferences Decision transformer for large decision modelState abstraction and representation learning for environment changes43Similarity MetricRobust representations of the visual scene should be insensitive to irrelevant obj

37、ects or details.11 Zhang,A.,McAllister,R.T.,Calandra,R.,Gal,Y.,&Levine,S.(2020,October).Learning Invariant Representations for Reinforcement Learning without Reconstruction.In International Conference on Learning Representations.2 Agarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Con

38、trastive Behavioral Similarity Embeddingsfor Generalization in Reinforcement Learning.In International Conference on Learning Representations.Policy Similarity Metric defines a notion of similarity between states originated from different environments by the proximity of the long-term optimal behavi

39、or from these states.Bisimulation MetricSimilar behavior with different rewardDifferent behavior,similar reward44Similarity MetricAgarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning.In Internation

40、al Conference on Learning Representations.Policy Similarity Metric-Bisimulation MetricWasserstein DistanceLocal Optimal Behavior DifferenceTime-discountLong-term Optimal Behavior DifferenceContrastive Learning ArchitecturePSEs align the labeled states that have the same distance from the obstacle,th

41、e invariant feature that generalizes across tasks.45Is reward enough with LLMs?Reward Design with Language Models,2023Conclusion:LLMs are efficient in-context learners.They are able to provide reward signals that are consistent with a users objectives from examples even a single example with an expl

42、anation will suffice.An LLM is able to identify well-known objectives and provide objective-aligned reward signals in a zero-shot setting.It can train objective-aligned agents when ground truth rewards are not present in complex,longer-horizon tasks.46ADPPPerceptionPlanningDecisionActionBuilding the

43、 World ModelMore important topics RLHF for language-based decision making Limitation of LLMs trained without actions Adopting online/offline RL with LLMs Designing long-term/short-term memory Long-horizon reasoning in RL Hierarchical RL with high-level skill discovery Can AGI Agents Be Trained with

44、World Model?Tuesday,December 5,2023WHEN REPRESENTATION LEARNING MEETS CAUSAL INFERENCE47 Social life incentivizes evolution of intelligence“Because corvids and apes share these cognitive tools,we argue that complex cognitive abilities evolved multiple times in distantly related species with vastly d

45、ifferent brain structures in order to solve similar socioecological problems.”-Science vol.306 issue 5703 pp 1903-1907 Open-ended world could produce:Theory of mind,negotiation,social skills,empathy,real language understanding Human-level concept learningGrandmaster-level in StarCraft IIWhich one is more difficult?How to use self-playin world model?Ilya Sutskever,2019Thank You!Q&A

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(張希_RLChina talk1126_watermark.pdf)為本站 (張5G) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站