《張希_RLChina talk1126_watermark.pdf》由會員分享,可在線閱讀,更多相關《張希_RLChina talk1126_watermark.pdf(42頁珍藏版)》請在三個皮匠報告上搜索。
1、融合大語言模型的智能體學習與決策融合大語言模型的智能體學習與決策 構建強化學習世界模型構建強化學習世界模型張張 希希 Xi Sheryl Zhang Nov.26,20232Why Reinforcement Learning?Active Learning vs Passive Learning Paradigms The main distinction with supervised learning Active learner interacts with the environment at training time,say,by posing queries or perfor
2、ming experiments Passive learner only observes the information provided by the environment(or the teacher)without influencing or directing it 3RL Nomenclature*Partially Observed MDP(POMDP)model is usually advocated when the agent has no access to the exact system state but only an observationof the
3、state.known/unknownstationary/non-stationarySergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial4RL Algorithm:Policy Evaluation&ImprovementMDP ControlGeneralized Policy Iteration(GPI)Richard S.Sutton and Andrew G.Barto,Reinforcement Learning:An Int
4、roduction,Second Edition,The MIT Press 2018FQ2:How can we learn a good policy?FQ1:How good is a specific policy?5Computational RL Anatomy Deep RLDeep Neural Networks wSergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial6DQNAlphaGoMuZeroAlphaStarAta
5、riGoChessStarCraftSuccesses obtained via DRLBeyond the Imitation Game:Quantifying and extrapolating the capabilities of language models,2023Checkmate-in-one taskLarger models are better at finding legal chess moves,but struggle to find checkmating moves.How about using LM?None of the BIG-G models te
6、sted can solve checkmate-in-one taskNot so goodNatural Natural Language inLanguage inRLRL7How might intelligent agents ground language understanding in their own embodied perception?Natural language inNatural language in RLRL89Natural language inNatural language in RLRLtask-agnostic controlRelabeled
7、 goal conditioned behavioral cloning(GCBC)Learning from Play(LfP)How can LLMs power autonomous agents?Learning Latent Plans from Play,201910A Gentle Start:PaLM-E PaLM-E:an embodied multimodal language model,202311A Gentle Start:PaLM-EResults on general language tasks Results on planning successOverv
8、iew of transfer learning demonstrated by PaLM-ESome Conclusions:Dataset:the full mixture is better than a single robot;Params:Frozen LLM,training an encoder is feasible;PaLM-E agent can be utilized in tasks such as VQA,NLG,etc.12PaLM-E Pipeline PaLM-E:an embodied multimodal language model,2023Input&
9、Scene Representations for Different Sensor ModalitiesLow-level skills/actions State estimation vectorVision Transformer(ViT)Object-centric representationsObject Scene Representation Transformer(OSRT)Entity ReferralsPaLM4B ViT+PaLM-8B22B ViT+PaLM-62B22B ViT+PaLM-540BfreezeAbilities:perception,visuall
10、y-grounded dialogue,and planningIt could be posed as building theWorld ModelLearning-Theoretic View13 Understand the world Plan the tasks Hasten the learning efficiency Mindstorms in Natural Language-Based Societies of Mind,2023How can LLMs power autonomous agents?Improve algorithmic generalization
11、Provide an intrinsic cost A Path Towards Autonomous Machine Intelligence,LeCun Yann,202214Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigatio
12、nAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDec
13、isionActionBuilding the World Model15Delve into a World ModelLeCun Yann,A Path Towards Autonomous Machine Intelligence,https:/ can we imitate human intelligence?joint embedding predictive architectures(JEPA)MDPPredictive model16Computational RL:More World ModelsWang et al.,Denoised MDP,ICML 2022How
14、can we imitate human intelligence?Variational boundCritic lossActor lossHafner et al.,DreamerV2,ICLR 202117Can LLMs help in understanding the world?Deep RL SketchIs the latent space based on visual sufficient for learning a complex world?State AbstractionNot that simple!Amy Zhang,201918Can LLMs help
15、 in understanding the world?Hallucination!The heuristic function determines when the trajectory is inefficient or contains hallucinationand should be stopped.ReAct:Synergizing reasoning and acting in language models,2023Reflexion:an autonomous agent with dynamic memory and self-reflection,202319Mind
16、storms in Natural Language-Based Societies of Mind,2023More than one LLMs 20Dreamer:magnificent backbone for MBRLDreamer series:Past and PresentLearning Latent Dynamics for Planning from Pixels,2019 Planning in latent spaces Recurrent State Space Model:RSSM is the key contribution of the PlaNet,and
17、the structure of this dynamics model has been consistently utilized throughout the subsequent Dreamer Series Variational encoder infers approximate state posteriors from past observations and actions21Dreamer v1:making decisions by gradient analysis instead of planning Dreamer series:Past and Presen
18、t Learning long-horizon behaviors by latent imagination Empirical performance for visual controlGradient propagation:Take action(like SAC):Target:22Computational RL:Our DesignIt aims to describe how do we understand the worldThe architectures by no means unique23Computational RL:Our DesignState Abst
19、ractionOff-policy modelw.non-stationaryassumptionLatent Space LearningGoal:understanding the observed world for decision making Control tasks provided by Deepmind Control SuiteLiu,Sc.,Zhang,X.Li.,Ys,Zhang,Yf.,&Cheng,J.(2023,May).On the data-efficiency with contrastive image transformation in reinfor
20、cement learning.In International Conference on Learning Representations.24Computational RL:Our Designmake the world pretty much like stationary,say,by parameterization Key Idea:data manipulation can make understanding easierRepresentation Invariant FrameworkProperties of functions are discussed(smoo
21、thness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a GaussianBalestriero et al.,Learning in High Dimension Always Amounts to Extrapolation,Arxiv 202125Computational RL:Our Designmake the world pretty much like stationary,say,by
22、parameterization Key Idea:data manipulation can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a Gaussian26Computational RL:Our DesignKey Idea:data manipulat
23、ion can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learning27Computational RL:Our DesignInsights:theoretical understanding of the possible rolesSSL via data augmentationQ-LearningRegularization via statistics Inductive Bias33D
24、reamer v3:A strong baseline with out-of-the-box usability,requiring no adjustment of hyperparametersDreamerV3:SOTA ChallengerDreamerV3 is the first algorithm capable of autonomously collecting diamonds in Minecraft from scratch,without any human data or pre-training.How about there is a pre-training
25、 for DreamerV3?35Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented
26、ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model Simulation platform with thousands of
27、 diverse open-ended tasks.Internet-scale multimodal Minecraft knowledge base.Novel algorithm for embodied agents with large-scale pre-training.36Can LLMs help in agent planning?MineDojo:Building Open-Ended Embodied Agents with Internet-Scale Knowledge,202237Can LLMs help in agent planning?Voyager:An
28、 Open-Ended Embodied Agent with Large Language Models,2023Yes,it can.38MCTSutilizes LLMs commonsense knowledge to generate the initial belief of statesRandom Rolloutchoose action based on Q value,visit count&LLM policyUpdate Q valueLLM world modelCan LLMs help in agent planning?Large Language Models
29、 as Commonsense Knowledge for Large-Scale Task Planning,202339MCTSCan LLMs help in agent planning?Large Language Models as Commonsense Knowledge for Large-Scale Task Planning,2023LLM as a commonsense world modelLLM as a heuristic policyInsight:How do we choose between L-Model and L-Policy*?One idea
30、is the minimum description length(MDL)principle.Theoretical analysis suggests that a hypothesis with a shorter description length has a smaller generalization error and is preferred.*L-Policy:treat the LLM as a policy and query it directly for the next actions.ADPP40Deployable Hardware Control Syste
31、msTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning Algorit
32、hmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model41Can LLMs help in RL generalization?The study of generalisation in deep Reinforcement Learn
33、ing(RL)aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time,avoiding overfitting to their training environments.5The goal of generalization in RL is to make RL algorithms perform well in test domains that are unseen duringtraining.61Kirk R,Zhang
34、A,Grefenstette E,et al.A Survey of Zero-shot Generalisation in Deep Reinforcement LearningJ.Journal of Artificial Intelligence Research,2023,76:201-264.2Ni T,Eysenbach B,Salakhutdinov R.Recurrent model-free RL can be a strong baseline for many POMDPsC/International Conference on Machine Learning.PML
35、R,2022:16691-16723.enable agents adapt to the real wordAgent work in many tasksAgent adapt to new environments42Generalization in RLHow to Generalization Meta-learning Robust RL Representation learning Multi-task Learning Meta learning for RL to fast online adaptationMulti-task learning for RL to so
36、lve different tasks Adversarial and robust RL to deal with environment interferences Decision transformer for large decision modelState abstraction and representation learning for environment changes43Similarity MetricRobust representations of the visual scene should be insensitive to irrelevant obj
37、ects or details.11 Zhang,A.,McAllister,R.T.,Calandra,R.,Gal,Y.,&Levine,S.(2020,October).Learning Invariant Representations for Reinforcement Learning without Reconstruction.In International Conference on Learning Representations.2 Agarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Con
38、trastive Behavioral Similarity Embeddingsfor Generalization in Reinforcement Learning.In International Conference on Learning Representations.Policy Similarity Metric defines a notion of similarity between states originated from different environments by the proximity of the long-term optimal behavi
39、or from these states.Bisimulation MetricSimilar behavior with different rewardDifferent behavior,similar reward44Similarity MetricAgarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning.In Internation
40、al Conference on Learning Representations.Policy Similarity Metric-Bisimulation MetricWasserstein DistanceLocal Optimal Behavior DifferenceTime-discountLong-term Optimal Behavior DifferenceContrastive Learning ArchitecturePSEs align the labeled states that have the same distance from the obstacle,th
41、e invariant feature that generalizes across tasks.45Is reward enough with LLMs?Reward Design with Language Models,2023Conclusion:LLMs are efficient in-context learners.They are able to provide reward signals that are consistent with a users objectives from examples even a single example with an expl
42、anation will suffice.An LLM is able to identify well-known objectives and provide objective-aligned reward signals in a zero-shot setting.It can train objective-aligned agents when ground truth rewards are not present in complex,longer-horizon tasks.46ADPPPerceptionPlanningDecisionActionBuilding the
43、 World ModelMore important topics RLHF for language-based decision making Limitation of LLMs trained without actions Adopting online/offline RL with LLMs Designing long-term/short-term memory Long-horizon reasoning in RL Hierarchical RL with high-level skill discovery Can AGI Agents Be Trained with
44、World Model?Tuesday,December 5,2023WHEN REPRESENTATION LEARNING MEETS CAUSAL INFERENCE47 Social life incentivizes evolution of intelligence“Because corvids and apes share these cognitive tools,we argue that complex cognitive abilities evolved multiple times in distantly related species with vastly d
45、ifferent brain structures in order to solve similar socioecological problems.”-Science vol.306 issue 5703 pp 1903-1907 Open-ended world could produce:Theory of mind,negotiation,social skills,empathy,real language understanding Human-level concept learningGrandmaster-level in StarCraft IIWhich one is more difficult?How to use self-playin world model?Ilya Sutskever,2019Thank You!Q&A