《朱軍-擴散策略學習的若干進展.pdf》由會員分享,可在線閱讀,更多相關《朱軍-擴散策略學習的若干進展.pdf(46頁珍藏版)》請在三個皮匠報告上搜索。
1、ML-SummitML-Summitwww.cpp-www.ml-summit.orgwww.gosim.orgwww.pm-summit.orgML-SummitML-SummitML-SummitML-SummitML-SummitML-Summit朱朱軍軍 I IE EE EE E F Fe el ll lo oww,清清華華大大學學人人工工智智能能研研究究院院副副院院長長,生生數數科科技技創創始始人人兼兼首首席席科科學學家家清華大學計算機系博世AI教授、IEEE Fellow、清華大學人工智能研究院副院長、計算機系人智實驗室主任、生數科技創始人兼首席科學家。主要從事機器學習基礎理論和高
2、效算法研究。曾獲中國科協求是杰出青年獎、科學探索獎、中國計算機學會自然科學一等獎、吳文俊人工智能自然科學一等獎、ICLR國際會議杰出論文獎等,入選國家級高層次人才計劃、中國計算機學會青年科學家、MIT TR35中國先鋒者等。演演講講主主題題:擴擴散散策策略略學學習習的的若若干干進進展展ML-SummitML-SummitDiffusion Policies:Reinforcement Learning with Diffusion ModelsJun ZhuTsinghua-Bosch Joint Center for MLDepartment of Computer Science and
3、Technology Tsinghua UniversityML-SummitML-SummitOffline RL:Data-driven;Open-loop RLML-SummitML-SummitOpen-loop RL leads to policy conservatism Online RLOffline RLBehavior(dataset)dist.Offline RL requires constrained policy optimization paradigmML-SummitML-SummitOpen-loop RL leads to policy conservat
4、ism Online RLOffline RLHighly inaccurate for unseen(s,a)pair without feedbackEstimation error can be corrected through Feedback LoopBehavior(dataset)dist.ML-SummitML-SummitBehavior modeling for offline RLConstrained Policy Optimization problem:has one optimal analytic solution:Resolving offline RL r
5、equires understanding the behavior distribution Generative modelingML-SummitML-SummitDiffusion Models for High-dim Data Generation Image,3DBlessing of Scale:self-supervisedly learn large models with a huge amount of unlabeled(multi-modal)data ProlificDreamer,NeurIPS 2023;CRM,ECCV 2024;DeepMesh,arXiv
6、 2025UniDiffuser,ICML 2023ML-SummitML-SummitDiffusion Models for High-dim Data Generation VideoVidu:the first high-performance video generator after Sora,released in April 27th,2024u:a Highly Consistent,Dynamic and Skilled Text-to-Video Generator with Diffusion Models,Bao et al.,arXiv 2024Vidu4D,Neu
7、rIPS 2024,PAMI 2025ML-SummitML-SummitWhat is Diffusion Policy?Diffusion AIGC:Noise ImageDiffusion Policy:Noise ActionLeveraging diffusion models to solve continuous control problemsML-SummitML-SummitAdvantages of Diffusion Policies1.Powerful modeling ability for complex human behaviors2.Generate div
8、ersified solutions for a given task3.Easy integration with visual and language foundation modelsModel Expressiveness-multi-modality50%150%21+22What about regression?ML-SummitML-Summit Large-scale Systems RDT-1B:a Diffusion Foundation Model for Bimanual Manipulation(ICLR 2025)Vidu a Highly Consistent
9、,Dynamic and Skilled Text-to-Video Generator(Arxiv 2024)Tianshou:A Modularized Deep Reinforcement Learning Library(JMLR 2022)Basic theory and algorithms SRPO:One-step diffusion policy,boost action sampling speed by 50 x(ICLR 2024)EDA:Alignment techniques for foundational diffusion behavior model.(Ne
10、urIPS 2024)SfBC:One of the earliest studies for diffusion policies(ICLR 2023)CEP:Diffusion Guidance methods for offline policy optimization(ICML 2023)Diffusion Foundations U-ViT The first scalable Diffusion-Transformer architecture(CVPR 2023)Analytic-DPM optimal variance estimate(ICLR 2022 Outstandi
11、ng Paper,ICML 2022)DPM-Solver the fastest inference algorithm(NeurIPS Oral,2022)10Some Progress on Diffusion Policies THU TSAIL Group ML-SummitML-Summit11Main Problems for Diffusion Policies Imitation Learning from Human Behaviors Problem1:Lack of generalizability due to data scarcity RDT-1B Problem
12、2:Low control frequency due to iterative sampling SRPO Policy Optimization based on Reward/Preference Feedback Problem3:Inability to effectively calculate model likelihood.EDA/CEP/SfBC Problem4:Lack of alignment methods for foundational diffusion policy EDAML-SummitML-SummitRDT-1B:a Diffusion Founda
13、tion Model for Bimanual Manipulationhttps:/rdt-robotics.github.io/12SfBCSongming Liu*,Lingxuan Wu*,Bangguo Li,Hengkai Tan,Huayu Chen,Zhengyi Wang,Ke Xu,Hang Su,Jun Zhu,ICLR 2025ML-SummitML-SummitPrevious foundational VLA models 1.Transformer+Discretized Actions following LLM -Information Loss2.Trans
14、former+MSE Regression Loss Head -No Diversity1350%11+22Regression50%2ML-SummitML-SummitContinuous control should leverage Diffusion as FoundationML-SummitML-SummitRDT OverviewML-SummitML-SummitRDT FrameworkHeterogeneous action spaces of various robots are embedded into a unified action space for mul
15、ti-robot trainingML-SummitML-SummitPretraining on Multi-Robot Dataa unified action space shared among various robotsML-SummitML-SummitFine-tuning on Self-Collected Dual-Arm dataA largest fine-tuning dataset created using the Mobile ALOHA robot ML-SummitML-SummitLargest and Best-performing Foundation
16、al Model for Dual-Arms Pre-trained RDT model(RDT(ours)v.s strong baselines including RDT without pre-training(RDT(scratch),OpenVLA,ACT,and Octo.Project page:https:/rdt-robotics.github.io/rdt-robotics/ML-SummitML-SummitLargest and Best-performing Foundational Model for Dual-Arms Dexterous Manipulatio
17、nUnseen ObjectInstruction FollowingFew-shot LearningML-SummitML-Summit21Main Problems for Diffusion Policies Imitation Learning from Human Behaviors Problem1:Lack of generalizability due to data scarcity RDT-1B Problem2:Low control frequency due to iterative sampling SRPO Policy Optimization based o
18、n Reward/Preference Feedback Problem3:Inability to effectively calculate model likelihood.EDA/SfBC/CEPML-SummitML-SummitBoosting Diffusion Action Sampling in 1 step:Score Regularized Policy Optimization https:/ Huayu Chen,Cheng Lu,Zhengyi Wang,Hang Su,ICLR 2024)22ML-SummitML-SummitAttacking computat
19、ional efficiency for diffusion RL23Can we distill a deterministic inference policy from a diffusion prior without sampling from it?Problem:Diffusion policy is too slow for real-time controlRobotics 50Hz,Nuclear reactor,10k Hz,diffusion 2-5Hz Diffusion models are expressive but sampling-expensive.ML-
20、SummitML-SummitAttacking computational efficiency for diffusion RL24Diffusion models are expressive but sampling-expensive.Can we distill a deterministic inference policy from a diffusion prior without sampling from it?Completely avoid diffusion sampling during training and evaluationML-SummitML-Sum
21、mitPolicy extraction from pretrained diffusion behavior25How to estimate?ML-SummitML-SummitPolicy extraction from pretrained diffusion behavior26How to estimate?Traditional methods sampling from generative models 1.Learn a generative model from data 2.For state s,sample multiple fake actions from an
22、d 3.Estimate KL divergence from and (Maximum Mean Discrepancy,MMD)ML-SummitML-SummitPolicy extraction from pretrained diffusion behaviorCan be estimated effectivelyby a pretrained diffusion behavior modelHow to estimate?Reverse-KL(deterministic)policy extraction:Key insights:ML-SummitML-SummitRecall
23、 that diffusion models are score-based modelsML-SummitML-Summit2D bandit visualizationsMoons8 GaussiansML-SummitML-SummitD4RL evaluationsML-SummitML-SummitMore 2D resultsML-SummitML-Summit32Main Problems for Diffusion Policies Imitation Learning from Human Behaviors Problem1:Lack of generalizability
24、 due to data scarcity RDT-1B Problem2:Low control frequency due to iterative sampling SRPO Policy Optimization based on Reward/Preference Feedback Problem3:Inability to effectively calculate model likelihood.EDA/CEP/SfBC Problem4:Lack of alignment methods for foundational diffusion policy EDAML-Summ
25、itML-SummitEfficient Diffusion Alignment:Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control Huayu Chen,Kaiwen Zheng,Su Hang,Jun ZhuNeurIPS 202433ML-SummitML-Summit34Learning ParadigmFoundation Model Self-SupervisedReinforcementLLMsGPT3/BERT.Preference AlignmentChatGPT/Lla
26、ma3VLA modelsRT-2/OpenVLADiffusion TransformerRDT-1BVLA+RL/PlanningQ-TransformerDiffusion Alignment?Autoregressive(Discrete)Diffusion(Continuous)Well-studiedExploredUnder-exploredLanguage AGI RoadPrevious Embodied AGI MethodsOur focusWhat is Diffusion Policy Alignment?ML-SummitML-SummitWhat is Diffu
27、sion Policy Alignment?Multi-Robot/Multi-Task Pre-training+Domain fine-tuningSupervised Learning+Behavior Data Diverse and Generalizable PoliciesReinforcement Learning+Reward Data Optimized and Domain-Specific PoliciesML-SummitML-SummitChallenges for Diffusion Policy Alignment LLM alignment relies on
28、 model likelihood calculation Most policy-based RL algorithms also requires calculating likelihood:PPO/REINFOR Diffusion models gradient fields(score)instead of ML-SummitML-SummitEfficient Diffusion Alignment(EDA):a Two-Stage ApproachDiffusion lossNew lossWe have(w/unlimited model capacity)likelihoo
29、dStage 1:a new diffusion modeling technique to enable direct density estimationML-SummitML-SummitEfficient Diffusion Alignment(EDA):a Two-Stage ApproachStage 2:aligning diffusion behaviors with Q-functionsML-SummitML-SummitResultEDA allows adapting to downstream tasks using minimal reward-labeled da
30、taML-SummitML-SummitConclusions and Future Work Diffusion policies leverage the expressiveness of diffusion models for fitting multi-modal behavior data in RL RDT-1B,the largest foundation model for bimanual manipulation SRPO,an one-step diffusion policy by distilling a deterministic policy EDA,an efficient alignment technique for foundational diffusion behavior model Future work Generate data with explicit physical principles=build world simulator Generate simulated interactive world for agents ML-SummitML-SummitThanksML-SummitML-Summit