當前位置：首頁 > 報告詳情

4-5 面向真實場景的數據驅動決策優化.pdf

上傳人：云閑編號：102395 2021-01-01 PDF PDF 42頁 5.73MB

該報告所屬合集： DataFunSummit：2022年決策智能在線峰會嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/42

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《4-5 面向真實場景的數據驅動決策優化.pdf》由會員分享，可在線閱讀，更多相關《4-5 面向真實場景的數據驅動決策優化.pdf（42頁珍藏版）》請在三個皮匠報告上搜索。

1、面向真實場景的數據驅動決策優化詹仙園清華大學智能產業研究院（AIR）助理研究員/助理教授|01Real-World Challenges for Data-Driven Decision-Making 02Offline Reinforcement Learning(RL)03Hybrid Offline-and-Online RL目錄 CONTENT|01Real-World Challenges for Data-Driven Decision-Making|4Decision-Making Application in Real-WorldGaming AIRoboticsAutono

2、mous drivingIndustrial&Energy SystemsLogisticsScheduling|5Real-World Challenges for Sequential Decision-Making Methods Conventional decision-making tasksReal-world tasksActionStateRewardActionStateRewardXRiskHistorical dataOffline datasetsX Not possible to interact with the real environment during t

3、raining Perfect simulation environment may not exist Severe sim-to-real transfer issue Only have offline logged data Most conventional methods fail!Is there a data-driven solution?Overview of Data-Driven Sequential Decision MakingLevel of System interactionAmount of offline dataHighNo interactionLow

4、MediumNo offline dataSmallMediumLargeOnline RLApplication ScenariosSample efficient online RLSample efficient online RLoffline RL/IL Few-shot IL/RL/PlanningSample efficient IL/RL/planningSample efficient offline IL/RL/planningGaming AIAI ApproachesLimited real-world applicationsLots of real-world ap

5、plicationsLimited real-world applicationsTechnological maturityLots of research Relatively matureSome researchLack of research.Hard problems.Lots of unknown.Lots of application scenariosMission critical system optimizationRoboticsAutonomous Driving|02Offline Reinforcement Learning(RL)|8Introduction

6、of Reinforcement LearningGaming AIRoboticsAction State Reward|9Challenge of DRL in Real-World Applications|10Challenges of DRL in Real-World ApplicationsConventional RL tasksUsing RL in real-world tasks?ActionStateRewardActionStateRewardXHistorical state,action,reward dataOffline logged data Not pos

7、sible to interact with the real environment during training Perfect simulation environment may not exist Severe sim-to-real transfer issue Only have offline logged data Most existing DRL algorithms fail!Real systemor simulator|11Offline Reinforcement Learning(Offline RL)https:/ and NotationsOffline

8、logged data|13 Kumar,Fu,Tucker,Levine.Stabilizing Off-Policy RL via Bootstrapping Error Reduction.NeurIPS 2019.Bootstrapping Error Accumulation and Distribution ShiftCounterfactual queries lead to distributional shift:Function approximators(policy,Q-function or both)trained under one distributionEva

9、luated on a different distribution due to change in visited states for the new policyWorsened by the act of maximizing the expected return|How to Make Offline RL WorkSome ideas:Adding policy constraints:Enforce behavior regularizationMany model-free offline RL methodsValue function regularization:Pe

10、nalize value/Q functionModify Q-function training objectivePenalize using uncertainty estimatesModel-based methodsSolve a pessimistic MDPPenalize OOD data rewardsStrict in-sample learningCommon principle:Conservatism,Pessimism Make rewards/value pessimisticKeep rewards/value unalteredKidambi,Rajeswa

11、ran,Netrapalli,Joachims.MOReL:Model-Based Offline Reinforcement Learning.NeurlPS 2020.|Over-Conservatism in Existing methods15policy constraintvalue regularizationin-sample learning?Over-Conservatism degenerates the performance and generalization on unknown areasPartially covered dataset|Over-Conser

12、vatism in Existing methods16Over-Conservatism:fully coverage(|)aPolicy distributionmay locate hereData distributionleaned policy(|)aPolicy distributionExplore in OOD regions|17How Well Does Q-Functions Perform in OOD Areas?Deep Q functions interpolate well but struggle to extrapolatedataConvex hull|

13、18Theoretical Explanation The geometry of dataset(distance to data samples)matters|19How to Measure the Distance to a Dataset?(,)aTraining dataDistance to training dataDistance function outputState-conditioned distance function|20What Can the Distance Function Do?Distance function based convex hull

14、constraint(,)aTraining dataDistance to training dataDistance function outputDOGE(Distance-sensitive Offline rl with better GEneralization):A Minimalist ModificationGLi,J.,Zhan,X.,et.al.Distance-Sensitive Offline Reinforcement Learning.arXiv 2022.WOWWOW|21Theoretical Analysis of DOGEAchieves tighter

15、performance bound than data support constraint methods Concentrability coefficient:Suboptimality constant:as|22ExperimentsOutperform SOTA methods on D4RL benchmarks|23Generalization Ability of DOGEPolicy constraintValue regularizationIn-sample learningDOGE?AntMaze Large DatasetDOGE enjoys better gen

16、eralization performance|24Insight from DOGEGeneralization of DNN and geometry of datasets are largely overlooked in existing offline RL methods.Its necessary to take them into the consideration of designing new effective offline RL algorithms.|25RL-based vs Imitation-based MethodsTraing dataImitatio

17、n-basedRL-basedPolicy constrain,Value regularzation,Uncertainty penalty,RL-based methods enjoy out-of-distribution generalization but suffer from distribution shift.Imitation-based methods avoid distribution shift but are too conservative to surpass the dataset.How can we avoid distribution shift an

18、d enjoy the benefit of out-of-distribution generalization?Goal-conditioned supervised learning:DT,TT,RvS Eysenbach B,et al.Imitating Past Successes can be Very Suboptimal.arXiv preprint arXiv:2206.03378,2022.BrandfonbrenerD,et al.When does return-conditioned supervised learning work for offline rein

19、forcement learning?.arXiv preprint arXiv:2206.01079,2022.Paster K,McIlraith S,Ba J.You Cant Count on Luck:Why Decision Transformers Fail in Stochastic Environments.arXiv preprint arXiv:2205.15967,2022.|26A Motivating Examplegreen arrows:transitions in datasetgrid color:state-value()Action-stitching

20、vs.state-stitchingAction-stitching:Choose action in data that leads to next state in data with highest()Has more imitation flavor,very conservativeState-stitching:Choose any action that leads to next state in data with highest()Allow OOD actions,but need some guidance|27Policy-Guided Offline Reinfor

21、cement LearningStandard RL approach:Policy evaluation step(learn(,):Policy maximization step(learn(|):arg m,2+()arg max,2+()Powerful DPCoupled&updatesTricky OOD regularizersGoodInstabilityConservatismHow to preserve DP while learning&in a decoupled way?1.Learning instead of ,Learning a better than i

22、n dataA different view:2.Learning a guide-policy(|)to determine where to go Output the optimal next state given current state 3.Learning a task-irrelevant execute-policy(|,)Determine which action can produce the given next state|28Policy-Guided Offline Reinforcement Learning1.Use expectile regressio

23、n to obtain the upper confidence of state-value.Denote the state-value as .The the learning object is:2.Train the guide-policy with respect to V(s).Denote the guide-policy as.Then the learning object g is given by:3.Train the execute-policyas:Given a state s,the final action is determined by both th

24、e guide-policy and the execute-policy,by:Credit to Kostrikov et al.ICLR 2022Xu et al.A Policy-Guided Imitation Approach for Offline Reinforcement Learning.NeurIPS 2022.(6)|29ExperimentsBenchmarks resultsMulti-task adaptationFor multi-task adaptation,POR only needs to re-train the guide-policy and us

25、e the same execute-policy for all tasksExecute-policy can be trained with data from different tasks,enabling superior data efficiency|30Offline RL for Real-World ApplicationsOffline LearningOffline LearningThermal Thermal combustion combustion process simulatorprocess simulatorCoal feederCoal millTu

26、rbineInduced draft fanForced draft fanControl variableWater pumpOnline OperationOnline OperationWaterSteamWindCoalBoilerValvesValvesWindCoalWatersmokeSteamOffline RL Offline RL algorithmalgorithmHistorical Historical operational operational datadataControl loopData inputNext state ElectricityPolluta

27、ntOptimized Optimized actionsactionsLogged Logged datadataCombustion Optimization for Thermal Power Generating Units(TPGUs)Using Offline RLTrainingTrainingUse less coalUse less coalMore electricityMore electricityLess pollutionLess pollutionZhan,X.,et.al.DeepThermal:Combustion Optimization for Therm

28、al Power Generating Units Using Offline Reinforcement Learning.In AAAI 2022;Spotlight RL4Real Life Workshop,ICML2021.系統在京東科技開發并完成真實電廠落地。Offline RL for Real-World ApplicationsCoal feederCoal millTurbineInduced draft fanForced draft fanControl variablesWater pumpWaterSteamWindCoalReal BoilerValvesValv

29、esWindCoalWaterSmokeSteamControl loopElectricityPollutantSystem complexity Coal mill,boiler,steamer,heater components 10,000 sensorsComplex dynamics Involve coal pulverizing,burning and steam circulation Complex physical and chemical processesHigh dimensional control 100+major control elements Conti

30、nuous controlModeling restrictions Not possible to interact with the system during training No high-fidelity simulator Only have offline logged dataDomain expertise Require large amount of domain knowledge Lots of safety constraintsLong-term optimization Needs to optimize long-term combustionperform

31、anceMulti-objective optimization Improve combustion efficiency Reduce NOx emission|32MORE:An Improved Model-based Offline RL with Restrictive Exploration MORE:Tackles the challenge of offline policy learning under constraints with an imperfect simulator Safe policy optimization:Uses two types of Q-f

32、unctions,and,for reward maximization and cost evaluationPolicy optimization is performed on the carefully combined real-simulated data|MORE:An Improved Model-based Offline RL with Restrictive Exploration Restrictive exploration and hybrid training:Intuition:only consider samples that the data-driven

33、 simulator is certain,and then further distinguish whether the samples are in distribution or notOffline training using a special local buffer,combining real,positive and negative simulated samplesFilter out samples if the model is uncertain or lack of prediction robustness,Model sensitivity based f

34、ilteringDataDensity+Data density based filteringHybrid training|34Real-World ExperimentsReal-world experiments at CHN Energy Langfang Power StationThe optimized control strategy achieved the maximum increase of 0.56%,0.65%and 0.51%on the combustion efficiency|03Hybrid Offline-and-Online RL|Limitatio

35、ns in Both Online&Offline Approaches36 Dynamics Gap:high-fidelity simulators are hard to construct Limited Coverage:offline data with sufficient space coverage for offline RL training is impractically largeIs it possible to combine learning from limited real data in offline RLand unrestricted explor

36、ation through imperfect simulators in online RLto address the drawbacks of both approaches?|37H2O:Dynamics-Aware Hybrid Offline-and-Online RLDynamics-aware policy evaluation:Minimize the dynamics-gap weighted soft-maximum of Q values:Push down Q values on high dynamics-gap samplesMaximize Q values o

37、n data:Pull up Q values on real offline data samplesLearn on both offline data and online simulated samplesFix Bellman error due to dynamics gap:Use dynamics ratio as an importance sampling weightCan be interpreted as adding following adaptive adjustment on rewards:Theoretical analysis:Niu,H.,et.al.

38、When to Trust Your Simulator:Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning.NeurIPS 2022.|38H2O:Dynamics-Aware Hybrid Offline-and-Online RLReal-world validation on a wheel-footed robotStanding Still:SAC,DARC,and DARC are not able to keep the robot balanced and quickly fail after ini

39、tialization.the robot with H2O policy remains steady after 11 seconds(s),while CQL bumps into the ground and goes out of control at 12s.SACDARCCQLH2O(Ours)|39H2O:Dynamics-Aware Hybrid Offline-and-Online RLMoving Straight:DARC,and SAC fail at the beginning.H2O achieves great control performance to ke

40、ep balanced and closely follow the target velocity(v=0.2m/s),CQL exceeds v by a fairly large margin,nearly doubling the desired target velocity.Additionally,the angle illustrates that the robot controlled by H2O runs more smoothly than CQL.SACDARCCQLH2OReal-world validation on a wheel-footed robotPr

41、omising Research Directions for Data-Driven Methods決策優化應用方向有數據有仿真/模型（+數據）優化問題可公式化描述問題分類場景傳統方法復雜工業系統優化機器人控制自動駕駛決策機理建模經典控制理論基于仿真的在線強化(online RL)物流調度生產排期運籌學（Operation research,OR）前沿研究方向 AI for math/OR 利用離線模仿/強化學習加速大規?；旌险鸵巹潌栴}求解有偏好的離線模仿學習：融合專家/非專家數據混合式強化學習：融合仿真+真實數據有限樣本下的高泛化能力離線模仿/強化學習算法安全約束下的離線強化學習

42、ReferencesOffline RLOffline ILOffline Planning Zhan,X.,Zhu,X.and Xu,H.Model-Based Offline Planning with Trajectory Pruning.IJCAI 2022.Zhan,X.,et.al.DeepThermal:Combustion Optimization for Thermal Power Generating Units Using Offline Reinforcement Learning.AAAI 2022.Xu,H.,Zhan,X.,and Zhu,X.Constraint

43、s Penalized Q-Learning for Safe Offline Reinforcement Learning.AAAI 2022.Xu,H.,Jiang,L.,Li,J.,Zhan,X.A Policy-Guided Imitation Approach for Offline Reinforcement Learning.NeurIPS 2022.Li,J.,Zhan,X.,et.al.Distance-Sensitive Offline Reinforcement Learning.arXiv preprint.Xu,H.,Zhan,X.,Li,J.,Yin,H.Offli

44、ne Reinforcement Learning with Soft Behavior Regularization.arXiv preprint.Xu,H.,Zhan,X.,Yin,H.and Qin,H.Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations.ICML 2022.Zhang,W.et al.Discriminator-Guided Model-Based Offline Imitation Learning.CoRL 2022.Offline-and-Online Hybrid RL Niu,H.,et.al.When to Trust Your Simulator:Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning.NeurIPS 2022.非常感謝您的觀看|

相關圖表

本文主要探討了數據驅動的決策優化方法，包括離線強化學習、混合離線-在線強化學習以及離線模仿學習。作者指出，傳統的決策制定方法在處理復雜工業系統優化、機器人控制和自動駕駛等問題時存在局限性，而數據驅動的方法可以更好地應對這些問題。核心數據包括：1）離線強化學習在仿真環境中的表現優于傳統方法，但存在分布偏差問題；2）混合離線-在線強化學習可以結合仿真和真實數據，提高決策的穩定性和泛化能力；3）離線模仿學習可以利用專家和非專家數據，提高決策的效率和準確性。關鍵點總結如下：1）數據驅動的決策優化方法可以應對傳統方法的局限性；2）離線強化學習在仿真環境中表現優異，但需解決分布偏差問題；3）混合離線-在線強化學習結合仿真和真實數據，提高決策的穩定性和泛化能力；4）離線模仿學習利用專家和非專家數據，提高決策的效率和準確性。

"離線強化學習在工業優化中的應用前景如何？" "如何結合仿真數據與真實數據進行混合式強化學習？" "在有限樣本情況下，離線模仿/強化學習算法如何實現高泛化能力？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站