《4-5 面向真實場景的數據驅動決策優化.pdf》由會員分享,可在線閱讀,更多相關《4-5 面向真實場景的數據驅動決策優化.pdf(42頁珍藏版)》請在三個皮匠報告上搜索。
1、面向真實場景的數據驅動決策優化詹仙園 清華大學智能產業研究院(AIR)助理研究員/助理教授|01Real-World Challenges for Data-Driven Decision-Making 02Offline Reinforcement Learning(RL)03Hybrid Offline-and-Online RL目錄 CONTENT|01Real-World Challenges for Data-Driven Decision-Making|4Decision-Making Application in Real-WorldGaming AIRoboticsAutono
2、mous drivingIndustrial&Energy SystemsLogisticsScheduling|5Real-World Challenges for Sequential Decision-Making Methods Conventional decision-making tasksReal-world tasksActionStateRewardActionStateRewardXRiskHistorical dataOffline datasetsX Not possible to interact with the real environment during t
3、raining Perfect simulation environment may not exist Severe sim-to-real transfer issue Only have offline logged data Most conventional methods fail!Is there a data-driven solution?Overview of Data-Driven Sequential Decision MakingLevel of System interactionAmount of offline dataHighNo interactionLow
4、MediumNo offline dataSmallMediumLargeOnline RLApplication ScenariosSample efficient online RLSample efficient online RLoffline RL/IL Few-shot IL/RL/PlanningSample efficient IL/RL/planningSample efficient offline IL/RL/planningGaming AIAI ApproachesLimited real-world applicationsLots of real-world ap
5、plicationsLimited real-world applicationsTechnological maturityLots of research Relatively matureSome researchLack of research.Hard problems.Lots of unknown.Lots of application scenariosMission critical system optimizationRoboticsAutonomous Driving|02Offline Reinforcement Learning(RL)|8Introduction
6、of Reinforcement LearningGaming AIRoboticsAction State Reward|9Challenge of DRL in Real-World Applications|10Challenges of DRL in Real-World ApplicationsConventional RL tasksUsing RL in real-world tasks?ActionStateRewardActionStateRewardXHistorical state,action,reward dataOffline logged data Not pos
7、sible to interact with the real environment during training Perfect simulation environment may not exist Severe sim-to-real transfer issue Only have offline logged data Most existing DRL algorithms fail!Real systemor simulator|11Offline Reinforcement Learning(Offline RL)https:/ and NotationsOffline
8、logged data|13 Kumar,Fu,Tucker,Levine.Stabilizing Off-Policy RL via Bootstrapping Error Reduction.NeurIPS 2019.Bootstrapping Error Accumulation and Distribution ShiftCounterfactual queries lead to distributional shift:Function approximators(policy,Q-function or both)trained under one distributionEva
9、luated on a different distribution due to change in visited states for the new policyWorsened by the act of maximizing the expected return|How to Make Offline RL WorkSome ideas:Adding policy constraints:Enforce behavior regularizationMany model-free offline RL methodsValue function regularization:Pe
10、nalize value/Q functionModify Q-function training objectivePenalize using uncertainty estimatesModel-based methodsSolve a pessimistic MDPPenalize OOD data rewardsStrict in-sample learningCommon principle:Conservatism,Pessimism Make rewards/value pessimisticKeep rewards/value unalteredKidambi,Rajeswa
11、ran,Netrapalli,Joachims.MOReL:Model-Based Offline Reinforcement Learning.NeurlPS 2020.|Over-Conservatism in Existing methods15policy constraintvalue regularizationin-sample learning?Over-Conservatism degenerates the performance and generalization on unknown areasPartially covered dataset|Over-Conser
12、vatism in Existing methods16Over-Conservatism:fully coverage(|)aPolicy distributionmay locate hereData distributionleaned policy(|)aPolicy distributionExplore in OOD regions|17How Well Does Q-Functions Perform in OOD Areas?Deep Q functions interpolate well but struggle to extrapolatedataConvex hull|
13、18Theoretical Explanation The geometry of dataset(distance to data samples)matters|19How to Measure the Distance to a Dataset?(,)aTraining dataDistance to training dataDistance function outputState-conditioned distance function|20What Can the Distance Function Do?Distance function based convex hull
14、constraint(,)aTraining dataDistance to training dataDistance function outputDOGE(Distance-sensitive Offline rl with better GEneralization):A Minimalist ModificationGLi,J.,Zhan,X.,et.al.Distance-Sensitive Offline Reinforcement Learning.arXiv 2022.WOWWOW|21Theoretical Analysis of DOGEAchieves tighter
15、performance bound than data support constraint methods Concentrability coefficient:Suboptimality constant:as|22ExperimentsOutperform SOTA methods on D4RL benchmarks|23Generalization Ability of DOGEPolicy constraintValue regularizationIn-sample learningDOGE?AntMaze Large DatasetDOGE enjoys better gen
16、eralization performance|24Insight from DOGEGeneralization of DNN and geometry of datasets are largely overlooked in existing offline RL methods.Its necessary to take them into the consideration of designing new effective offline RL algorithms.|25RL-based vs Imitation-based MethodsTraing dataImitatio
17、n-basedRL-basedPolicy constrain,Value regularzation,Uncertainty penalty,RL-based methods enjoy out-of-distribution generalization but suffer from distribution shift.Imitation-based methods avoid distribution shift but are too conservative to surpass the dataset.How can we avoid distribution shift an
18、d enjoy the benefit of out-of-distribution generalization?Goal-conditioned supervised learning:DT,TT,RvS Eysenbach B,et al.Imitating Past Successes can be Very Suboptimal.arXiv preprint arXiv:2206.03378,2022.BrandfonbrenerD,et al.When does return-conditioned supervised learning work for offline rein
19、forcement learning?.arXiv preprint arXiv:2206.01079,2022.Paster K,McIlraith S,Ba J.You Cant Count on Luck:Why Decision Transformers Fail in Stochastic Environments.arXiv preprint arXiv:2205.15967,2022.|26A Motivating Examplegreen arrows:transitions in datasetgrid color:state-value()Action-stitching
20、vs.state-stitchingAction-stitching:Choose action in data that leads to next state in data with highest()Has more imitation flavor,very conservativeState-stitching:Choose any action that leads to next state in data with highest()Allow OOD actions,but need some guidance|27Policy-Guided Offline Reinfor
21、cement LearningStandard RL approach:Policy evaluation step(learn(,):Policy maximization step(learn(|):arg m,2+()arg max,2+()Powerful DPCoupled&updatesTricky OOD regularizersGoodInstabilityConservatismHow to preserve DP while learning&in a decoupled way?1.Learning instead of ,Learning a better than i
22、n dataA different view:2.Learning a guide-policy(|)to determine where to go Output the optimal next state given current state 3.Learning a task-irrelevant execute-policy(|,)Determine which action can produce the given next state|28Policy-Guided Offline Reinforcement Learning1.Use expectile regressio
23、n to obtain the upper confidence of state-value.Denote the state-value as .The the learning object is:2.Train the guide-policy with respect to V(s).Denote the guide-policy as.Then the learning object g is given by:3.Train the execute-policyas:Given a state s,the final action is determined by both th
24、e guide-policy and the execute-policy,by:Credit to Kostrikov et al.ICLR 2022Xu et al.A Policy-Guided Imitation Approach for Offline Reinforcement Learning.NeurIPS 2022.(6)|29ExperimentsBenchmarks resultsMulti-task adaptationFor multi-task adaptation,POR only needs to re-train the guide-policy and us
25、e the same execute-policy for all tasksExecute-policy can be trained with data from different tasks,enabling superior data efficiency|30Offline RL for Real-World ApplicationsOffline LearningOffline LearningThermal Thermal combustion combustion process simulatorprocess simulatorCoal feederCoal millTu
26、rbineInduced draft fanForced draft fanControl variableWater pumpOnline OperationOnline OperationWaterSteamWindCoalBoilerValvesValvesWindCoalWatersmokeSteamOffline RL Offline RL algorithmalgorithmHistorical Historical operational operational datadataControl loopData inputNext state ElectricityPolluta
27、ntOptimized Optimized actionsactionsLogged Logged datadataCombustion Optimization for Thermal Power Generating Units(TPGUs)Using Offline RLTrainingTrainingUse less coalUse less coalMore electricityMore electricityLess pollutionLess pollutionZhan,X.,et.al.DeepThermal:Combustion Optimization for Therm
28、al Power Generating Units Using Offline Reinforcement Learning.In AAAI 2022;Spotlight RL4Real Life Workshop,ICML2021.系統在京東科技開發并完成真實電廠落地。Offline RL for Real-World ApplicationsCoal feederCoal millTurbineInduced draft fanForced draft fanControl variablesWater pumpWaterSteamWindCoalReal BoilerValvesValv
29、esWindCoalWaterSmokeSteamControl loopElectricityPollutantSystem complexity Coal mill,boiler,steamer,heater components 10,000 sensorsComplex dynamics Involve coal pulverizing,burning and steam circulation Complex physical and chemical processesHigh dimensional control 100+major control elements Conti
30、nuous controlModeling restrictions Not possible to interact with the system during training No high-fidelity simulator Only have offline logged dataDomain expertise Require large amount of domain knowledge Lots of safety constraintsLong-term optimization Needs to optimize long-term combustionperform
31、anceMulti-objective optimization Improve combustion efficiency Reduce NOx emission|32MORE:An Improved Model-based Offline RL with Restrictive Exploration MORE:Tackles the challenge of offline policy learning under constraints with an imperfect simulator Safe policy optimization:Uses two types of Q-f
32、unctions,and,for reward maximization and cost evaluationPolicy optimization is performed on the carefully combined real-simulated data|MORE:An Improved Model-based Offline RL with Restrictive Exploration Restrictive exploration and hybrid training:Intuition:only consider samples that the data-driven
33、 simulator is certain,and then further distinguish whether the samples are in distribution or notOffline training using a special local buffer,combining real,positive and negative simulated samplesFilter out samples if the model is uncertain or lack of prediction robustness,Model sensitivity based f
34、ilteringDataDensity+Data density based filteringHybrid training|34Real-World ExperimentsReal-world experiments at CHN Energy Langfang Power StationThe optimized control strategy achieved the maximum increase of 0.56%,0.65%and 0.51%on the combustion efficiency|03Hybrid Offline-and-Online RL|Limitatio
35、ns in Both Online&Offline Approaches36 Dynamics Gap:high-fidelity simulators are hard to construct Limited Coverage:offline data with sufficient space coverage for offline RL training is impractically largeIs it possible to combine learning from limited real data in offline RLand unrestricted explor
36、ation through imperfect simulators in online RLto address the drawbacks of both approaches?|37H2O:Dynamics-Aware Hybrid Offline-and-Online RLDynamics-aware policy evaluation:Minimize the dynamics-gap weighted soft-maximum of Q values:Push down Q values on high dynamics-gap samplesMaximize Q values o
37、n data:Pull up Q values on real offline data samplesLearn on both offline data and online simulated samplesFix Bellman error due to dynamics gap:Use dynamics ratio as an importance sampling weightCan be interpreted as adding following adaptive adjustment on rewards:Theoretical analysis:Niu,H.,et.al.
38、When to Trust Your Simulator:Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning.NeurIPS 2022.|38H2O:Dynamics-Aware Hybrid Offline-and-Online RLReal-world validation on a wheel-footed robotStanding Still:SAC,DARC,and DARC are not able to keep the robot balanced and quickly fail after ini
39、tialization.the robot with H2O policy remains steady after 11 seconds(s),while CQL bumps into the ground and goes out of control at 12s.SACDARCCQLH2O(Ours)|39H2O:Dynamics-Aware Hybrid Offline-and-Online RLMoving Straight:DARC,and SAC fail at the beginning.H2O achieves great control performance to ke
40、ep balanced and closely follow the target velocity(v=0.2m/s),CQL exceeds v by a fairly large margin,nearly doubling the desired target velocity.Additionally,the angle illustrates that the robot controlled by H2O runs more smoothly than CQL.SACDARCCQLH2OReal-world validation on a wheel-footed robotPr
41、omising Research Directions for Data-Driven Methods決策優化應用方向有數據有仿真/模型(+數據)優化問題可公式化描述問題分類場景傳統方法復雜工業系統優化機器人控制自動駕駛決策機理建模經典控制理論基于仿真的在線強化(online RL)物流調度生產排期運籌學(Operation research,OR)前沿研究方向 AI for math/OR 利用離線模仿/強化學習加速大規?;旌险鸵巹潌栴}求解 有偏好的離線模仿學習:融合專家/非專家數據 混合式強化學習:融合仿真+真實數據 有限樣本下的高泛化能力離線模仿/強化學習算法 安全約束下的離線強化學習
42、ReferencesOffline RLOffline ILOffline Planning Zhan,X.,Zhu,X.and Xu,H.Model-Based Offline Planning with Trajectory Pruning.IJCAI 2022.Zhan,X.,et.al.DeepThermal:Combustion Optimization for Thermal Power Generating Units Using Offline Reinforcement Learning.AAAI 2022.Xu,H.,Zhan,X.,and Zhu,X.Constraint
43、s Penalized Q-Learning for Safe Offline Reinforcement Learning.AAAI 2022.Xu,H.,Jiang,L.,Li,J.,Zhan,X.A Policy-Guided Imitation Approach for Offline Reinforcement Learning.NeurIPS 2022.Li,J.,Zhan,X.,et.al.Distance-Sensitive Offline Reinforcement Learning.arXiv preprint.Xu,H.,Zhan,X.,Li,J.,Yin,H.Offli
44、ne Reinforcement Learning with Soft Behavior Regularization.arXiv preprint.Xu,H.,Zhan,X.,Yin,H.and Qin,H.Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations.ICML 2022.Zhang,W.et al.Discriminator-Guided Model-Based Offline Imitation Learning.CoRL 2022.Offline-and-Online Hybrid RL Niu,H.,et.al.When to Trust Your Simulator:Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning.NeurIPS 2022.非常感謝您的觀看|