《1-3 當強化學習遇上高自由度動作游戲:問題研究與應用實踐.pdf》由會員分享,可在線閱讀,更多相關《1-3 當強化學習遇上高自由度動作游戲:問題研究與應用實踐.pdf(36頁珍藏版)》請在三個皮匠報告上搜索。
1、When RL Meets Highly Free Action Game:Research and Case Study2022/09/24胡裕靖胡裕靖1Overview2Navigation3Melee Combat4FutureIntro of Fuxi&Naraka:Bladepoint How we solve navigation problem in Naraka:BladepointHow we solve melee combat problem in Naraka:BladepointWhat we wantto do nextOverviewIntro of Fuxi&N
2、araka:Bladepoint NetEase FuxiBusiness and Research InterestsFuxi is founded on the principle of bridging artificial intelligence and video gamesReinforcement LearningComputer VisionNatural Language ProcessingUser PersonaVirtual HumanRoboticsNetEase Fuxi RL GroupBusiness and Research InterestsCard Ga
3、meRevelation MobileRevelation MobileMMORPGJustice 6Justice 6-vsvs-6 6Sports GameFever Basketball 3Fever Basketball 3-vsvs-3 3ACT GameNarakaNaraka BladepointBladepointTypical Applications of RL in Games Game AI Bots60-player PVP mythical action combatMelee combatGravity defying mobilityVast arsenals
4、of melee&ranged weaponsLegendary customizable heroes with epic abilitiesAction-adventure Battle Royale Game Developed by 24 Entertainment and published by NetEase Games MontrealNaraka:Bladepoint(永劫無間)Naraka:BladepointTwo major problems in Naraka(人機模式)we want to solve1.1.NavigationNavigation in very
5、complex terrains2.2.Melee combatMelee combat bots with high skill levelReinforcement Learning Applications in NarakaNavigation and Melee CombatNavigation TaskMelee Combat TaskNavigationComplex threeComplex three-dimensional terrainsdimensional terrains:mountains,trees,rivers,temples,tall buildings(T
6、oo many disconnected areas)Problems for AI in Naraka:BladepointProblems for pathfindingNavMeshTypical terrains in Naraka:BladepointDynamic environmentDynamic environment(i.e.,poison circle,bombing zone,traps)Problems for AI in Naraka:BladepointProblems for pathfindingBombing ZonePoison CircleTrapsMu
7、ltiple game mechanisms for moving Multiple game mechanisms for moving(i.e.,grappling hooks,scale rush,sliding jump,charge-to-dodge)Demand for humanDemand for human-likenesslikenessGrappling hookProblems for AI in Naraka:BladepointProblems for pathfindingScale RushSliding Jump&Charge-to-dodgeThreeThr
8、ee-Dimensional realDimensional real-time perceptiontime perceptionComplex threeComplex three-dimensional terrainsdimensional terrainsDynamic environmentDynamic environmentDisconnected areasDisconnected areasMultiple game mechanism for movingMultiple game mechanism for movingHuman Like moving operati
9、onsHuman Like moving operationsHumanHuman-like Policy Output Designlike Policy Output DesignDeep Reinforcement LearningDeep Reinforcement LearningNavigation:3D perception with DRLProblems and methodsTechniques such as Automated Reward Techniques such as Automated Reward Shaping and Curriculum Learni
10、ngShaping and Curriculum LearningRaderDepth MapNavigation:3D perception with DRL3D real-time perception in the game3D Features3D FeaturesScalar FeaturesScalar FeaturesTimeTime-Series FeaturesSeries FeaturesW/A/S/DW/A/S/D(ForwardForward、BackBack、LeftLeft、RightRight)Hook/Crouch/Hook/Crouch/Dodge/Jump/
11、Dodge/Jump/Navigation:3D perception with DRLNeural Network StructureNavigation:3D perception with DRLAgent can get stuck and lacks human-likeness Agent gets stuck in cornersAgent keeps jumpingNavigation:3D perception with DRLAutomated Reward ShapingReward Shaping needs tedious tuning work to get app
12、ropriate weight hyperparametersReward Shaping needs tedious tuning work to get appropriate weight hyperparametersOptimal PolicySuboptimal PolicyTrue RewardShaping RewardIRATLi Wang,Yupeng Zhang,Yujing Hu,et al.Individual Reward Assisted Multi-Agent Reinforcement Learning.ICML 2022.Navigation:3D perc
13、eption with DRLAutomated Reward Shaping Updating Shaping PolicyLi Wang,Yupeng Zhang,Yujing Hu,et al.Individual Reward Assisted Multi-Agent Reinforcement Learning.ICML 2022.For each shaping policy and the target policy:When two policies are consistentconsistent,the shaping policy should learn quickly
14、learn quickly.When two policies conflict conflict too much,the shaping policy should update carefullyupdate carefully.Combine with its original optimization objective:=1max,+1min,An increasing-effect KL regularizer is introduced to distill target policy knowledge:=,A new objective is:=clip,1 ,1+Simi
15、larity between and is defined as:=|Navigation:3D perception with DRLAutomated Reward Shaping Updating Target PolicyLi Wang,Yupeng Zhang,Yujing Hu,et al.Individual Reward Assisted Multi-Agent Reinforcement Learning.ICML 2022.Target policy uses learning objective corrected by importance sampling:=A de
16、creasing-effect KL regularizer to ensure effective update.The total learning objective of team policy is:=min ,clip ,1 ,1+,Where is a decreasing coefficient.Curriculum learningCurriculum learning:choose start point in specific areas,then randomly choose from the full map,and lastly choose stuck poin
17、tsNavigation:3D perception with DRLCurriculum LearningArea NameFull MapCelestraStilltide TempleWreckage PlainsShadow jade MineSun wings RestAverageNavMesh Arrival Rate63.40%32.70%27.90%35.90%24.80%41.40%37.70%Our methodArrival rate81.50%88.00%74.70%85.50%81.50%73.30%80.75%Increase Ratio28.54%169.11%
18、167.74%138.16%228.63%77.05%114.19%Comparison of the arrival rate between NavMesh and our method in different areaNavigation in complex terrainsHigh arrival rate in complex terrainsNavigation in complex terrainsHigh arrival rate in complex terrainsShadow Jade MineRL Navigation Agent vs Rule-based Age
19、ntMelee CombatRockRock-paperpaper-scissors combat systemscissors combat systemFocus Strikes Common AttackCounterstrikes Focus StrikesCommon Attack CounterstrikesProblems for AI in Naraka:BladepointProblems for melee combatThirteen heroesThirteen heroes(more in the future)with different hero skillsPr
20、oblems for AI in Naraka:BladepointProblems for melee combatSkills of Different Heros in Naraka:BladepointVarious melee weaponsVarious melee weapons with different mechanismsProblems for AI in Naraka:BladepointProblems for melee combatSpearNunchukA playing demo of Naraka,showing rich attack modesProb
21、lems for AI in Naraka:BladepointProblems for melee combatPolicy distillationPolicy distillation:knowledge transferOpponent ModelingOpponent Modeling:observing opponents historical behaviors to predict opponents next movesVarious melee weapons Various melee weapons with different mechanismsThirty her
22、oes Thirty heroes(more in the future)with different hero skillsRockRock-paperpaper-scissors combat system scissors combat system Requires players to guess/predict and counteract the others strategiesCombat Bot with High Skill LevelProblems and methodsPolicy distillation Policy distillation can impro
23、ve student agents performance effectively by transferring knowledge from multiple teachersCombat Bot with High Skill LevelAn all-rounder AI to master every kind of weapon:knowledge transfer Stage One:Train all teacher proficient in one weapon Stage Two:Distill their knowledge to one studentSince wea
24、pon combos are more complicated than hero skills,we only use distillation to handle weaponsCombat Bot with High Skill LevelAn all-rounder AI to master every kind of weapon:knowledge transfer How to predict opponents next moves:Observe and encode their historical behaviorsDifferent historical behavio
25、rsEncoded featuresCommon Attack(White)Force Strikes(Blue)Counter Strikes(Red)0.4,0.05,0.450.4,0.55,0.050.8,0.15,0.05Combat Bot with High Skill LevelPredict opponents next movesCombat Bot with High Skill LevelAn all-rounder AI to master every kind of weaponCombat Bot with High Skill LevelPredict opponents next moves(PVE)Future workWhat we want to do nextFuture workNavigation in the room and high buildings with big height differencesThe timing for switching melee/ranged weaponsOther sub-goal in battle royale game,i.e.,resource collection,team cooperation etc.Remaining ProblemsQ&A