《黃世宇-OpenRL支持大模型訓練的強化學習框架與大模型時代的PluginStore.pdf》由會員分享,可在線閱讀,更多相關《黃世宇-OpenRL支持大模型訓練的強化學習框架與大模型時代的PluginStore.pdf(61頁珍藏版)》請在三個皮匠報告上搜索。
1、OpenRL:A Unified Reinforcement Learning Framework黃世宇 第四范式演講嘉賓黃世宇第四范式強化學習科學家,開源強化學習OpenRL Lab負責人本科與博士均畢業于清華大學計算機系,導師是朱軍和陳挺教授,本科期間在CMU交換,導師為Deva Ramanan教授。主要研究方向為強化學習,多智能體強化學習,分布式強化學習。曾在ICLR、CVPR、AAAI、NeurIPS,Nature Machine Intelligence,ICML,AAMAS,Pattern Recognition等會議和期刊發表多篇學術論文。其領導開發的TiZero谷歌足球游戲智能
2、體曾在及第平臺上取得排名第一的成績。黃世宇也曾在騰訊AI Lab、華為諾亞、商湯、瑞萊智慧等工作。目 錄CONTENTS1.強化學習背景2.OpenRL介紹3.OpenRL未來發展4.OpenPlugin介紹Introduction&MotivationPART 01What is Reinforcement Learning?Goal of RL:Artificial General Intelligence(AGI)Reinforcement learning in dog training.What else?Robotics Autonomous DrivingOpenAI 2019C
3、ARLA 2017What else?Industrial Design Quantitative TradingPrefixRL 2022FinRL 2020What else?Chat BotWhat else?Multi-agent RL Competitive RLTiZero 2023Honor of Kings Arena 2022Do RL in a Unified FrameworkVarious RL AlgorithmsVarious EnvironmentsMulti-agent&Self-playOffline RLOpenRL:An Open-Souce RL Fra
4、meworkPART 02Main Features of OpenRL Friendly to beginnerspip install openrlordocker pull openrllab/openrlMain Features of OpenRL Friendly to beginnersopenrl-mode train-env CartPole-v1Main Features of OpenRL Friendly to beginnersMain Features of OpenRL Friendly to beginnersDocumentation/中文文檔Tutorial
5、Main Features of OpenRL Customizable capabilities for professionalsConfigure everything via YAMLUse yaml python train_ppo.py-config mpe_ppo.yamlUse yaml python train_ppo.py-config mpe_ppo.yaml python train_ppo.py-seed 1-lr 5e-4Main Features of OpenRL Customizable capabilities for professionalsTrack
6、your experiments via WandbMain Features of OpenRL Customizable capabilities for professionalsTrack your experiments via TensorboardCustomize Wandb Outputhttps:/ Wandb OutputMain Features of OpenRL Customizable capabilities for professionalsAbstract&Modularized DesignReward ModulePolicy ModuleValue M
7、oduleAlgorithmCustomize Reward ModelChen,Wenze,et al.DGPO:Discovering Multiple Strategies with Diversity-Guided Policy Optimization.arXiv preprint arXiv:2207.05631(2022).Customize Reward ModelCustomize Reward ModelCustomize Reward ModelIntent Reward:When the generated text by the agent is close to t
8、he expected intent,the agent can receive higher rewards.METEOR Metric Reward:METEOR is a metric used to evaluate text generation quality and can be used to measure how similar generated texts are compared with expected ones.We use this metric as feedback for rewards given to agents in order to optim
9、ize their text generation performance.KL Divergence Reward:This reward is used to limit how much text generated by agents deviates from pre-trained models and prevent issues of reward hacking.Customize Reward ModelIntent Reward:When the generated text by the agent is close to the expected intent,the
10、 agent can receive higher rewards.Customize Reward Model METEOR Metric Reward:METEOR is a metric used to evaluate text generation quality and can be used to measure how similar generated texts are compared with expected ones.We use this metric as feedback for rewards given to agents in order to opti
11、mize their text generation performance.Customize Reward Model KL Divergence Reward:This reward is used to limit how much text generated by agents deviates from pre-trained models and prevent issues of reward hacking.Main Features of OpenRL Support Offline RLLearn from IteractionLearn from Expert Dat
12、aMain Features of OpenRL Support Offline RLMain Features of OpenRL Customizable capabilities for professionals Dictionary observation space support Serial or parallel environment training Support for models such as LSTM,GRU,Transformer etc.Automatic mixed precision(AMP)training Data collecting wth h
13、alf precision policy networkMain Features of OpenRL Build on top of othersDatasetsModelsMain Features of OpenRLGalleryMain Features of OpenRLHigh performanceTraining CartPole on a laptop only takes a few seconds.+17%speedup for language model training.Ranking 1st on Google Research Football.Achievin
14、g+43%performance improvement on LLM.Compared with RL4LMsTiZeroLin,Fanqi,et al.TiZero:Mastering Multi-Agent Football with Curriculum Learning and Self-Play.arXiv preprint arXiv:2302.07515(2023).TiZeroLin,Fanqi,et al.TiZero:Mastering Multi-Agent Football with Curriculum Learning and Self-Play.arXivpre
15、print arXiv:2302.07515(2023).TiZeroLin,Fanqi,et al.TiZero:Mastering Multi-Agent Football with Curriculum Learning and Self-Play.arXiv preprint arXiv:2302.07515(2023).Future ReleasePART 03Large-Scale RLLarge ModelLarge ClusterLarge PopulationLarge-Scale RLLarge PopulationYang,Xinyi,et al.Learning Gra
16、ph-Enhanced Commander-Executor for Multi-Agent Navigation.arXiv preprint arXiv:2302.04094(2023).Open RL via SharingShare ModelsShare CodesShare ResultsScan the QR code to try OpenRL!Visit: the QR code to try OpenRL!Visit: for LLMPART 03Why?Think about pip for Python package(apt/yum/brew/dnf/npm/)!Th
17、ink about App Store.Standardize plugin.Provide a simple way to use,share LLM plugins.Main Features of OpenPlugin Installationpip install openplugin-pyMain Features of OpenPlugin Usage install plugin:op install install locally op install./reinstall op reinstall uninstall plugin:op uninstall start to
18、run plugin:op run list installed plugins:op listop is all you need!Main Features of OpenPlugin Usage Provide config API for SageGPT/ChatGPT platform can get json file via:server_host/ai-plugin.json can get YAML file via:server_host/openapi.yamlMain Features of OpenPlugin Build on top of othersYou ca
19、n share your plugin to others!https:/ Features of OpenPluginPlugin Storeikun_plugintodo_pluginQRcode_plugin.Main Features of OpenPlugin QRcode_pluginSupport for placeholder:Plugin StructureMain Features of OpenPlugin How to use QRcode_plugin Step 0:Find a server Step 1:pip install openplugin-py Step 2:op install QRcode_plugin Step 3:op run QRcode Step 4:Get the json and YAML file Step 5:Register plugin to SageGPT or ChatGPT website Step6:Finished!Have fun!Main Features of OpenPluginQRcode_pluginDemoTry OpenPlugin,Click Star!Visit:https:/ 謝 聆 聽