Skip to content

ding.bonus.ppof

ding.bonus.ppof

PPOF

Overview

Class of agent for training, evaluation and deployment of Reinforcement learning algorithm Proximal Policy Optimization(PPO). For more information about the system design of RL agent, please refer to https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html.

Interface: __init__, train, deploy, collect_data, batch_evaluate, best

supported_env_list = ['LunarLander-v2', 'LunarLanderContinuous-v2', 'BipedalWalker-v3', 'Pendulum-v1', 'acrobot', 'rocket_landing', 'drone_fly', 'hybrid_moving', 'evogym_carrier', 'mario', 'di_sheep', 'procgen_bigfish', 'minigrid_fourroom', 'metadrive', 'BowlingNoFrameskip-v4', 'BreakoutNoFrameskip-v4', 'GopherNoFrameskip-v4KangarooNoFrameskip-v4', 'PongNoFrameskip-v4', 'QbertNoFrameskip-v4', 'SpaceInvadersNoFrameskip-v4', 'Hopper-v3', 'HalfCheetah-v3', 'Walker2d-v3'] class-attribute instance-attribute

Overview

List of supported envs.

Examples: >>> from ding.bonus.ppof import PPOF >>> print(PPOF.supported_env_list)

best property

Overview

Load the best model from the checkpoint directory, which is by default in folder exp_name/ckpt/eval.pth.tar. The return value is the agent with the best model.

Returns: - (:obj:PPOF): The agent with the best model. Examples: >>> agent = PPOF(env_id='LunarLander-v2') >>> agent.train() >>> agent = agent.best()

.. note:: The best model is the model with the highest evaluation return. If this method is called, the current model will be replaced by the best model.

__init__(env_id=None, env=None, seed=0, exp_name=None, model=None, cfg=None, policy_state_dict=None)

Overview

Initialize agent for PPO algorithm.

Arguments: - env_id (:obj:str): The environment id, which is a registered environment name in gym or gymnasium. If env_id is not specified, env_id in cfg must be specified. If env_id is specified, env_id in cfg will be ignored. env_id should be one of the supported envs, which can be found in PPOF.supported_env_list. - env (:obj:BaseEnv): The environment instance for training and evaluation. If env is not specified, env_id or cfg.env_id must be specified. env_id or cfg.env_id will be used to create environment instance. If env is specified, env_id and cfg.env_id will be ignored. - seed (:obj:int): The random seed, which is set before running the program. Default to 0. - exp_name (:obj:str): The name of this experiment, which will be used to create the folder to save log data. Default to None. If not specified, the folder name will be env_id-algorithm. - model (:obj:torch.nn.Module): The model of PPO algorithm, which should be an instance of class ding.model.PPOFModel. If not specified, a default model will be generated according to the configuration. - cfg (:obj:Union[EasyDict, dict]): The configuration of PPO algorithm, which is a dict. Default to None. If not specified, the default configuration will be used. - policy_state_dict (:obj:str): The path of policy state dict saved by PyTorch a in local file. If specified, the policy will be loaded from this file. Default to None.

.. note:: An RL Agent Instance can be initialized in two basic ways. For example, we have an environment with id LunarLander-v2 registered in gym, and we want to train an agent with PPO algorithm with default configuration. Then we can initialize the agent in the following ways: >>> agent = PPOF(env_id='LunarLander-v2') or, if we want can specify the env_id in the configuration: >>> cfg = {'env': {'env_id': 'LunarLander-v2'}, 'policy': ...... } >>> agent = PPOF(cfg=cfg) There are also other arguments to specify the agent when initializing. For example, if we want to specify the environment instance: >>> env = CustomizedEnv('LunarLander-v2') >>> agent = PPOF(cfg=cfg, env=env) or, if we want to specify the model: >>> model = VAC(**cfg.policy.model) >>> agent = PPOF(cfg=cfg, model=model) or, if we want to reload the policy from a saved policy state dict: >>> agent = PPOF(cfg=cfg, policy_state_dict='LunarLander-v2.pth.tar') Make sure that the configuration is consistent with the saved policy state dict.

train(step=int(10000000.0), collector_env_num=4, evaluator_env_num=4, n_iter_log_show=500, n_iter_save_ckpt=1000, context=None, reward_model=None, debug=False, wandb_sweep=False)

Overview

Train the agent with PPO algorithm for step iterations with collector_env_num collector environments and evaluator_env_num evaluator environments. Information during training will be recorded and saved by wandb.

Arguments: - step (:obj:int): The total training environment steps of all collector environments. Default to 1e7. - collector_env_num (:obj:int): The number of collector environments. Default to 4. - evaluator_env_num (:obj:int): The number of evaluator environments. Default to 4. - n_iter_log_show (:obj:int): The frequency of logging every training iteration. Default to 500. - n_iter_save_ckpt (:obj:int): The frequency of saving checkpoint every training iteration. Default to 1000. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - reward_model (:obj:str): The reward model name. Default to None. This argument is not supported yet. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. - wandb_sweep (:obj:bool): Whether to use wandb sweep, which is a hyper-parameter optimization process for seeking the best configurations. Default to False. If True, the wandb sweep id will be used as the experiment name. Returns: - (:obj:TrainingReturn): The training result, of which the attributions are: - wandb_url (:obj:str): The weight & biases (wandb) project url of the trainning experiment.

deploy(enable_save_replay=False, concatenate_all_replay=False, replay_save_path=None, seed=None, debug=False)

Overview

Deploy the agent with PPO algorithm by interacting with the environment, during which the replay video can be saved if enable_save_replay is True. The evaluation result will be returned.

Arguments: - enable_save_replay (:obj:bool): Whether to save the replay video. Default to False. - concatenate_all_replay (:obj:bool): Whether to concatenate all replay videos into one video. Default to False. If enable_save_replay is False, this argument will be ignored. If enable_save_replay is True and concatenate_all_replay is False, the replay video of each episode will be saved separately. - replay_save_path (:obj:str): The path to save the replay video. Default to None. If not specified, the video will be saved in exp_name/videos. - seed (:obj:Union[int, List]): The random seed, which is set before running the program. Default to None. If not specified, self.seed will be used. If seed is an integer, the agent will be deployed once. If seed is a list of integers, the agent will be deployed once for each seed in the list. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

collect_data(env_num=8, save_data_path=None, n_sample=None, n_episode=None, context=None, debug=False)

Overview

Collect data with PPO algorithm for n_episode episodes with env_num collector environments. The collected data will be saved in save_data_path if specified, otherwise it will be saved in exp_name/demo_data.

Arguments: - env_num (:obj:int): The number of collector environments. Default to 8. - save_data_path (:obj:str): The path to save the collected data. Default to None. If not specified, the data will be saved in exp_name/demo_data. - n_sample (:obj:int): The number of samples to collect. Default to None. If not specified, n_episode must be specified. - n_episode (:obj:int): The number of episodes to collect. Default to None. If not specified, n_sample must be specified. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used.

batch_evaluate(env_num=4, n_evaluator_episode=4, context=None, debug=False)

Overview

Evaluate the agent with PPO algorithm for n_evaluator_episode episodes with env_num evaluator environments. The evaluation result will be returned. The difference between methods batch_evaluate and deploy is that batch_evaluate will create multiple evaluator environments to evaluate the agent to get an average performance, while deploy will only create one evaluator environment to evaluate the agent and save the replay video.

Arguments: - env_num (:obj:int): The number of evaluator environments. Default to 4. - n_evaluator_episode (:obj:int): The number of episodes to evaluate. Default to 4. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

Full Source Code

../ding/bonus/ppof.py

1from typing import Optional, Union, List 2from ditk import logging 3from easydict import EasyDict 4from functools import partial 5import os 6import gym 7import gymnasium 8import numpy as np 9import torch 10from ding.framework import task, OnlineRLContext 11from ding.framework.middleware import interaction_evaluator_ttorch, PPOFStepCollector, multistep_trainer, CkptSaver, \ 12 wandb_online_logger, offline_data_saver, termination_checker, ppof_adv_estimator 13from ding.envs import BaseEnv, BaseEnvManagerV2, SubprocessEnvManagerV2 14from ding.policy import PPOFPolicy, single_env_forward_wrapper_ttorch 15from ding.utils import set_pkg_seed 16from ding.utils import get_env_fps, render 17from ding.config import save_config_py 18from .model import PPOFModel 19from .config import get_instance_config, get_instance_env, get_hybrid_shape 20from ding.bonus.common import TrainingReturn, EvalReturn 21 22 23class PPOF: 24 """ 25 Overview: 26 Class of agent for training, evaluation and deployment of Reinforcement learning algorithm \ 27 Proximal Policy Optimization(PPO). 28 For more information about the system design of RL agent, please refer to \ 29 <https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html>. 30 Interface: 31 ``__init__``, ``train``, ``deploy``, ``collect_data``, ``batch_evaluate``, ``best`` 32 """ 33 34 supported_env_list = [ 35 # common 36 'LunarLander-v2', 37 'LunarLanderContinuous-v2', 38 'BipedalWalker-v3', 39 'Pendulum-v1', 40 'acrobot', 41 # ch2: action 42 'rocket_landing', 43 'drone_fly', 44 'hybrid_moving', 45 # ch3: obs 46 'evogym_carrier', 47 'mario', 48 'di_sheep', 49 'procgen_bigfish', 50 # ch4: reward 51 'minigrid_fourroom', 52 'metadrive', 53 # atari 54 'BowlingNoFrameskip-v4', 55 'BreakoutNoFrameskip-v4', 56 'GopherNoFrameskip-v4' 57 'KangarooNoFrameskip-v4', 58 'PongNoFrameskip-v4', 59 'QbertNoFrameskip-v4', 60 'SpaceInvadersNoFrameskip-v4', 61 # mujoco 62 'Hopper-v3', 63 'HalfCheetah-v3', 64 'Walker2d-v3', 65 ] 66 """ 67 Overview: 68 List of supported envs. 69 Examples: 70 >>> from ding.bonus.ppof import PPOF 71 >>> print(PPOF.supported_env_list) 72 """ 73 74 def __init__( 75 self, 76 env_id: str = None, 77 env: BaseEnv = None, 78 seed: int = 0, 79 exp_name: str = None, 80 model: Optional[torch.nn.Module] = None, 81 cfg: Optional[Union[EasyDict, dict]] = None, 82 policy_state_dict: str = None 83 ) -> None: 84 """ 85 Overview: 86 Initialize agent for PPO algorithm. 87 Arguments: 88 - env_id (:obj:`str`): The environment id, which is a registered environment name in gym or gymnasium. \ 89 If ``env_id`` is not specified, ``env_id`` in ``cfg`` must be specified. \ 90 If ``env_id`` is specified, ``env_id`` in ``cfg`` will be ignored. \ 91 ``env_id`` should be one of the supported envs, which can be found in ``PPOF.supported_env_list``. 92 - env (:obj:`BaseEnv`): The environment instance for training and evaluation. \ 93 If ``env`` is not specified, ``env_id`` or ``cfg.env_id`` must be specified. \ 94 ``env_id`` or ``cfg.env_id`` will be used to create environment instance. \ 95 If ``env`` is specified, ``env_id`` and ``cfg.env_id`` will be ignored. 96 - seed (:obj:`int`): The random seed, which is set before running the program. \ 97 Default to 0. 98 - exp_name (:obj:`str`): The name of this experiment, which will be used to create the folder to save \ 99 log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. 100 - model (:obj:`torch.nn.Module`): The model of PPO algorithm, which should be an instance of class \ 101 ``ding.model.PPOFModel``. \ 102 If not specified, a default model will be generated according to the configuration. 103 - cfg (:obj:`Union[EasyDict, dict]`): The configuration of PPO algorithm, which is a dict. \ 104 Default to None. If not specified, the default configuration will be used. 105 - policy_state_dict (:obj:`str`): The path of policy state dict saved by PyTorch a in local file. \ 106 If specified, the policy will be loaded from this file. Default to None. 107 108 .. note:: 109 An RL Agent Instance can be initialized in two basic ways. \ 110 For example, we have an environment with id ``LunarLander-v2`` registered in gym, \ 111 and we want to train an agent with PPO algorithm with default configuration. \ 112 Then we can initialize the agent in the following ways: 113 >>> agent = PPOF(env_id='LunarLander-v2') 114 or, if we want can specify the env_id in the configuration: 115 >>> cfg = {'env': {'env_id': 'LunarLander-v2'}, 'policy': ...... } 116 >>> agent = PPOF(cfg=cfg) 117 There are also other arguments to specify the agent when initializing. 118 For example, if we want to specify the environment instance: 119 >>> env = CustomizedEnv('LunarLander-v2') 120 >>> agent = PPOF(cfg=cfg, env=env) 121 or, if we want to specify the model: 122 >>> model = VAC(**cfg.policy.model) 123 >>> agent = PPOF(cfg=cfg, model=model) 124 or, if we want to reload the policy from a saved policy state dict: 125 >>> agent = PPOF(cfg=cfg, policy_state_dict='LunarLander-v2.pth.tar') 126 Make sure that the configuration is consistent with the saved policy state dict. 127 """ 128 129 assert env_id is not None or cfg is not None, "Please specify env_id or cfg." 130 131 if cfg is not None and not isinstance(cfg, EasyDict): 132 cfg = EasyDict(cfg) 133 134 if env_id is not None: 135 assert env_id in PPOF.supported_env_list, "Please use supported envs: {}".format(PPOF.supported_env_list) 136 if cfg is None: 137 cfg = get_instance_config(env_id, algorithm="PPOF") 138 139 if not hasattr(cfg, "env_id"): 140 cfg.env_id = env_id 141 assert cfg.env_id == env_id, "env_id in cfg should be the same as env_id in args." 142 else: 143 assert hasattr(cfg, "env_id"), "Please specify env_id in cfg." 144 assert cfg.env_id in PPOF.supported_env_list, "Please use supported envs: {}".format( 145 PPOF.supported_env_list 146 ) 147 148 if exp_name is not None: 149 cfg.exp_name = exp_name 150 elif not hasattr(cfg, "exp_name"): 151 cfg.exp_name = "{}-{}".format(cfg.env_id, "PPO") 152 self.cfg = cfg 153 self.exp_name = self.cfg.exp_name 154 155 if env is None: 156 self.env = get_instance_env(self.cfg.env_id) 157 else: 158 self.env = env 159 160 logging.getLogger().setLevel(logging.INFO) 161 self.seed = seed 162 set_pkg_seed(self.seed, use_cuda=self.cfg.cuda) 163 164 if not os.path.exists(self.exp_name): 165 os.makedirs(self.exp_name) 166 save_config_py(self.cfg, os.path.join(self.exp_name, 'policy_config.py')) 167 168 action_space = self.env.action_space 169 if isinstance(action_space, (gym.spaces.Discrete, gymnasium.spaces.Discrete)): 170 action_shape = int(action_space.n) 171 elif isinstance(action_space, (gym.spaces.Tuple, gymnasium.spaces.Tuple)): 172 action_shape = get_hybrid_shape(action_space) 173 else: 174 action_shape = action_space.shape 175 176 # Three types of value normalization is supported currently 177 assert self.cfg.value_norm in ['popart', 'value_rescale', 'symlog', 'baseline'] 178 if model is None: 179 if self.cfg.value_norm != 'popart': 180 model = PPOFModel( 181 self.env.observation_space.shape, 182 action_shape, 183 action_space=self.cfg.action_space, 184 **self.cfg.model 185 ) 186 else: 187 model = PPOFModel( 188 self.env.observation_space.shape, 189 action_shape, 190 action_space=self.cfg.action_space, 191 popart_head=True, 192 **self.cfg.model 193 ) 194 self.policy = PPOFPolicy(self.cfg, model=model) 195 if policy_state_dict is not None: 196 self.policy.load_state_dict(policy_state_dict) 197 self.checkpoint_save_dir = os.path.join(self.exp_name, "ckpt") 198 199 def train( 200 self, 201 step: int = int(1e7), 202 collector_env_num: int = 4, 203 evaluator_env_num: int = 4, 204 n_iter_log_show: int = 500, 205 n_iter_save_ckpt: int = 1000, 206 context: Optional[str] = None, 207 reward_model: Optional[str] = None, 208 debug: bool = False, 209 wandb_sweep: bool = False, 210 ) -> TrainingReturn: 211 """ 212 Overview: 213 Train the agent with PPO algorithm for ``step`` iterations with ``collector_env_num`` collector \ 214 environments and ``evaluator_env_num`` evaluator environments. Information during training will be \ 215 recorded and saved by wandb. 216 Arguments: 217 - step (:obj:`int`): The total training environment steps of all collector environments. Default to 1e7. 218 - collector_env_num (:obj:`int`): The number of collector environments. Default to 4. 219 - evaluator_env_num (:obj:`int`): The number of evaluator environments. Default to 4. 220 - n_iter_log_show (:obj:`int`): The frequency of logging every training iteration. Default to 500. 221 - n_iter_save_ckpt (:obj:`int`): The frequency of saving checkpoint every training iteration. \ 222 Default to 1000. 223 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 224 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 225 - reward_model (:obj:`str`): The reward model name. Default to None. This argument is not supported yet. 226 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 227 If set True, base environment manager will be used for easy debugging. Otherwise, \ 228 subprocess environment manager will be used. 229 - wandb_sweep (:obj:`bool`): Whether to use wandb sweep, \ 230 which is a hyper-parameter optimization process for seeking the best configurations. \ 231 Default to False. If True, the wandb sweep id will be used as the experiment name. 232 Returns: 233 - (:obj:`TrainingReturn`): The training result, of which the attributions are: 234 - wandb_url (:obj:`str`): The weight & biases (wandb) project url of the trainning experiment. 235 """ 236 237 if debug: 238 logging.getLogger().setLevel(logging.DEBUG) 239 logging.debug(self.policy._model) 240 # define env and policy 241 collector_env = self._setup_env_manager(collector_env_num, context, debug, 'collector') 242 evaluator_env = self._setup_env_manager(evaluator_env_num, context, debug, 'evaluator') 243 244 if reward_model is not None: 245 # self.reward_model = create_reward_model(reward_model, self.cfg.reward_model) 246 pass 247 248 with task.start(ctx=OnlineRLContext()): 249 task.use(interaction_evaluator_ttorch(self.seed, self.policy, evaluator_env)) 250 task.use(CkptSaver(self.policy, save_dir=self.checkpoint_save_dir, train_freq=n_iter_save_ckpt)) 251 task.use(PPOFStepCollector(self.seed, self.policy, collector_env, self.cfg.n_sample)) 252 task.use(ppof_adv_estimator(self.policy)) 253 task.use(multistep_trainer(self.policy, log_freq=n_iter_log_show)) 254 task.use( 255 wandb_online_logger( 256 metric_list=self.policy.monitor_vars(), 257 model=self.policy._model, 258 anonymous=True, 259 project_name=self.exp_name, 260 wandb_sweep=wandb_sweep, 261 ) 262 ) 263 task.use(termination_checker(max_env_step=step)) 264 task.run() 265 266 return TrainingReturn(wandb_url=task.ctx.wandb_url) 267 268 def deploy( 269 self, 270 enable_save_replay: bool = False, 271 concatenate_all_replay: bool = False, 272 replay_save_path: str = None, 273 seed: Optional[Union[int, List]] = None, 274 debug: bool = False 275 ) -> EvalReturn: 276 """ 277 Overview: 278 Deploy the agent with PPO algorithm by interacting with the environment, during which the replay video \ 279 can be saved if ``enable_save_replay`` is True. The evaluation result will be returned. 280 Arguments: 281 - enable_save_replay (:obj:`bool`): Whether to save the replay video. Default to False. 282 - concatenate_all_replay (:obj:`bool`): Whether to concatenate all replay videos into one video. \ 283 Default to False. If ``enable_save_replay`` is False, this argument will be ignored. \ 284 If ``enable_save_replay`` is True and ``concatenate_all_replay`` is False, \ 285 the replay video of each episode will be saved separately. 286 - replay_save_path (:obj:`str`): The path to save the replay video. Default to None. \ 287 If not specified, the video will be saved in ``exp_name/videos``. 288 - seed (:obj:`Union[int, List]`): The random seed, which is set before running the program. \ 289 Default to None. If not specified, ``self.seed`` will be used. \ 290 If ``seed`` is an integer, the agent will be deployed once. \ 291 If ``seed`` is a list of integers, the agent will be deployed once for each seed in the list. 292 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 293 If set True, base environment manager will be used for easy debugging. Otherwise, \ 294 subprocess environment manager will be used. 295 Returns: 296 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 297 - eval_value (:obj:`np.float32`): The mean of evaluation return. 298 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 299 """ 300 301 if debug: 302 logging.getLogger().setLevel(logging.DEBUG) 303 # define env and policy 304 env = self.env.clone(caller='evaluator') 305 306 if seed is not None and isinstance(seed, int): 307 seeds = [seed] 308 elif seed is not None and isinstance(seed, list): 309 seeds = seed 310 else: 311 seeds = [self.seed] 312 313 returns = [] 314 images = [] 315 if enable_save_replay: 316 replay_save_path = os.path.join(self.exp_name, 'videos') if replay_save_path is None else replay_save_path 317 env.enable_save_replay(replay_path=replay_save_path) 318 else: 319 logging.warning('No video would be generated during the deploy.') 320 if concatenate_all_replay: 321 logging.warning('concatenate_all_replay is set to False because enable_save_replay is False.') 322 concatenate_all_replay = False 323 324 forward_fn = single_env_forward_wrapper_ttorch(self.policy.eval, self.cfg.cuda) 325 326 # reset first to make sure the env is in the initial state 327 # env will be reset again in the main loop 328 env.reset() 329 330 for seed in seeds: 331 env.seed(seed, dynamic_seed=False) 332 return_ = 0. 333 step = 0 334 obs = env.reset() 335 images.append(render(env)[None]) if concatenate_all_replay else None 336 while True: 337 action = forward_fn(obs) 338 obs, rew, done, info = env.step(action) 339 images.append(render(env)[None]) if concatenate_all_replay else None 340 return_ += rew 341 step += 1 342 if done: 343 break 344 logging.info(f'DQN deploy is finished, final episode return with {step} steps is: {return_}') 345 returns.append(return_) 346 347 env.close() 348 349 if concatenate_all_replay: 350 images = np.concatenate(images, axis=0) 351 import imageio 352 imageio.mimwrite(os.path.join(replay_save_path, 'deploy.mp4'), images, fps=get_env_fps(env)) 353 354 return EvalReturn(eval_value=np.mean(returns), eval_value_std=np.std(returns)) 355 356 def collect_data( 357 self, 358 env_num: int = 8, 359 save_data_path: Optional[str] = None, 360 n_sample: Optional[int] = None, 361 n_episode: Optional[int] = None, 362 context: Optional[str] = None, 363 debug: bool = False 364 ) -> None: 365 """ 366 Overview: 367 Collect data with PPO algorithm for ``n_episode`` episodes with ``env_num`` collector environments. \ 368 The collected data will be saved in ``save_data_path`` if specified, otherwise it will be saved in \ 369 ``exp_name/demo_data``. 370 Arguments: 371 - env_num (:obj:`int`): The number of collector environments. Default to 8. 372 - save_data_path (:obj:`str`): The path to save the collected data. Default to None. \ 373 If not specified, the data will be saved in ``exp_name/demo_data``. 374 - n_sample (:obj:`int`): The number of samples to collect. Default to None. \ 375 If not specified, ``n_episode`` must be specified. 376 - n_episode (:obj:`int`): The number of episodes to collect. Default to None. \ 377 If not specified, ``n_sample`` must be specified. 378 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 379 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 380 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 381 If set True, base environment manager will be used for easy debugging. Otherwise, \ 382 subprocess environment manager will be used. 383 """ 384 385 if debug: 386 logging.getLogger().setLevel(logging.DEBUG) 387 if n_episode is not None: 388 raise NotImplementedError 389 # define env and policy 390 env = self._setup_env_manager(env_num, context, debug, 'collector') 391 if save_data_path is None: 392 save_data_path = os.path.join(self.exp_name, 'demo_data') 393 394 # main execution task 395 with task.start(ctx=OnlineRLContext()): 396 task.use(PPOFStepCollector(self.seed, self.policy, env, n_sample)) 397 task.use(offline_data_saver(save_data_path, data_type='hdf5')) 398 task.run(max_step=1) 399 logging.info( 400 f'PPOF collecting is finished, more than {n_sample} samples are collected and saved in `{save_data_path}`' 401 ) 402 403 def batch_evaluate( 404 self, 405 env_num: int = 4, 406 n_evaluator_episode: int = 4, 407 context: Optional[str] = None, 408 debug: bool = False, 409 ) -> EvalReturn: 410 """ 411 Overview: 412 Evaluate the agent with PPO algorithm for ``n_evaluator_episode`` episodes with ``env_num`` evaluator \ 413 environments. The evaluation result will be returned. 414 The difference between methods ``batch_evaluate`` and ``deploy`` is that ``batch_evaluate`` will create \ 415 multiple evaluator environments to evaluate the agent to get an average performance, while ``deploy`` \ 416 will only create one evaluator environment to evaluate the agent and save the replay video. 417 Arguments: 418 - env_num (:obj:`int`): The number of evaluator environments. Default to 4. 419 - n_evaluator_episode (:obj:`int`): The number of episodes to evaluate. Default to 4. 420 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 421 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 422 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 423 If set True, base environment manager will be used for easy debugging. Otherwise, \ 424 subprocess environment manager will be used. 425 Returns: 426 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 427 - eval_value (:obj:`np.float32`): The mean of evaluation return. 428 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 429 """ 430 431 if debug: 432 logging.getLogger().setLevel(logging.DEBUG) 433 # define env and policy 434 env = self._setup_env_manager(env_num, context, debug, 'evaluator') 435 436 # reset first to make sure the env is in the initial state 437 # env will be reset again in the main loop 438 env.launch() 439 env.reset() 440 441 # main execution task 442 with task.start(ctx=OnlineRLContext()): 443 task.use(interaction_evaluator_ttorch( 444 self.seed, 445 self.policy, 446 env, 447 n_evaluator_episode, 448 )) 449 task.run(max_step=1) 450 451 return EvalReturn(eval_value=task.ctx.eval_value, eval_value_std=task.ctx.eval_value_std) 452 453 def _setup_env_manager( 454 self, 455 env_num: int, 456 context: Optional[str] = None, 457 debug: bool = False, 458 caller: str = 'collector' 459 ) -> BaseEnvManagerV2: 460 """ 461 Overview: 462 Setup the environment manager. The environment manager is used to manage multiple environments. 463 Arguments: 464 - env_num (:obj:`int`): The number of environments. 465 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 466 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 467 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 468 If set True, base environment manager will be used for easy debugging. Otherwise, \ 469 subprocess environment manager will be used. 470 - caller (:obj:`str`): The caller of the environment manager. Default to 'collector'. 471 Returns: 472 - (:obj:`BaseEnvManagerV2`): The environment manager. 473 """ 474 assert caller in ['evaluator', 'collector'] 475 if debug: 476 env_cls = BaseEnvManagerV2 477 manager_cfg = env_cls.default_config() 478 else: 479 env_cls = SubprocessEnvManagerV2 480 manager_cfg = env_cls.default_config() 481 if context is not None: 482 manager_cfg.context = context 483 return env_cls([partial(self.env.clone, caller) for _ in range(env_num)], manager_cfg) 484 485 @property 486 def best(self) -> 'PPOF': 487 """ 488 Overview: 489 Load the best model from the checkpoint directory, \ 490 which is by default in folder ``exp_name/ckpt/eval.pth.tar``. \ 491 The return value is the agent with the best model. 492 Returns: 493 - (:obj:`PPOF`): The agent with the best model. 494 Examples: 495 >>> agent = PPOF(env_id='LunarLander-v2') 496 >>> agent.train() 497 >>> agent = agent.best() 498 499 .. note:: 500 The best model is the model with the highest evaluation return. If this method is called, the current \ 501 model will be replaced by the best model. 502 """ 503 504 best_model_file_path = os.path.join(self.checkpoint_save_dir, "eval.pth.tar") 505 # Load best model if it exists 506 if os.path.exists(best_model_file_path): 507 policy_state_dict = torch.load(best_model_file_path, map_location=torch.device("cpu")) 508 self.policy.learn_mode.load_state_dict(policy_state_dict) 509 return self