Skip to content

ding.bonus.ppo_offpolicy

ding.bonus.ppo_offpolicy

PPOOffPolicyAgent

Overview

Class of agent for training, evaluation and deployment of Reinforcement learning algorithm Proximal Policy Optimization(PPO) in an off-policy style. For more information about the system design of RL agent, please refer to https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html.

Interface: __init__, train, deploy, collect_data, batch_evaluate, best

supported_env_list = list(supported_env_cfg.keys()) class-attribute instance-attribute

Overview

List of supported envs.

Examples: >>> from ding.bonus.ppo_offpolicy import PPOOffPolicyAgent >>> print(PPOOffPolicyAgent.supported_env_list)

best property

Overview

Load the best model from the checkpoint directory, which is by default in folder exp_name/ckpt/eval.pth.tar. The return value is the agent with the best model.

Returns: - (:obj:PPOOffPolicyAgent): The agent with the best model. Examples: >>> agent = PPOOffPolicyAgent(env_id='LunarLander-v2') >>> agent.train() >>> agent.best

.. note:: The best model is the model with the highest evaluation return. If this method is called, the current model will be replaced by the best model.

__init__(env_id=None, env=None, seed=0, exp_name=None, model=None, cfg=None, policy_state_dict=None)

Overview

Initialize agent for PPO (offpolicy) algorithm.

Arguments: - env_id (:obj:str): The environment id, which is a registered environment name in gym or gymnasium. If env_id is not specified, env_id in cfg.env must be specified. If env_id is specified, env_id in cfg.env will be ignored. env_id should be one of the supported envs, which can be found in supported_env_list. - env (:obj:BaseEnv): The environment instance for training and evaluation. If env is not specified, env_id`` or ``cfg.env.env_id`` must be specified. ``env_id`` or ``cfg.env.env_id`` will be used to create environment instance. If ``env`` is specified, ``env_id`` and ``cfg.env.env_id`` will be ignored. - seed (:obj:int): The random seed, which is set before running the program. Default to 0. - exp_name (:obj:str): The name of this experiment, which will be used to create the folder to save log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. - model (:obj:torch.nn.Module): The model of PPO (offpolicy) algorithm, which should be an instance of class :class:ding.model.VAC. If not specified, a default model will be generated according to the configuration. - cfg (:obj:Union[EasyDict, dict]): The configuration of PPO (offpolicy) algorithm, which is a dict. Default to None. If not specified, the default configuration will be used. The default configuration can be found in ``ding/config/example/PPO (offpolicy)/gym_lunarlander_v2.py``. - policy_state_dict (:obj:str`): The path of policy state dict saved by PyTorch a in local file. If specified, the policy will be loaded from this file. Default to None.

.. note:: An RL Agent Instance can be initialized in two basic ways. For example, we have an environment with id LunarLander-v2 registered in gym, and we want to train an agent with PPO (offpolicy) algorithm with default configuration. Then we can initialize the agent in the following ways: >>> agent = PPOOffPolicyAgent(env_id='LunarLander-v2') or, if we want can specify the env_id in the configuration: >>> cfg = {'env': {'env_id': 'LunarLander-v2'}, 'policy': ...... } >>> agent = PPOOffPolicyAgent(cfg=cfg) There are also other arguments to specify the agent when initializing. For example, if we want to specify the environment instance: >>> env = CustomizedEnv('LunarLander-v2') >>> agent = PPOOffPolicyAgent(cfg=cfg, env=env) or, if we want to specify the model: >>> model = VAC(**cfg.policy.model) >>> agent = PPOOffPolicyAgent(cfg=cfg, model=model) or, if we want to reload the policy from a saved policy state dict: >>> agent = PPOOffPolicyAgent(cfg=cfg, policy_state_dict='LunarLander-v2.pth.tar') Make sure that the configuration is consistent with the saved policy state dict.

train(step=int(10000000.0), collector_env_num=None, evaluator_env_num=None, n_iter_save_ckpt=1000, context=None, debug=False, wandb_sweep=False)

Overview

Train the agent with PPO (offpolicy) algorithm for step iterations with collector_env_num collector environments and evaluator_env_num evaluator environments. Information during training will be recorded and saved by wandb.

Arguments: - step (:obj:int): The total training environment steps of all collector environments. Default to 1e7. - collector_env_num (:obj:int): The collector environment number. Default to None. If not specified, it will be set according to the configuration. - evaluator_env_num (:obj:int): The evaluator environment number. Default to None. If not specified, it will be set according to the configuration. - n_iter_save_ckpt (:obj:int): The frequency of saving checkpoint every training iteration. Default to 1000. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. - wandb_sweep (:obj:bool): Whether to use wandb sweep, which is a hyper-parameter optimization process for seeking the best configurations. Default to False. If True, the wandb sweep id will be used as the experiment name. Returns: - (:obj:TrainingReturn): The training result, of which the attributions are: - wandb_url (:obj:str): The weight & biases (wandb) project url of the trainning experiment.

deploy(enable_save_replay=False, concatenate_all_replay=False, replay_save_path=None, seed=None, debug=False)

Overview

Deploy the agent with PPO (offpolicy) algorithm by interacting with the environment, during which the replay video can be saved if enable_save_replay is True. The evaluation result will be returned.

Arguments: - enable_save_replay (:obj:bool): Whether to save the replay video. Default to False. - concatenate_all_replay (:obj:bool): Whether to concatenate all replay videos into one video. Default to False. If enable_save_replay is False, this argument will be ignored. If enable_save_replay is True and concatenate_all_replay is False, the replay video of each episode will be saved separately. - replay_save_path (:obj:str): The path to save the replay video. Default to None. If not specified, the video will be saved in exp_name/videos. - seed (:obj:Union[int, List]): The random seed, which is set before running the program. Default to None. If not specified, self.seed will be used. If seed is an integer, the agent will be deployed once. If seed is a list of integers, the agent will be deployed once for each seed in the list. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

collect_data(env_num=8, save_data_path=None, n_sample=None, n_episode=None, context=None, debug=False)

Overview

Collect data with PPO (offpolicy) algorithm for n_episode episodes with env_num collector environments. The collected data will be saved in save_data_path if specified, otherwise it will be saved in exp_name/demo_data.

Arguments: - env_num (:obj:int): The number of collector environments. Default to 8. - save_data_path (:obj:str): The path to save the collected data. Default to None. If not specified, the data will be saved in exp_name/demo_data. - n_sample (:obj:int): The number of samples to collect. Default to None. If not specified, n_episode must be specified. - n_episode (:obj:int): The number of episodes to collect. Default to None. If not specified, n_sample must be specified. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used.

batch_evaluate(env_num=4, n_evaluator_episode=4, context=None, debug=False)

Overview

Evaluate the agent with PPO (offpolicy) algorithm for n_evaluator_episode episodes with env_num evaluator environments. The evaluation result will be returned. The difference between methods batch_evaluate and deploy is that batch_evaluate will create multiple evaluator environments to evaluate the agent to get an average performance, while deploy will only create one evaluator environment to evaluate the agent and save the replay video.

Arguments: - env_num (:obj:int): The number of evaluator environments. Default to 4. - n_evaluator_episode (:obj:int): The number of episodes to evaluate. Default to 4. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

Full Source Code

../ding/bonus/ppo_offpolicy.py

1from typing import Optional, Union, List 2from ditk import logging 3from easydict import EasyDict 4import os 5import numpy as np 6import torch 7import treetensor.torch as ttorch 8from ding.framework import task, OnlineRLContext 9from ding.framework.middleware import CkptSaver, final_ctx_saver, OffPolicyLearner, StepCollector, \ 10 wandb_online_logger, offline_data_saver, termination_checker, interaction_evaluator, gae_estimator 11from ding.envs import BaseEnv 12from ding.envs import setup_ding_env_manager 13from ding.policy import PPOOffPolicy 14from ding.utils import set_pkg_seed 15from ding.utils import get_env_fps, render 16from ding.config import save_config_py, compile_config 17from ding.model import VAC 18from ding.model import model_wrap 19from ding.data import DequeBuffer 20from ding.bonus.common import TrainingReturn, EvalReturn 21from ding.config.example.PPOOffPolicy import supported_env_cfg 22from ding.config.example.PPOOffPolicy import supported_env 23 24 25class PPOOffPolicyAgent: 26 """ 27 Overview: 28 Class of agent for training, evaluation and deployment of Reinforcement learning algorithm \ 29 Proximal Policy Optimization(PPO) in an off-policy style. 30 For more information about the system design of RL agent, please refer to \ 31 <https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html>. 32 Interface: 33 ``__init__``, ``train``, ``deploy``, ``collect_data``, ``batch_evaluate``, ``best`` 34 """ 35 supported_env_list = list(supported_env_cfg.keys()) 36 """ 37 Overview: 38 List of supported envs. 39 Examples: 40 >>> from ding.bonus.ppo_offpolicy import PPOOffPolicyAgent 41 >>> print(PPOOffPolicyAgent.supported_env_list) 42 """ 43 44 def __init__( 45 self, 46 env_id: str = None, 47 env: BaseEnv = None, 48 seed: int = 0, 49 exp_name: str = None, 50 model: Optional[torch.nn.Module] = None, 51 cfg: Optional[Union[EasyDict, dict]] = None, 52 policy_state_dict: str = None, 53 ) -> None: 54 """ 55 Overview: 56 Initialize agent for PPO (offpolicy) algorithm. 57 Arguments: 58 - env_id (:obj:`str`): The environment id, which is a registered environment name in gym or gymnasium. \ 59 If ``env_id`` is not specified, ``env_id`` in ``cfg.env`` must be specified. \ 60 If ``env_id`` is specified, ``env_id`` in ``cfg.env`` will be ignored. \ 61 ``env_id`` should be one of the supported envs, which can be found in ``supported_env_list``. 62 - env (:obj:`BaseEnv`): The environment instance for training and evaluation. \ 63 If ``env`` is not specified, `env_id`` or ``cfg.env.env_id`` must be specified. \ 64 ``env_id`` or ``cfg.env.env_id`` will be used to create environment instance. \ 65 If ``env`` is specified, ``env_id`` and ``cfg.env.env_id`` will be ignored. 66 - seed (:obj:`int`): The random seed, which is set before running the program. \ 67 Default to 0. 68 - exp_name (:obj:`str`): The name of this experiment, which will be used to create the folder to save \ 69 log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. 70 - model (:obj:`torch.nn.Module`): The model of PPO (offpolicy) algorithm, \ 71 which should be an instance of class :class:`ding.model.VAC`. \ 72 If not specified, a default model will be generated according to the configuration. 73 - cfg (:obj:`Union[EasyDict, dict]`): The configuration of PPO (offpolicy) algorithm, which is a dict. \ 74 Default to None. If not specified, the default configuration will be used. \ 75 The default configuration can be found in ``ding/config/example/PPO (offpolicy)/gym_lunarlander_v2.py``. 76 - policy_state_dict (:obj:`str`): The path of policy state dict saved by PyTorch a in local file. \ 77 If specified, the policy will be loaded from this file. Default to None. 78 79 .. note:: 80 An RL Agent Instance can be initialized in two basic ways. \ 81 For example, we have an environment with id ``LunarLander-v2`` registered in gym, \ 82 and we want to train an agent with PPO (offpolicy) algorithm with default configuration. \ 83 Then we can initialize the agent in the following ways: 84 >>> agent = PPOOffPolicyAgent(env_id='LunarLander-v2') 85 or, if we want can specify the env_id in the configuration: 86 >>> cfg = {'env': {'env_id': 'LunarLander-v2'}, 'policy': ...... } 87 >>> agent = PPOOffPolicyAgent(cfg=cfg) 88 There are also other arguments to specify the agent when initializing. 89 For example, if we want to specify the environment instance: 90 >>> env = CustomizedEnv('LunarLander-v2') 91 >>> agent = PPOOffPolicyAgent(cfg=cfg, env=env) 92 or, if we want to specify the model: 93 >>> model = VAC(**cfg.policy.model) 94 >>> agent = PPOOffPolicyAgent(cfg=cfg, model=model) 95 or, if we want to reload the policy from a saved policy state dict: 96 >>> agent = PPOOffPolicyAgent(cfg=cfg, policy_state_dict='LunarLander-v2.pth.tar') 97 Make sure that the configuration is consistent with the saved policy state dict. 98 """ 99 100 assert env_id is not None or cfg is not None, "Please specify env_id or cfg." 101 102 if cfg is not None and not isinstance(cfg, EasyDict): 103 cfg = EasyDict(cfg) 104 105 if env_id is not None: 106 assert env_id in PPOOffPolicyAgent.supported_env_list, "Please use supported envs: {}".format( 107 PPOOffPolicyAgent.supported_env_list 108 ) 109 if cfg is None: 110 cfg = supported_env_cfg[env_id] 111 else: 112 assert cfg.env.env_id == env_id, "env_id in cfg should be the same as env_id in args." 113 else: 114 assert hasattr(cfg.env, "env_id"), "Please specify env_id in cfg." 115 assert cfg.env.env_id in PPOOffPolicyAgent.supported_env_list, "Please use supported envs: {}".format( 116 PPOOffPolicyAgent.supported_env_list 117 ) 118 default_policy_config = EasyDict({"policy": PPOOffPolicy.default_config()}) 119 default_policy_config.update(cfg) 120 cfg = default_policy_config 121 122 if exp_name is not None: 123 cfg.exp_name = exp_name 124 self.cfg = compile_config(cfg, policy=PPOOffPolicy) 125 self.exp_name = self.cfg.exp_name 126 if env is None: 127 self.env = supported_env[cfg.env.env_id](cfg=cfg.env) 128 else: 129 assert isinstance(env, BaseEnv), "Please use BaseEnv as env data type." 130 self.env = env 131 132 logging.getLogger().setLevel(logging.INFO) 133 self.seed = seed 134 set_pkg_seed(self.seed, use_cuda=self.cfg.policy.cuda) 135 if not os.path.exists(self.exp_name): 136 os.makedirs(self.exp_name) 137 save_config_py(self.cfg, os.path.join(self.exp_name, 'policy_config.py')) 138 if model is None: 139 model = VAC(**self.cfg.policy.model) 140 self.buffer_ = DequeBuffer(size=self.cfg.policy.other.replay_buffer.replay_buffer_size) 141 self.policy = PPOOffPolicy(self.cfg.policy, model=model) 142 if policy_state_dict is not None: 143 self.policy.learn_mode.load_state_dict(policy_state_dict) 144 self.checkpoint_save_dir = os.path.join(self.exp_name, "ckpt") 145 146 def train( 147 self, 148 step: int = int(1e7), 149 collector_env_num: int = None, 150 evaluator_env_num: int = None, 151 n_iter_save_ckpt: int = 1000, 152 context: Optional[str] = None, 153 debug: bool = False, 154 wandb_sweep: bool = False, 155 ) -> TrainingReturn: 156 """ 157 Overview: 158 Train the agent with PPO (offpolicy) algorithm for ``step`` iterations with ``collector_env_num`` \ 159 collector environments and ``evaluator_env_num`` evaluator environments. \ 160 Information during training will be recorded and saved by wandb. 161 Arguments: 162 - step (:obj:`int`): The total training environment steps of all collector environments. Default to 1e7. 163 - collector_env_num (:obj:`int`): The collector environment number. Default to None. \ 164 If not specified, it will be set according to the configuration. 165 - evaluator_env_num (:obj:`int`): The evaluator environment number. Default to None. \ 166 If not specified, it will be set according to the configuration. 167 - n_iter_save_ckpt (:obj:`int`): The frequency of saving checkpoint every training iteration. \ 168 Default to 1000. 169 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 170 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 171 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 172 If set True, base environment manager will be used for easy debugging. Otherwise, \ 173 subprocess environment manager will be used. 174 - wandb_sweep (:obj:`bool`): Whether to use wandb sweep, \ 175 which is a hyper-parameter optimization process for seeking the best configurations. \ 176 Default to False. If True, the wandb sweep id will be used as the experiment name. 177 Returns: 178 - (:obj:`TrainingReturn`): The training result, of which the attributions are: 179 - wandb_url (:obj:`str`): The weight & biases (wandb) project url of the trainning experiment. 180 """ 181 182 if debug: 183 logging.getLogger().setLevel(logging.DEBUG) 184 logging.debug(self.policy._model) 185 # define env and policy 186 collector_env_num = collector_env_num if collector_env_num else self.cfg.env.collector_env_num 187 evaluator_env_num = evaluator_env_num if evaluator_env_num else self.cfg.env.evaluator_env_num 188 collector_env = setup_ding_env_manager(self.env, collector_env_num, context, debug, 'collector') 189 evaluator_env = setup_ding_env_manager(self.env, evaluator_env_num, context, debug, 'evaluator') 190 191 with task.start(ctx=OnlineRLContext()): 192 task.use( 193 interaction_evaluator( 194 self.cfg, 195 self.policy.eval_mode, 196 evaluator_env, 197 render=self.cfg.policy.eval.render if hasattr(self.cfg.policy.eval, "render") else False 198 ) 199 ) 200 task.use(CkptSaver(policy=self.policy, save_dir=self.checkpoint_save_dir, train_freq=n_iter_save_ckpt)) 201 task.use( 202 StepCollector( 203 self.cfg, 204 self.policy.collect_mode, 205 collector_env, 206 random_collect_size=self.cfg.policy.random_collect_size 207 if hasattr(self.cfg.policy, 'random_collect_size') else 0, 208 ) 209 ) 210 task.use(gae_estimator(self.cfg, self.policy.collect_mode, self.buffer_)) 211 task.use(OffPolicyLearner(self.cfg, self.policy.learn_mode, self.buffer_)) 212 task.use( 213 wandb_online_logger( 214 cfg=self.cfg.wandb_logger, 215 exp_config=self.cfg, 216 metric_list=self.policy._monitor_vars_learn(), 217 model=self.policy._model, 218 anonymous=True, 219 project_name=self.exp_name, 220 wandb_sweep=wandb_sweep, 221 ) 222 ) 223 task.use(termination_checker(max_env_step=step)) 224 task.use(final_ctx_saver(name=self.exp_name)) 225 task.run() 226 227 return TrainingReturn(wandb_url=task.ctx.wandb_url) 228 229 def deploy( 230 self, 231 enable_save_replay: bool = False, 232 concatenate_all_replay: bool = False, 233 replay_save_path: str = None, 234 seed: Optional[Union[int, List]] = None, 235 debug: bool = False 236 ) -> EvalReturn: 237 """ 238 Overview: 239 Deploy the agent with PPO (offpolicy) algorithm by interacting with the environment, \ 240 during which the replay video can be saved if ``enable_save_replay`` is True. \ 241 The evaluation result will be returned. 242 Arguments: 243 - enable_save_replay (:obj:`bool`): Whether to save the replay video. Default to False. 244 - concatenate_all_replay (:obj:`bool`): Whether to concatenate all replay videos into one video. \ 245 Default to False. If ``enable_save_replay`` is False, this argument will be ignored. \ 246 If ``enable_save_replay`` is True and ``concatenate_all_replay`` is False, \ 247 the replay video of each episode will be saved separately. 248 - replay_save_path (:obj:`str`): The path to save the replay video. Default to None. \ 249 If not specified, the video will be saved in ``exp_name/videos``. 250 - seed (:obj:`Union[int, List]`): The random seed, which is set before running the program. \ 251 Default to None. If not specified, ``self.seed`` will be used. \ 252 If ``seed`` is an integer, the agent will be deployed once. \ 253 If ``seed`` is a list of integers, the agent will be deployed once for each seed in the list. 254 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 255 If set True, base environment manager will be used for easy debugging. Otherwise, \ 256 subprocess environment manager will be used. 257 Returns: 258 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 259 - eval_value (:obj:`np.float32`): The mean of evaluation return. 260 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 261 """ 262 263 if debug: 264 logging.getLogger().setLevel(logging.DEBUG) 265 # define env and policy 266 env = self.env.clone(caller='evaluator') 267 268 if seed is not None and isinstance(seed, int): 269 seeds = [seed] 270 elif seed is not None and isinstance(seed, list): 271 seeds = seed 272 else: 273 seeds = [self.seed] 274 275 returns = [] 276 images = [] 277 if enable_save_replay: 278 replay_save_path = os.path.join(self.exp_name, 'videos') if replay_save_path is None else replay_save_path 279 env.enable_save_replay(replay_path=replay_save_path) 280 else: 281 logging.warning('No video would be generated during the deploy.') 282 if concatenate_all_replay: 283 logging.warning('concatenate_all_replay is set to False because enable_save_replay is False.') 284 concatenate_all_replay = False 285 286 def single_env_forward_wrapper(forward_fn, cuda=True): 287 288 if self.cfg.policy.action_space == 'discrete': 289 forward_fn = model_wrap(forward_fn, wrapper_name='argmax_sample').forward 290 elif self.cfg.policy.action_space == 'continuous': 291 forward_fn = model_wrap(forward_fn, wrapper_name='deterministic_sample').forward 292 elif self.cfg.policy.action_space == 'hybrid': 293 forward_fn = model_wrap(forward_fn, wrapper_name='hybrid_deterministic_argmax_sample').forward 294 elif self.cfg.policy.action_space == 'general': 295 forward_fn = model_wrap(forward_fn, wrapper_name='base').forward 296 else: 297 raise NotImplementedError 298 299 def _forward(obs): 300 # unsqueeze means add batch dim, i.e. (O, ) -> (1, O) 301 obs = ttorch.as_tensor(obs).unsqueeze(0) 302 if cuda and torch.cuda.is_available(): 303 obs = obs.cuda() 304 action = forward_fn(obs, mode='compute_actor')["action"] 305 # squeeze means delete batch dim, i.e. (1, A) -> (A, ) 306 action = action.squeeze(0).detach().cpu().numpy() 307 return action 308 309 return _forward 310 311 forward_fn = single_env_forward_wrapper(self.policy._model, self.cfg.policy.cuda) 312 313 # reset first to make sure the env is in the initial state 314 # env will be reset again in the main loop 315 env.reset() 316 317 for seed in seeds: 318 env.seed(seed, dynamic_seed=False) 319 return_ = 0. 320 step = 0 321 obs = env.reset() 322 images.append(render(env)[None]) if concatenate_all_replay else None 323 while True: 324 action = forward_fn(obs) 325 obs, rew, done, info = env.step(action) 326 images.append(render(env)[None]) if concatenate_all_replay else None 327 return_ += rew 328 step += 1 329 if done: 330 break 331 logging.info(f'PPO (offpolicy) deploy is finished, final episode return with {step} steps is: {return_}') 332 returns.append(return_) 333 334 env.close() 335 336 if concatenate_all_replay: 337 images = np.concatenate(images, axis=0) 338 import imageio 339 imageio.mimwrite(os.path.join(replay_save_path, 'deploy.mp4'), images, fps=get_env_fps(env)) 340 341 return EvalReturn(eval_value=np.mean(returns), eval_value_std=np.std(returns)) 342 343 def collect_data( 344 self, 345 env_num: int = 8, 346 save_data_path: Optional[str] = None, 347 n_sample: Optional[int] = None, 348 n_episode: Optional[int] = None, 349 context: Optional[str] = None, 350 debug: bool = False 351 ) -> None: 352 """ 353 Overview: 354 Collect data with PPO (offpolicy) algorithm for ``n_episode`` episodes \ 355 with ``env_num`` collector environments. \ 356 The collected data will be saved in ``save_data_path`` if specified, otherwise it will be saved in \ 357 ``exp_name/demo_data``. 358 Arguments: 359 - env_num (:obj:`int`): The number of collector environments. Default to 8. 360 - save_data_path (:obj:`str`): The path to save the collected data. Default to None. \ 361 If not specified, the data will be saved in ``exp_name/demo_data``. 362 - n_sample (:obj:`int`): The number of samples to collect. Default to None. \ 363 If not specified, ``n_episode`` must be specified. 364 - n_episode (:obj:`int`): The number of episodes to collect. Default to None. \ 365 If not specified, ``n_sample`` must be specified. 366 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 367 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 368 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 369 If set True, base environment manager will be used for easy debugging. Otherwise, \ 370 subprocess environment manager will be used. 371 """ 372 373 if debug: 374 logging.getLogger().setLevel(logging.DEBUG) 375 if n_episode is not None: 376 raise NotImplementedError 377 # define env and policy 378 env_num = env_num if env_num else self.cfg.env.collector_env_num 379 env = setup_ding_env_manager(self.env, env_num, context, debug, 'collector') 380 381 if save_data_path is None: 382 save_data_path = os.path.join(self.exp_name, 'demo_data') 383 384 # main execution task 385 with task.start(ctx=OnlineRLContext()): 386 task.use( 387 StepCollector( 388 self.cfg, self.policy.collect_mode, env, random_collect_size=self.cfg.policy.random_collect_size 389 ) 390 ) 391 task.use(offline_data_saver(save_data_path, data_type='hdf5')) 392 task.run(max_step=1) 393 logging.info( 394 f'PPOOffPolicy collecting is finished, more than {n_sample} \ 395 samples are collected and saved in `{save_data_path}`' 396 ) 397 398 def batch_evaluate( 399 self, 400 env_num: int = 4, 401 n_evaluator_episode: int = 4, 402 context: Optional[str] = None, 403 debug: bool = False 404 ) -> EvalReturn: 405 """ 406 Overview: 407 Evaluate the agent with PPO (offpolicy) algorithm for ``n_evaluator_episode`` episodes \ 408 with ``env_num`` evaluator environments. The evaluation result will be returned. 409 The difference between methods ``batch_evaluate`` and ``deploy`` is that ``batch_evaluate`` will create \ 410 multiple evaluator environments to evaluate the agent to get an average performance, while ``deploy`` \ 411 will only create one evaluator environment to evaluate the agent and save the replay video. 412 Arguments: 413 - env_num (:obj:`int`): The number of evaluator environments. Default to 4. 414 - n_evaluator_episode (:obj:`int`): The number of episodes to evaluate. Default to 4. 415 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 416 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 417 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 418 If set True, base environment manager will be used for easy debugging. Otherwise, \ 419 subprocess environment manager will be used. 420 Returns: 421 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 422 - eval_value (:obj:`np.float32`): The mean of evaluation return. 423 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 424 """ 425 426 if debug: 427 logging.getLogger().setLevel(logging.DEBUG) 428 # define env and policy 429 env_num = env_num if env_num else self.cfg.env.evaluator_env_num 430 env = setup_ding_env_manager(self.env, env_num, context, debug, 'evaluator') 431 432 # reset first to make sure the env is in the initial state 433 # env will be reset again in the main loop 434 env.launch() 435 env.reset() 436 437 evaluate_cfg = self.cfg 438 evaluate_cfg.env.n_evaluator_episode = n_evaluator_episode 439 440 # main execution task 441 with task.start(ctx=OnlineRLContext()): 442 task.use(interaction_evaluator(self.cfg, self.policy.eval_mode, env)) 443 task.run(max_step=1) 444 445 return EvalReturn(eval_value=task.ctx.eval_value, eval_value_std=task.ctx.eval_value_std) 446 447 @property 448 def best(self) -> 'PPOOffPolicyAgent': 449 """ 450 Overview: 451 Load the best model from the checkpoint directory, \ 452 which is by default in folder ``exp_name/ckpt/eval.pth.tar``. \ 453 The return value is the agent with the best model. 454 Returns: 455 - (:obj:`PPOOffPolicyAgent`): The agent with the best model. 456 Examples: 457 >>> agent = PPOOffPolicyAgent(env_id='LunarLander-v2') 458 >>> agent.train() 459 >>> agent.best 460 461 .. note:: 462 The best model is the model with the highest evaluation return. If this method is called, the current \ 463 model will be replaced by the best model. 464 """ 465 466 best_model_file_path = os.path.join(self.checkpoint_save_dir, "eval.pth.tar") 467 # Load best model if it exists 468 if os.path.exists(best_model_file_path): 469 policy_state_dict = torch.load(best_model_file_path, map_location=torch.device("cpu")) 470 self.policy.learn_mode.load_state_dict(policy_state_dict) 471 return self