Skip to content

ding.bonus.pg

ding.bonus.pg

PGAgent

Overview

Class of agent for training, evaluation and deployment of Reinforcement learning algorithm Policy Gradient(PG). For more information about the system design of RL agent, please refer to https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html.

Interface: __init__, train, deploy, collect_data, batch_evaluate, best

supported_env_list = list(supported_env_cfg.keys()) class-attribute instance-attribute

Overview

List of supported envs.

Examples: >>> from ding.bonus.pg import PGAgent >>> print(PGAgent.supported_env_list)

best property

Overview

Load the best model from the checkpoint directory, which is by default in folder exp_name/ckpt/eval.pth.tar. The return value is the agent with the best model.

Returns: - (:obj:PGAgent): The agent with the best model. Examples: >>> agent = PGAgent(env_id='LunarLanderContinuous-v2') >>> agent.train() >>> agent = agent.best

.. note:: The best model is the model with the highest evaluation return. If this method is called, the current model will be replaced by the best model.

__init__(env_id=None, env=None, seed=0, exp_name=None, model=None, cfg=None, policy_state_dict=None)

Overview

Initialize agent for PG algorithm.

Arguments: - env_id (:obj:str): The environment id, which is a registered environment name in gym or gymnasium. If env_id is not specified, env_id in cfg.env must be specified. If env_id is specified, env_id in cfg.env will be ignored. env_id should be one of the supported envs, which can be found in supported_env_list. - env (:obj:BaseEnv): The environment instance for training and evaluation. If env is not specified, env_id`` or ``cfg.env.env_id`` must be specified. ``env_id`` or ``cfg.env.env_id`` will be used to create environment instance. If ``env`` is specified, ``env_id`` and ``cfg.env.env_id`` will be ignored. - seed (:obj:int): The random seed, which is set before running the program. Default to 0. - exp_name (:obj:str): The name of this experiment, which will be used to create the folder to save log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. - model (:obj:torch.nn.Module): The model of PG algorithm, which should be an instance of class :class:ding.model.PG. If not specified, a default model will be generated according to the configuration. - cfg (:obj:Union[EasyDict, dict]): The configuration of PG algorithm, which is a dict. Default to None. If not specified, the default configuration will be used. The default configuration can be found in ``ding/config/example/PG/gym_lunarlander_v2.py``. - policy_state_dict (:obj:str`): The path of policy state dict saved by PyTorch a in local file. If specified, the policy will be loaded from this file. Default to None.

.. note:: An RL Agent Instance can be initialized in two basic ways. For example, we have an environment with id LunarLanderContinuous-v2 registered in gym, and we want to train an agent with PG algorithm with default configuration. Then we can initialize the agent in the following ways: >>> agent = PGAgent(env_id='LunarLanderContinuous-v2') or, if we want can specify the env_id in the configuration: >>> cfg = {'env': {'env_id': 'LunarLanderContinuous-v2'}, 'policy': ...... } >>> agent = PGAgent(cfg=cfg) There are also other arguments to specify the agent when initializing. For example, if we want to specify the environment instance: >>> env = CustomizedEnv('LunarLanderContinuous-v2') >>> agent = PGAgent(cfg=cfg, env=env) or, if we want to specify the model: >>> model = PG(**cfg.policy.model) >>> agent = PGAgent(cfg=cfg, model=model) or, if we want to reload the policy from a saved policy state dict: >>> agent = PGAgent(cfg=cfg, policy_state_dict='LunarLanderContinuous-v2.pth.tar') Make sure that the configuration is consistent with the saved policy state dict.

train(step=int(10000000.0), collector_env_num=None, evaluator_env_num=None, n_iter_save_ckpt=1000, context=None, debug=False, wandb_sweep=False)

Overview

Train the agent with PG algorithm for step iterations with collector_env_num collector environments and evaluator_env_num evaluator environments. Information during training will be recorded and saved by wandb.

Arguments: - step (:obj:int): The total training environment steps of all collector environments. Default to 1e7. - collector_env_num (:obj:int): The collector environment number. Default to None. If not specified, it will be set according to the configuration. - evaluator_env_num (:obj:int): The evaluator environment number. Default to None. If not specified, it will be set according to the configuration. - n_iter_save_ckpt (:obj:int): The frequency of saving checkpoint every training iteration. Default to 1000. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. - wandb_sweep (:obj:bool): Whether to use wandb sweep, which is a hyper-parameter optimization process for seeking the best configurations. Default to False. If True, the wandb sweep id will be used as the experiment name. Returns: - (:obj:TrainingReturn): The training result, of which the attributions are: - wandb_url (:obj:str): The weight & biases (wandb) project url of the trainning experiment.

deploy(enable_save_replay=False, concatenate_all_replay=False, replay_save_path=None, seed=None, debug=False)

Overview

Deploy the agent with PG algorithm by interacting with the environment, during which the replay video can be saved if enable_save_replay is True. The evaluation result will be returned.

Arguments: - enable_save_replay (:obj:bool): Whether to save the replay video. Default to False. - concatenate_all_replay (:obj:bool): Whether to concatenate all replay videos into one video. Default to False. If enable_save_replay is False, this argument will be ignored. If enable_save_replay is True and concatenate_all_replay is False, the replay video of each episode will be saved separately. - replay_save_path (:obj:str): The path to save the replay video. Default to None. If not specified, the video will be saved in exp_name/videos. - seed (:obj:Union[int, List]): The random seed, which is set before running the program. Default to None. If not specified, self.seed will be used. If seed is an integer, the agent will be deployed once. If seed is a list of integers, the agent will be deployed once for each seed in the list. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

collect_data(env_num=8, save_data_path=None, n_sample=None, n_episode=None, context=None, debug=False)

Overview

Collect data with PG algorithm for n_episode episodes with env_num collector environments. The collected data will be saved in save_data_path if specified, otherwise it will be saved in exp_name/demo_data.

Arguments: - env_num (:obj:int): The number of collector environments. Default to 8. - save_data_path (:obj:str): The path to save the collected data. Default to None. If not specified, the data will be saved in exp_name/demo_data. - n_sample (:obj:int): The number of samples to collect. Default to None. If not specified, n_episode must be specified. - n_episode (:obj:int): The number of episodes to collect. Default to None. If not specified, n_sample must be specified. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used.

batch_evaluate(env_num=4, n_evaluator_episode=4, context=None, debug=False)

Overview

Evaluate the agent with PG algorithm for n_evaluator_episode episodes with env_num evaluator environments. The evaluation result will be returned. The difference between methods batch_evaluate and deploy is that batch_evaluate will create multiple evaluator environments to evaluate the agent to get an average performance, while deploy will only create one evaluator environment to evaluate the agent and save the replay video.

Arguments: - env_num (:obj:int): The number of evaluator environments. Default to 4. - n_evaluator_episode (:obj:int): The number of episodes to evaluate. Default to 4. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

Full Source Code

../ding/bonus/pg.py

1from typing import Optional, Union, List 2from ditk import logging 3from easydict import EasyDict 4import os 5import numpy as np 6import torch 7import treetensor.torch as ttorch 8from ding.framework import task, OnlineRLContext 9from ding.framework.middleware import CkptSaver, trainer, \ 10 wandb_online_logger, offline_data_saver, termination_checker, interaction_evaluator, StepCollector, \ 11 montecarlo_return_estimator, final_ctx_saver, EpisodeCollector 12from ding.envs import BaseEnv 13from ding.envs import setup_ding_env_manager 14from ding.policy import PGPolicy 15from ding.utils import set_pkg_seed 16from ding.utils import get_env_fps, render 17from ding.config import save_config_py, compile_config 18from ding.model import PG 19from ding.bonus.common import TrainingReturn, EvalReturn 20from ding.config.example.PG import supported_env_cfg 21from ding.config.example.PG import supported_env 22 23 24class PGAgent: 25 """ 26 Overview: 27 Class of agent for training, evaluation and deployment of Reinforcement learning algorithm Policy Gradient(PG). 28 For more information about the system design of RL agent, please refer to \ 29 <https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html>. 30 Interface: 31 ``__init__``, ``train``, ``deploy``, ``collect_data``, ``batch_evaluate``, ``best`` 32 """ 33 supported_env_list = list(supported_env_cfg.keys()) 34 """ 35 Overview: 36 List of supported envs. 37 Examples: 38 >>> from ding.bonus.pg import PGAgent 39 >>> print(PGAgent.supported_env_list) 40 """ 41 42 def __init__( 43 self, 44 env_id: str = None, 45 env: BaseEnv = None, 46 seed: int = 0, 47 exp_name: str = None, 48 model: Optional[torch.nn.Module] = None, 49 cfg: Optional[Union[EasyDict, dict]] = None, 50 policy_state_dict: str = None, 51 ) -> None: 52 """ 53 Overview: 54 Initialize agent for PG algorithm. 55 Arguments: 56 - env_id (:obj:`str`): The environment id, which is a registered environment name in gym or gymnasium. \ 57 If ``env_id`` is not specified, ``env_id`` in ``cfg.env`` must be specified. \ 58 If ``env_id`` is specified, ``env_id`` in ``cfg.env`` will be ignored. \ 59 ``env_id`` should be one of the supported envs, which can be found in ``supported_env_list``. 60 - env (:obj:`BaseEnv`): The environment instance for training and evaluation. \ 61 If ``env`` is not specified, `env_id`` or ``cfg.env.env_id`` must be specified. \ 62 ``env_id`` or ``cfg.env.env_id`` will be used to create environment instance. \ 63 If ``env`` is specified, ``env_id`` and ``cfg.env.env_id`` will be ignored. 64 - seed (:obj:`int`): The random seed, which is set before running the program. \ 65 Default to 0. 66 - exp_name (:obj:`str`): The name of this experiment, which will be used to create the folder to save \ 67 log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. 68 - model (:obj:`torch.nn.Module`): The model of PG algorithm, which should be an instance of class \ 69 :class:`ding.model.PG`. \ 70 If not specified, a default model will be generated according to the configuration. 71 - cfg (:obj:`Union[EasyDict, dict]`): The configuration of PG algorithm, which is a dict. \ 72 Default to None. If not specified, the default configuration will be used. \ 73 The default configuration can be found in ``ding/config/example/PG/gym_lunarlander_v2.py``. 74 - policy_state_dict (:obj:`str`): The path of policy state dict saved by PyTorch a in local file. \ 75 If specified, the policy will be loaded from this file. Default to None. 76 77 .. note:: 78 An RL Agent Instance can be initialized in two basic ways. \ 79 For example, we have an environment with id ``LunarLanderContinuous-v2`` registered in gym, \ 80 and we want to train an agent with PG algorithm with default configuration. \ 81 Then we can initialize the agent in the following ways: 82 >>> agent = PGAgent(env_id='LunarLanderContinuous-v2') 83 or, if we want can specify the env_id in the configuration: 84 >>> cfg = {'env': {'env_id': 'LunarLanderContinuous-v2'}, 'policy': ...... } 85 >>> agent = PGAgent(cfg=cfg) 86 There are also other arguments to specify the agent when initializing. 87 For example, if we want to specify the environment instance: 88 >>> env = CustomizedEnv('LunarLanderContinuous-v2') 89 >>> agent = PGAgent(cfg=cfg, env=env) 90 or, if we want to specify the model: 91 >>> model = PG(**cfg.policy.model) 92 >>> agent = PGAgent(cfg=cfg, model=model) 93 or, if we want to reload the policy from a saved policy state dict: 94 >>> agent = PGAgent(cfg=cfg, policy_state_dict='LunarLanderContinuous-v2.pth.tar') 95 Make sure that the configuration is consistent with the saved policy state dict. 96 """ 97 98 assert env_id is not None or cfg is not None, "Please specify env_id or cfg." 99 100 if cfg is not None and not isinstance(cfg, EasyDict): 101 cfg = EasyDict(cfg) 102 103 if env_id is not None: 104 assert env_id in PGAgent.supported_env_list, "Please use supported envs: {}".format( 105 PGAgent.supported_env_list 106 ) 107 if cfg is None: 108 cfg = supported_env_cfg[env_id] 109 else: 110 assert cfg.env.env_id == env_id, "env_id in cfg should be the same as env_id in args." 111 else: 112 assert hasattr(cfg.env, "env_id"), "Please specify env_id in cfg." 113 assert cfg.env.env_id in PGAgent.supported_env_list, "Please use supported envs: {}".format( 114 PGAgent.supported_env_list 115 ) 116 default_policy_config = EasyDict({"policy": PGPolicy.default_config()}) 117 default_policy_config.update(cfg) 118 cfg = default_policy_config 119 120 if exp_name is not None: 121 cfg.exp_name = exp_name 122 self.cfg = compile_config(cfg, policy=PGPolicy) 123 self.exp_name = self.cfg.exp_name 124 if env is None: 125 self.env = supported_env[cfg.env.env_id](cfg=cfg.env) 126 else: 127 assert isinstance(env, BaseEnv), "Please use BaseEnv as env data type." 128 self.env = env 129 130 logging.getLogger().setLevel(logging.INFO) 131 self.seed = seed 132 set_pkg_seed(self.seed, use_cuda=self.cfg.policy.cuda) 133 if not os.path.exists(self.exp_name): 134 os.makedirs(self.exp_name) 135 save_config_py(self.cfg, os.path.join(self.exp_name, 'policy_config.py')) 136 if model is None: 137 model = PG(**self.cfg.policy.model) 138 self.policy = PGPolicy(self.cfg.policy, model=model) 139 if policy_state_dict is not None: 140 self.policy.learn_mode.load_state_dict(policy_state_dict) 141 self.checkpoint_save_dir = os.path.join(self.exp_name, "ckpt") 142 143 def train( 144 self, 145 step: int = int(1e7), 146 collector_env_num: int = None, 147 evaluator_env_num: int = None, 148 n_iter_save_ckpt: int = 1000, 149 context: Optional[str] = None, 150 debug: bool = False, 151 wandb_sweep: bool = False, 152 ) -> TrainingReturn: 153 """ 154 Overview: 155 Train the agent with PG algorithm for ``step`` iterations with ``collector_env_num`` collector \ 156 environments and ``evaluator_env_num`` evaluator environments. Information during training will be \ 157 recorded and saved by wandb. 158 Arguments: 159 - step (:obj:`int`): The total training environment steps of all collector environments. Default to 1e7. 160 - collector_env_num (:obj:`int`): The collector environment number. Default to None. \ 161 If not specified, it will be set according to the configuration. 162 - evaluator_env_num (:obj:`int`): The evaluator environment number. Default to None. \ 163 If not specified, it will be set according to the configuration. 164 - n_iter_save_ckpt (:obj:`int`): The frequency of saving checkpoint every training iteration. \ 165 Default to 1000. 166 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 167 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 168 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 169 If set True, base environment manager will be used for easy debugging. Otherwise, \ 170 subprocess environment manager will be used. 171 - wandb_sweep (:obj:`bool`): Whether to use wandb sweep, \ 172 which is a hyper-parameter optimization process for seeking the best configurations. \ 173 Default to False. If True, the wandb sweep id will be used as the experiment name. 174 Returns: 175 - (:obj:`TrainingReturn`): The training result, of which the attributions are: 176 - wandb_url (:obj:`str`): The weight & biases (wandb) project url of the trainning experiment. 177 """ 178 179 if debug: 180 logging.getLogger().setLevel(logging.DEBUG) 181 logging.debug(self.policy._model) 182 # define env and policy 183 collector_env_num = collector_env_num if collector_env_num else self.cfg.env.collector_env_num 184 evaluator_env_num = evaluator_env_num if evaluator_env_num else self.cfg.env.evaluator_env_num 185 collector_env = setup_ding_env_manager(self.env, collector_env_num, context, debug, 'collector') 186 evaluator_env = setup_ding_env_manager(self.env, evaluator_env_num, context, debug, 'evaluator') 187 188 with task.start(ctx=OnlineRLContext()): 189 task.use( 190 interaction_evaluator( 191 self.cfg, 192 self.policy.eval_mode, 193 evaluator_env, 194 render=self.cfg.policy.eval.render if hasattr(self.cfg.policy.eval, "render") else False 195 ) 196 ) 197 task.use(CkptSaver(policy=self.policy, save_dir=self.checkpoint_save_dir, train_freq=n_iter_save_ckpt)) 198 task.use(EpisodeCollector(self.cfg, self.policy.collect_mode, collector_env)) 199 task.use(montecarlo_return_estimator(self.policy)) 200 task.use(trainer(self.cfg, self.policy.learn_mode)) 201 task.use( 202 wandb_online_logger( 203 metric_list=self.policy._monitor_vars_learn(), 204 model=self.policy._model, 205 anonymous=True, 206 project_name=self.exp_name, 207 wandb_sweep=wandb_sweep, 208 ) 209 ) 210 task.use(termination_checker(max_env_step=step)) 211 task.use(final_ctx_saver(name=self.exp_name)) 212 task.run() 213 214 return TrainingReturn(wandb_url=task.ctx.wandb_url) 215 216 def deploy( 217 self, 218 enable_save_replay: bool = False, 219 concatenate_all_replay: bool = False, 220 replay_save_path: str = None, 221 seed: Optional[Union[int, List]] = None, 222 debug: bool = False 223 ) -> EvalReturn: 224 """ 225 Overview: 226 Deploy the agent with PG algorithm by interacting with the environment, during which the replay video \ 227 can be saved if ``enable_save_replay`` is True. The evaluation result will be returned. 228 Arguments: 229 - enable_save_replay (:obj:`bool`): Whether to save the replay video. Default to False. 230 - concatenate_all_replay (:obj:`bool`): Whether to concatenate all replay videos into one video. \ 231 Default to False. If ``enable_save_replay`` is False, this argument will be ignored. \ 232 If ``enable_save_replay`` is True and ``concatenate_all_replay`` is False, \ 233 the replay video of each episode will be saved separately. 234 - replay_save_path (:obj:`str`): The path to save the replay video. Default to None. \ 235 If not specified, the video will be saved in ``exp_name/videos``. 236 - seed (:obj:`Union[int, List]`): The random seed, which is set before running the program. \ 237 Default to None. If not specified, ``self.seed`` will be used. \ 238 If ``seed`` is an integer, the agent will be deployed once. \ 239 If ``seed`` is a list of integers, the agent will be deployed once for each seed in the list. 240 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 241 If set True, base environment manager will be used for easy debugging. Otherwise, \ 242 subprocess environment manager will be used. 243 Returns: 244 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 245 - eval_value (:obj:`np.float32`): The mean of evaluation return. 246 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 247 """ 248 249 if debug: 250 logging.getLogger().setLevel(logging.DEBUG) 251 # define env and policy 252 env = self.env.clone(caller='evaluator') 253 254 if seed is not None and isinstance(seed, int): 255 seeds = [seed] 256 elif seed is not None and isinstance(seed, list): 257 seeds = seed 258 else: 259 seeds = [self.seed] 260 261 returns = [] 262 images = [] 263 if enable_save_replay: 264 replay_save_path = os.path.join(self.exp_name, 'videos') if replay_save_path is None else replay_save_path 265 env.enable_save_replay(replay_path=replay_save_path) 266 else: 267 logging.warning('No video would be generated during the deploy.') 268 if concatenate_all_replay: 269 logging.warning('concatenate_all_replay is set to False because enable_save_replay is False.') 270 concatenate_all_replay = False 271 272 def single_env_forward_wrapper(forward_fn, cuda=True): 273 274 def _forward(obs): 275 # unsqueeze means add batch dim, i.e. (O, ) -> (1, O) 276 obs = ttorch.as_tensor(obs).unsqueeze(0) 277 if cuda and torch.cuda.is_available(): 278 obs = obs.cuda() 279 output = forward_fn(obs) 280 if self.policy._cfg.deterministic_eval: 281 if self.policy._cfg.action_space == 'discrete': 282 output['action'] = output['logit'].argmax(dim=-1) 283 elif self.policy._cfg.action_space == 'continuous': 284 output['action'] = output['logit']['mu'] 285 else: 286 raise KeyError("invalid action_space: {}".format(self.policy._cfg.action_space)) 287 else: 288 output['action'] = output['dist'].sample() 289 # squeeze means delete batch dim, i.e. (1, A) -> (A, ) 290 action = output['action'].squeeze(0).detach().cpu().numpy() 291 return action 292 293 return _forward 294 295 forward_fn = single_env_forward_wrapper(self.policy._model, self.cfg.policy.cuda) 296 297 # reset first to make sure the env is in the initial state 298 # env will be reset again in the main loop 299 env.reset() 300 301 for seed in seeds: 302 env.seed(seed, dynamic_seed=False) 303 return_ = 0. 304 step = 0 305 obs = env.reset() 306 images.append(render(env)[None]) if concatenate_all_replay else None 307 while True: 308 action = forward_fn(obs) 309 obs, rew, done, info = env.step(action) 310 images.append(render(env)[None]) if concatenate_all_replay else None 311 return_ += rew 312 step += 1 313 if done: 314 break 315 logging.info(f'DQN deploy is finished, final episode return with {step} steps is: {return_}') 316 returns.append(return_) 317 318 env.close() 319 320 if concatenate_all_replay: 321 images = np.concatenate(images, axis=0) 322 import imageio 323 imageio.mimwrite(os.path.join(replay_save_path, 'deploy.mp4'), images, fps=get_env_fps(env)) 324 325 return EvalReturn(eval_value=np.mean(returns), eval_value_std=np.std(returns)) 326 327 def collect_data( 328 self, 329 env_num: int = 8, 330 save_data_path: Optional[str] = None, 331 n_sample: Optional[int] = None, 332 n_episode: Optional[int] = None, 333 context: Optional[str] = None, 334 debug: bool = False 335 ) -> None: 336 """ 337 Overview: 338 Collect data with PG algorithm for ``n_episode`` episodes with ``env_num`` collector environments. \ 339 The collected data will be saved in ``save_data_path`` if specified, otherwise it will be saved in \ 340 ``exp_name/demo_data``. 341 Arguments: 342 - env_num (:obj:`int`): The number of collector environments. Default to 8. 343 - save_data_path (:obj:`str`): The path to save the collected data. Default to None. \ 344 If not specified, the data will be saved in ``exp_name/demo_data``. 345 - n_sample (:obj:`int`): The number of samples to collect. Default to None. \ 346 If not specified, ``n_episode`` must be specified. 347 - n_episode (:obj:`int`): The number of episodes to collect. Default to None. \ 348 If not specified, ``n_sample`` must be specified. 349 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 350 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 351 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 352 If set True, base environment manager will be used for easy debugging. Otherwise, \ 353 subprocess environment manager will be used. 354 """ 355 356 if debug: 357 logging.getLogger().setLevel(logging.DEBUG) 358 if n_episode is not None: 359 raise NotImplementedError 360 # define env and policy 361 env_num = env_num if env_num else self.cfg.env.collector_env_num 362 env = setup_ding_env_manager(self.env, env_num, context, debug, 'collector') 363 364 if save_data_path is None: 365 save_data_path = os.path.join(self.exp_name, 'demo_data') 366 367 # main execution task 368 with task.start(ctx=OnlineRLContext()): 369 task.use( 370 StepCollector( 371 self.cfg, self.policy.collect_mode, env, random_collect_size=self.cfg.policy.random_collect_size 372 ) 373 ) 374 task.use(offline_data_saver(save_data_path, data_type='hdf5')) 375 task.run(max_step=1) 376 logging.info( 377 f'PG collecting is finished, more than {n_sample} samples are collected and saved in `{save_data_path}`' 378 ) 379 380 def batch_evaluate( 381 self, 382 env_num: int = 4, 383 n_evaluator_episode: int = 4, 384 context: Optional[str] = None, 385 debug: bool = False 386 ) -> EvalReturn: 387 """ 388 Overview: 389 Evaluate the agent with PG algorithm for ``n_evaluator_episode`` episodes with ``env_num`` evaluator \ 390 environments. The evaluation result will be returned. 391 The difference between methods ``batch_evaluate`` and ``deploy`` is that ``batch_evaluate`` will create \ 392 multiple evaluator environments to evaluate the agent to get an average performance, while ``deploy`` \ 393 will only create one evaluator environment to evaluate the agent and save the replay video. 394 Arguments: 395 - env_num (:obj:`int`): The number of evaluator environments. Default to 4. 396 - n_evaluator_episode (:obj:`int`): The number of episodes to evaluate. Default to 4. 397 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 398 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 399 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 400 If set True, base environment manager will be used for easy debugging. Otherwise, \ 401 subprocess environment manager will be used. 402 Returns: 403 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 404 - eval_value (:obj:`np.float32`): The mean of evaluation return. 405 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 406 """ 407 408 if debug: 409 logging.getLogger().setLevel(logging.DEBUG) 410 # define env and policy 411 env_num = env_num if env_num else self.cfg.env.evaluator_env_num 412 env = setup_ding_env_manager(self.env, env_num, context, debug, 'evaluator') 413 414 # reset first to make sure the env is in the initial state 415 # env will be reset again in the main loop 416 env.launch() 417 env.reset() 418 419 evaluate_cfg = self.cfg 420 evaluate_cfg.env.n_evaluator_episode = n_evaluator_episode 421 422 # main execution task 423 with task.start(ctx=OnlineRLContext()): 424 task.use(interaction_evaluator(self.cfg, self.policy.eval_mode, env)) 425 task.run(max_step=1) 426 427 return EvalReturn(eval_value=task.ctx.eval_value, eval_value_std=task.ctx.eval_value_std) 428 429 @property 430 def best(self) -> 'PGAgent': 431 """ 432 Overview: 433 Load the best model from the checkpoint directory, \ 434 which is by default in folder ``exp_name/ckpt/eval.pth.tar``. \ 435 The return value is the agent with the best model. 436 Returns: 437 - (:obj:`PGAgent`): The agent with the best model. 438 Examples: 439 >>> agent = PGAgent(env_id='LunarLanderContinuous-v2') 440 >>> agent.train() 441 >>> agent = agent.best 442 443 .. note:: 444 The best model is the model with the highest evaluation return. If this method is called, the current \ 445 model will be replaced by the best model. 446 """ 447 448 best_model_file_path = os.path.join(self.checkpoint_save_dir, "eval.pth.tar") 449 # Load best model if it exists 450 if os.path.exists(best_model_file_path): 451 policy_state_dict = torch.load(best_model_file_path, map_location=torch.device("cpu")) 452 self.policy.learn_mode.load_state_dict(policy_state_dict) 453 return self