Skip to content

ding.bonus.c51

ding.bonus.c51

C51Agent

Overview

Class of agent for training, evaluation and deployment of Reinforcement learning algorithm C51. For more information about the system design of RL agent, please refer to https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html.

Interface: __init__, train, deploy, collect_data, batch_evaluate, best

supported_env_list = list(supported_env_cfg.keys()) class-attribute instance-attribute

Overview

List of supported envs.

Examples: >>> from ding.bonus.c51 import C51Agent >>> print(C51Agent.supported_env_list)

best property

Overview

Load the best model from the checkpoint directory, which is by default in folder exp_name/ckpt/eval.pth.tar. The return value is the agent with the best model.

Returns: - (:obj:C51Agent): The agent with the best model. Examples: >>> agent = C51Agent(env_id='LunarLander-v2') >>> agent.train() >>> agent = agent.best

.. note:: The best model is the model with the highest evaluation return. If this method is called, the current model will be replaced by the best model.

__init__(env_id=None, env=None, seed=0, exp_name=None, model=None, cfg=None, policy_state_dict=None)

Overview

Initialize agent for C51 algorithm.

Arguments: - env_id (:obj:str): The environment id, which is a registered environment name in gym or gymnasium. If env_id is not specified, env_id in cfg.env must be specified. If env_id is specified, env_id in cfg.env will be ignored. env_id should be one of the supported envs, which can be found in supported_env_list. - env (:obj:BaseEnv): The environment instance for training and evaluation. If env is not specified, env_id`` or ``cfg.env.env_id`` must be specified. ``env_id`` or ``cfg.env.env_id`` will be used to create environment instance. If ``env`` is specified, ``env_id`` and ``cfg.env.env_id`` will be ignored. - seed (:obj:int): The random seed, which is set before running the program. Default to 0. - exp_name (:obj:str): The name of this experiment, which will be used to create the folder to save log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. - model (:obj:torch.nn.Module): The model of C51 algorithm, which should be an instance of class :class:ding.model.C51DQN. If not specified, a default model will be generated according to the config. - cfg (:obj:Union[EasyDict, dict]): The configuration of C51 algorithm, which is a dict. Default to None. If not specified, the default configuration will be used. The default configuration can be found in ``ding/config/example/C51/gym_lunarlander_v2.py``. - policy_state_dict (:obj:str`): The path of policy state dict saved by PyTorch a in local file. If specified, the policy will be loaded from this file. Default to None.

.. note:: An RL Agent Instance can be initialized in two basic ways. For example, we have an environment with id LunarLander-v2 registered in gym, and we want to train an agent with C51 algorithm with default configuration. Then we can initialize the agent in the following ways: >>> agent = C51Agent(env_id='LunarLander-v2') or, if we want can specify the env_id in the configuration: >>> cfg = {'env': {'env_id': 'LunarLander-v2'}, 'policy': ...... } >>> agent = C51Agent(cfg=cfg) There are also other arguments to specify the agent when initializing. For example, if we want to specify the environment instance: >>> env = CustomizedEnv('LunarLander-v2') >>> agent = C51Agent(cfg=cfg, env=env) or, if we want to specify the model: >>> model = C51DQN(**cfg.policy.model) >>> agent = C51Agent(cfg=cfg, model=model) or, if we want to reload the policy from a saved policy state dict: >>> agent = C51Agent(cfg=cfg, policy_state_dict='LunarLander-v2.pth.tar') Make sure that the configuration is consistent with the saved policy state dict.

train(step=int(10000000.0), collector_env_num=None, evaluator_env_num=None, n_iter_save_ckpt=1000, context=None, debug=False, wandb_sweep=False)

Overview

Train the agent with C51 algorithm for step iterations with collector_env_num collector environments and evaluator_env_num evaluator environments. Information during training will be recorded and saved by wandb.

Arguments: - step (:obj:int): The total training environment steps of all collector environments. Default to 1e7. - collector_env_num (:obj:int): The collector environment number. Default to None. If not specified, it will be set according to the configuration. - evaluator_env_num (:obj:int): The evaluator environment number. Default to None. If not specified, it will be set according to the configuration. - n_iter_save_ckpt (:obj:int): The frequency of saving checkpoint every training iteration. Default to 1000. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. - wandb_sweep (:obj:bool): Whether to use wandb sweep, which is a hyper-parameter optimization process for seeking the best configurations. Default to False. If True, the wandb sweep id will be used as the experiment name. Returns: - (:obj:TrainingReturn): The training result, of which the attributions are: - wandb_url (:obj:str): The weight & biases (wandb) project url of the trainning experiment.

deploy(enable_save_replay=False, concatenate_all_replay=False, replay_save_path=None, seed=None, debug=False)

Overview

Deploy the agent with C51 algorithm by interacting with the environment, during which the replay video can be saved if enable_save_replay is True. The evaluation result will be returned.

Arguments: - enable_save_replay (:obj:bool): Whether to save the replay video. Default to False. - concatenate_all_replay (:obj:bool): Whether to concatenate all replay videos into one video. Default to False. If enable_save_replay is False, this argument will be ignored. If enable_save_replay is True and concatenate_all_replay is False, the replay video of each episode will be saved separately. - replay_save_path (:obj:str): The path to save the replay video. Default to None. If not specified, the video will be saved in exp_name/videos. - seed (:obj:Union[int, List]): The random seed, which is set before running the program. Default to None. If not specified, self.seed will be used. If seed is an integer, the agent will be deployed once. If seed is a list of integers, the agent will be deployed once for each seed in the list. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

collect_data(env_num=8, save_data_path=None, n_sample=None, n_episode=None, context=None, debug=False)

Overview

Collect data with C51 algorithm for n_episode episodes with env_num collector environments. The collected data will be saved in save_data_path if specified, otherwise it will be saved in exp_name/demo_data.

Arguments: - env_num (:obj:int): The number of collector environments. Default to 8. - save_data_path (:obj:str): The path to save the collected data. Default to None. If not specified, the data will be saved in exp_name/demo_data. - n_sample (:obj:int): The number of samples to collect. Default to None. If not specified, n_episode must be specified. - n_episode (:obj:int): The number of episodes to collect. Default to None. If not specified, n_sample must be specified. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used.

batch_evaluate(env_num=4, n_evaluator_episode=4, context=None, debug=False)

Overview

Evaluate the agent with C51 algorithm for n_evaluator_episode episodes with env_num evaluator environments. The evaluation result will be returned. The difference between methods batch_evaluate and deploy is that batch_evaluate will create multiple evaluator environments to evaluate the agent to get an average performance, while deploy will only create one evaluator environment to evaluate the agent and save the replay video.

Arguments: - env_num (:obj:int): The number of evaluator environments. Default to 4. - n_evaluator_episode (:obj:int): The number of episodes to evaluate. Default to 4. - context (:obj:str): The multi-process context of the environment manager. Default to None. It can be specified as spawn, fork or forkserver. - debug (:obj:bool): Whether to use debug mode in the environment manager. Default to False. If set True, base environment manager will be used for easy debugging. Otherwise, subprocess environment manager will be used. Returns: - (:obj:EvalReturn): The evaluation result, of which the attributions are: - eval_value (:obj:np.float32): The mean of evaluation return. - eval_value_std (:obj:np.float32): The standard deviation of evaluation return.

Full Source Code

../ding/bonus/c51.py

1from typing import Optional, Union, List 2from ditk import logging 3from easydict import EasyDict 4import os 5import numpy as np 6import torch 7import treetensor.torch as ttorch 8from ding.framework import task, OnlineRLContext 9from ding.framework.middleware import CkptSaver, \ 10 wandb_online_logger, offline_data_saver, termination_checker, interaction_evaluator, StepCollector, data_pusher, \ 11 OffPolicyLearner, final_ctx_saver, eps_greedy_handler, nstep_reward_enhancer 12from ding.envs import BaseEnv 13from ding.envs import setup_ding_env_manager 14from ding.policy import C51Policy 15from ding.utils import set_pkg_seed 16from ding.utils import get_env_fps, render 17from ding.config import save_config_py, compile_config 18from ding.model import C51DQN 19from ding.model import model_wrap 20from ding.data import DequeBuffer 21from ding.bonus.common import TrainingReturn, EvalReturn 22from ding.config.example.C51 import supported_env_cfg 23from ding.config.example.C51 import supported_env 24 25 26class C51Agent: 27 """ 28 Overview: 29 Class of agent for training, evaluation and deployment of Reinforcement learning algorithm C51. 30 For more information about the system design of RL agent, please refer to \ 31 <https://di-engine-docs.readthedocs.io/en/latest/03_system/agent.html>. 32 Interface: 33 ``__init__``, ``train``, ``deploy``, ``collect_data``, ``batch_evaluate``, ``best`` 34 """ 35 supported_env_list = list(supported_env_cfg.keys()) 36 """ 37 Overview: 38 List of supported envs. 39 Examples: 40 >>> from ding.bonus.c51 import C51Agent 41 >>> print(C51Agent.supported_env_list) 42 """ 43 44 def __init__( 45 self, 46 env_id: str = None, 47 env: BaseEnv = None, 48 seed: int = 0, 49 exp_name: str = None, 50 model: Optional[torch.nn.Module] = None, 51 cfg: Optional[Union[EasyDict, dict]] = None, 52 policy_state_dict: str = None, 53 ) -> None: 54 """ 55 Overview: 56 Initialize agent for C51 algorithm. 57 Arguments: 58 - env_id (:obj:`str`): The environment id, which is a registered environment name in gym or gymnasium. \ 59 If ``env_id`` is not specified, ``env_id`` in ``cfg.env`` must be specified. \ 60 If ``env_id`` is specified, ``env_id`` in ``cfg.env`` will be ignored. \ 61 ``env_id`` should be one of the supported envs, which can be found in ``supported_env_list``. 62 - env (:obj:`BaseEnv`): The environment instance for training and evaluation. \ 63 If ``env`` is not specified, `env_id`` or ``cfg.env.env_id`` must be specified. \ 64 ``env_id`` or ``cfg.env.env_id`` will be used to create environment instance. \ 65 If ``env`` is specified, ``env_id`` and ``cfg.env.env_id`` will be ignored. 66 - seed (:obj:`int`): The random seed, which is set before running the program. \ 67 Default to 0. 68 - exp_name (:obj:`str`): The name of this experiment, which will be used to create the folder to save \ 69 log data. Default to None. If not specified, the folder name will be ``env_id``-``algorithm``. 70 - model (:obj:`torch.nn.Module`): The model of C51 algorithm, which should be an instance of class \ 71 :class:`ding.model.C51DQN`. If not specified, a default model will be generated according to the config. 72 - cfg (:obj:`Union[EasyDict, dict]`): The configuration of C51 algorithm, which is a dict. \ 73 Default to None. If not specified, the default configuration will be used. \ 74 The default configuration can be found in ``ding/config/example/C51/gym_lunarlander_v2.py``. 75 - policy_state_dict (:obj:`str`): The path of policy state dict saved by PyTorch a in local file. \ 76 If specified, the policy will be loaded from this file. Default to None. 77 78 .. note:: 79 An RL Agent Instance can be initialized in two basic ways. \ 80 For example, we have an environment with id ``LunarLander-v2`` registered in gym, \ 81 and we want to train an agent with C51 algorithm with default configuration. \ 82 Then we can initialize the agent in the following ways: 83 >>> agent = C51Agent(env_id='LunarLander-v2') 84 or, if we want can specify the env_id in the configuration: 85 >>> cfg = {'env': {'env_id': 'LunarLander-v2'}, 'policy': ...... } 86 >>> agent = C51Agent(cfg=cfg) 87 There are also other arguments to specify the agent when initializing. 88 For example, if we want to specify the environment instance: 89 >>> env = CustomizedEnv('LunarLander-v2') 90 >>> agent = C51Agent(cfg=cfg, env=env) 91 or, if we want to specify the model: 92 >>> model = C51DQN(**cfg.policy.model) 93 >>> agent = C51Agent(cfg=cfg, model=model) 94 or, if we want to reload the policy from a saved policy state dict: 95 >>> agent = C51Agent(cfg=cfg, policy_state_dict='LunarLander-v2.pth.tar') 96 Make sure that the configuration is consistent with the saved policy state dict. 97 """ 98 99 assert env_id is not None or cfg is not None, "Please specify env_id or cfg." 100 101 if cfg is not None and not isinstance(cfg, EasyDict): 102 cfg = EasyDict(cfg) 103 104 if env_id is not None: 105 assert env_id in C51Agent.supported_env_list, "Please use supported envs: {}".format( 106 C51Agent.supported_env_list 107 ) 108 if cfg is None: 109 cfg = supported_env_cfg[env_id] 110 else: 111 assert cfg.env.env_id == env_id, "env_id in cfg should be the same as env_id in args." 112 else: 113 assert hasattr(cfg.env, "env_id"), "Please specify env_id in cfg." 114 assert cfg.env.env_id in C51Agent.supported_env_list, "Please use supported envs: {}".format( 115 C51Agent.supported_env_list 116 ) 117 default_policy_config = EasyDict({"policy": C51Policy.default_config()}) 118 default_policy_config.update(cfg) 119 cfg = default_policy_config 120 121 if exp_name is not None: 122 cfg.exp_name = exp_name 123 self.cfg = compile_config(cfg, policy=C51Policy) 124 self.exp_name = self.cfg.exp_name 125 if env is None: 126 self.env = supported_env[cfg.env.env_id](cfg=cfg.env) 127 else: 128 assert isinstance(env, BaseEnv), "Please use BaseEnv as env data type." 129 self.env = env 130 131 logging.getLogger().setLevel(logging.INFO) 132 self.seed = seed 133 set_pkg_seed(self.seed, use_cuda=self.cfg.policy.cuda) 134 if not os.path.exists(self.exp_name): 135 os.makedirs(self.exp_name) 136 save_config_py(self.cfg, os.path.join(self.exp_name, 'policy_config.py')) 137 if model is None: 138 model = C51DQN(**self.cfg.policy.model) 139 self.buffer_ = DequeBuffer(size=self.cfg.policy.other.replay_buffer.replay_buffer_size) 140 self.policy = C51Policy(self.cfg.policy, model=model) 141 if policy_state_dict is not None: 142 self.policy.learn_mode.load_state_dict(policy_state_dict) 143 self.checkpoint_save_dir = os.path.join(self.exp_name, "ckpt") 144 145 def train( 146 self, 147 step: int = int(1e7), 148 collector_env_num: int = None, 149 evaluator_env_num: int = None, 150 n_iter_save_ckpt: int = 1000, 151 context: Optional[str] = None, 152 debug: bool = False, 153 wandb_sweep: bool = False, 154 ) -> TrainingReturn: 155 """ 156 Overview: 157 Train the agent with C51 algorithm for ``step`` iterations with ``collector_env_num`` collector \ 158 environments and ``evaluator_env_num`` evaluator environments. Information during training will be \ 159 recorded and saved by wandb. 160 Arguments: 161 - step (:obj:`int`): The total training environment steps of all collector environments. Default to 1e7. 162 - collector_env_num (:obj:`int`): The collector environment number. Default to None. \ 163 If not specified, it will be set according to the configuration. 164 - evaluator_env_num (:obj:`int`): The evaluator environment number. Default to None. \ 165 If not specified, it will be set according to the configuration. 166 - n_iter_save_ckpt (:obj:`int`): The frequency of saving checkpoint every training iteration. \ 167 Default to 1000. 168 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 169 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 170 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 171 If set True, base environment manager will be used for easy debugging. Otherwise, \ 172 subprocess environment manager will be used. 173 - wandb_sweep (:obj:`bool`): Whether to use wandb sweep, \ 174 which is a hyper-parameter optimization process for seeking the best configurations. \ 175 Default to False. If True, the wandb sweep id will be used as the experiment name. 176 Returns: 177 - (:obj:`TrainingReturn`): The training result, of which the attributions are: 178 - wandb_url (:obj:`str`): The weight & biases (wandb) project url of the trainning experiment. 179 """ 180 181 if debug: 182 logging.getLogger().setLevel(logging.DEBUG) 183 logging.debug(self.policy._model) 184 # define env and policy 185 collector_env_num = collector_env_num if collector_env_num else self.cfg.env.collector_env_num 186 evaluator_env_num = evaluator_env_num if evaluator_env_num else self.cfg.env.evaluator_env_num 187 collector_env = setup_ding_env_manager(self.env, collector_env_num, context, debug, 'collector') 188 evaluator_env = setup_ding_env_manager(self.env, evaluator_env_num, context, debug, 'evaluator') 189 190 with task.start(ctx=OnlineRLContext()): 191 task.use( 192 interaction_evaluator( 193 self.cfg, 194 self.policy.eval_mode, 195 evaluator_env, 196 render=self.cfg.policy.eval.render if hasattr(self.cfg.policy.eval, "render") else False 197 ) 198 ) 199 task.use(CkptSaver(policy=self.policy, save_dir=self.checkpoint_save_dir, train_freq=n_iter_save_ckpt)) 200 task.use(eps_greedy_handler(self.cfg)) 201 task.use( 202 StepCollector( 203 self.cfg, 204 self.policy.collect_mode, 205 collector_env, 206 random_collect_size=self.cfg.policy.random_collect_size 207 if hasattr(self.cfg.policy, 'random_collect_size') else 0, 208 ) 209 ) 210 task.use(nstep_reward_enhancer(self.cfg)) 211 task.use(data_pusher(self.cfg, self.buffer_)) 212 task.use(OffPolicyLearner(self.cfg, self.policy.learn_mode, self.buffer_)) 213 task.use( 214 wandb_online_logger( 215 metric_list=self.policy._monitor_vars_learn(), 216 model=self.policy._model, 217 anonymous=True, 218 project_name=self.exp_name, 219 wandb_sweep=wandb_sweep, 220 ) 221 ) 222 task.use(termination_checker(max_env_step=step)) 223 task.use(final_ctx_saver(name=self.exp_name)) 224 task.run() 225 226 return TrainingReturn(wandb_url=task.ctx.wandb_url) 227 228 def deploy( 229 self, 230 enable_save_replay: bool = False, 231 concatenate_all_replay: bool = False, 232 replay_save_path: str = None, 233 seed: Optional[Union[int, List]] = None, 234 debug: bool = False 235 ) -> EvalReturn: 236 """ 237 Overview: 238 Deploy the agent with C51 algorithm by interacting with the environment, during which the replay video \ 239 can be saved if ``enable_save_replay`` is True. The evaluation result will be returned. 240 Arguments: 241 - enable_save_replay (:obj:`bool`): Whether to save the replay video. Default to False. 242 - concatenate_all_replay (:obj:`bool`): Whether to concatenate all replay videos into one video. \ 243 Default to False. If ``enable_save_replay`` is False, this argument will be ignored. \ 244 If ``enable_save_replay`` is True and ``concatenate_all_replay`` is False, \ 245 the replay video of each episode will be saved separately. 246 - replay_save_path (:obj:`str`): The path to save the replay video. Default to None. \ 247 If not specified, the video will be saved in ``exp_name/videos``. 248 - seed (:obj:`Union[int, List]`): The random seed, which is set before running the program. \ 249 Default to None. If not specified, ``self.seed`` will be used. \ 250 If ``seed`` is an integer, the agent will be deployed once. \ 251 If ``seed`` is a list of integers, the agent will be deployed once for each seed in the list. 252 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 253 If set True, base environment manager will be used for easy debugging. Otherwise, \ 254 subprocess environment manager will be used. 255 Returns: 256 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 257 - eval_value (:obj:`np.float32`): The mean of evaluation return. 258 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 259 """ 260 261 if debug: 262 logging.getLogger().setLevel(logging.DEBUG) 263 # define env and policy 264 env = self.env.clone(caller='evaluator') 265 266 if seed is not None and isinstance(seed, int): 267 seeds = [seed] 268 elif seed is not None and isinstance(seed, list): 269 seeds = seed 270 else: 271 seeds = [self.seed] 272 273 returns = [] 274 images = [] 275 if enable_save_replay: 276 replay_save_path = os.path.join(self.exp_name, 'videos') if replay_save_path is None else replay_save_path 277 env.enable_save_replay(replay_path=replay_save_path) 278 else: 279 logging.warning('No video would be generated during the deploy.') 280 if concatenate_all_replay: 281 logging.warning('concatenate_all_replay is set to False because enable_save_replay is False.') 282 concatenate_all_replay = False 283 284 def single_env_forward_wrapper(forward_fn, cuda=True): 285 286 forward_fn = model_wrap(forward_fn, wrapper_name='argmax_sample').forward 287 288 def _forward(obs): 289 # unsqueeze means add batch dim, i.e. (O, ) -> (1, O) 290 obs = ttorch.as_tensor(obs).unsqueeze(0) 291 if cuda and torch.cuda.is_available(): 292 obs = obs.cuda() 293 action = forward_fn(obs)["action"] 294 # squeeze means delete batch dim, i.e. (1, A) -> (A, ) 295 action = action.squeeze(0).detach().cpu().numpy() 296 return action 297 298 return _forward 299 300 forward_fn = single_env_forward_wrapper(self.policy._model, self.cfg.policy.cuda) 301 302 # reset first to make sure the env is in the initial state 303 # env will be reset again in the main loop 304 env.reset() 305 306 for seed in seeds: 307 env.seed(seed, dynamic_seed=False) 308 return_ = 0. 309 step = 0 310 obs = env.reset() 311 images.append(render(env)[None]) if concatenate_all_replay else None 312 while True: 313 action = forward_fn(obs) 314 obs, rew, done, info = env.step(action) 315 images.append(render(env)[None]) if concatenate_all_replay else None 316 return_ += rew 317 step += 1 318 if done: 319 break 320 logging.info(f'C51 deploy is finished, final episode return with {step} steps is: {return_}') 321 returns.append(return_) 322 323 env.close() 324 325 if concatenate_all_replay: 326 images = np.concatenate(images, axis=0) 327 import imageio 328 imageio.mimwrite(os.path.join(replay_save_path, 'deploy.mp4'), images, fps=get_env_fps(env)) 329 330 return EvalReturn(eval_value=np.mean(returns), eval_value_std=np.std(returns)) 331 332 def collect_data( 333 self, 334 env_num: int = 8, 335 save_data_path: Optional[str] = None, 336 n_sample: Optional[int] = None, 337 n_episode: Optional[int] = None, 338 context: Optional[str] = None, 339 debug: bool = False 340 ) -> None: 341 """ 342 Overview: 343 Collect data with C51 algorithm for ``n_episode`` episodes with ``env_num`` collector environments. \ 344 The collected data will be saved in ``save_data_path`` if specified, otherwise it will be saved in \ 345 ``exp_name/demo_data``. 346 Arguments: 347 - env_num (:obj:`int`): The number of collector environments. Default to 8. 348 - save_data_path (:obj:`str`): The path to save the collected data. Default to None. \ 349 If not specified, the data will be saved in ``exp_name/demo_data``. 350 - n_sample (:obj:`int`): The number of samples to collect. Default to None. \ 351 If not specified, ``n_episode`` must be specified. 352 - n_episode (:obj:`int`): The number of episodes to collect. Default to None. \ 353 If not specified, ``n_sample`` must be specified. 354 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 355 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 356 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 357 If set True, base environment manager will be used for easy debugging. Otherwise, \ 358 subprocess environment manager will be used. 359 """ 360 361 if debug: 362 logging.getLogger().setLevel(logging.DEBUG) 363 if n_episode is not None: 364 raise NotImplementedError 365 # define env and policy 366 env_num = env_num if env_num else self.cfg.env.collector_env_num 367 env = setup_ding_env_manager(self.env, env_num, context, debug, 'collector') 368 369 if save_data_path is None: 370 save_data_path = os.path.join(self.exp_name, 'demo_data') 371 372 # main execution task 373 with task.start(ctx=OnlineRLContext()): 374 task.use( 375 StepCollector( 376 self.cfg, self.policy.collect_mode, env, random_collect_size=self.cfg.policy.random_collect_size 377 ) 378 ) 379 task.use(offline_data_saver(save_data_path, data_type='hdf5')) 380 task.run(max_step=1) 381 logging.info( 382 f'C51 collecting is finished, more than {n_sample} samples are collected and saved in `{save_data_path}`' 383 ) 384 385 def batch_evaluate( 386 self, 387 env_num: int = 4, 388 n_evaluator_episode: int = 4, 389 context: Optional[str] = None, 390 debug: bool = False 391 ) -> EvalReturn: 392 """ 393 Overview: 394 Evaluate the agent with C51 algorithm for ``n_evaluator_episode`` episodes with ``env_num`` evaluator \ 395 environments. The evaluation result will be returned. 396 The difference between methods ``batch_evaluate`` and ``deploy`` is that ``batch_evaluate`` will create \ 397 multiple evaluator environments to evaluate the agent to get an average performance, while ``deploy`` \ 398 will only create one evaluator environment to evaluate the agent and save the replay video. 399 Arguments: 400 - env_num (:obj:`int`): The number of evaluator environments. Default to 4. 401 - n_evaluator_episode (:obj:`int`): The number of episodes to evaluate. Default to 4. 402 - context (:obj:`str`): The multi-process context of the environment manager. Default to None. \ 403 It can be specified as ``spawn``, ``fork`` or ``forkserver``. 404 - debug (:obj:`bool`): Whether to use debug mode in the environment manager. Default to False. \ 405 If set True, base environment manager will be used for easy debugging. Otherwise, \ 406 subprocess environment manager will be used. 407 Returns: 408 - (:obj:`EvalReturn`): The evaluation result, of which the attributions are: 409 - eval_value (:obj:`np.float32`): The mean of evaluation return. 410 - eval_value_std (:obj:`np.float32`): The standard deviation of evaluation return. 411 """ 412 413 if debug: 414 logging.getLogger().setLevel(logging.DEBUG) 415 # define env and policy 416 env_num = env_num if env_num else self.cfg.env.evaluator_env_num 417 env = setup_ding_env_manager(self.env, env_num, context, debug, 'evaluator') 418 419 # reset first to make sure the env is in the initial state 420 # env will be reset again in the main loop 421 env.launch() 422 env.reset() 423 424 evaluate_cfg = self.cfg 425 evaluate_cfg.env.n_evaluator_episode = n_evaluator_episode 426 427 # main execution task 428 with task.start(ctx=OnlineRLContext()): 429 task.use(interaction_evaluator(self.cfg, self.policy.eval_mode, env)) 430 task.run(max_step=1) 431 432 return EvalReturn(eval_value=task.ctx.eval_value, eval_value_std=task.ctx.eval_value_std) 433 434 @property 435 def best(self) -> 'C51Agent': 436 """ 437 Overview: 438 Load the best model from the checkpoint directory, \ 439 which is by default in folder ``exp_name/ckpt/eval.pth.tar``. \ 440 The return value is the agent with the best model. 441 Returns: 442 - (:obj:`C51Agent`): The agent with the best model. 443 Examples: 444 >>> agent = C51Agent(env_id='LunarLander-v2') 445 >>> agent.train() 446 >>> agent = agent.best 447 448 .. note:: 449 The best model is the model with the highest evaluation return. If this method is called, the current \ 450 model will be replaced by the best model. 451 """ 452 453 best_model_file_path = os.path.join(self.checkpoint_save_dir, "eval.pth.tar") 454 # Load best model if it exists 455 if os.path.exists(best_model_file_path): 456 policy_state_dict = torch.load(best_model_file_path, map_location=torch.device("cpu")) 457 self.policy.learn_mode.load_state_dict(policy_state_dict) 458 return self