Skip to content

ding.policy.ppo

ding.policy.ppo

PPOPolicy

Bases: Policy

Overview

Policy class of on-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347.

default_model()

Overview

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns: - model_info (:obj:Tuple[str, List[str]]): The registered model name and model's import_names.

.. note:: The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For example about PPO, its registered name is ppo and the import_names is ding.model.template.vac.

.. note:: Because now PPO supports both single-agent and multi-agent usages, so we can implement these functions with the same policy and two different default models, which is controled by self._cfg.multi_agent.

PPOPGPolicy

Bases: Policy

Overview

Policy class of on policy version PPO algorithm (pure policy gradient without value network). Paper link: https://arxiv.org/abs/1707.06347.

default_model()

Overview

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns: - model_info (:obj:Tuple[str, List[str]]): The registered model name and model's import_names.

PPOOffPolicy

Bases: Policy

Overview

Policy class of off-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347. This version is more suitable for large-scale distributed training.

default_model()

Overview

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns: - model_info (:obj:Tuple[str, List[str]]): The registered model name and model's import_names.

PPOSTDIMPolicy

Bases: PPOPolicy

Overview

Policy class of on policy version PPO algorithm with ST-DIM auxiliary model. PPO paper link: https://arxiv.org/abs/1707.06347. ST-DIM paper link: https://arxiv.org/abs/1906.08226.

Full Source Code

../ding/policy/ppo.py

1from typing import List, Dict, Any, Tuple, Union 2from collections import namedtuple 3import torch 4import copy 5import numpy as np 6 7from ding.torch_utils import Adam, to_device, to_dtype, unsqueeze, ContrastiveLoss 8from ding.rl_utils import ppo_data, ppo_error, ppo_policy_error, ppo_policy_data, get_gae_with_default_last_value, \ 9 v_nstep_td_data, v_nstep_td_error, get_nstep_return_data, get_train_sample, gae, gae_data, ppo_error_continuous, \ 10 get_gae, ppo_policy_error_continuous 11from ding.model import model_wrap 12from ding.utils import POLICY_REGISTRY, split_data_generator, RunningMeanStd 13from ding.utils.data import default_collate, default_decollate 14from .base_policy import Policy 15from .common_utils import default_preprocess_learn 16 17 18@POLICY_REGISTRY.register('ppo') 19class PPOPolicy(Policy): 20 """ 21 Overview: 22 Policy class of on-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347. 23 """ 24 config = dict( 25 # (str) RL policy register name (refer to function "POLICY_REGISTRY"). 26 type='ppo', 27 # (bool) Whether to use cuda for network. 28 cuda=False, 29 # (bool) Whether the RL algorithm is on-policy or off-policy. (Note: in practice PPO can be off-policy used) 30 on_policy=True, 31 # (bool) Whether to use priority (priority sample, IS weight, update priority). 32 priority=False, 33 # (bool) Whether to use Importance Sampling Weight to correct biased update due to priority. 34 # If True, priority must be True. 35 priority_IS_weight=False, 36 # (bool) Whether to recompurete advantages in each iteration of on-policy PPO. 37 recompute_adv=True, 38 # (str) Which kind of action space used in PPOPolicy, ['discrete', 'continuous', 'hybrid'] 39 action_space='discrete', 40 # (bool) Whether to use nstep return to calculate value target, otherwise, use return = adv + value. 41 nstep_return=False, 42 # (bool) Whether to enable multi-agent training, i.e.: MAPPO. 43 multi_agent=False, 44 # (bool) Whether to need policy ``_forward_collect`` output data in process transition. 45 transition_with_policy_data=True, 46 # learn_mode config 47 learn=dict( 48 # (int) After collecting n_sample/n_episode data, how many epoches to train models. 49 # Each epoch means the one entire passing of training data. 50 epoch_per_collect=10, 51 # (int) How many samples in a training batch. 52 batch_size=64, 53 # (float) The step size of gradient descent. 54 learning_rate=3e-4, 55 # (dict or None) The learning rate decay. 56 # If not None, should contain key 'epoch_num' and 'min_lr_lambda'. 57 # where 'epoch_num' is the total epoch num to decay the learning rate to min value, 58 # 'min_lr_lambda' is the final decayed learning rate. 59 lr_scheduler=None, 60 # (float) The loss weight of value network, policy network weight is set to 1. 61 value_weight=0.5, 62 # (float) The loss weight of entropy regularization, policy network weight is set to 1. 63 entropy_weight=0.0, 64 # (float) PPO clip ratio, defaults to 0.2. 65 clip_ratio=0.2, 66 # (bool) Whether to use advantage norm in a whole training batch. 67 adv_norm=True, 68 # (bool) Whether to use value norm with running mean and std in the whole training process. 69 value_norm=True, 70 # (bool) Whether to enable special network parameters initialization scheme in PPO, such as orthogonal init. 71 ppo_param_init=True, 72 # (str) The gradient clip operation type used in PPO, ['clip_norm', clip_value', 'clip_momentum_norm']. 73 grad_clip_type='clip_norm', 74 # (float) The gradient clip target value used in PPO. 75 # If ``grad_clip_type`` is 'clip_norm', then the maximum of gradient will be normalized to this value. 76 grad_clip_value=0.5, 77 # (bool) Whether ignore done (usually for max step termination env). 78 ignore_done=False, 79 # (str) The type of KL divergence loss between current policy and pretrained policy, ['k1', 'k2', 'k3']. 80 # Reference: http://joschu.net/blog/kl-approx.html 81 kl_type='k1', 82 # (float) The weight of KL divergence loss. 83 kl_beta=0.0, 84 # (Optional[str]) The path of pretrained model checkpoint. 85 # If provided, KL regularizer will be calculated between current policy and pretrained policy. 86 # Default to None, which means KL is not calculated. 87 pretrained_model_path=None, 88 ), 89 # collect_mode config 90 collect=dict( 91 # (int) How many training samples collected in one collection procedure. 92 # Only one of [n_sample, n_episode] should be set. 93 # n_sample=64, 94 # (int) Split episodes or trajectories into pieces with length `unroll_len`. 95 unroll_len=1, 96 # (float) Reward's future discount factor, aka. gamma. 97 discount_factor=0.99, 98 # (float) GAE lambda factor for the balance of bias and variance(1-step td and mc) 99 gae_lambda=0.95, 100 ), 101 eval=dict(), # for compability 102 ) 103 104 def default_model(self) -> Tuple[str, List[str]]: 105 """ 106 Overview: 107 Return this algorithm default neural network model setting for demonstration. ``__init__`` method will \ 108 automatically call this method to get the default model setting and create model. 109 Returns: 110 - model_info (:obj:`Tuple[str, List[str]]`): The registered model name and model's import_names. 111 112 .. note:: 113 The user can define and use customized network model but must obey the same inferface definition indicated \ 114 by import_names path. For example about PPO, its registered name is ``ppo`` and the import_names is \ 115 ``ding.model.template.vac``. 116 117 .. note:: 118 Because now PPO supports both single-agent and multi-agent usages, so we can implement these functions \ 119 with the same policy and two different default models, which is controled by ``self._cfg.multi_agent``. 120 """ 121 if self._cfg.multi_agent: 122 return 'mavac', ['ding.model.template.mavac'] 123 else: 124 return 'vac', ['ding.model.template.vac'] 125 126 def _init_learn(self) -> None: 127 """ 128 Overview: 129 Initialize the learn mode of policy, including related attributes and modules. For PPO, it mainly contains \ 130 optimizer, algorithm-specific arguments such as loss weight, clip_ratio and recompute_adv. This method \ 131 also executes some special network initializations and prepares running mean/std monitor for value. 132 This method will be called in ``__init__`` method if ``learn`` field is in ``enable_field``. 133 134 .. note:: 135 For the member variables that need to be saved and loaded, please refer to the ``_state_dict_learn`` \ 136 and ``_load_state_dict_learn`` methods. 137 138 .. note:: 139 For the member variables that need to be monitored, please refer to the ``_monitor_vars_learn`` method. 140 141 .. note:: 142 If you want to set some spacial member variables in ``_init_learn`` method, you'd better name them \ 143 with prefix ``_learn_`` to avoid conflict with other modes, such as ``self._learn_attr1``. 144 """ 145 self._priority = self._cfg.priority 146 self._priority_IS_weight = self._cfg.priority_IS_weight 147 assert not self._priority and not self._priority_IS_weight, "Priority is not implemented in PPO" 148 149 assert self._cfg.action_space in ["continuous", "discrete", "hybrid"] 150 self._action_space = self._cfg.action_space 151 if self._cfg.learn.ppo_param_init: 152 for n, m in self._model.named_modules(): 153 if isinstance(m, torch.nn.Linear): 154 torch.nn.init.orthogonal_(m.weight) 155 torch.nn.init.zeros_(m.bias) 156 if self._action_space in ['continuous', 'hybrid']: 157 # init log sigma 158 if self._action_space == 'continuous': 159 if hasattr(self._model.actor_head, 'log_sigma_param'): 160 torch.nn.init.constant_(self._model.actor_head.log_sigma_param, -0.5) 161 elif self._action_space == 'hybrid': # actor_head[1]: ReparameterizationHead, for action_args 162 if hasattr(self._model.actor_head[1], 'log_sigma_param'): 163 torch.nn.init.constant_(self._model.actor_head[1].log_sigma_param, -0.5) 164 165 for m in list(self._model.critic.modules()) + list(self._model.actor.modules()): 166 if isinstance(m, torch.nn.Linear): 167 # orthogonal initialization 168 torch.nn.init.orthogonal_(m.weight, gain=np.sqrt(2)) 169 torch.nn.init.zeros_(m.bias) 170 # do last policy layer scaling, this will make initial actions have (close to) 171 # 0 mean and std, and will help boost performances, 172 # see https://arxiv.org/abs/2006.05990, Fig.24 for details 173 for m in self._model.actor.modules(): 174 if isinstance(m, torch.nn.Linear): 175 torch.nn.init.zeros_(m.bias) 176 m.weight.data.copy_(0.01 * m.weight.data) 177 178 # Optimizer 179 self._optimizer = Adam( 180 self._model.parameters(), 181 lr=self._cfg.learn.learning_rate, 182 grad_clip_type=self._cfg.learn.grad_clip_type, 183 clip_value=self._cfg.learn.grad_clip_value 184 ) 185 186 # Define linear lr scheduler 187 if self._cfg.learn.lr_scheduler is not None: 188 epoch_num = self._cfg.learn.lr_scheduler['epoch_num'] 189 min_lr_lambda = self._cfg.learn.lr_scheduler['min_lr_lambda'] 190 191 self._lr_scheduler = torch.optim.lr_scheduler.LambdaLR( 192 self._optimizer, 193 lr_lambda=lambda epoch: max(1.0 - epoch * (1.0 - min_lr_lambda) / epoch_num, min_lr_lambda) 194 ) 195 196 self._learn_model = model_wrap(self._model, wrapper_name='base') 197 198 # load pretrained model 199 if self._cfg.learn.pretrained_model_path is not None: 200 self._pretrained_model = copy.deepcopy(self._model) 201 state_dict = torch.load(self._cfg.learn.pretrained_model_path, map_location='cpu') 202 self._pretrained_model.load_state_dict(state_dict) 203 self._pretrained_model.eval() 204 else: 205 self._pretrained_model = None 206 207 # Algorithm config 208 self._value_weight = self._cfg.learn.value_weight 209 self._entropy_weight = self._cfg.learn.entropy_weight 210 self._clip_ratio = self._cfg.learn.clip_ratio 211 self._adv_norm = self._cfg.learn.adv_norm 212 self._value_norm = self._cfg.learn.value_norm 213 self._kl_type = self._cfg.learn.kl_type 214 self._kl_beta = self._cfg.learn.kl_beta 215 if self._value_norm: 216 self._running_mean_std = RunningMeanStd(epsilon=1e-4, device=self._device) 217 self._gamma = self._cfg.collect.discount_factor 218 self._gae_lambda = self._cfg.collect.gae_lambda 219 self._recompute_adv = self._cfg.recompute_adv 220 # Main model 221 self._learn_model.reset() 222 223 def _forward_learn(self, data: List[Dict[str, Any]]) -> List[Dict[str, Any]]: 224 """ 225 Overview: 226 Policy forward function of learn mode (training policy and updating parameters). Forward means \ 227 that the policy inputs some training batch data from the replay buffer and then returns the output \ 228 result, including various training information such as loss, clipfrac, approx_kl. 229 Arguments: 230 - data (:obj:`List[Dict[int, Any]]`): The input data used for policy forward, including the latest \ 231 collected training samples for on-policy algorithms like PPO. For each element in list, the key of the \ 232 dict is the name of data items and the value is the corresponding data. Usually, the value is \ 233 torch.Tensor or np.ndarray or there dict/list combinations. In the ``_forward_learn`` method, data \ 234 often need to first be stacked in the batch dimension by some utility functions such as \ 235 ``default_preprocess_learn``. \ 236 For PPO, each element in list is a dict containing at least the following keys: ``obs``, ``action``, \ 237 ``reward``, ``logit``, ``value``, ``done``. Sometimes, it also contains other keys such as ``weight``. 238 Returns: 239 - return_infos (:obj:`List[Dict[str, Any]]`): The information list that indicated training result, each \ 240 training iteration contains append a information dict into the final list. The list will be precessed \ 241 and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of \ 242 scalars. For the detailed definition of the dict, refer to the code of ``_monitor_vars_learn`` method. 243 244 .. tip:: 245 The training procedure of PPO is two for loops. The outer loop trains all the collected training samples \ 246 with ``epoch_per_collect`` epochs. The inner loop splits all the data into different mini-batch with \ 247 the length of ``batch_size``. 248 249 .. note:: 250 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \ 251 For the data type that not supported, the main reason is that the corresponding model does not support it. \ 252 You can implement you own model rather than use the default model. For more information, please raise an \ 253 issue in GitHub repo and we will continue to follow up. 254 255 .. note:: 256 For more detailed examples, please refer to our unittest for PPOPolicy: ``ding.policy.tests.test_ppo``. 257 """ 258 data = default_preprocess_learn(data, ignore_done=self._cfg.learn.ignore_done, use_nstep=False) 259 if self._cuda: 260 data = to_device(data, self._device) 261 data['obs'] = to_dtype(data['obs'], torch.float32) 262 if 'next_obs' in data: 263 data['next_obs'] = to_dtype(data['next_obs'], torch.float32) 264 # ==================== 265 # PPO forward 266 # ==================== 267 return_infos = [] 268 self._learn_model.train() 269 270 for epoch in range(self._cfg.learn.epoch_per_collect): 271 if self._recompute_adv: # calculate new value using the new updated value network 272 with torch.no_grad(): 273 value = self._learn_model.forward(data['obs'], mode='compute_critic')['value'] 274 next_value = self._learn_model.forward(data['next_obs'], mode='compute_critic')['value'] 275 if self._value_norm: 276 value *= self._running_mean_std.std 277 next_value *= self._running_mean_std.std 278 279 traj_flag = data.get('traj_flag', None) # traj_flag indicates termination of trajectory 280 compute_adv_data = gae_data(value, next_value, data['reward'], data['done'], traj_flag) 281 data['adv'] = gae(compute_adv_data, self._gamma, self._gae_lambda) 282 283 unnormalized_returns = value + data['adv'] 284 285 if self._value_norm: 286 data['value'] = value / self._running_mean_std.std 287 data['return'] = unnormalized_returns / self._running_mean_std.std 288 self._running_mean_std.update(unnormalized_returns.cpu().numpy()) 289 else: 290 data['value'] = value 291 data['return'] = unnormalized_returns 292 293 else: # don't recompute adv 294 if self._value_norm: 295 unnormalized_return = data['adv'] + data['value'] * self._running_mean_std.std 296 data['return'] = unnormalized_return / self._running_mean_std.std 297 self._running_mean_std.update(unnormalized_return.cpu().numpy()) 298 else: 299 data['return'] = data['adv'] + data['value'] 300 301 for batch in split_data_generator(data, self._cfg.learn.batch_size, shuffle=True): 302 output = self._learn_model.forward(batch['obs'], mode='compute_actor_critic') 303 adv = batch['adv'] 304 if self._adv_norm: 305 # Normalize advantage in a train_batch 306 adv = (adv - adv.mean()) / (adv.std() + 1e-8) 307 308 if self._pretrained_model is not None: 309 with torch.no_grad(): 310 logit_pretrained = self._pretrained_model.forward(batch['obs'], mode='compute_actor')['logit'] 311 else: 312 logit_pretrained = None 313 314 # Calculate ppo error 315 if self._action_space == 'continuous': 316 ppo_batch = ppo_data( 317 output['logit'], batch['logit'], batch['action'], output['value'], batch['value'], adv, 318 batch['return'], batch['weight'], logit_pretrained 319 ) 320 ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio, kl_type=self._kl_type) 321 elif self._action_space == 'discrete': 322 ppo_batch = ppo_data( 323 output['logit'], batch['logit'], batch['action'], output['value'], batch['value'], adv, 324 batch['return'], batch['weight'], logit_pretrained 325 ) 326 ppo_loss, ppo_info = ppo_error(ppo_batch, self._clip_ratio, kl_type=self._kl_type) 327 elif self._action_space == 'hybrid': 328 # discrete part (discrete policy loss and entropy loss) 329 ppo_discrete_batch = ppo_policy_data( 330 output['logit']['action_type'], batch['logit']['action_type'], batch['action']['action_type'], 331 adv, batch['weight'], logit_pretrained 332 ) 333 ppo_discrete_loss, ppo_discrete_info = ppo_policy_error( 334 ppo_discrete_batch, self._clip_ratio, kl_type=self._kl_type 335 ) 336 # continuous part (continuous policy loss and entropy loss, value loss) 337 ppo_continuous_batch = ppo_data( 338 output['logit']['action_args'], batch['logit']['action_args'], batch['action']['action_args'], 339 output['value'], batch['value'], adv, batch['return'], batch['weight'], None 340 ) 341 ppo_continuous_loss, ppo_continuous_info = ppo_error_continuous( 342 ppo_continuous_batch, self._clip_ratio, kl_type=self._kl_type 343 ) 344 # sum discrete and continuous loss 345 ppo_loss = type(ppo_continuous_loss)( 346 ppo_continuous_loss.policy_loss + ppo_discrete_loss.policy_loss, ppo_continuous_loss.value_loss, 347 ppo_continuous_loss.entropy_loss + ppo_discrete_loss.entropy_loss, ppo_continuous_loss.kl_div 348 ) 349 ppo_info = type(ppo_continuous_info)( 350 max(ppo_continuous_info.approx_kl, ppo_discrete_info.approx_kl), 351 max(ppo_continuous_info.clipfrac, ppo_discrete_info.clipfrac) 352 ) 353 wv, we = self._value_weight, self._entropy_weight 354 kl_div = ppo_loss.kl_div 355 total_loss = ( 356 ppo_loss.policy_loss + wv * ppo_loss.value_loss - we * ppo_loss.entropy_loss + 357 self._kl_beta * kl_div 358 ) 359 360 self._optimizer.zero_grad() 361 total_loss.backward() 362 self._optimizer.step() 363 364 if self._cfg.learn.lr_scheduler is not None: 365 cur_lr = sum(self._lr_scheduler.get_last_lr()) / len(self._lr_scheduler.get_last_lr()) 366 else: 367 cur_lr = self._optimizer.defaults['lr'] 368 369 return_info = { 370 'cur_lr': cur_lr, 371 'total_loss': total_loss.item(), 372 'policy_loss': ppo_loss.policy_loss.item(), 373 'value_loss': ppo_loss.value_loss.item(), 374 'entropy_loss': ppo_loss.entropy_loss.item(), 375 'adv_max': adv.max().item(), 376 'adv_mean': adv.mean().item(), 377 'value_mean': output['value'].mean().item(), 378 'value_max': output['value'].max().item(), 379 'approx_kl': ppo_info.approx_kl, 380 'clipfrac': ppo_info.clipfrac, 381 'kl_div': kl_div.item(), 382 } 383 if self._action_space == 'continuous': 384 return_info.update( 385 { 386 'act': batch['action'].float().mean().item(), 387 'mu_mean': output['logit']['mu'].mean().item(), 388 'sigma_mean': output['logit']['sigma'].mean().item(), 389 } 390 ) 391 return_infos.append(return_info) 392 393 if self._cfg.learn.lr_scheduler is not None: 394 self._lr_scheduler.step() 395 396 return return_infos 397 398 def _init_collect(self) -> None: 399 """ 400 Overview: 401 Initialize the collect mode of policy, including related attributes and modules. For PPO, it contains the \ 402 collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in \ 403 discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. 404 This method will be called in ``__init__`` method if ``collect`` field is in ``enable_field``. 405 406 .. note:: 407 If you want to set some spacial member variables in ``_init_collect`` method, you'd better name them \ 408 with prefix ``_collect_`` to avoid conflict with other modes, such as ``self._collect_attr1``. 409 410 .. tip:: 411 Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. \ 412 This design is for the convenience of parallel execution of different policy modes. 413 """ 414 self._unroll_len = self._cfg.collect.unroll_len 415 assert self._cfg.action_space in ["continuous", "discrete", "hybrid"], self._cfg.action_space 416 self._action_space = self._cfg.action_space 417 if self._action_space == 'continuous': 418 self._collect_model = model_wrap(self._model, wrapper_name='reparam_sample') 419 elif self._action_space == 'discrete': 420 self._collect_model = model_wrap(self._model, wrapper_name='multinomial_sample') 421 elif self._action_space == 'hybrid': 422 self._collect_model = model_wrap(self._model, wrapper_name='hybrid_reparam_multinomial_sample') 423 self._collect_model.reset() 424 self._gamma = self._cfg.collect.discount_factor 425 self._gae_lambda = self._cfg.collect.gae_lambda 426 self._recompute_adv = self._cfg.recompute_adv 427 428 def _forward_collect(self, data: Dict[int, Any]) -> Dict[int, Any]: 429 """ 430 Overview: 431 Policy forward function of collect mode (collecting training data by interacting with envs). Forward means \ 432 that the policy gets some necessary data (mainly observation) from the envs and then returns the output \ 433 data, such as the action to interact with the envs. 434 Arguments: 435 - data (:obj:`Dict[int, Any]`): The input data used for policy forward, including at least the obs. The \ 436 key of the dict is environment id and the value is the corresponding data of the env. 437 Returns: 438 - output (:obj:`Dict[int, Any]`): The output data of policy forward, including at least the action and \ 439 other necessary data (action logit and value) for learn mode defined in ``self._process_transition`` \ 440 method. The key of the dict is the same as the input data, i.e. environment id. 441 442 .. tip:: 443 If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass \ 444 related data as extra keyword arguments of this method. 445 446 .. note:: 447 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \ 448 For the data type that not supported, the main reason is that the corresponding model does not support it. \ 449 You can implement you own model rather than use the default model. For more information, please raise an \ 450 issue in GitHub repo and we will continue to follow up. 451 452 .. note:: 453 For more detailed examples, please refer to our unittest for PPOPolicy: ``ding.policy.tests.test_ppo``. 454 """ 455 data_id = list(data.keys()) 456 data = default_collate(list(data.values())) 457 if self._cuda: 458 data = to_device(data, self._device) 459 self._collect_model.eval() 460 with torch.no_grad(): 461 output = self._collect_model.forward(data, mode='compute_actor_critic') 462 if self._cuda: 463 output = to_device(output, 'cpu') 464 output = default_decollate(output) 465 return {i: d for i, d in zip(data_id, output)} 466 467 def _process_transition(self, obs: torch.Tensor, policy_output: Dict[str, torch.Tensor], 468 timestep: namedtuple) -> Dict[str, torch.Tensor]: 469 """ 470 Overview: 471 Process and pack one timestep transition data into a dict, which can be directly used for training and \ 472 saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value. 473 Arguments: 474 - obs (:obj:`torch.Tensor`): The env observation of current timestep, such as stacked 2D image in Atari. 475 - policy_output (:obj:`Dict[str, torch.Tensor]`): The output of the policy network with the observation \ 476 as input. For PPO, it contains the state value, action and the logit of the action. 477 - timestep (:obj:`namedtuple`): The execution result namedtuple returned by the environment step method, \ 478 except all the elements have been transformed into tensor data. Usually, it contains the next obs, \ 479 reward, done, info, etc. 480 Returns: 481 - transition (:obj:`Dict[str, torch.Tensor]`): The processed transition data of the current timestep. 482 483 .. note:: 484 ``next_obs`` is used to calculate nstep return when necessary, so we place in into transition by default. \ 485 You can delete this field to save memory occupancy if you do not need nstep return. 486 """ 487 transition = { 488 'obs': obs, 489 'next_obs': timestep.obs, 490 'action': policy_output['action'], 491 'logit': policy_output['logit'], 492 'value': policy_output['value'], 493 'reward': timestep.reward, 494 'done': timestep.done, 495 } 496 return transition 497 498 def _get_train_sample(self, transitions: List[Dict[str, Any]]) -> List[Dict[str, Any]]: 499 """ 500 Overview: 501 For a given trajectory (transitions, a list of transition) data, process it into a list of sample that \ 502 can be used for training directly. In PPO, a train sample is a processed transition with new computed \ 503 ``traj_flag`` and ``adv`` field. This method is usually used in collectors to execute necessary \ 504 RL data preprocessing before training, which can help learner amortize revelant time consumption. \ 505 In addition, you can also implement this method as an identity function and do the data processing \ 506 in ``self._forward_learn`` method. 507 Arguments: 508 - transitions (:obj:`List[Dict[str, Any]`): The trajectory data (a list of transition), each element is \ 509 the same format as the return value of ``self._process_transition`` method. 510 Returns: 511 - samples (:obj:`List[Dict[str, Any]]`): The processed train samples, each element is the similar format \ 512 as input transitions, but may contain more data for training, such as GAE advantage. 513 """ 514 data = transitions 515 data = to_device(data, self._device) 516 for transition in data: 517 transition['traj_flag'] = copy.deepcopy(transition['done']) 518 data[-1]['traj_flag'] = True 519 520 if self._cfg.learn.ignore_done: 521 data[-1]['done'] = False 522 523 if data[-1]['done']: 524 last_value = torch.zeros_like(data[-1]['value']) 525 else: 526 with torch.no_grad(): 527 last_value = self._collect_model.forward( 528 unsqueeze(data[-1]['next_obs'], 0), mode='compute_actor_critic' 529 )['value'] 530 if len(last_value.shape) == 2: # multi_agent case: 531 last_value = last_value.squeeze(0) 532 if self._value_norm: 533 last_value *= self._running_mean_std.std 534 for i in range(len(data)): 535 data[i]['value'] *= self._running_mean_std.std 536 data = get_gae( 537 data, 538 to_device(last_value, self._device), 539 gamma=self._gamma, 540 gae_lambda=self._gae_lambda, 541 cuda=False, 542 ) 543 if self._value_norm: 544 for i in range(len(data)): 545 data[i]['value'] /= self._running_mean_std.std 546 547 # remove next_obs for save memory when not recompute adv 548 if not self._recompute_adv: 549 for i in range(len(data)): 550 data[i].pop('next_obs') 551 return get_train_sample(data, self._unroll_len) 552 553 def _init_eval(self) -> None: 554 """ 555 Overview: 556 Initialize the eval mode of policy, including related attributes and modules. For PPO, it contains the \ 557 eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). 558 This method will be called in ``__init__`` method if ``eval`` field is in ``enable_field``. 559 560 .. note:: 561 If you want to set some spacial member variables in ``_init_eval`` method, you'd better name them \ 562 with prefix ``_eval_`` to avoid conflict with other modes, such as ``self._eval_attr1``. 563 """ 564 assert self._cfg.action_space in ["continuous", "discrete", "hybrid"] 565 self._action_space = self._cfg.action_space 566 if self._action_space == 'continuous': 567 self._eval_model = model_wrap(self._model, wrapper_name='deterministic_sample') 568 elif self._action_space == 'discrete': 569 self._eval_model = model_wrap(self._model, wrapper_name='argmax_sample') 570 elif self._action_space == 'hybrid': 571 self._eval_model = model_wrap(self._model, wrapper_name='hybrid_reparam_multinomial_sample') 572 573 self._eval_model.reset() 574 575 def _forward_eval(self, data: Dict[int, Any]) -> Dict[int, Any]: 576 """ 577 Overview: 578 Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward \ 579 means that the policy gets some necessary data (mainly observation) from the envs and then returns the \ 580 action to interact with the envs. ``_forward_eval`` in PPO often uses deterministic sample method to get \ 581 actions while ``_forward_collect`` usually uses stochastic sample method for balance exploration and \ 582 exploitation. 583 Arguments: 584 - data (:obj:`Dict[int, Any]`): The input data used for policy forward, including at least the obs. The \ 585 key of the dict is environment id and the value is the corresponding data of the env. 586 Returns: 587 - output (:obj:`Dict[int, Any]`): The output data of policy forward, including at least the action. The \ 588 key of the dict is the same as the input data, i.e. environment id. 589 590 .. note:: 591 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \ 592 For the data type that not supported, the main reason is that the corresponding model does not support it. \ 593 You can implement you own model rather than use the default model. For more information, please raise an \ 594 issue in GitHub repo and we will continue to follow up. 595 596 .. note:: 597 For more detailed examples, please refer to our unittest for PPOPolicy: ``ding.policy.tests.test_ppo``. 598 """ 599 data_id = list(data.keys()) 600 data = default_collate(list(data.values())) 601 if self._cuda: 602 data = to_device(data, self._device) 603 self._eval_model.eval() 604 with torch.no_grad(): 605 output = self._eval_model.forward(data, mode='compute_actor') 606 if self._cuda: 607 output = to_device(output, 'cpu') 608 output = default_decollate(output) 609 return {i: d for i, d in zip(data_id, output)} 610 611 def _monitor_vars_learn(self) -> List[str]: 612 """ 613 Overview: 614 Return the necessary keys for logging the return dict of ``self._forward_learn``. The logger module, such \ 615 as text logger, tensorboard logger, will use these keys to save the corresponding data. 616 Returns: 617 - necessary_keys (:obj:`List[str]`): The list of the necessary keys to be logged. 618 """ 619 variables = super()._monitor_vars_learn() + [ 620 'policy_loss', 621 'value_loss', 622 'entropy_loss', 623 'adv_max', 624 'adv_mean', 625 'approx_kl', 626 'clipfrac', 627 'value_max', 628 'value_mean', 629 ] 630 if self._pretrained_model is not None: 631 variables += ['kl_div'] 632 if self._action_space == 'continuous': 633 variables += ['mu_mean', 'sigma_mean', 'sigma_grad', 'act'] 634 return variables 635 636 637@POLICY_REGISTRY.register('ppo_pg') 638class PPOPGPolicy(Policy): 639 """ 640 Overview: 641 Policy class of on policy version PPO algorithm (pure policy gradient without value network). 642 Paper link: https://arxiv.org/abs/1707.06347. 643 """ 644 config = dict( 645 # (str) RL policy register name (refer to function "POLICY_REGISTRY"). 646 type='ppo_pg', 647 # (bool) Whether to use cuda for network. 648 cuda=False, 649 # (bool) Whether the RL algorithm is on-policy or off-policy. (Note: in practice PPO can be off-policy used) 650 on_policy=True, 651 # (str) Which kind of action space used in PPOPolicy, ['discrete', 'continuous', 'hybrid'] 652 action_space='discrete', 653 # (bool) Whether to enable multi-agent training, i.e.: MAPPO. 654 multi_agent=False, 655 # (bool) Whether to need policy data in process transition. 656 transition_with_policy_data=True, 657 # learn_mode config 658 learn=dict( 659 # (int) After collecting n_sample/n_episode data, how many epoches to train models. 660 # Each epoch means the one entire passing of training data. 661 epoch_per_collect=10, 662 # (int) How many samples in a training batch. 663 batch_size=64, 664 # (float) The step size of gradient descent. 665 learning_rate=3e-4, 666 # (float) The loss weight of entropy regularization, policy network weight is set to 1. 667 entropy_weight=0.0, 668 # (float) PPO clip ratio, defaults to 0.2. 669 clip_ratio=0.2, 670 # (bool) Whether to enable special network parameters initialization scheme in PPO, such as orthogonal init. 671 ppo_param_init=True, 672 # (str) The gradient clip operation type used in PPO, ['clip_norm', clip_value', 'clip_momentum_norm']. 673 grad_clip_type='clip_norm', 674 # (float) The gradient clip target value used in PPO. 675 # If ``grad_clip_type`` is 'clip_norm', then the maximum of gradient will be normalized to this value. 676 grad_clip_value=0.5, 677 # (bool) Whether ignore done (usually for max step termination env). 678 ignore_done=False, 679 ), 680 # collect_mode config 681 collect=dict( 682 # (int) How many training episodes collected in one collection process. Only one of n_episode shoule be set. 683 # n_episode=8, 684 # (int) Cut trajectories into pieces with length "unroll_len". 685 unroll_len=1, 686 # (float) Reward's future discount factor, aka. gamma. 687 discount_factor=0.99, 688 ), 689 eval=dict(), # for compability 690 ) 691 692 def default_model(self) -> Tuple[str, List[str]]: 693 """ 694 Overview: 695 Return this algorithm default neural network model setting for demonstration. ``__init__`` method will \ 696 automatically call this method to get the default model setting and create model. 697 Returns: 698 - model_info (:obj:`Tuple[str, List[str]]`): The registered model name and model's import_names. 699 """ 700 return 'pg', ['ding.model.template.pg'] 701 702 def _init_learn(self) -> None: 703 """ 704 Overview: 705 Initialize the learn mode of policy, including related attributes and modules. For PPOPG, it mainly \ 706 contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method \ 707 also executes some special network initializations. 708 This method will be called in ``__init__`` method if ``learn`` field is in ``enable_field``. 709 710 .. note:: 711 For the member variables that need to be saved and loaded, please refer to the ``_state_dict_learn`` \ 712 and ``_load_state_dict_learn`` methods. 713 714 .. note:: 715 For the member variables that need to be monitored, please refer to the ``_monitor_vars_learn`` method. 716 717 .. note:: 718 If you want to set some spacial member variables in ``_init_learn`` method, you'd better name them \ 719 with prefix ``_learn_`` to avoid conflict with other modes, such as ``self._learn_attr1``. 720 """ 721 assert self._cfg.action_space in ["continuous", "discrete"] 722 self._action_space = self._cfg.action_space 723 if self._cfg.learn.ppo_param_init: 724 for n, m in self._model.named_modules(): 725 if isinstance(m, torch.nn.Linear): 726 torch.nn.init.orthogonal_(m.weight) 727 torch.nn.init.zeros_(m.bias) 728 if self._action_space == 'continuous': 729 if hasattr(self._model.head, 'log_sigma_param'): 730 torch.nn.init.constant_(self._model.head.log_sigma_param, -0.5) 731 for m in self._model.modules(): 732 if isinstance(m, torch.nn.Linear): 733 torch.nn.init.zeros_(m.bias) 734 m.weight.data.copy_(0.01 * m.weight.data) 735 736 # Optimizer 737 self._optimizer = Adam( 738 self._model.parameters(), 739 lr=self._cfg.learn.learning_rate, 740 grad_clip_type=self._cfg.learn.grad_clip_type, 741 clip_value=self._cfg.learn.grad_clip_value 742 ) 743 744 self._learn_model = model_wrap(self._model, wrapper_name='base') 745 746 # Algorithm config 747 self._entropy_weight = self._cfg.learn.entropy_weight 748 self._clip_ratio = self._cfg.learn.clip_ratio 749 self._gamma = self._cfg.collect.discount_factor 750 # Main model 751 self._learn_model.reset() 752 753 def _forward_learn(self, data: List[Dict[str, Any]]) -> List[Dict[str, Any]]: 754 """ 755 Overview: 756 Policy forward function of learn mode (training policy and updating parameters). Forward means \ 757 that the policy inputs some training batch data from the replay buffer and then returns the output \ 758 result, including various training information such as loss, clipfrac, approx_kl. 759 Arguments: 760 - data (:obj:`List[Dict[int, Any]]`): The input data used for policy forward, including the latest \ 761 collected training samples for on-policy algorithms like PPO. For each element in list, the key of the \ 762 dict is the name of data items and the value is the corresponding data. Usually, the value is \ 763 torch.Tensor or np.ndarray or there dict/list combinations. In the ``_forward_learn`` method, data \ 764 often need to first be stacked in the batch dimension by some utility functions such as \ 765 ``default_preprocess_learn``. \ 766 For PPOPG, each element in list is a dict containing at least the following keys: ``obs``, ``action``, \ 767 ``return``, ``logit``, ``done``. Sometimes, it also contains other keys such as ``weight``. 768 Returns: 769 - return_infos (:obj:`List[Dict[str, Any]]`): The information list that indicated training result, each \ 770 training iteration contains append a information dict into the final list. The list will be precessed \ 771 and recorded in text log and tensorboard. The value of the dict must be python scalar or a list of \ 772 scalars. For the detailed definition of the dict, refer to the code of ``_monitor_vars_learn`` method. 773 774 .. tip:: 775 The training procedure of PPOPG is two for loops. The outer loop trains all the collected training samples \ 776 with ``epoch_per_collect`` epochs. The inner loop splits all the data into different mini-batch with \ 777 the length of ``batch_size``. 778 779 .. note:: 780 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \ 781 For the data type that not supported, the main reason is that the corresponding model does not support it. \ 782 You can implement you own model rather than use the default model. For more information, please raise an \ 783 issue in GitHub repo and we will continue to follow up. 784 """ 785 786 data = default_preprocess_learn(data) 787 if self._cuda: 788 data = to_device(data, self._device) 789 return_infos = [] 790 self._learn_model.train() 791 792 for epoch in range(self._cfg.learn.epoch_per_collect): 793 for batch in split_data_generator(data, self._cfg.learn.batch_size, shuffle=True): 794 output = self._learn_model.forward(batch['obs']) 795 796 ppo_batch = ppo_policy_data( 797 output['logit'], batch['logit'], batch['action'], batch['return'], batch['weight'], None 798 ) 799 if self._action_space == 'continuous': 800 ppo_loss, ppo_info = ppo_policy_error_continuous(ppo_batch, self._clip_ratio) 801 elif self._action_space == 'discrete': 802 ppo_loss, ppo_info = ppo_policy_error(ppo_batch, self._clip_ratio) 803 total_loss = ppo_loss.policy_loss - self._entropy_weight * ppo_loss.entropy_loss 804 805 self._optimizer.zero_grad() 806 total_loss.backward() 807 self._optimizer.step() 808 809 return_info = { 810 'cur_lr': self._optimizer.defaults['lr'], 811 'total_loss': total_loss.item(), 812 'policy_loss': ppo_loss.policy_loss.item(), 813 'entropy_loss': ppo_loss.entropy_loss.item(), 814 'approx_kl': ppo_info.approx_kl, 815 'clipfrac': ppo_info.clipfrac, 816 } 817 if self._action_space == 'continuous': 818 return_info.update( 819 { 820 'act': batch['action'].float().mean().item(), 821 'mu_mean': output['logit']['mu'].mean().item(), 822 'sigma_mean': output['logit']['sigma'].mean().item(), 823 } 824 ) 825 return_infos.append(return_info) 826 return return_infos 827 828 def _init_collect(self) -> None: 829 """ 830 Overview: 831 Initialize the collect mode of policy, including related attributes and modules. For PPOPG, it contains \ 832 the collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in \ 833 discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda. 834 This method will be called in ``__init__`` method if ``collect`` field is in ``enable_field``. 835 836 .. note:: 837 If you want to set some spacial member variables in ``_init_collect`` method, you'd better name them \ 838 with prefix ``_collect_`` to avoid conflict with other modes, such as ``self._collect_attr1``. 839 840 .. tip:: 841 Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPO. \ 842 This design is for the convenience of parallel execution of different policy modes. 843 """ 844 assert self._cfg.action_space in ["continuous", "discrete"], self._cfg.action_space 845 self._action_space = self._cfg.action_space 846 self._unroll_len = self._cfg.collect.unroll_len 847 if self._action_space == 'continuous': 848 self._collect_model = model_wrap(self._model, wrapper_name='reparam_sample') 849 elif self._action_space == 'discrete': 850 self._collect_model = model_wrap(self._model, wrapper_name='multinomial_sample') 851 self._collect_model.reset() 852 self._gamma = self._cfg.collect.discount_factor 853 854 def _forward_collect(self, data: Dict[int, Any]) -> Dict[int, Any]: 855 """ 856 Overview: 857 Policy forward function of collect mode (collecting training data by interacting with envs). Forward means \ 858 that the policy gets some necessary data (mainly observation) from the envs and then returns the output \ 859 data, such as the action to interact with the envs. 860 Arguments: 861 - data (:obj:`Dict[int, Any]`): The input data used for policy forward, including at least the obs. The \ 862 key of the dict is environment id and the value is the corresponding data of the env. 863 Returns: 864 - output (:obj:`Dict[int, Any]`): The output data of policy forward, including at least the action and \ 865 other necessary data (action logit) for learn mode defined in ``self._process_transition`` \ 866 method. The key of the dict is the same as the input data, i.e. environment id. 867 868 .. tip:: 869 If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass \ 870 related data as extra keyword arguments of this method. 871 872 .. note:: 873 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \ 874 For the data type that not supported, the main reason is that the corresponding model does not support it. \ 875 You can implement you own model rather than use the default model. For more information, please raise an \ 876 issue in GitHub repo and we will continue to follow up. 877 """ 878 data_id = list(data.keys()) 879 data = default_collate(list(data.values())) 880 if self._cuda: 881 data = to_device(data, self._device) 882 self._collect_model.eval() 883 with torch.no_grad(): 884 output = self._collect_model.forward(data) 885 if self._cuda: 886 output = to_device(output, 'cpu') 887 output = default_decollate(output) 888 return {i: d for i, d in zip(data_id, output)} 889 890 def _process_transition(self, obs: torch.Tensor, policy_output: Dict[str, torch.Tensor], 891 timestep: namedtuple) -> Dict[str, torch.Tensor]: 892 """ 893 Overview: 894 Process and pack one timestep transition data into a dict, which can be directly used for training and \ 895 saved in replay buffer. For PPOPG, it contains obs, action, reward, done, logit. 896 Arguments: 897 - obs (:obj:`torch.Tensor`): The env observation of current timestep, such as stacked 2D image in Atari. 898 - policy_output (:obj:`Dict[str, torch.Tensor]`): The output of the policy network with the observation \ 899 as input. For PPOPG, it contains the action and the logit of the action. 900 - timestep (:obj:`namedtuple`): The execution result namedtuple returned by the environment step method, \ 901 except all the elements have been transformed into tensor data. Usually, it contains the next obs, \ 902 reward, done, info, etc. 903 Returns: 904 - transition (:obj:`Dict[str, torch.Tensor]`): The processed transition data of the current timestep. 905 """ 906 transition = { 907 'obs': obs, 908 'action': policy_output['action'], 909 'logit': policy_output['logit'], 910 'reward': timestep.reward, 911 'done': timestep.done, 912 } 913 return transition 914 915 def _get_train_sample(self, data: List[Dict[str, Any]]) -> List[Dict[str, Any]]: 916 """ 917 Overview: 918 For a given entire episode data (a list of transition), process it into a list of sample that \ 919 can be used for training directly. In PPOPG, a train sample is a processed transition with new computed \ 920 ``return`` field. This method is usually used in collectors to execute necessary \ 921 RL data preprocessing before training, which can help learner amortize revelant time consumption. \ 922 In addition, you can also implement this method as an identity function and do the data processing \ 923 in ``self._forward_learn`` method. 924 Arguments: 925 - data (:obj:`List[Dict[str, Any]`): The episode data (a list of transition), each element is \ 926 the same format as the return value of ``self._process_transition`` method. 927 Returns: 928 - samples (:obj:`List[Dict[str, Any]]`): The processed train samples, each element is the similar format \ 929 as input transitions, but may contain more data for training, such as discounted episode return. 930 """ 931 assert data[-1]['done'] is True, "PPO-PG needs a complete epsiode" 932 933 if self._cfg.learn.ignore_done: 934 raise NotImplementedError 935 936 R = 0. 937 for i in reversed(range(len(data))): 938 R = self._gamma * R + data[i]['reward'] 939 data[i]['return'] = R 940 941 return get_train_sample(data, self._unroll_len) 942 943 def _init_eval(self) -> None: 944 """ 945 Overview: 946 Initialize the eval mode of policy, including related attributes and modules. For PPOPG, it contains the \ 947 eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action). 948 This method will be called in ``__init__`` method if ``eval`` field is in ``enable_field``. 949 950 .. note:: 951 If you want to set some spacial member variables in ``_init_eval`` method, you'd better name them \ 952 with prefix ``_eval_`` to avoid conflict with other modes, such as ``self._eval_attr1``. 953 """ 954 assert self._cfg.action_space in ["continuous", "discrete"] 955 self._action_space = self._cfg.action_space 956 if self._action_space == 'continuous': 957 self._eval_model = model_wrap(self._model, wrapper_name='deterministic_sample') 958 elif self._action_space == 'discrete': 959 self._eval_model = model_wrap(self._model, wrapper_name='argmax_sample') 960 self._eval_model.reset() 961 962 def _forward_eval(self, data: Dict[int, Any]) -> Dict[int, Any]: 963 """ 964 Overview: 965 Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward \ 966 means that the policy gets some necessary data (mainly observation) from the envs and then returns the \ 967 action to interact with the envs. ``_forward_eval`` in PPO often uses deterministic sample method to get \ 968 actions while ``_forward_collect`` usually uses stochastic sample method for balance exploration and \ 969 exploitation. 970 Arguments: 971 - data (:obj:`Dict[int, Any]`): The input data used for policy forward, including at least the obs. The \ 972 key of the dict is environment id and the value is the corresponding data of the env. 973 Returns: 974 - output (:obj:`Dict[int, Any]`): The output data of policy forward, including at least the action. The \ 975 key of the dict is the same as the input data, i.e. environment id. 976 977 .. note:: 978 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \ 979 For the data type that not supported, the main reason is that the corresponding model does not support it. \ 980 You can implement you own model rather than use the default model. For more information, please raise an \ 981 issue in GitHub repo and we will continue to follow up. 982 983 .. note:: 984 For more detailed examples, please refer to our unittest for PPOPGPolicy: ``ding.policy.tests.test_ppo``. 985 """ 986 data_id = list(data.keys()) 987 data = default_collate(list(data.values())) 988 if self._cuda: 989 data = to_device(data, self._device) 990 self._eval_model.eval() 991 with torch.no_grad(): 992 output = self._eval_model.forward(data) 993 if self._cuda: 994 output = to_device(output, 'cpu') 995 output = default_decollate(output) 996 return {i: d for i, d in zip(data_id, output)} 997 998 def _monitor_vars_learn(self) -> List[str]: 999 """1000 Overview:1001 Return the necessary keys for logging the return dict of ``self._forward_learn``. The logger module, such \1002 as text logger, tensorboard logger, will use these keys to save the corresponding data.1003 Returns:1004 - necessary_keys (:obj:`List[str]`): The list of the necessary keys to be logged.1005 """1006 return super()._monitor_vars_learn() + [1007 'policy_loss',1008 'entropy_loss',1009 'approx_kl',1010 'clipfrac',1011 ]101210131014@POLICY_REGISTRY.register('ppo_offpolicy')1015class PPOOffPolicy(Policy):1016 """1017 Overview:1018 Policy class of off-policy version PPO algorithm. Paper link: https://arxiv.org/abs/1707.06347.1019 This version is more suitable for large-scale distributed training.1020 """1021 config = dict(1022 # (str) RL policy register name (refer to function "POLICY_REGISTRY").1023 type='ppo',1024 # (bool) Whether to use cuda for network.1025 cuda=False,1026 on_policy=False,1027 # (bool) Whether to use priority (priority sample, IS weight, update priority).1028 priority=False,1029 # (bool) Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.1030 priority_IS_weight=False,1031 # (str) Which kind of action space used in PPOPolicy, ["continuous", "discrete", "hybrid"].1032 action_space='discrete',1033 # (bool) Whether to use nstep_return for value loss.1034 nstep_return=False,1035 # (int) The timestep of TD (temporal-difference) loss.1036 nstep=3,1037 # (bool) Whether to need policy data in process transition.1038 transition_with_policy_data=True,1039 # learn_mode config1040 learn=dict(1041 # (int) How many updates(iterations) to train after collector's one collection.1042 # Bigger "update_per_collect" means bigger off-policy.1043 # collect data -> update policy-> collect data -> ...1044 update_per_collect=5,1045 # (int) How many samples in a training batch.1046 batch_size=64,1047 # (float) The step size of gradient descent.1048 learning_rate=0.001,1049 # (float) The loss weight of value network, policy network weight is set to 1.1050 value_weight=0.5,1051 # (float) The loss weight of entropy regularization, policy network weight is set to 1.1052 entropy_weight=0.01,1053 # (float) PPO clip ratio, defaults to 0.2.1054 clip_ratio=0.2,1055 # (bool) Whether to use advantage norm in a whole training batch.1056 adv_norm=False,1057 # (bool) Whether to use value norm with running mean and std in the whole training process.1058 value_norm=True,1059 # (bool) Whether to enable special network parameters initialization scheme in PPO, such as orthogonal init.1060 ppo_param_init=True,1061 # (str) The gradient clip operation type used in PPO, ['clip_norm', clip_value', 'clip_momentum_norm'].1062 grad_clip_type='clip_norm',1063 # (float) The gradient clip target value used in PPO.1064 # If ``grad_clip_type`` is 'clip_norm', then the maximum of gradient will be normalized to this value.1065 grad_clip_value=0.5,1066 # (bool) Whether ignore done (usually for max step termination env).1067 ignore_done=False,1068 # (float) The weight decay (L2 regularization) loss weight, defaults to 0.0.1069 weight_decay=0.0,1070 ),1071 # collect_mode config1072 collect=dict(1073 # (int) How many training samples collected in one collection procedure.1074 # Only one of [n_sample, n_episode] shoule be set.1075 # n_sample=64,1076 # (int) Cut trajectories into pieces with length "unroll_len".1077 unroll_len=1,1078 # (float) Reward's future discount factor, aka. gamma.1079 discount_factor=0.99,1080 # (float) GAE lambda factor for the balance of bias and variance (1-step td and mc).1081 gae_lambda=0.95,1082 ),1083 eval=dict(), # for compability1084 other=dict(1085 replay_buffer=dict(1086 # (int) Maximum size of replay buffer. Usually, larger buffer size is better.1087 replay_buffer_size=10000,1088 ),1089 ),1090 )10911092 def default_model(self) -> Tuple[str, List[str]]:1093 """1094 Overview:1095 Return this algorithm default neural network model setting for demonstration. ``__init__`` method will \1096 automatically call this method to get the default model setting and create model.1097 Returns:1098 - model_info (:obj:`Tuple[str, List[str]]`): The registered model name and model's import_names.1099 """1100 return 'vac', ['ding.model.template.vac']11011102 def _init_learn(self) -> None:1103 """1104 Overview:1105 Initialize the learn mode of policy, including related attributes and modules. For PPOOff, it mainly \1106 contains optimizer, algorithm-specific arguments such as loss weight and clip_ratio. This method \1107 also executes some special network initializations and prepares running mean/std monitor for value.1108 This method will be called in ``__init__`` method if ``learn`` field is in ``enable_field``.11091110 .. note::1111 For the member variables that need to be saved and loaded, please refer to the ``_state_dict_learn`` \1112 and ``_load_state_dict_learn`` methods.11131114 .. note::1115 For the member variables that need to be monitored, please refer to the ``_monitor_vars_learn`` method.11161117 .. note::1118 If you want to set some spacial member variables in ``_init_learn`` method, you'd better name them \1119 with prefix ``_learn_`` to avoid conflict with other modes, such as ``self._learn_attr1``.1120 """1121 self._priority = self._cfg.priority1122 self._priority_IS_weight = self._cfg.priority_IS_weight1123 assert not self._priority and not self._priority_IS_weight, "Priority is not implemented in PPOOff"11241125 assert self._cfg.action_space in ["continuous", "discrete", "hybrid"]1126 self._action_space = self._cfg.action_space11271128 if self._cfg.learn.ppo_param_init:1129 for n, m in self._model.named_modules():1130 if isinstance(m, torch.nn.Linear):1131 torch.nn.init.orthogonal_(m.weight)1132 torch.nn.init.zeros_(m.bias)1133 if self._action_space in ['continuous', 'hybrid']:1134 # init log sigma1135 if self._action_space == 'continuous':1136 if hasattr(self._model.actor_head, 'log_sigma_param'):1137 torch.nn.init.constant_(self._model.actor_head.log_sigma_param, -2.0)1138 elif self._action_space == 'hybrid': # actor_head[1]: ReparameterizationHead, for action_args1139 if hasattr(self._model.actor_head[1], 'log_sigma_param'):1140 torch.nn.init.constant_(self._model.actor_head[1].log_sigma_param, -0.5)11411142 for m in list(self._model.critic.modules()) + list(self._model.actor.modules()):1143 if isinstance(m, torch.nn.Linear):1144 # orthogonal initialization1145 torch.nn.init.orthogonal_(m.weight, gain=np.sqrt(2))1146 torch.nn.init.zeros_(m.bias)1147 # do last policy layer scaling, this will make initial actions have (close to)1148 # 0 mean and std, and will help boost performances,1149 # see https://arxiv.org/abs/2006.05990, Fig.24 for details1150 for m in self._model.actor.modules():1151 if isinstance(m, torch.nn.Linear):1152 torch.nn.init.zeros_(m.bias)1153 m.weight.data.copy_(0.01 * m.weight.data)11541155 # Optimizer1156 self._optimizer = Adam(1157 self._model.parameters(),1158 lr=self._cfg.learn.learning_rate,1159 grad_clip_type=self._cfg.learn.grad_clip_type,1160 clip_value=self._cfg.learn.grad_clip_value1161 )11621163 self._learn_model = model_wrap(self._model, wrapper_name='base')11641165 # Algorithm config1166 self._value_weight = self._cfg.learn.value_weight1167 self._entropy_weight = self._cfg.learn.entropy_weight1168 self._clip_ratio = self._cfg.learn.clip_ratio1169 self._adv_norm = self._cfg.learn.adv_norm1170 self._value_norm = self._cfg.learn.value_norm1171 if self._value_norm:1172 self._running_mean_std = RunningMeanStd(epsilon=1e-4, device=self._device)1173 self._gamma = self._cfg.collect.discount_factor1174 self._gae_lambda = self._cfg.collect.gae_lambda1175 self._nstep = self._cfg.nstep1176 self._nstep_return = self._cfg.nstep_return1177 # Main model1178 self._learn_model.reset()11791180 def _forward_learn(self, data: List[Dict[str, Any]]) -> Dict[str, Any]:1181 """1182 Overview:1183 Policy forward function of learn mode (training policy and updating parameters). Forward means \1184 that the policy inputs some training batch data from the replay buffer and then returns the output \1185 result, including various training information such as loss, clipfrac and approx_kl.1186 Arguments:1187 - data (:obj:`List[Dict[int, Any]]`): The input data used for policy forward, including a batch of \1188 training samples. For each element in list, the key of the dict is the name of data items and the \1189 value is the corresponding data. Usually, the value is torch.Tensor or np.ndarray or there dict/list \1190 combinations. In the ``_forward_learn`` method, data often need to first be stacked in the batch \1191 dimension by some utility functions such as ``default_preprocess_learn``. \1192 For PPOOff, each element in list is a dict containing at least the following keys: ``obs``, ``adv``, \1193 ``action``, ``logit``, ``value``, ``done``. Sometimes, it also contains other keys such as ``weight`` \1194 and ``value_gamma``.1195 Returns:1196 - info_dict (:obj:`Dict[str, Any]`): The information dict that indicated training result, which will be \1197 recorded in text log and tensorboard, values must be python scalar or a list of scalars. For the \1198 detailed definition of the dict, refer to the code of ``_monitor_vars_learn`` method.11991200 .. note::1201 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \1202 For the data type that not supported, the main reason is that the corresponding model does not support it. \1203 You can implement you own model rather than use the default model. For more information, please raise an \1204 issue in GitHub repo and we will continue to follow up.1205 """1206 data = default_preprocess_learn(data, ignore_done=self._cfg.learn.ignore_done, use_nstep=self._nstep_return)1207 if self._cuda:1208 data = to_device(data, self._device)1209 data['obs'] = to_dtype(data['obs'], torch.float32)1210 if 'next_obs' in data:1211 data['next_obs'] = to_dtype(data['next_obs'], torch.float32)1212 # ====================1213 # PPO forward1214 # ====================12151216 self._learn_model.train()12171218 with torch.no_grad():1219 if self._value_norm:1220 unnormalized_return = data['adv'] + data['value'] * self._running_mean_std.std1221 data['return'] = unnormalized_return / self._running_mean_std.std1222 self._running_mean_std.update(unnormalized_return.cpu().numpy())1223 else:1224 data['return'] = data['adv'] + data['value']12251226 # normal ppo1227 if not self._nstep_return:1228 output = self._learn_model.forward(data['obs'], mode='compute_actor_critic')1229 adv = data['adv']12301231 if self._adv_norm:1232 # Normalize advantage in a total train_batch1233 adv = (adv - adv.mean()) / (adv.std() + 1e-8)1234 # Calculate ppo loss1235 if self._action_space == 'continuous':1236 ppodata = ppo_data(1237 output['logit'], data['logit'], data['action'], output['value'], data['value'], adv, data['return'],1238 data['weight'], None1239 )1240 ppo_loss, ppo_info = ppo_error_continuous(ppodata, self._clip_ratio)1241 elif self._action_space == 'discrete':1242 ppodata = ppo_data(1243 output['logit'], data['logit'], data['action'], output['value'], data['value'], adv, data['return'],1244 data['weight'], None1245 )1246 ppo_loss, ppo_info = ppo_error(ppodata, self._clip_ratio)1247 elif self._action_space == 'hybrid':1248 # discrete part (discrete policy loss and entropy loss)1249 ppo_discrete_batch = ppo_policy_data(1250 output['logit']['action_type'], data['logit']['action_type'], data['action']['action_type'], adv,1251 data['weight'], None1252 )1253 ppo_discrete_loss, ppo_discrete_info = ppo_policy_error(ppo_discrete_batch, self._clip_ratio)1254 # continuous part (continuous policy loss and entropy loss, value loss)1255 ppo_continuous_batch = ppo_data(1256 output['logit']['action_args'], data['logit']['action_args'], data['action']['action_args'],1257 output['value'], data['value'], adv, data['return'], data['weight'], None1258 )1259 ppo_continuous_loss, ppo_continuous_info = ppo_error_continuous(ppo_continuous_batch, self._clip_ratio)1260 # sum discrete and continuous loss1261 ppo_loss = type(ppo_continuous_loss)(1262 ppo_continuous_loss.policy_loss + ppo_discrete_loss.policy_loss, ppo_continuous_loss.value_loss,1263 ppo_continuous_loss.entropy_loss + ppo_discrete_loss.entropy_loss1264 )1265 ppo_info = type(ppo_continuous_info)(1266 max(ppo_continuous_info.approx_kl, ppo_discrete_info.approx_kl),1267 max(ppo_continuous_info.clipfrac, ppo_discrete_info.clipfrac)1268 )12691270 wv, we = self._value_weight, self._entropy_weight1271 total_loss = ppo_loss.policy_loss + wv * ppo_loss.value_loss - we * ppo_loss.entropy_loss12721273 else:1274 output = self._learn_model.forward(data['obs'], mode='compute_actor')1275 adv = data['adv']1276 if self._adv_norm:1277 # Normalize advantage in a total train_batch1278 adv = (adv - adv.mean()) / (adv.std() + 1e-8)12791280 # Calculate ppo loss1281 if self._action_space == 'continuous':1282 ppodata = ppo_policy_data(output['logit'], data['logit'], data['action'], adv, data['weight'], None)1283 ppo_policy_loss, ppo_info = ppo_policy_error_continuous(ppodata, self._clip_ratio)1284 elif self._action_space == 'discrete':1285 ppodata = ppo_policy_data(output['logit'], data['logit'], data['action'], adv, data['weight'], None)1286 ppo_policy_loss, ppo_info = ppo_policy_error(ppodata, self._clip_ratio)1287 elif self._action_space == 'hybrid':1288 # discrete part (discrete policy loss and entropy loss)1289 ppo_discrete_data = ppo_policy_data(1290 output['logit']['action_type'], data['logit']['action_type'], data['action']['action_type'], adv,1291 data['weight'], None1292 )1293 ppo_discrete_loss, ppo_discrete_info = ppo_policy_error(ppo_discrete_data, self._clip_ratio)1294 # continuous part (continuous policy loss and entropy loss, value loss)1295 ppo_continuous_data = ppo_policy_data(1296 output['logit']['action_args'], data['logit']['action_args'], data['action']['action_args'], adv,1297 data['weight'], None1298 )1299 ppo_continuous_loss, ppo_continuous_info = ppo_policy_error_continuous(1300 ppo_continuous_data, self._clip_ratio1301 )1302 # sum discrete and continuous loss1303 ppo_policy_loss = type(ppo_continuous_loss)(1304 ppo_continuous_loss.policy_loss + ppo_discrete_loss.policy_loss,1305 ppo_continuous_loss.entropy_loss + ppo_discrete_loss.entropy_loss1306 )1307 ppo_info = type(ppo_continuous_info)(1308 max(ppo_continuous_info.approx_kl, ppo_discrete_info.approx_kl),1309 max(ppo_continuous_info.clipfrac, ppo_discrete_info.clipfrac)1310 )13111312 wv, we = self._value_weight, self._entropy_weight1313 next_obs = data.get('next_obs')1314 value_gamma = data.get('value_gamma')1315 reward = data.get('reward')1316 # current value1317 value = self._learn_model.forward(data['obs'], mode='compute_critic')1318 # target value1319 next_data = {'obs': next_obs}1320 target_value = self._learn_model.forward(next_data['obs'], mode='compute_critic')1321 # TODO what should we do here to keep shape1322 assert self._nstep > 11323 td_data = v_nstep_td_data(1324 value['value'], target_value['value'], reward, data['done'], data['weight'], value_gamma1325 )1326 # calculate v_nstep_td critic_loss1327 critic_loss, td_error_per_sample = v_nstep_td_error(td_data, self._gamma, self._nstep)1328 ppo_loss_data = namedtuple('ppo_loss', ['policy_loss', 'value_loss', 'entropy_loss'])1329 ppo_loss = ppo_loss_data(ppo_policy_loss.policy_loss, critic_loss, ppo_policy_loss.entropy_loss)1330 total_loss = ppo_policy_loss.policy_loss + wv * critic_loss - we * ppo_policy_loss.entropy_loss13311332 # ====================1333 # PPO update1334 # ====================1335 self._optimizer.zero_grad()1336 total_loss.backward()1337 self._optimizer.step()1338 return_info = {1339 'cur_lr': self._optimizer.defaults['lr'],1340 'total_loss': total_loss.item(),1341 'policy_loss': ppo_loss.policy_loss.item(),1342 'value': data['value'].mean().item(),1343 'value_loss': ppo_loss.value_loss.item(),1344 'entropy_loss': ppo_loss.entropy_loss.item(),1345 'adv_abs_max': adv.abs().max().item(),1346 'approx_kl': ppo_info.approx_kl,1347 'clipfrac': ppo_info.clipfrac,1348 }1349 if self._action_space == 'continuous':1350 return_info.update(1351 {1352 'act': data['action'].float().mean().item(),1353 'mu_mean': output['logit']['mu'].mean().item(),1354 'sigma_mean': output['logit']['sigma'].mean().item(),1355 }1356 )1357 return return_info13581359 def _init_collect(self) -> None:1360 """1361 Overview:1362 Initialize the collect mode of policy, including related attributes and modules. For PPOOff, it contains \1363 collect_model to balance the exploration and exploitation (e.g. the multinomial sample mechanism in \1364 discrete action space), and other algorithm-specific arguments such as unroll_len and gae_lambda.1365 This method will be called in ``__init__`` method if ``collect`` field is in ``enable_field``.13661367 .. note::1368 If you want to set some spacial member variables in ``_init_collect`` method, you'd better name them \1369 with prefix ``_collect_`` to avoid conflict with other modes, such as ``self._collect_attr1``.13701371 .. tip::1372 Some variables need to initialize independently in different modes, such as gamma and gae_lambda in PPOOff.1373 This design is for the convenience of parallel execution of different policy modes.1374 """1375 self._unroll_len = self._cfg.collect.unroll_len1376 assert self._cfg.action_space in ["continuous", "discrete", "hybrid"]1377 self._action_space = self._cfg.action_space1378 if self._action_space == 'continuous':1379 self._collect_model = model_wrap(self._model, wrapper_name='reparam_sample')1380 elif self._action_space == 'discrete':1381 self._collect_model = model_wrap(self._model, wrapper_name='multinomial_sample')1382 elif self._action_space == 'hybrid':1383 self._collect_model = model_wrap(self._model, wrapper_name='hybrid_reparam_multinomial_sample')1384 self._collect_model.reset()1385 self._gamma = self._cfg.collect.discount_factor1386 self._gae_lambda = self._cfg.collect.gae_lambda1387 self._nstep = self._cfg.nstep1388 self._nstep_return = self._cfg.nstep_return1389 self._value_norm = self._cfg.learn.value_norm1390 if self._value_norm:1391 self._running_mean_std = RunningMeanStd(epsilon=1e-4, device=self._device)13921393 def _forward_collect(self, data: Dict[int, Any]) -> Dict[int, Any]:1394 """1395 Overview:1396 Policy forward function of collect mode (collecting training data by interacting with envs). Forward means \1397 that the policy gets some necessary data (mainly observation) from the envs and then returns the output \1398 data, such as the action to interact with the envs.1399 Arguments:1400 - data (:obj:`Dict[int, Any]`): The input data used for policy forward, including at least the obs. The \1401 key of the dict is environment id and the value is the corresponding data of the env.1402 Returns:1403 - output (:obj:`Dict[int, Any]`): The output data of policy forward, including at least the action and \1404 other necessary data (action logit and value) for learn mode defined in ``self._process_transition`` \1405 method. The key of the dict is the same as the input data, i.e. environment id.14061407 .. tip::1408 If you want to add more tricks on this policy, like temperature factor in multinomial sample, you can pass \1409 related data as extra keyword arguments of this method.14101411 .. note::1412 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \1413 For the data type that not supported, the main reason is that the corresponding model does not support it. \1414 You can implement you own model rather than use the default model. For more information, please raise an \1415 issue in GitHub repo and we will continue to follow up.14161417 .. note::1418 For more detailed examples, please refer to our unittest for PPOOffPolicy: ``ding.policy.tests.test_ppo``.1419 """1420 data_id = list(data.keys())1421 data = default_collate(list(data.values()))1422 if self._cuda:1423 data = to_device(data, self._device)1424 self._collect_model.eval()1425 with torch.no_grad():1426 output = self._collect_model.forward(data, mode='compute_actor_critic')1427 if self._cuda:1428 output = to_device(output, 'cpu')1429 output = default_decollate(output)1430 return {i: d for i, d in zip(data_id, output)}14311432 def _process_transition(self, obs: torch.Tensor, policy_output: Dict[str, torch.Tensor],1433 timestep: namedtuple) -> Dict[str, torch.Tensor]:1434 """1435 Overview:1436 Process and pack one timestep transition data into a dict, which can be directly used for training and \1437 saved in replay buffer. For PPO, it contains obs, next_obs, action, reward, done, logit, value.1438 Arguments:1439 - obs (:obj:`torch.Tensor`): The env observation of current timestep, such as stacked 2D image in Atari.1440 - policy_output (:obj:`Dict[str, torch.Tensor]`): The output of the policy network with the observation \1441 as input. For PPO, it contains the state value, action and the logit of the action.1442 - timestep (:obj:`namedtuple`): The execution result namedtuple returned by the environment step method, \1443 except all the elements have been transformed into tensor data. Usually, it contains the next obs, \1444 reward, done, info, etc.1445 Returns:1446 - transition (:obj:`Dict[str, torch.Tensor]`): The processed transition data of the current timestep.14471448 .. note::1449 ``next_obs`` is used to calculate nstep return when necessary, so we place in into transition by default. \1450 You can delete this field to save memory occupancy if you do not need nstep return.1451 """14521453 transition = {1454 'obs': obs,1455 'next_obs': timestep.obs,1456 'logit': policy_output['logit'],1457 'action': policy_output['action'],1458 'value': policy_output['value'],1459 'reward': timestep.reward,1460 'done': timestep.done,1461 }1462 return transition14631464 def _get_train_sample(self, transitions: List[Dict[str, Any]]) -> List[Dict[str, Any]]:1465 """1466 Overview:1467 For a given trajectory (transitions, a list of transition) data, process it into a list of sample that \1468 can be used for training directly. In PPO, a train sample is a processed transition with new computed \1469 ``traj_flag`` and ``adv`` field. This method is usually used in collectors to execute necessary \1470 RL data preprocessing before training, which can help learner amortize revelant time consumption. \1471 In addition, you can also implement this method as an identity function and do the data processing \1472 in ``self._forward_learn`` method.1473 Arguments:1474 - transitions (:obj:`List[Dict[str, Any]`): The trajectory data (a list of transition), each element is \1475 the same format as the return value of ``self._process_transition`` method.1476 Returns:1477 - samples (:obj:`List[Dict[str, Any]]`): The processed train samples, each element is the similar format \1478 as input transitions, but may contain more data for training, such as GAE advantage.1479 """1480 data = transitions1481 data = to_device(data, self._device)1482 for transition in data:1483 transition['traj_flag'] = copy.deepcopy(transition['done'])1484 data[-1]['traj_flag'] = True14851486 if self._cfg.learn.ignore_done:1487 data[-1]['done'] = False14881489 if data[-1]['done']:1490 last_value = torch.zeros_like(data[-1]['value'])1491 else:1492 with torch.no_grad():1493 last_value = self._collect_model.forward(1494 unsqueeze(data[-1]['next_obs'], 0), mode='compute_actor_critic'1495 )['value']1496 if len(last_value.shape) == 2: # multi_agent case:1497 last_value = last_value.squeeze(0)1498 if self._value_norm:1499 last_value *= self._running_mean_std.std1500 for i in range(len(data)):1501 data[i]['value'] *= self._running_mean_std.std1502 data = get_gae(1503 data,1504 to_device(last_value, self._device),1505 gamma=self._gamma,1506 gae_lambda=self._gae_lambda,1507 cuda=False,1508 )1509 if self._value_norm:1510 for i in range(len(data)):1511 data[i]['value'] /= self._running_mean_std.std15121513 if not self._nstep_return:1514 return get_train_sample(data, self._unroll_len)1515 else:1516 return get_nstep_return_data(data, self._nstep)15171518 def _init_eval(self) -> None:1519 """1520 Overview:1521 Initialize the eval mode of policy, including related attributes and modules. For PPOOff, it contains the \1522 eval model to select optimial action (e.g. greedily select action with argmax mechanism in discrete action).1523 This method will be called in ``__init__`` method if ``eval`` field is in ``enable_field``.15241525 .. note::1526 If you want to set some spacial member variables in ``_init_eval`` method, you'd better name them \1527 with prefix ``_eval_`` to avoid conflict with other modes, such as ``self._eval_attr1``.1528 """1529 assert self._cfg.action_space in ["continuous", "discrete", "hybrid"]1530 self._action_space = self._cfg.action_space1531 if self._action_space == 'continuous':1532 self._eval_model = model_wrap(self._model, wrapper_name='deterministic_sample')1533 elif self._action_space == 'discrete':1534 self._eval_model = model_wrap(self._model, wrapper_name='argmax_sample')1535 elif self._action_space == 'hybrid':1536 self._eval_model = model_wrap(self._model, wrapper_name='hybrid_deterministic_argmax_sample')1537 self._eval_model.reset()15381539 def _forward_eval(self, data: Dict[int, Any]) -> Dict[int, Any]:1540 """1541 Overview:1542 Policy forward function of eval mode (evaluation policy performance by interacting with envs). Forward \1543 means that the policy gets some necessary data (mainly observation) from the envs and then returns the \1544 action to interact with the envs. ``_forward_eval`` in PPO often uses deterministic sample method to get \1545 actions while ``_forward_collect`` usually uses stochastic sample method for balance exploration and \1546 exploitation.1547 Arguments:1548 - data (:obj:`Dict[int, Any]`): The input data used for policy forward, including at least the obs. The \1549 key of the dict is environment id and the value is the corresponding data of the env.1550 Returns:1551 - output (:obj:`Dict[int, Any]`): The output data of policy forward, including at least the action. The \1552 key of the dict is the same as the input data, i.e. environment id.15531554 .. note::1555 The input value can be torch.Tensor or dict/list combinations and current policy supports all of them. \1556 For the data type that not supported, the main reason is that the corresponding model does not support it. \1557 You can implement you own model rather than use the default model. For more information, please raise an \1558 issue in GitHub repo and we will continue to follow up.15591560 .. note::1561 For more detailed examples, please refer to our unittest for PPOOffPolicy: ``ding.policy.tests.test_ppo``.1562 """1563 data_id = list(data.keys())1564 data = default_collate(list(data.values()))1565 if self._cuda:1566 data = to_device(data, self._device)1567 self._eval_model.eval()1568 with torch.no_grad():1569 output = self._eval_model.forward(data, mode='compute_actor')1570 if self._cuda:1571 output = to_device(output, 'cpu')1572 output = default_decollate(output)1573 return {i: d for i, d in zip(data_id, output)}15741575 def _monitor_vars_learn(self) -> List[str]:1576 """1577 Overview:1578 Return the necessary keys for logging the return dict of ``self._forward_learn``. The logger module, such \1579 as text logger, tensorboard logger, will use these keys to save the corresponding data.1580 Returns:1581 - necessary_keys (:obj:`List[str]`): The list of the necessary keys to be logged.1582 """1583 variables = super()._monitor_vars_learn() + [1584 'policy_loss', 'value', 'value_loss', 'entropy_loss', 'adv_abs_max', 'approx_kl', 'clipfrac'1585 ]1586 if self._action_space == 'continuous':1587 variables += ['mu_mean', 'sigma_mean', 'sigma_grad', 'act']1588 return variables158915901591@POLICY_REGISTRY.register('ppo_stdim')1592class PPOSTDIMPolicy(PPOPolicy):1593 """1594 Overview:1595 Policy class of on policy version PPO algorithm with ST-DIM auxiliary model.1596 PPO paper link: https://arxiv.org/abs/1707.06347.1597 ST-DIM paper link: https://arxiv.org/abs/1906.08226.1598 """1599 config = dict(1600 # (str) RL policy register name (refer to function "POLICY_REGISTRY").1601 type='ppo_stdim',1602 # (bool) Whether to use cuda for network.1603 cuda=False,1604 # (bool) Whether the RL algorithm is on-policy or off-policy. (Note: in practice PPO can be off-policy used)1605 on_policy=True,1606 # (bool) Whether to use priority(priority sample, IS weight, update priority)1607 priority=False,1608 # (bool) Whether to use Importance Sampling Weight to correct biased update due to priority.1609 # If True, priority must be True.1610 priority_IS_weight=False,1611 # (bool) Whether to recompurete advantages in each iteration of on-policy PPO1612 recompute_adv=True,1613 # (str) Which kind of action space used in PPOPolicy, ['discrete', 'continuous']1614 action_space='discrete',1615 # (bool) Whether to use nstep return to calculate value target, otherwise, use return = adv + value1616 nstep_return=False,1617 # (bool) Whether to enable multi-agent training, i.e.: MAPPO1618 multi_agent=False,1619 # (bool) Whether to need policy data in process transition1620 transition_with_policy_data=True,1621 # (float) The loss weight of the auxiliary model to the main loss.1622 aux_loss_weight=0.001,1623 aux_model=dict(1624 # (int) the encoding size (of each head) to apply contrastive loss.1625 encode_shape=64,1626 # ([int, int]) the heads number of the obs encoding and next_obs encoding respectively.1627 heads=[1, 1],1628 # (str) the contrastive loss type.1629 loss_type='infonce',1630 # (float) a parameter to adjust the polarity between positive and negative samples.1631 temperature=1.0,1632 ),1633 # learn_mode config1634 learn=dict(1635 # (int) After collecting n_sample/n_episode data, how many epoches to train models.1636 # Each epoch means the one entire passing of training data.1637 epoch_per_collect=10,1638 # (int) How many samples in a training batch.1639 batch_size=64,1640 # (float) The step size of gradient descent.1641 learning_rate=3e-4,1642 # (float) The loss weight of value network, policy network weight is set to 1.1643 value_weight=0.5,1644 # (float) The loss weight of entropy regularization, policy network weight is set to 1.1645 entropy_weight=0.0,1646 # (float) PPO clip ratio, defaults to 0.2.1647 clip_ratio=0.2,1648 # (bool) Whether to use advantage norm in a whole training batch.1649 adv_norm=True,1650 # (bool) Whether to use value norm with running mean and std in the whole training process.1651 value_norm=True,1652 # (bool) Whether to enable special network parameters initialization scheme in PPO, such as orthogonal init.1653 ppo_param_init=True,1654 # (str) The gradient clip operation type used in PPO, ['clip_norm', clip_value', 'clip_momentum_norm'].1655 grad_clip_type='clip_norm',1656 # (float) The gradient clip target value used in PPO.1657 # If ``grad_clip_type`` is 'clip_norm', then the maximum of gradient will be normalized to this value.1658 grad_clip_value=0.5,1659 # (bool) Whether ignore done (usually for max step termination env).1660 ignore_done=False,1661 ),1662 # collect_mode config1663 collect=dict(1664 # (int) How many training samples collected in one collection procedure.1665 # Only one of [n_sample, n_episode] shoule be set.1666 # n_sample=64,1667 # (int) Cut trajectories into pieces with length "unroll_len".1668 unroll_len=1,1669 # (float) Reward's future discount factor, aka. gamma.1670 discount_factor=0.99,1671 # (float) GAE lambda factor for the balance of bias and variance (1-step td and mc).1672 gae_lambda=0.95,1673 ),1674 eval=dict(), # for compability1675 )16761677 def _init_learn(self) -> None:1678 """1679 Overview:1680 Learn mode init method. Called by ``self.__init__``.1681 Init the auxiliary model, its optimizer, and the axuliary loss weight to the main loss.1682 """1683 super()._init_learn()1684 x_size, y_size = self._get_encoding_size()1685 self._aux_model = ContrastiveLoss(x_size, y_size, **self._cfg.aux_model)1686 if self._cuda:1687 self._aux_model.cuda()1688 self._aux_optimizer = Adam(self._aux_model.parameters(), lr=self._cfg.learn.learning_rate)1689 self._aux_loss_weight = self._cfg.aux_loss_weight16901691 def _get_encoding_size(self):1692 """1693 Overview:1694 Get the input encoding size of the ST-DIM axuiliary model.1695 Returns:1696 - info_dict (:obj:`[Tuple, Tuple]`): The encoding size without the first (Batch) dimension.1697 """1698 obs = self._cfg.model.obs_shape1699 if isinstance(obs, int):1700 obs = [obs]1701 test_data = {1702 "obs": torch.randn(1, *obs),1703 "next_obs": torch.randn(1, *obs),1704 }1705 if self._cuda:1706 test_data = to_device(test_data, self._device)1707 with torch.no_grad():1708 x, y = self._model_encode(test_data)1709 return x.size()[1:], y.size()[1:]17101711 def _model_encode(self, data):1712 """1713 Overview:1714 Get the encoding of the main model as input for the auxiliary model.1715 Arguments:1716 - data (:obj:`dict`): Dict type data, same as the _forward_learn input.1717 Returns:1718 - (:obj:`Tuple[Tensor]`): the tuple of two tensors to apply contrastive embedding learning.1719 In ST-DIM algorithm, these two variables are the dqn encoding of `obs` and `next_obs`\1720 respectively.1721 """1722 assert hasattr(self._model, "encoder")1723 x = self._model.encoder(data["obs"])1724 y = self._model.encoder(data["next_obs"])1725 return x, y17261727 def _forward_learn(self, data: Dict[str, Any]) -> Dict[str, Any]:1728 """1729 Overview:1730 Forward and backward function of learn mode.1731 Arguments:1732 - data (:obj:`dict`): Dict type data1733 Returns:1734 - info_dict (:obj:`Dict[str, Any]`):1735 Including current lr, total_loss, policy_loss, value_loss, entropy_loss, \1736 adv_abs_max, approx_kl, clipfrac1737 """1738 data = default_preprocess_learn(data, ignore_done=self._cfg.learn.ignore_done, use_nstep=False)1739 if self._cuda:1740 data = to_device(data, self._device)1741 # ====================1742 # PPO forward1743 # ====================1744 return_infos = []1745 self._learn_model.train()17461747 for epoch in range(self._cfg.learn.epoch_per_collect):1748 if self._recompute_adv: # calculate new value using the new updated value network1749 with torch.no_grad():1750 value = self._learn_model.forward(data['obs'], mode='compute_critic')['value']1751 next_value = self._learn_model.forward(data['next_obs'], mode='compute_critic')['value']1752 if self._value_norm:1753 value *= self._running_mean_std.std1754 next_value *= self._running_mean_std.std17551756 traj_flag = data.get('traj_flag', None) # traj_flag indicates termination of trajectory1757 compute_adv_data = gae_data(value, next_value, data['reward'], data['done'], traj_flag)1758 data['adv'] = gae(compute_adv_data, self._gamma, self._gae_lambda)17591760 unnormalized_returns = value + data['adv']17611762 if self._value_norm:1763 data['value'] = value / self._running_mean_std.std1764 data['return'] = unnormalized_returns / self._running_mean_std.std1765 self._running_mean_std.update(unnormalized_returns.cpu().numpy())1766 else:1767 data['value'] = value1768 data['return'] = unnormalized_returns17691770 else: # don't recompute adv1771 if self._value_norm:1772 unnormalized_return = data['adv'] + data['value'] * self._running_mean_std.std1773 data['return'] = unnormalized_return / self._running_mean_std.std1774 self._running_mean_std.update(unnormalized_return.cpu().numpy())1775 else:1776 data['return'] = data['adv'] + data['value']17771778 for batch in split_data_generator(data, self._cfg.learn.batch_size, shuffle=True):1779 # ======================1780 # Auxiliary model update1781 # ======================17821783 # RL network encoding1784 # To train the auxiliary network, the gradients of x, y should be 0.1785 with torch.no_grad():1786 x_no_grad, y_no_grad = self._model_encode(batch)1787 # the forward function of the auxiliary network1788 self._aux_model.train()1789 aux_loss_learn = self._aux_model.forward(x_no_grad, y_no_grad)1790 # the BP process of the auxiliary network1791 self._aux_optimizer.zero_grad()1792 aux_loss_learn.backward()1793 if self._cfg.multi_gpu:1794 self.sync_gradients(self._aux_model)1795 self._aux_optimizer.step()17961797 output = self._learn_model.forward(batch['obs'], mode='compute_actor_critic')1798 adv = batch['adv']1799 if self._adv_norm:1800 # Normalize advantage in a train_batch1801 adv = (adv - adv.mean()) / (adv.std() + 1e-8)18021803 # Calculate ppo loss1804 if self._action_space == 'continuous':1805 ppo_batch = ppo_data(1806 output['logit'], batch['logit'], batch['action'], output['value'], batch['value'], adv,1807 batch['return'], batch['weight'], None1808 )1809 ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio)1810 elif self._action_space == 'discrete':1811 ppo_batch = ppo_data(1812 output['logit'], batch['logit'], batch['action'], output['value'], batch['value'], adv,1813 batch['return'], batch['weight'], None1814 )1815 ppo_loss, ppo_info = ppo_error(ppo_batch, self._clip_ratio)18161817 # ======================1818 # Compute auxiliary loss1819 # ======================18201821 # In total_loss BP, the gradients of x, y are required to update the encoding network.1822 # The auxiliary network won't be updated since the self._optimizer does not contain1823 # its weights.1824 x, y = self._model_encode(data)1825 self._aux_model.eval()1826 aux_loss_eval = self._aux_model.forward(x, y) * self._aux_loss_weight18271828 wv, we = self._value_weight, self._entropy_weight1829 total_loss = ppo_loss.policy_loss + wv * ppo_loss.value_loss - we * ppo_loss.entropy_loss\1830 + aux_loss_eval18311832 self._optimizer.zero_grad()1833 total_loss.backward()1834 self._optimizer.step()18351836 return_info = {1837 'cur_lr': self._optimizer.defaults['lr'],1838 'total_loss': total_loss.item(),1839 'aux_loss_learn': aux_loss_learn.item(),1840 'aux_loss_eval': aux_loss_eval.item(),1841 'policy_loss': ppo_loss.policy_loss.item(),1842 'value_loss': ppo_loss.value_loss.item(),1843 'entropy_loss': ppo_loss.entropy_loss.item(),1844 'adv_max': adv.max().item(),1845 'adv_mean': adv.mean().item(),1846 'value_mean': output['value'].mean().item(),1847 'value_max': output['value'].max().item(),1848 'approx_kl': ppo_info.approx_kl,1849 'clipfrac': ppo_info.clipfrac,1850 }1851 if self._action_space == 'continuous':1852 return_info.update(1853 {1854 'act': batch['action'].float().mean().item(),1855 'mu_mean': output['logit']['mu'].mean().item(),1856 'sigma_mean': output['logit']['sigma'].mean().item(),1857 }1858 )1859 return_infos.append(return_info)1860 return return_infos18611862 def _state_dict_learn(self) -> Dict[str, Any]:1863 """1864 Overview:1865 Return the state_dict of learn mode, usually including model, optimizer and aux_optimizer for \1866 representation learning.1867 Returns:1868 - state_dict (:obj:`Dict[str, Any]`): The dict of current policy learn state, for saving and restoring.1869 """1870 return {1871 'model': self._learn_model.state_dict(),1872 'optimizer': self._optimizer.state_dict(),1873 'aux_optimizer': self._aux_optimizer.state_dict(),1874 }18751876 def _load_state_dict_learn(self, state_dict: Dict[str, Any]) -> None:1877 """1878 Overview:1879 Load the state_dict variable into policy learn mode.1880 Arguments:1881 - state_dict (:obj:`Dict[str, Any]`): The dict of policy learn state saved before.18821883 .. tip::1884 If you want to only load some parts of model, you can simply set the ``strict`` argument in \1885 load_state_dict to ``False``, or refer to ``ding.torch_utils.checkpoint_helper`` for more \1886 complicated operation.1887 """1888 self._learn_model.load_state_dict(state_dict['model'])1889 self._optimizer.load_state_dict(state_dict['optimizer'])1890 self._aux_optimizer.load_state_dict(state_dict['aux_optimizer'])18911892 def _monitor_vars_learn(self) -> List[str]:1893 """1894 Overview:1895 Return the necessary keys for logging the return dict of ``self._forward_learn``. The logger module, such \1896 as text logger, tensorboard logger, will use these keys to save the corresponding data.1897 Returns:1898 - necessary_keys (:obj:`List[str]`): The list of the necessary keys to be logged.1899 """1900 return super()._monitor_vars_learn() + ["aux_loss_learn", "aux_loss_eval"]