Skip to content

ding.model.template.ppg

ding.model.template.ppg

PPG

Bases: Module

Overview

Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.

Interfaces: forward, compute_actor, compute_critic, compute_actor_critic

__init__(obs_shape, action_shape, action_space='discrete', share_encoder=True, encoder_hidden_size_list=[128, 128, 64], actor_head_hidden_size=64, actor_head_layer_num=2, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None, impala_cnn_encoder=False)

Overview

Initailize the PPG Model according to input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation's shape, such as 128, (156, ). - action_shape (:obj:Union[int, SequenceType]): Action's shape, such as 4, (3, ). - action_space (:obj:str): The action space type, such as 'discrete', 'continuous'. - share_encoder (:obj:bool): Whether to share encoder. - encoder_hidden_size_list (:obj:SequenceType): The hidden size list of encoder. - actor_head_hidden_size (:obj:int): The hidden_size to pass to actor head. - actor_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for actor head. - critic_head_hidden_size (:obj:int): The hidden_size to pass to critic head. - critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic head. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details. - impala_cnn_encoder (:obj:bool): Whether to use impala cnn encoder.

forward(inputs, mode)

Overview

Compute action logits or value according to mode being compute_actor, compute_critic or compute_actor_critic.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. - mode (:obj:str): The forward mode, all the modes are defined in the beginning of this class. Returns: - outputs (:obj:Dict): The output dict of PPG's forward computation graph, whose key-values vary from different mode.

compute_actor(x)

Overview

Use actor to compute action logits.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - output (:obj:Dict): The output data containing action logits. ReturnsKeys: - logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is the input feature size. - output (:obj:Dict): logit: :math:(B, A), where B is batch size and A is the action space size.

compute_critic(x)

Overview

Use critic to compute value.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - output (:obj:Dict): The output dict of VAC's forward computation graph for critic, including value. ReturnsKeys: - necessary: value Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is the input feature size. - output (:obj:Dict): value: :math:(B, 1), where B is batch size.

compute_actor_critic(x)

Overview

Use actor and critic to compute action logits and value.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - outputs (:obj:Dict): The output dict of PPG's forward computation graph for both actor and critic, including logit and value. ReturnsKeys: - logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args. - value (:obj:torch.Tensor): The predicted state value tensor. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is the input feature size. - output (:obj:Dict): value: :math:(B, 1), where B is batch size. - output (:obj:Dict): logit: :math:(B, A), where B is batch size and A is the action space size.

.. note:: compute_actor_critic interface aims to save computation when shares encoder.

Full Source Code

../ding/model/template/ppg.py

1from typing import Optional, Dict, Union 2import copy 3import torch 4import torch.nn as nn 5from ding.utils import SequenceType, MODEL_REGISTRY 6from .vac import VAC 7 8 9@MODEL_REGISTRY.register('ppg') 10class PPG(nn.Module): 11 """ 12 Overview: 13 Phasic Policy Gradient (PPG) model from paper `Phasic Policy Gradient` 14 https://arxiv.org/abs/2009.04416 \ 15 This module contains VAC module and an auxiliary critic module. 16 Interfaces: 17 ``forward``, ``compute_actor``, ``compute_critic``, ``compute_actor_critic`` 18 """ 19 20 mode = ['compute_actor', 'compute_critic', 'compute_actor_critic'] 21 22 def __init__( 23 self, 24 obs_shape: Union[int, SequenceType], 25 action_shape: Union[int, SequenceType], 26 action_space: str = 'discrete', 27 share_encoder: bool = True, 28 encoder_hidden_size_list: SequenceType = [128, 128, 64], 29 actor_head_hidden_size: int = 64, 30 actor_head_layer_num: int = 2, 31 critic_head_hidden_size: int = 64, 32 critic_head_layer_num: int = 1, 33 activation: Optional[nn.Module] = nn.ReLU(), 34 norm_type: Optional[str] = None, 35 impala_cnn_encoder: bool = False, 36 ) -> None: 37 """ 38 Overview: 39 Initailize the PPG Model according to input arguments. 40 Arguments: 41 - obs_shape (:obj:`Union[int, SequenceType]`): Observation's shape, such as 128, (156, ). 42 - action_shape (:obj:`Union[int, SequenceType]`): Action's shape, such as 4, (3, ). 43 - action_space (:obj:`str`): The action space type, such as 'discrete', 'continuous'. 44 - share_encoder (:obj:`bool`): Whether to share encoder. 45 - encoder_hidden_size_list (:obj:`SequenceType`): The hidden size list of encoder. 46 - actor_head_hidden_size (:obj:`int`): The ``hidden_size`` to pass to actor head. 47 - actor_head_layer_num (:obj:`int`): The num of layers used in the network to compute Q value output \ 48 for actor head. 49 - critic_head_hidden_size (:obj:`int`): The ``hidden_size`` to pass to critic head. 50 - critic_head_layer_num (:obj:`int`): The num of layers used in the network to compute Q value output \ 51 for critic head. 52 - activation (:obj:`Optional[nn.Module]`): The type of activation function to use in ``MLP`` \ 53 after each FC layer, if ``None`` then default set to ``nn.ReLU()``. 54 - norm_type (:obj:`Optional[str]`): The type of normalization to after network layer (FC, Conv), \ 55 see ``ding.torch_utils.network`` for more details. 56 - impala_cnn_encoder (:obj:`bool`): Whether to use impala cnn encoder. 57 """ 58 super(PPG, self).__init__() 59 self.actor_critic = VAC( 60 obs_shape, 61 action_shape, 62 action_space, 63 share_encoder, 64 encoder_hidden_size_list, 65 actor_head_hidden_size, 66 actor_head_layer_num, 67 critic_head_hidden_size, 68 critic_head_layer_num, 69 activation, 70 norm_type, 71 impala_cnn_encoder=impala_cnn_encoder 72 ) 73 self.aux_critic = copy.deepcopy(self.actor_critic.critic) 74 75 def forward(self, inputs: Union[torch.Tensor, Dict], mode: str) -> Dict: 76 """ 77 Overview: 78 Compute action logits or value according to mode being ``compute_actor``, ``compute_critic`` or \ 79 ``compute_actor_critic``. 80 Arguments: 81 - x (:obj:`torch.Tensor`): The input observation tensor data. 82 - mode (:obj:`str`): The forward mode, all the modes are defined in the beginning of this class. 83 Returns: 84 - outputs (:obj:`Dict`): The output dict of PPG's forward computation graph, whose key-values vary from \ 85 different ``mode``. 86 """ 87 assert mode in self.mode, "not support forward mode: {}/{}".format(mode, self.mode) 88 return getattr(self, mode)(inputs) 89 90 def compute_actor(self, x: torch.Tensor) -> Dict: 91 """ 92 Overview: 93 Use actor to compute action logits. 94 Arguments: 95 - x (:obj:`torch.Tensor`): The input observation tensor data. 96 Returns: 97 - output (:obj:`Dict`): The output data containing action logits. 98 ReturnsKeys: 99 - logit (:obj:`torch.Tensor`): The predicted action logit tensor, for discrete action space, it will be \ 100 the same dimension real-value ranged tensor of possible action choices, and for continuous action \ 101 space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the \ 102 same as the number of continuous actions. Hybrid action space is a kind of combination of discrete \ 103 and continuous action space, so the logit will be a dict with ``action_type`` and ``action_args``. 104 Shapes: 105 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is the input feature size. 106 - output (:obj:`Dict`): ``logit``: :math:`(B, A)`, where B is batch size and A is the action space size. 107 """ 108 return self.actor_critic(x, mode='compute_actor') 109 110 def compute_critic(self, x: torch.Tensor) -> Dict: 111 """ 112 Overview: 113 Use critic to compute value. 114 Arguments: 115 - x (:obj:`torch.Tensor`): The input observation tensor data. 116 Returns: 117 - output (:obj:`Dict`): The output dict of VAC's forward computation graph for critic, including ``value``. 118 ReturnsKeys: 119 - necessary: ``value`` 120 Shapes: 121 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is the input feature size. 122 - output (:obj:`Dict`): ``value``: :math:`(B, 1)`, where B is batch size. 123 """ 124 x = self.aux_critic[0](x) # encoder 125 x = self.aux_critic[1](x) # head 126 return {'value': x['pred']} 127 128 def compute_actor_critic(self, x: torch.Tensor) -> Dict: 129 """ 130 Overview: 131 Use actor and critic to compute action logits and value. 132 Arguments: 133 - x (:obj:`torch.Tensor`): The input observation tensor data. 134 Returns: 135 - outputs (:obj:`Dict`): The output dict of PPG's forward computation graph for both actor and critic, \ 136 including ``logit`` and ``value``. 137 ReturnsKeys: 138 - logit (:obj:`torch.Tensor`): The predicted action logit tensor, for discrete action space, it will be \ 139 the same dimension real-value ranged tensor of possible action choices, and for continuous action \ 140 space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the \ 141 same as the number of continuous actions. Hybrid action space is a kind of combination of discrete \ 142 and continuous action space, so the logit will be a dict with ``action_type`` and ``action_args``. 143 - value (:obj:`torch.Tensor`): The predicted state value tensor. 144 Shapes: 145 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is the input feature size. 146 - output (:obj:`Dict`): ``value``: :math:`(B, 1)`, where B is batch size. 147 - output (:obj:`Dict`): ``logit``: :math:`(B, A)`, where B is batch size and A is the action space size. 148 149 .. note:: 150 ``compute_actor_critic`` interface aims to save computation when shares encoder. 151 """ 152 return self.actor_critic(x, mode='compute_actor_critic')