ding.model.template.ppg¶
ding.model.template.ppg
¶
PPG
¶
Bases: Module
Overview
Phasic Policy Gradient (PPG) model from paper Phasic Policy Gradient
https://arxiv.org/abs/2009.04416 This module contains VAC module and an auxiliary critic module.
Interfaces:
forward, compute_actor, compute_critic, compute_actor_critic
__init__(obs_shape, action_shape, action_space='discrete', share_encoder=True, encoder_hidden_size_list=[128, 128, 64], actor_head_hidden_size=64, actor_head_layer_num=2, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None, impala_cnn_encoder=False)
¶
Overview
Initailize the PPG Model according to input arguments.
Arguments:
- obs_shape (:obj:Union[int, SequenceType]): Observation's shape, such as 128, (156, ).
- action_shape (:obj:Union[int, SequenceType]): Action's shape, such as 4, (3, ).
- action_space (:obj:str): The action space type, such as 'discrete', 'continuous'.
- share_encoder (:obj:bool): Whether to share encoder.
- encoder_hidden_size_list (:obj:SequenceType): The hidden size list of encoder.
- actor_head_hidden_size (:obj:int): The hidden_size to pass to actor head.
- actor_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for actor head.
- critic_head_hidden_size (:obj:int): The hidden_size to pass to critic head.
- critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic head.
- activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
- norm_type (:obj:Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.
- impala_cnn_encoder (:obj:bool): Whether to use impala cnn encoder.
forward(inputs, mode)
¶
Overview
Compute action logits or value according to mode being compute_actor, compute_critic or compute_actor_critic.
Arguments:
- x (:obj:torch.Tensor): The input observation tensor data.
- mode (:obj:str): The forward mode, all the modes are defined in the beginning of this class.
Returns:
- outputs (:obj:Dict): The output dict of PPG's forward computation graph, whose key-values vary from different mode.
compute_actor(x)
¶
Overview
Use actor to compute action logits.
Arguments:
- x (:obj:torch.Tensor): The input observation tensor data.
Returns:
- output (:obj:Dict): The output data containing action logits.
ReturnsKeys:
- logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.
Shapes:
- x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is the input feature size.
- output (:obj:Dict): logit: :math:(B, A), where B is batch size and A is the action space size.
compute_critic(x)
¶
Overview
Use critic to compute value.
Arguments:
- x (:obj:torch.Tensor): The input observation tensor data.
Returns:
- output (:obj:Dict): The output dict of VAC's forward computation graph for critic, including value.
ReturnsKeys:
- necessary: value
Shapes:
- x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is the input feature size.
- output (:obj:Dict): value: :math:(B, 1), where B is batch size.
compute_actor_critic(x)
¶
Overview
Use actor and critic to compute action logits and value.
Arguments:
- x (:obj:torch.Tensor): The input observation tensor data.
Returns:
- outputs (:obj:Dict): The output dict of PPG's forward computation graph for both actor and critic, including logit and value.
ReturnsKeys:
- logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args.
- value (:obj:torch.Tensor): The predicted state value tensor.
Shapes:
- x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is the input feature size.
- output (:obj:Dict): value: :math:(B, 1), where B is batch size.
- output (:obj:Dict): logit: :math:(B, A), where B is batch size and A is the action space size.
.. note::
compute_actor_critic interface aims to save computation when shares encoder.
Full Source Code
../ding/model/template/ppg.py