ding.model.template.qvac¶
ding.model.template.qvac
¶
ContinuousQVAC
¶
Bases: Module
Overview
The neural network and computation graph of algorithms related to Actor-Critic that have both Q-value and V-value critic, such as IQL. This model now supports continuous and hybrid action space. The ContinuousQVAC is composed of four parts: actor_encoder, critic_encoder, actor_head and critic_head. Encoders are used to extract the feature. Heads are used to predict corresponding value or action logit.
In high-dimensional observation space like 2D image, we often use a shared encoder for both actor_encoder and critic_encoder. In low-dimensional observation space like 1D vector, we often use different encoders.
Interfaces:
__init__, forward, compute_actor, compute_critic
__init__(obs_shape, action_shape, action_space, twin_critic=False, actor_head_hidden_size=64, actor_head_layer_num=1, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.SiLU(), norm_type=None, encoder_hidden_size_list=None, share_encoder=False)
¶
Overview
Initailize the ContinuousQVAC Model according to input arguments.
Arguments:
- obs_shape (:obj:Union[int, SequenceType]): Observation's shape, such as 128, (156, ).
- action_shape (:obj:Union[int, SequenceType, EasyDict]): Action's shape, such as 4, (3, ), EasyDict({'action_type_shape': 3, 'action_args_shape': 4}).
- action_space (:obj:str): The type of action space, including [regression, reparameterization, hybrid], regression is used for DDPG/TD3, reparameterization is used for SAC and hybrid for PADDPG.
- twin_critic (:obj:bool): Whether to use twin critic, one of tricks in TD3.
- actor_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to actor head.
- actor_head_layer_num (:obj:int): The num of layers used in the actor network to compute action.
- critic_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to critic head.
- critic_head_layer_num (:obj:int): The num of layers used in the critic network to compute Q-value.
- activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU().
- norm_type (:obj:Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.
- encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size, this argument is only used in image observation.
- share_encoder (:obj:Optional[bool]): Whether to share encoder between actor and critic.
forward(inputs, mode)
¶
Overview
QVAC forward computation graph, input observation tensor to predict Q-value or action logit. Different mode will forward with different network modules to get different outputs and save computation.
Arguments:
- inputs (:obj:Union[torch.Tensor, Dict[str, torch.Tensor]]): The input data for forward computation graph, for compute_actor, it is the observation tensor, for compute_critic, it is the dict data including obs and action tensor.
- mode (:obj:str): The forward mode, all the modes are defined in the beginning of this class.
Returns:
- output (:obj:Dict[str, torch.Tensor]): The output dict of QVAC forward computation graph, whose key-values vary in different forward modes.
Examples (Actor):
>>> # Regression mode
>>> model = ContinuousQVAC(64, 6, 'regression')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 6])
>>> # Reparameterization Mode
>>> model = ContinuousQVAC(64, 6, 'reparameterization')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'][0].shape == torch.Size([4, 6]) # mu
>>> actor_outputs['logit'][1].shape == torch.Size([4, 6]) # sigma
Examples (Critic): >>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)} >>> model = ContinuousQVAC(obs_shape=(8, ),action_shape=1, action_space='regression') >>> assert model(inputs, mode='compute_critic')['q_value'].shape == (4, ) # q value
compute_actor(obs)
¶
Overview
QVAC forward computation graph for actor part, input observation tensor to predict action or action logit.
Arguments:
- x (:obj:torch.Tensor): The input observation tensor data.
Returns:
- outputs (:obj:Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]): Actor output dict varying from action_space: regression, reparameterization, hybrid.
ReturnsKeys (regression):
- action (:obj:torch.Tensor): Continuous action with same size as action_shape, usually in DDPG/TD3.
ReturnsKeys (reparameterization):
- logit (:obj:Dict[str, torch.Tensor]): The predictd reparameterization action logit, usually in SAC. It is a list containing two tensors: mu and sigma. The former is the mean of the gaussian distribution, the latter is the standard deviation of the gaussian distribution.
ReturnsKeys (hybrid):
- logit (:obj:torch.Tensor): The predicted discrete action type logit, it will be the same dimension as action_type_shape, i.e., all the possible discrete action types.
- action_args (:obj:torch.Tensor): Continuous action arguments with same size as action_args_shape.
Shapes:
- obs (:obj:torch.Tensor): :math:(B, N0), B is batch size and N0 corresponds to obs_shape.
- action (:obj:torch.Tensor): :math:(B, N1), B is batch size and N1 corresponds to action_shape.
- logit.mu (:obj:torch.Tensor): :math:(B, N1), B is batch size and N1 corresponds to action_shape.
- logit.sigma (:obj:torch.Tensor): :math:(B, N1), B is batch size.
- logit (:obj:torch.Tensor): :math:(B, N2), B is batch size and N2 corresponds to action_shape.action_type_shape.
- action_args (:obj:torch.Tensor): :math:(B, N3), B is batch size and N3 corresponds to action_shape.action_args_shape.
Examples:
>>> # Regression mode
>>> model = ContinuousQVAC(64, 6, 'regression')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 6])
>>> # Reparameterization Mode
>>> model = ContinuousQVAC(64, 6, 'reparameterization')
>>> obs = torch.randn(4, 64)
>>> actor_outputs = model(obs,'compute_actor')
>>> assert actor_outputs['logit'][0].shape == torch.Size([4, 6]) # mu
>>> actor_outputs['logit'][1].shape == torch.Size([4, 6]) # sigma
compute_critic(inputs)
¶
Overview
QVAC forward computation graph for critic part, input observation and action tensor to predict Q-value.
Arguments:
- inputs (:obj:Dict[str, torch.Tensor]): The dict of input data, including obs and action tensor, also contains logit and action_args tensor in hybrid action_space.
ArgumentsKeys:
- obs: (:obj:torch.Tensor): Observation tensor data, now supports a batch of 1-dim vector data.
- action (:obj:Union[torch.Tensor, Dict]): Continuous action with same size as action_shape.
- logit (:obj:torch.Tensor): Discrete action logit, only in hybrid action_space.
- action_args (:obj:torch.Tensor): Continuous action arguments, only in hybrid action_space.
Returns:
- outputs (:obj:Dict[str, torch.Tensor]): The output of QVAC's forward computation graph for critic, including q_value.
ReturnKeys:
- q_value (:obj:torch.Tensor): Q value tensor with same size as batch size.
Shapes:
- obs (:obj:torch.Tensor): :math:(B, N1), where B is batch size and N1 is obs_shape.
- logit (:obj:torch.Tensor): :math:(B, N2), B is batch size and N2 corresponds to action_shape.action_type_shape.
- action_args (:obj:torch.Tensor): :math:(B, N3), B is batch size and N3 corresponds to action_shape.action_args_shape.
- action (:obj:torch.Tensor): :math:(B, N4), where B is batch size and N4 is action_shape.
- q_value (:obj:torch.Tensor): :math:(B, ), where B is batch size.
Examples:
>>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)}
>>> model = ContinuousQVAC(obs_shape=(8, ),action_shape=1, action_space='regression')
>>> assert model(inputs, mode='compute_critic')['q_value'].shape == (4, ) # q value
Full Source Code
../ding/model/template/qvac.py