`ding.model`¶

Arguments: - s (:obj:torch.Tensor): Tensor containing input embedding. - a (:obj:torch.Tensor): The original continuous behaviour action. - mu (:obj:torch.Tensor): The mu gaussian reparameterization output of actor head at current timestep. - sigma (:obj:torch.Tensor): The sigma gaussian reparameterization output of actor head at current timestep. - sample_size (:obj:int): The number of samples for continuous action when computing the Q value. Returns: - outputs (:obj:Dict): Dict containing keywords q_value (:obj:torch.Tensor) and v_value (:obj:torch.Tensor). Shapes: - s: :math:(B, N), where B = batch_size and N = hidden_size. - a: :math:(B, A), where A = action_size. - mu: :math:(B, A). - sigma: :math:(B, A). - q_value: :math:(B, 1). - v_value: :math:(B, 1). Examples: >>> head = StochasticDuelingHead(64, 64) >>> inputs = torch.randn(4, 64) >>> a = torch.randn(4, 64) >>> mu = torch.randn(4, 64) >>> sigma = torch.ones(4, 64) >>> outputs = head(inputs, a, mu, sigma) >>> assert isinstance(outputs, dict) >>> assert outputs['q_value'].shape == torch.Size([4, 1]) >>> assert outputs['v_value'].shape == torch.Size([4, 1])

`QuantileHead` ¶

Bases: Module

Overview

The QuantileHead is used to output action quantiles. This module is used in IQN.

Interfaces: __init__, forward, quantile_net.

.. note:: The difference between QuantileHead and QRDQNHead is that QuantileHead models the state-action quantile function as a mapping from state-actions and samples from some base distribution while QRDQNHead approximates random returns by a uniform mixture of Diracs functions.

`init(hidden_size, output_size, layer_num=1, num_quantiles=32, quantile_embedding_size=128, beta_function_type='uniform', activation=nn.ReLU(), norm_type=None, noise=False)` ¶

Overview

Init the QuantileHead layers according to the provided arguments.

Arguments: - hidden_size (:obj:int): The hidden_size of the MLP connected to QuantileHead. - output_size (:obj:int): The number of outputs. - layer_num (:obj:int): The number of layers used in the network to compute Q value output. - num_quantiles (:obj:int): The number of quantiles. - quantile_embedding_size (:obj:int): The embedding size of a quantile. - beta_function_type (:obj:str): Type of beta function. See ding.rl_utils.beta_function.py for more details. Default is uniform. - activation (:obj:nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None. - norm_type (:obj:str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None. - noise (:obj:bool): Whether use NoiseLinearLayer as layer_fn in Q networks' MLP. Default False.

`quantile_net(quantiles)` ¶

Overview

Deterministic parametric function trained to reparameterize samples from a base distribution. By repeated Bellman update iterations of Q-learning, the optimal action-value function is estimated.

Arguments: - x (:obj:torch.Tensor): The encoded embedding tensor of parametric sample. Returns: - quantile_net (:obj:torch.Tensor): Quantile network output tensor after reparameterization. Shapes: - quantile_net :math:(quantile_embedding_size, M), where M = output_size. Examples: >>> head = QuantileHead(64, 64) >>> quantiles = torch.randn(128,1) >>> qn_output = head.quantile_net(quantiles) >>> assert isinstance(qn_output, torch.Tensor) >>> # default quantile_embedding_size: int = 128, >>> assert qn_output.shape == torch.Size([128, 64])

`forward(x, num_quantiles=None)` ¶

Overview

Use encoded embedding tensor to run MLP with QuantileHead and return the prediction dictionary.

Arguments: - x (:obj:torch.Tensor): Tensor containing input embedding. Returns: - outputs (:obj:Dict): Dict containing keywords logit (:obj:torch.Tensor), q (:obj:torch.Tensor), and quantiles (:obj:torch.Tensor). Shapes: - x: :math:(B, N), where B = batch_size and N = hidden_size. - logit: :math:(B, M), where M = output_size. - q: :math:(num_quantiles, B, M). - quantiles: :math:(quantile_embedding_size, 1). Examples: >>> head = QuantileHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles is 32 >>> assert outputs['q'].shape == torch.Size([32, 4, 64]) >>> assert outputs['quantiles'].shape == torch.Size([128, 1])

`FQFHead` ¶

Bases: Module

Overview

The FQFHead is used to output action quantiles. This module is used in FQF.

Interfaces: __init__, forward, quantile_net.

.. note:: The implementation of FQFHead is based on the paper https://arxiv.org/abs/1911.02140. The difference between FQFHead and QuantileHead is that, in FQF, N adjustable quantile values for N adjustable quantile fractions are estimated to approximate the quantile function. The distribution of the return is approximated by a weighted mixture of N Diracs functions. While in IQN, the state-action quantile function is modeled as a mapping from state-actions and samples from some base distribution.

Bases: Module

Overview

Cross-attention-type discrete action policy head, which is often used in variable discrete action space.

Interfaces: __init__, forward.

`forward(key, query)` ¶

Overview

Use attention-like mechanism to combine key and query tensor to output discrete action logit.

Arguments: - key (:obj:torch.Tensor): Tensor containing key embedding. - query (:obj:torch.Tensor): Tensor containing query embedding. Returns: - logit (:obj:torch.Tensor): Tensor containing output discrete action logit. Shapes: - key: :math:(B, N, K), where B = batch_size, N = possible discrete action choices and K = hidden_size. - query: :math:(B, K). - logit: :math:(B, N). Examples: >>> head = AttentionPolicyHead() >>> key = torch.randn(4, 5, 64) >>> query = torch.randn(4, 64) >>> logit = head(key, query) >>> assert logit.shape == torch.Size([4, 5])

.. note:: In this head, we assume that the key and query tensor are both normalized.

`PopArtVHead` ¶

Bases: Module

Overview

The PopArtVHead is used to generate adaptive normalized state value. More information can be found in paper Multi-task Deep Reinforcement Learning with PopArt. https://arxiv.org/abs/1809.04474 This module is used in PPO or IMPALA.

Interfaces: __init__, forward.

`init(hidden_size, output_size, layer_num=1, activation=nn.ReLU(), norm_type=None)` ¶

Overview

Init the PopArtVHead layers according to the provided arguments.

Arguments: - hidden_size (:obj:int): The hidden_size of the MLP connected to PopArtVHead. - output_size (:obj:int): The number of outputs. - layer_num (:obj:int): The number of layers used in the network to compute Q value output. - activation (:obj:nn.Module): The type of activation function to use in MLP. If None, then default set activation to nn.ReLU(). Default None. - norm_type (:obj:str): The type of normalization to use. See ding.torch_utils.network.fc_block for more details. Default None.

`forward(x)` ¶

Overview

Use encoded embedding tensor to run MLP with PopArtVHead and return the normalized prediction and the unnormalized prediction dictionary.

Arguments: - x (:obj:torch.Tensor): Tensor containing input embedding. Returns: - outputs (:obj:Dict): Dict containing keyword pred (:obj:torch.Tensor) and unnormalized_pred (:obj:torch.Tensor). Shapes: - x: :math:(B, N), where B = batch_size and N = hidden_size. - logit: :math:(B, M), where M = output_size. Examples: >>> head = PopArtVHead(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = head(inputs) >>> assert isinstance(outputs, dict) and outputs['pred'].shape == torch.Size([4, 64]) and outputs['unnormalized_pred'].shape == torch.Size([4, 64])

`EnsembleHead` ¶

.. note:: Current DQN supports two types of encoder: FCEncoder and ConvEncoder, two types of head: DiscreteHead and DuelingHead. You can customize your own encoder or head by inheriting this class.

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, dropout=None, init_bias=None, noise=False)` ¶

Overview

initialize the DQN (encoder + head) Model according to corresponding input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3]. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size. - dueling (:obj:Optional[bool]): Whether choose DuelingHead or DiscreteHead (default). - head_hidden_size (:obj:Optional[int]): The hidden_size of head network, defaults to None, then it will be set to the last element of encoder_hidden_size_list. - head_layer_num (:obj:int): The number of layers used in the head network to compute Q value output. - activation (:obj:Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] - dropout (:obj:Optional[float]): The dropout rate of the dropout layer. if None then default disable dropout layer. - init_bias (:obj:Optional[float]): The initial value of the last layer bias in the head network. - noise (:obj:bool): Whether to use NoiseLinearLayer as layer_fn to boost exploration in Q networks' MLP. Default to False.

`forward(x)` ¶

Overview

DQN forward computation graph, input observation tensor to predict q_value.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - outputs (:obj:Dict): The output of DQN's forward, including q_value. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output of each possible action dimension. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is obs_shape - logit (:obj:torch.Tensor): :math:(B, M), where B is batch size and M is action_shape Examples: >>> model = DQN(32, 6) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 32) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

.. note:: For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

`RainbowDQN` ¶

`DRQN` ¶

Bases: Module

Overview

The DRQN (Deep Recurrent Q-Network) is a neural network model combining DQN with RNN to handle sequential data and partially observable environments. It consists of three main components: encoder, rnn, and head. - Encoder: Extracts features from various observation inputs. - RNN: Processes sequential observations and other data. - Head: Computes Q-values for each action dimension.

Interfaces

__init__, forward.

.. note:: The current implementation supports: - Two encoder types: FCEncoder and ConvEncoder. - Two head types: DiscreteHead and DuelingHead. - Three RNN types: normal (LSTM with LayerNorm), pytorch (PyTorch's native LSTM), and gru. You can extend the model by customizing your own encoder, RNN, or head by inheriting this class.

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, lstm_type='normal', activation=nn.ReLU(), norm_type=None, res_link=False)` ¶

Overview

Initialize the DRQN model with specified parameters.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Shape of the observation space, e.g., 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Shape of the action space, e.g., 6 or [2, 3, 3]. - encoder_hidden_size_list (:obj:SequenceType): List of hidden sizes for the encoder. The last element must match head_hidden_size. - dueling (:obj:Optional[bool]): Use DuelingHead if True, otherwise use DiscreteHead. - head_hidden_size (:obj:Optional[int]): Hidden size for the head network. Defaults to the last element of encoder_hidden_size_list if None. - head_layer_num (:obj:int): Number of layers in the head network to compute Q-value outputs. - lstm_type (:obj:Optional[str]): Type of RNN module. Supported types are normal, pytorch, and gru. - activation (:obj:Optional[nn.Module]): Activation function used in the network. Defaults to nn.ReLU(). - norm_type (:obj:Optional[str]): Normalization type for the networks. Supported types are: ['BN', 'IN', 'SyncBN', 'LN']. See ding.torch_utils.fc_block for more details. - res_link (:obj:bool): Enables residual connections between single-frame data and sequential data. Defaults to False.

`forward(inputs, inference=False, saved_state_timesteps=None)` ¶

Overview

Defines the forward pass of the DRQN model. Takes observation and previous RNN states as inputs and predicts Q-values.

Arguments: - inputs (:obj:Dict): Input data dictionary containing observation and previous RNN state. - inference (:obj:bool): If True, unrolls one timestep (used during evaluation). If False, unrolls the entire sequence (used during training). - saved_state_timesteps (:obj:Optional[list]): When inference is False, specifies the timesteps whose hidden states are saved and returned. ArgumentsKeys: - obs (:obj:torch.Tensor): Raw observation tensor. - prev_state (:obj:list): Previous RNN state tensor, structure depends on lstm_type. Returns: - outputs (:obj:Dict): The output of DRQN's forward, including logit (q_value) and next state. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output for each action dimension. - next_state (:obj:list): Next RNN state tensor. Shapes: - obs (:obj:torch.Tensor): :math:(B, N) where B is batch size and N is obs_shape. - logit (:obj:torch.Tensor): :math:(B, M) where B is batch size and M is action_shape. Examples: >>> # Initialize input keys >>> prev_state = [[torch.randn(1, 1, 64) for __ in range(2)] for _ in range(4)] # B=4 >>> obs = torch.randn(4,64) >>> model = DRQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> outputs = model({'obs': inputs, 'prev_state': prev_state}, inference=True) >>> # Validate output keys and shapes >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == (4, 64) >>> assert len(outputs['next_state']) == 4 >>> assert all([len(t) == 2 for t in outputs['next_state']]) >>> assert all([t[0].shape == (1, 1, 64) for t in outputs['next_state']])

`C51DQN` ¶

Bases: Module

Overview

The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of encoder and head. encoder is used to extract the feature of observation, and head is used to compute the distribution of Q-value.

Interfaces: __init__, forward

Overview

BDQ forward computation graph, input observation tensor to predict q_value.

Arguments: - x (:obj:torch.Tensor): Observation inputs Returns: - outputs (:obj:Dict): BDQ forward outputs, such as q_value. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output of each action dimension. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is obs_shape - logit (:obj:torch.FloatTensor): :math:(B, M), where B is batch size and M is num_branches * action_bins_per_branch Examples: >>> model = BDQ(8, 5, 2) # arguments: 'obs_shape', 'num_branches' and 'action_bins_per_branch'. >>> inputs = torch.randn(4, 8) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2])

`GTrXLDQN` ¶

Bases: Module

Overview

The neural network structure and computation graph of Gated Transformer-XL DQN algorithm, which is the enhanced version of DRQN, using Transformer-XL to improve long-term sequential modelling ability. The GTrXL-DQN is composed of three parts: encoder, head and core. The encoder is used to extract the feature from various observation, the core is used to process the sequential observation and other data, and the head is used to compute the Q value of each action dimension.

Interfaces: __init__, forward, reset_memory, get_memory .

`init(obs_shape, action_shape, head_layer_num=1, att_head_dim=16, hidden_size=16, att_head_num=2, att_mlp_num=2, att_layer_num=3, memory_len=64, activation=nn.ReLU(), head_norm_type=None, dropout=0.0, gru_gating=True, gru_bias=2.0, dueling=True, encoder_hidden_size_list=[128, 128, 256], encoder_norm_type=None)` ¶

Overview

Initialize the GTrXLDQN model accoding to corresponding input arguments.

.. tip:: You can refer to GTrXl class in ding.torch_utils.network.gtrxl for more details about the input arguments.

Parameters:

Name	Description	Default
`- obs_shape (`	obj:`Union[int, SequenceType]`): Used by Transformer. Observation's space.	required
`- action_shape (`	obj:Union[int, SequenceType]): Used by Head. Action's space.	required
`- head_layer_num (`	obj:`int`): Used by Head. Number of layers.	required
`- att_head_dim (`	obj:`int`): Used by Transformer.	required
`- hidden_size (`	obj:`int`): Used by Transformer and Head.	required
`- att_head_num (`	obj:`int`): Used by Transformer.	required
`- att_mlp_num (`	obj:`int`): Used by Transformer.	required
`- att_layer_num (`	obj:`int`): Used by Transformer.	required
`- memory_len (`	obj:`int`): Used by Transformer.	required
`- activation (`	obj:`Optional[nn.Module]`): Used by Transformer and Head. if `None` then default set to `nn.ReLU()`.	required
`- head_norm_type (`	obj:`Optional[str]`): Used by Head. The type of normalization to use, see `ding.torch_utils.fc_block` for more details`.	required
`- dropout (`	obj:`bool`): Used by Transformer.	required
`- gru_gating (`	obj:`bool`): Used by Transformer.	required
`- gru_bias (`	obj:`float`): Used by Transformer.	required
`- dueling (`	obj:`bool`): Used by Head. Make the head dueling.	required
`- encoder_hidden_size_list(`	obj:`SequenceType`): Used by Encoder. The collection of `hidden_size` if using a custom convolutional encoder.	required
`- encoder_norm_type (`	obj:`Optional[str]`): Used by Encoder. The type of normalization to use, see `ding.torch_utils.fc_block` for more details`.	required

`forward(x)` ¶

Overview

Let input tensor go through GTrXl and the Head sequentially.

Arguments: - x (:obj:torch.Tensor): input tensor of shape (seq_len, bs, obs_shape). Returns: - out (:obj:Dict): run GTrXL with DiscreteHead setups and return the result prediction dictionary. ReturnKeys: - logit (:obj:torch.Tensor): discrete Q-value output of each action dimension, shape is (B, action_space). - memory (:obj:torch.Tensor): memory tensor of size (bs x layer_num+1 x memory_len x embedding_dim). - transformer_out (:obj:torch.Tensor): output tensor of transformer with same size as input x. Examples: >>> # Init input's Keys: >>> obs_dim, seq_len, bs, action_dim = 128, 64, 32, 4 >>> obs = torch.rand(seq_len, bs, obs_dim) >>> model = GTrXLDQN(obs_dim, action_dim) >>> outputs = model(obs) >>> assert isinstance(outputs, dict)

`PDQN` ¶

Bases: Module

`init(obs_shape, action_shape, action_space='discrete', share_encoder=True, encoder_hidden_size_list=[128, 128, 64], actor_head_hidden_size=64, actor_head_layer_num=1, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None, sigma_type='independent', fixed_sigma_value=0.3, bound_type=None, encoder=None, impala_cnn_encoder=False)` ¶

Overview

Initialize the VAC model according to corresponding input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3]. - action_space (:obj:str): The type of different action spaces, including ['discrete', 'continuous', 'hybrid'], then will instantiate corresponding head, including DiscreteHead, ReparameterizationHead, and hybrid heads. - share_encoder (:obj:bool): Whether to share observation encoders between actor and decoder. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element is used as the input size of actor_head and critic_head. - actor_head_hidden_size (:obj:Optional[int]): The hidden_size of actor_head network, defaults to 64, it is the hidden size of the last layer of the actor_head network. - actor_head_layer_num (:obj:int): The num of layers used in the actor_head network to compute action. - critic_head_hidden_size (:obj:Optional[int]): The hidden_size of critic_head network, defaults to 64, it is the hidden size of the last layer of the critic_head network. - critic_head_layer_num (:obj:int): The num of layers used in the critic_head network. - activation (:obj:Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] - sigma_type (:obj:Optional[str]): The type of sigma in continuous action space, see ding.torch_utils.network.dreamer.ReparameterizationHead for more details, in A2C/PPO, it defaults to independent, which means state-independent sigma parameters. - fixed_sigma_value (:obj:Optional[int]): If sigma_type is fixed, then use this value as sigma. - bound_type (:obj:Optional[str]): The type of action bound methods in continuous action space, defaults to None, which means no bound. - encoder (:obj:Optional[torch.nn.Module]): The encoder module, defaults to None, you can define your own encoder module and pass it into VAC to deal with different observation space. - impala_cnn_encoder (:obj:bool): Whether to use IMPALA CNN encoder, defaults to False.

`forward(x, mode)` ¶

Overview

VAC forward computation graph, input observation tensor to predict state value or action logit. Different mode will forward with different network modules to get different outputs and save computation.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. - mode (:obj:str): The forward mode, all the modes are defined in the beginning of this class. Returns: - outputs (:obj:Dict): The output dict of VAC's forward computation graph, whose key-values vary from different mode.

Examples (Actor): >>> model = VAC(64, 128) >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 128])

Examples (Critic): >>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> critic_outputs = model(inputs,'compute_critic') >>> assert actor_outputs['logit'].shape == torch.Size([4, 64])

Examples (Actor-Critic): >>> model = VAC(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs,'compute_actor_critic') >>> assert critic_outputs['value'].shape == torch.Size([4]) >>> assert outputs['logit'].shape == torch.Size([4, 64])

`compute_actor(x)` ¶

Overview

VAC forward computation graph for actor part, input observation tensor to predict action logit.

Arguments: - x (:obj:Union[torch.Tensor, Dict]): The input observation tensor data. If a dictionary is provided, it should contain keys 'observation' and optionally 'action_mask'. Returns: - outputs (:obj:Dict): The output dict of VAC's forward computation graph for actor, including logit and optionally action_mask if the input is a dictionary. ReturnsKeys: - logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args. - action_mask (:obj:Optional[torch.Tensor]): The action mask tensor, included if the input is a dictionary containing 'action_mask'. Shapes: - logit (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is action_shape

Examples:

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 64])

`compute_critic(x)` ¶

Overview

VAC forward computation graph for critic part, input observation tensor to predict state value.

Arguments: - x (:obj:Union[torch.Tensor, Dict]): The input observation tensor data. If a dictionary is provided, it should contain the key 'observation'. Returns: - outputs (:obj:Dict): The output dict of VAC's forward computation graph for critic, including value. ReturnsKeys: - value (:obj:torch.Tensor): The predicted state value tensor. Shapes: - value (:obj:torch.Tensor): :math:(B, ), where B is batch size, (B, 1) is squeezed to (B, ).

Examples:

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> assert critic_outputs['value'].shape == torch.Size([4])

`compute_actor_critic(x)` ¶

Overview

VAC forward computation graph for both actor and critic part, input observation tensor to predict action logit and state value.

Arguments: - x (:obj:Union[torch.Tensor, Dict]): The input observation tensor data. If a dictionary is provided, it should contain keys 'observation' and optionally 'action_mask'. Returns: - outputs (:obj:Dict): The output dict of VAC's forward computation graph for both actor and critic, including logit, value, and optionally action_mask if the input is a dictionary. ReturnsKeys: - logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Hybrid action space is a kind of combination of discrete and continuous action space, so the logit will be a dict with action_type and action_args. - value (:obj:torch.Tensor): The predicted state value tensor. - action_mask (:obj:torch.Tensor, optional): The action mask tensor, included if the input is a dictionary containing 'action_mask'. Shapes: - logit (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is action_shape - value (:obj:torch.Tensor): :math:(B, ), where B is batch size, (B, 1) is squeezed to (B, ).

Examples:

>>> model = VAC(64, 64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> assert critic_outputs['value'].shape == torch.Size([4])
>>> assert outputs['logit'].shape == torch.Size([4, 64])

.. note:: compute_actor_critic interface aims to save computation when shares encoder and return the combination dict output.

`DREAMERVAC` ¶

`init(model_name='bert-base-uncased', add_linear=False, embedding_size=128, freeze_encoder=True, hidden_dim=768, norm_embedding=False)` ¶

Overview

Init the LanguageTransformer Model according to input arguments.

Arguments: - model_name (:obj:str): The base language model name in huggingface, such as "bert-base-uncased". - add_linear (:obj:bool): Whether to add a linear layer on the top of language model, defaults to be False. - embedding_size (:obj:int): The embedding size of the added linear layer, such as 128. - freeze_encoder (:obj:bool): Whether to freeze the encoder language model while training, defaults to be True. - hidden_dim (:obj:int): The embedding dimension of the encoding model (e.g. BERT). This value should correspond to the model you use. For bert-base-uncased, this value is 768. - norm_embedding (:obj:bool): Whether to normalize the embedding vectors. Default to be False.

`forward(train_samples, candidate_samples=None, mode='compute_actor')` ¶

Overview

LanguageTransformer forward computation graph, input two lists of strings and predict their matching scores. Different mode will forward with different network modules to get different outputs.

Arguments: - train_samples (:obj:List[str]): One list of strings. - candidate_samples (:obj:Optional[List[str]]): The other list of strings to calculate matching scores. - - mode (:obj:str): The forward mode, all the modes are defined in the beginning of this class. Returns: - output (:obj:Dict): Output dict data, including the logit of matching scores and the corresponding torch.distributions.Categorical object.

Examples:

>>> test_pids = [1]
>>> cand_pids = [0, 2, 4]
>>> problems = [                 "This is problem 0", "This is the first question", "Second problem is here", "Another problem",                 "This is the last problem"             ]
>>> ctxt_list = [problems[pid] for pid in test_pids]
>>> cands_list = [problems[pid] for pid in cand_pids]
>>> model = LanguageTransformer(model_name="bert-base-uncased", add_linear=True, embedding_size=256)
>>> scores = model(ctxt_list, cands_list)
>>> assert scores.shape == (1, 3)

`PG` ¶

Bases: Module

Overview

The neural network and computation graph of algorithms related to Policy Gradient(PG) (https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf). The PG model is composed of two parts: encoder and head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding action logit.

Interface: __init__, forward.

`init(obs_shape, action_shape, action_space='discrete', encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None)` ¶

Overview

Initialize the PG model according to corresponding input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3]. - action_space (:obj:str): The type of different action spaces, including ['discrete', 'continuous'], then will instantiate corresponding head, including DiscreteHead and ReparameterizationHead. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size. - head_hidden_size (:obj:Optional[int]): The hidden_size of head network, defaults to None, it must match the last element of encoder_hidden_size_list. - head_layer_num (:obj:int): The num of layers used in the head network to compute action. - activation (:obj:Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] Examples: >>> model = PG((4, 84, 84), 5) >>> inputs = torch.randn(8, 4, 84, 84) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == (8, 5) >>> assert outputs['dist'].sample().shape == (8, )

`forward(x)` ¶

Overview

PG forward computation graph, input observation tensor to predict policy distribution.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - outputs (:obj:torch.distributions): The output policy distribution. If action space is discrete, the output is Categorical distribution; if action space is continuous, the output is Normal distribution.

`PPG` ¶

Bases: Module

`compute_actor(obs, get_delta_q=False)` ¶

Overview

compute the action according to inputs, call the _compute_delta_q function to compute delta_q

Arguments: - obs (:obj:torch.Tensor): observation - get_delta_q (:obj:bool) : whether need to get delta_q Returns: - outputs (:obj:Dict): the output of actor network and delta_q ReturnsKeys: - necessary: action - optional: group, initiator_prob, is_initiator, new_thoughts, old_thoughts, delta_q Shapes: - obs (:obj:torch.Tensor): :math:(B, A, N), where B is batch size, A is agent num, N is obs size - action (:obj:torch.Tensor): :math:(B, A, M), where M is action size - group (:obj:torch.Tensor): :math:(B, A, A) - initiator_prob (:obj:torch.Tensor): :math:(B, A) - is_initiator (:obj:torch.Tensor): :math:(B, A) - new_thoughts (:obj:torch.Tensor): :math:(B, A, M) - old_thoughts (:obj:torch.Tensor): :math:(B, A, M) - delta_q (:obj:torch.Tensor): :math:(B, A) Examples: >>> net = ATOC(64, 64, 64, 3) >>> obs = torch.randn(2, 3, 64) >>> net.compute_actor(obs)

`compute_critic(inputs)` ¶

Overview

compute the q_value according to inputs

Arguments: - inputs (:obj:Dict): the inputs contain the obs and action Returns: - outputs (:obj:Dict): the output of critic network ArgumentsKeys: - necessary: obs, action ReturnsKeys: - necessary: q_value Shapes: - obs (:obj:torch.Tensor): :math:(B, A, N), where B is batch size, A is agent num, N is obs size - action (:obj:torch.Tensor): :math:(B, A, M), where M is action size - q_value (:obj:torch.Tensor): :math:(B, A) Examples: >>> net = ATOC(64, 64, 64, 3) >>> obs = torch.randn(2, 3, 64) >>> action = torch.randn(2, 3, 64) >>> net.compute_critic({'obs': obs, 'action': action})

`optimize_actor_attention(inputs)` ¶

Overview

return the actor attention loss

Arguments: - inputs (:obj:Dict): the inputs contain the delta_q, initiator_prob, and is_initiator Returns - loss (:obj:Dict): the loss of actor attention unit ArgumentsKeys: - necessary: delta_q, initiator_prob, is_initiator ReturnsKeys: - necessary: loss Shapes: - delta_q (:obj:torch.Tensor): :math:(B, A) - initiator_prob (:obj:torch.Tensor): :math:(B, A) - is_initiator (:obj:torch.Tensor): :math:(B, A) - loss (:obj:torch.Tensor): :math:(1) Examples: >>> net = ATOC(64, 64, 64, 3) >>> delta_q = torch.randn(2, 3) >>> initiator_prob = torch.randn(2, 3) >>> is_initiator = torch.randn(2, 3) >>> net.optimize_actor_attention( >>> {'delta_q': delta_q, >>> 'initiator_prob': initiator_prob, >>> 'is_initiator': is_initiator})

`ACER` ¶

Bases: Module

`init(agent_num, obs_shape, global_obs_shape, action_shape, hidden_size_list, embedding_size, lstm_type='gru', dueling=False)` ¶

Overview

initialize QTRAN network

Arguments: - agent_num (:obj:int): the number of agent - obs_shape (:obj:int): the dimension of each agent's observation state - global_obs_shape (:obj:int): the dimension of global observation state - action_shape (:obj:int): the dimension of action shape - hidden_size_list (:obj:list): the list of hidden size - embedding_size (:obj:int): the dimension of embedding - lstm_type (:obj:str): use lstm or gru, default to gru - dueling (:obj:bool): use dueling head or not, default to False.

`forward(data, single_step=True)` ¶

Overview

forward computation graph of qtran network

Arguments: - data (:obj:dict): input data dict with keys ['obs', 'prev_state', 'action'] - agent_state (:obj:torch.Tensor): each agent local state(obs) - global_state (:obj:torch.Tensor): global state(obs) - prev_state (:obj:list): previous rnn state - action (:obj:torch.Tensor or None): if action is None, use argmax q_value index as action to calculate agent_q_act - single_step (:obj:bool): whether single_step forward, if so, add timestep dim before forward and remove it after forward Return: - ret (:obj:dict): output data dict with keys ['total_q', 'logit', 'next_state'] - total_q (:obj:torch.Tensor): total q_value, which is the result of mixer network - agent_q (:obj:torch.Tensor): each agent q_value - next_state (:obj:list): next rnn state Shapes: - agent_state (:obj:torch.Tensor): :math:(T, B, A, N), where T is timestep, B is batch_size A is agent_num, N is obs_shape - global_state (:obj:torch.Tensor): :math:(T, B, M), where M is global_obs_shape - prev_state (:obj:list): math:(B, A), a list of length B, and each element is a list of length A - action (:obj:torch.Tensor): :math:(T, B, A) - total_q (:obj:torch.Tensor): :math:(T, B) - agent_q (:obj:torch.Tensor): :math:(T, B, A, P), where P is action_shape - next_state (:obj:list): math:(B, A), a list of length B, and each element is a list of length A

`MAVAC` ¶

Bases: Module

Overview

The neural network and computation graph of algorithms related to (state) Value Actor-Critic (VAC) for multi-agent, such as MAPPO(https://arxiv.org/abs/2103.01955). This model now supports discrete and continuous action space. The MAVAC is composed of four parts: actor_encoder, critic_encoder, actor_head and critic_head. Encoders are used to extract the feature from various observation. Heads are used to predict corresponding value or action logit.

Interfaces: __init__, forward, compute_actor, compute_critic, compute_actor_critic.

`init(agent_obs_shape, global_obs_shape, action_shape, agent_num, actor_head_hidden_size=256, actor_head_layer_num=2, critic_head_hidden_size=512, critic_head_layer_num=1, action_space='discrete', activation=nn.ReLU(), norm_type=None, sigma_type='independent', bound_type=None, encoder=None)` ¶

Overview

Init the MAVAC Model according to arguments.

Arguments: - agent_obs_shape (:obj:Union[int, SequenceType]): Observation's space for single agent, such as 8 or [4, 84, 84]. - global_obs_shape (:obj:Union[int, SequenceType]): Global observation's space, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action space shape for single agent, such as 6 or [2, 3, 3]. - agent_num (:obj:int): This parameter is temporarily reserved. This parameter may be required for subsequent changes to the model - actor_head_hidden_size (:obj:Optional[int]): The hidden_size of actor_head network, defaults to 256, it must match the last element of agent_obs_shape. - actor_head_layer_num (:obj:int): The num of layers used in the actor_head network to compute action. - critic_head_hidden_size (:obj:Optional[int]): The hidden_size of critic_head network, defaults to 512, it must match the last element of global_obs_shape. - critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic's nn. - action_space (:obj:Union[int, SequenceType]): The type of different action spaces, including ['discrete', 'continuous'], then will instantiate corresponding head, including DiscreteHead and ReparameterizationHead. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN']. - sigma_type (:obj:Optional[str]): The type of sigma in continuous action space, see ding.torch_utils.network.dreamer.ReparameterizationHead for more details, in MAPPO, it defaults to independent, which means state-independent sigma parameters. - bound_type (:obj:Optional[str]): The type of action bound methods in continuous action space, defaults to None, which means no bound. - encoder (:obj:Optional[Tuple[torch.nn.Module, torch.nn.Module]]): The encoder module list, defaults to None, you can define your own actor and critic encoder module and pass it into MAVAC to deal with different observation space.

`forward(inputs, mode)` ¶

Overview

MAVAC forward computation graph, input observation tensor to predict state value or action logit. mode includes compute_actor, compute_critic, compute_actor_critic. Different mode will forward with different network modules to get different outputs and save computation.

Arguments: - inputs (:obj:Dict): The input dict including observation and related info, whose key-values vary from different mode. - mode (:obj:str): The forward mode, all the modes are defined in the beginning of this class. Returns: - outputs (:obj:Dict): The output dict of MAVAC's forward computation graph, whose key-values vary from different mode.

Examples (Actor): >>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([10, 8, 14])

Examples (Critic): >>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> critic_outputs = model(inputs,'compute_critic') >>> assert actor_outputs['value'].shape == torch.Size([10, 8])

Examples (Actor-Critic): >>> model = MAVAC(64, 64) >>> inputs = { 'agent_state': torch.randn(10, 8, 64), 'global_state': torch.randn(10, 8, 128), 'action_mask': torch.randint(0, 2, size=(10, 8, 14)) } >>> outputs = model(inputs,'compute_actor_critic') >>> assert outputs['value'].shape == torch.Size([10, 8, 14]) >>> assert outputs['logit'].shape == torch.Size([10, 8])

`compute_actor(x)` ¶

Overview

MAVAC forward computation graph for actor part, predicting action logit with agent observation tensor in x.

Arguments: - x (:obj:Dict): Input data dict with keys ['agent_state', 'action_mask'(optional)]. - agent_state: (:obj:torch.Tensor): Each agent local state(obs). - action_mask(optional): (:obj:torch.Tensor): When action_space is discrete, action_mask needs to be provided to mask illegal actions. Returns: - outputs (:obj:Dict): The output dict of the forward computation graph for actor, including logit. ReturnsKeys: - logit (:obj:torch.Tensor): The predicted action logit tensor, for discrete action space, it will be the same dimension real-value ranged tensor of possible action choices, and for continuous action space, it will be the mu and sigma of the Gaussian distribution, and the number of mu and sigma is the same as the number of continuous actions. Shapes: - logit (:obj:torch.FloatTensor): :math:(B, M, N), where B is batch size and N is action_shape and M is agent_num.

Examples:

>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([10, 8, 14])

`compute_critic(x)` ¶

Overview

MAVAC forward computation graph for critic part. Predict state value with global observation tensor in x.

Arguments: - x (:obj:Dict): Input data dict with keys ['global_state']. - global_state: (:obj:torch.Tensor): Global state(obs). Returns: - outputs (:obj:Dict): The output dict of MAVAC's forward computation graph for critic, including value. ReturnsKeys: - value (:obj:torch.Tensor): The predicted state value tensor. Shapes: - value (:obj:torch.FloatTensor): :math:(B, M), where B is batch size and M is agent_num.

Examples:

>>> model = MAVAC(agent_obs_shape=64, global_obs_shape=128, action_shape=14)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> critic_outputs = model(inputs,'compute_critic')
>>> assert critic_outputs['value'].shape == torch.Size([10, 8])

`compute_actor_critic(x)` ¶

Overview

MAVAC forward computation graph for both actor and critic part, input observation to predict action logit and state value.

Arguments: - x (:obj:Dict): The input dict contains agent_state, global_state and other related info. Returns: - outputs (:obj:Dict): The output dict of MAVAC's forward computation graph for both actor and critic, including logit and value. ReturnsKeys: - logit (:obj:torch.Tensor): Logit encoding tensor, with same size as input x. - value (:obj:torch.Tensor): Q value tensor with same size as batch size. Shapes: - logit (:obj:torch.FloatTensor): :math:(B, M, N), where B is batch size and N is action_shape and M is agent_num. - value (:obj:torch.FloatTensor): :math:(B, M), where B is batch sizeand M is agent_num.

Examples:

>>> model = MAVAC(64, 64)
>>> inputs = {
        'agent_state': torch.randn(10, 8, 64),
        'global_state': torch.randn(10, 8, 128),
        'action_mask': torch.randint(0, 2, size=(10, 8, 14))
    }
>>> outputs = model(inputs,'compute_actor_critic')
>>> assert outputs['value'].shape == torch.Size([10, 8])
>>> assert outputs['logit'].shape == torch.Size([10, 8, 14])

`NGU` ¶

Bases: Module

Overview

The recurrent Q model for NGU(https://arxiv.org/pdf/2002.06038.pdf) policy, modified from the class DRQN in q_leaning.py. The implementation mentioned in the original paper is 'adapt the R2D2 agent that uses the dueling network architecture with an LSTM layer after a convolutional neural network'. The NGU network includes encoder, LSTM core(rnn) and head.

Interface: __init__, forward.

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], collector_env_num=1, dueling=True, head_hidden_size=None, head_layer_num=1, lstm_type='normal', activation=nn.ReLU(), norm_type=None)` ¶

Overview

Init the DRQN Model for NGU according to arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation's space, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action's space, such as 6 or [2, 3, 3]. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder. - collector_env_num (:obj:Optional[int]): The number of environments used to collect data simultaneously. - dueling (:obj:bool): Whether choose DuelingHead (True) or DiscreteHead (False), default to True. - head_hidden_size (:obj:Optional[int]): The hidden_size to pass to Head, should match the last element of encoder_hidden_size_list. - head_layer_num (:obj:int): The number of layers in head network. - lstm_type (:obj:Optional[str]): Version of rnn cell, now support ['normal', 'pytorch', 'hpc', 'gru'], default is 'normal'. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details`.

`forward(inputs, inference=False, saved_state_timesteps=None)` ¶

Overview

Forward computation graph of NGU R2D2 network. Input observation, prev_action prev_reward_extrinsic to predict NGU Q output. Parameter updates with NGU's MLPs forward setup.

Arguments: - inputs (:obj:Dict): - obs (:obj:torch.Tensor): Encoded observation. - prev_state (:obj:list): Previous state's tensor of size (B, N). - inference: (:obj:'bool'): If inference is True, we unroll the one timestep transition, if inference is False, we unroll the sequence transitions. - saved_state_timesteps: (:obj:'Optional[list]'): When inference is False, we unroll the sequence transitions, then we would save rnn hidden states at timesteps that are listed in list saved_state_timesteps. Returns: - outputs (:obj:Dict): Run MLP with DRQN setups and return the result prediction dictionary.

ReturnsKeys

logit (:obj:torch.Tensor): Logit tensor with same size as input obs.
next_state (:obj:list): Next state's tensor of size (B, N).

Shapes: - obs (:obj:torch.Tensor): :math:(B, N=obs_space), where B is batch size. - prev_state(:obj:torch.FloatTensor list): :math:[(B, N)]. - logit (:obj:torch.FloatTensor): :math:(B, N). - next_state(:obj:torch.FloatTensor list): :math:[(B, N)].

`QACDIST` ¶

Bases: Module

Overview

The QAC model with distributional Q-value.

Interfaces: __init__, forward, compute_actor, compute_critic

`init(obs_shape, action_shape, action_space='regression', critic_head_type='categorical', actor_head_hidden_size=64, actor_head_layer_num=1, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51)` ¶

Overview

Init the QAC Distributional Model according to arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation's space. - action_shape (:obj:Union[int, SequenceType]): Action's space. - action_space (:obj:str): Whether choose regression or reparameterization. - critic_head_type (:obj:str): Only categorical. - actor_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to actor-nn's Head. - actor_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for actor's nn. - critic_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to critic-nn's Head. - critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic's nn. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details. - v_min (:obj:int): Value of the smallest atom - v_max (:obj:int): Value of the largest atom - n_atom (:obj:int): Number of atoms in the support

`forward(inputs, mode)` ¶

Overview

Use observation and action tensor to predict output. Parameter updates with QACDIST's MLPs forward setup.

Arguments: Forward with 'compute_actor': - inputs (:obj:torch.Tensor): The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). Whether actor_head_hidden_size or critic_head_hidden_size depend on mode.

Forward with ``'compute_critic'``, inputs (`Dict`) Necessary Keys:
    - ``obs``, ``action`` encoded tensors.

- mode (:obj:`str`): Name of the forward mode.

Returns: - outputs (:obj:Dict): Outputs of network forward.

    Forward with ``'compute_actor'``, Necessary Keys (either):
        - action (:obj:`torch.Tensor`): Action tensor with same size as input ``x``.
        - logit (:obj:`torch.Tensor`):
            Logit tensor encoding ``mu`` and ``sigma``, both with same size as input ``x``.

    Forward with ``'compute_critic'``, Necessary Keys:
        - q_value (:obj:`torch.Tensor`): Q value tensor with same size as batch size.
        - distribution (:obj:`torch.Tensor`): Q value distribution tensor.

Actor Shapes: - inputs (:obj:torch.Tensor): :math:(B, N0), B is batch size and N0 corresponds to hidden_size - action (:obj:torch.Tensor): :math:(B, N0) - q_value (:obj:torch.FloatTensor): :math:(B, ), where B is batch size.

Critic Shapes

obs (:obj:torch.Tensor): :math:(B, N1), where B is batch size and N1 is obs_shape
action (:obj:torch.Tensor): :math:(B, N2), where B is batch size and N2 isaction_shape
q_value (:obj:torch.FloatTensor): :math:(B, N2), where B is batch size and N2 is action_shape
distribution (:obj:torch.FloatTensor): :math:(B, 1, N3), where B is batch size and N3 is num_atom

Actor Examples

Regression mode¶

model = QACDIST(64, 64, 'regression') inputs = torch.randn(4, 64) actor_outputs = model(inputs,'compute_actor') assert actor_outputs['action'].shape == torch.Size([4, 64])

Reparameterization Mode¶

model = QACDIST(64, 64, 'reparameterization') inputs = torch.randn(4, 64) actor_outputs = model(inputs,'compute_actor') actor_outputs['logit'][0].shape # mu torch.Size([4, 64]) actor_outputs['logit'][1].shape # sigma torch.Size([4, 64])

Critic Examples

Categorical mode¶

inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)} model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression', ... critic_head_type='categorical', n_atoms=51) q_value = model(inputs, mode='compute_critic') # q value assert q_value['q_value'].shape == torch.Size([4, 1]) assert q_value['distribution'].shape == torch.Size([4, 1, 51])

`compute_actor(inputs)` ¶

Overview

Use encoded embedding tensor to predict output. Execute parameter updates with 'compute_actor' mode Use encoded embedding tensor to predict output.

Arguments: - inputs (:obj:torch.Tensor): The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size - mode (:obj:str): Name of the forward mode. Returns: - outputs (:obj:Dict): Outputs of forward pass encoder and head.

ReturnsKeys (either): - action (:obj:torch.Tensor): Continuous action tensor with same size as action_shape. - logit (:obj:torch.Tensor): Logit tensor encoding mu and sigma, both with same size as input x. Shapes: - inputs (:obj:torch.Tensor): :math:(B, N0), B is batch size and N0 corresponds to hidden_size - action (:obj:torch.Tensor): :math:(B, N0) - logit (:obj:list): 2 elements, mu and sigma, each is the shape of :math:(B, N0). - q_value (:obj:torch.FloatTensor): :math:(B, ), B is batch size. Examples: >>> # Regression mode >>> model = QACDIST(64, 64, 'regression') >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['action'].shape == torch.Size([4, 64]) >>> # Reparameterization Mode >>> model = QACDIST(64, 64, 'reparameterization') >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> actor_outputs['logit'][0].shape # mu >>> torch.Size([4, 64]) >>> actor_outputs['logit'][1].shape # sigma >>> torch.Size([4, 64])

`compute_critic(inputs)` ¶

Overview

Execute parameter updates with 'compute_critic' mode Use encoded embedding tensor to predict output.

Arguments: - obs, action encoded tensors. - mode (:obj:str): Name of the forward mode. Returns: - outputs (:obj:Dict): Q-value output and distribution.

ReturnKeys

q_value (:obj:torch.Tensor): Q value tensor with same size as batch size.
distribution (:obj:torch.Tensor): Q value distribution tensor.

Shapes: - obs (:obj:torch.Tensor): :math:(B, N1), where B is batch size and N1 is obs_shape - action (:obj:torch.Tensor): :math:(B, N2), where B is batch size and N2 isaction_shape - q_value (:obj:torch.FloatTensor): :math:(B, N2), where B is batch size and N2 is action_shape - distribution (:obj:torch.FloatTensor): :math:(B, 1, N3), where B is batch size and N3 is num_atom

Examples:

>>> # Categorical mode
>>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)}
>>> model = QACDIST(obs_shape=(N, ),action_shape=1,action_space='regression',             ...                 critic_head_type='categorical', n_atoms=51)
>>> q_value = model(inputs, mode='compute_critic') # q value
>>> assert q_value['q_value'].shape == torch.Size([4, 1])
>>> assert q_value['distribution'].shape == torch.Size([4, 1, 51])

`DiscreteMAQAC` ¶

Bases: Module

Overview

The neural network and computation graph of algorithms related to discrete action Multi-Agent Q-value Actor-CritiC (MAQAC) model. The model is composed of actor and critic, where actor is a MLP network and critic is a MLP network. The actor network is used to predict the action probability distribution, and the critic network is used to predict the Q value of the state-action pair.

Interfaces: __init__, forward, compute_actor, compute_critic

`init(agent_obs_shape, global_obs_shape, action_shape, twin_critic=False, actor_head_hidden_size=64, actor_head_layer_num=1, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None)` ¶

Overview

Initialize the DiscreteMAQAC Model according to arguments.

Arguments: - agent_obs_shape (:obj:Union[int, SequenceType]): Agent's observation's space. - global_obs_shape (:obj:Union[int, SequenceType]): Global observation's space. - obs_shape (:obj:Union[int, SequenceType]): Observation's space. - action_shape (:obj:Union[int, SequenceType]): Action's space. - twin_critic (:obj:bool): Whether include twin critic. - actor_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to actor-nn's Head. - actor_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for actor's nn. - critic_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to critic-nn's Head. - critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic's nn. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details.

`forward(inputs, mode)` ¶

Overview

Use observation tensor to predict output, with compute_actor or compute_critic mode.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - obs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - agent_state (:obj:torch.Tensor): The agent's observation tensor data, with shape :math:(B, A, N0), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape. - global_state (:obj:torch.Tensor): The global observation tensor data, with shape :math:(B, A, N1), where B is batch size and A is agent num. N1 corresponds to global_obs_shape. - action_mask (:obj:torch.Tensor): The action mask tensor data, with shape :math:(B, A, N2), where B is batch size and A is agent num. N2 corresponds to action_shape.

- mode (:obj:`str`): The forward mode, all the modes are defined in the beginning of this class.

Returns: - output (:obj:Dict[str, torch.Tensor]): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different forward modes. Examples: >>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> } >>> } >>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True) >>> logit = model(data, mode='compute_actor')['logit'] >>> value = model(data, mode='compute_critic')['q_value']

`compute_actor(inputs)` ¶

Overview

Use observation tensor to predict action logits.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - obs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - agent_state (:obj:torch.Tensor): The agent's observation tensor data, with shape :math:(B, A, N0), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape. - global_state (:obj:torch.Tensor): The global observation tensor data, with shape :math:(B, A, N1), where B is batch size and A is agent num. N1 corresponds to global_obs_shape. - action_mask (:obj:torch.Tensor): The action mask tensor data, with shape :math:(B, A, N2), where B is batch size and A is agent num. N2 corresponds to action_shape. Returns: - output (:obj:Dict[str, torch.Tensor]): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different forward modes. - logit (:obj:torch.Tensor): Action's output logit (real value range), whose shape is :math:(B, A, N2), where N2 corresponds to action_shape. - action_mask (:obj:torch.Tensor): Action mask tensor with same size as action_shape. Examples: >>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> } >>> } >>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True) >>> logit = model.compute_actor(data)['logit']

`compute_critic(inputs)` ¶

Overview

use observation tensor to predict Q value.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - obs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - agent_state (:obj:torch.Tensor): The agent's observation tensor data, with shape :math:(B, A, N0), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape. - global_state (:obj:torch.Tensor): The global observation tensor data, with shape :math:(B, A, N1), where B is batch size and A is agent num. N1 corresponds to global_obs_shape. - action_mask (:obj:torch.Tensor): The action mask tensor data, with shape :math:(B, A, N2), where B is batch size and A is agent num. N2 corresponds to action_shape. Returns: - output (:obj:Dict[str, torch.Tensor]): The output dict of DiscreteMAQAC forward computation graph, whose key-values vary in different values of twin_critic. - q_value (:obj:list): If twin_critic=True, q_value should be 2 elements, each is the shape of :math:(B, A, N2), where B is batch size and A is agent num. N2 corresponds to action_shape. Otherwise, q_value should be torch.Tensor. Examples: >>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> } >>> } >>> model = DiscreteMAQAC(agent_obs_shape, global_obs_shape, action_shape, twin_critic=True) >>> value = model.compute_critic(data)['q_value']

`ContinuousMAQAC` ¶

Bases: Module

Overview

The neural network and computation graph of algorithms related to continuous action Multi-Agent Q-value Actor-CritiC (MAQAC) model. The model is composed of actor and critic, where actor is a MLP network and critic is a MLP network. The actor network is used to predict the action probability distribution, and the critic network is used to predict the Q value of the state-action pair.

Interfaces: __init__, forward, compute_actor, compute_critic

`init(agent_obs_shape, global_obs_shape, action_shape, action_space, twin_critic=False, actor_head_hidden_size=64, actor_head_layer_num=1, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None)` ¶

Overview

Initialize the QAC Model according to arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation's space. - action_shape (:obj:Union[int, SequenceType, EasyDict]): Action's space, such as 4, (3, ) - action_space (:obj:str): Whether choose regression or reparameterization. - twin_critic (:obj:bool): Whether include twin critic. - actor_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to actor-nn's Head. - actor_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for actor's nn. - critic_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to critic-nn's Head. - critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic's nn. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details.

`forward(inputs, mode)` ¶

Overview

Use observation and action tensor to predict output in compute_actor or compute_critic mode.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - obs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - agent_state (:obj:torch.Tensor): The agent's observation tensor data, with shape :math:(B, A, N0), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape. - global_state (:obj:torch.Tensor): The global observation tensor data, with shape :math:(B, A, N1), where B is batch size and A is agent num. N1 corresponds to global_obs_shape. - action_mask (:obj:torch.Tensor): The action mask tensor data, with shape :math:(B, A, N2), where B is batch size and A is agent num. N2 corresponds to action_shape.

    - ``action`` (:obj:`torch.Tensor`): The action tensor data,                     with shape :math:`(B, A, N3)`, where B is batch size and A is agent num.                     N3 corresponds to ``action_shape``.
- mode (:obj:`str`): Name of the forward mode.

Returns: - outputs (:obj:Dict): Outputs of network forward, whose key-values will be different for different mode, twin_critic, action_space. Examples: >>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> act_space = 'reparameterization' # regression >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> }, >>> 'action': torch.randn(B, agent_num, squeeze(action_shape)) >>> } >>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False) >>> if action_space == 'regression': >>> action = model(data['obs'], mode='compute_actor')['action'] >>> elif action_space == 'reparameterization': >>> (mu, sigma) = model(data['obs'], mode='compute_actor')['logit'] >>> value = model(data, mode='compute_critic')['q_value']

`compute_actor(inputs)` ¶

Overview

Use observation tensor to predict action logits.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - agent_state (:obj:torch.Tensor): The agent's observation tensor data, with shape :math:(B, A, N0), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape.

Returns:

Type	Description
`Dict`	outputs (:obj:`Dict`): Outputs of network forward.

ReturnKeys (action_space == 'regression'): - action (:obj:torch.Tensor): Action tensor with same size as action_shape. ReturnKeys (action_space == 'reparameterization'): - logit (:obj:list): 2 elements, each is the shape of :math:(B, A, N3), where B is batch size and A is agent num. N3 corresponds to action_shape. Examples: >>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> act_space = 'reparameterization' # 'regression' >>> data = { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> } >>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False) >>> if action_space == 'regression': >>> action = model.compute_actor(data)['action'] >>> elif action_space == 'reparameterization': >>> (mu, sigma) = model.compute_actor(data)['logit']

`compute_critic(inputs)` ¶

Overview

Use observation tensor and action tensor to predict Q value.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - obs (:obj:Dict[str, torch.Tensor]): The input dict tensor data, has keys: - agent_state (:obj:torch.Tensor): The agent's observation tensor data, with shape :math:(B, A, N0), where B is batch size and A is agent num. N0 corresponds to agent_obs_shape. - global_state (:obj:torch.Tensor): The global observation tensor data, with shape :math:(B, A, N1), where B is batch size and A is agent num. N1 corresponds to global_obs_shape. - action_mask (:obj:torch.Tensor): The action mask tensor data, with shape :math:(B, A, N2), where B is batch size and A is agent num. N2 corresponds to action_shape.

    - ``action`` (:obj:`torch.Tensor`): The action tensor data,                     with shape :math:`(B, A, N3)`, where B is batch size and A is agent num.                     N3 corresponds to ``action_shape``.

Returns:

Type	Description
`Dict`	outputs (:obj:`Dict`): Outputs of network forward.

ReturnKeys (twin_critic=True): - q_value (:obj:list): 2 elements, each is the shape of :math:(B, A), where B is batch size and A is agent num. ReturnKeys (twin_critic=False): - q_value (:obj:torch.Tensor): :math:(B, A), where B is batch size and A is agent num. Examples: >>> B = 32 >>> agent_obs_shape = 216 >>> global_obs_shape = 264 >>> agent_num = 8 >>> action_shape = 14 >>> act_space = 'reparameterization' # 'regression' >>> data = { >>> 'obs': { >>> 'agent_state': torch.randn(B, agent_num, agent_obs_shape), >>> 'global_state': torch.randn(B, agent_num, global_obs_shape), >>> 'action_mask': torch.randint(0, 2, size=(B, agent_num, action_shape)) >>> }, >>> 'action': torch.randn(B, agent_num, squeeze(action_shape)) >>> } >>> model = ContinuousMAQAC(agent_obs_shape, global_obs_shape, action_shape, act_space, twin_critic=False) >>> value = model.compute_critic(data)['q_value']

`VanillaVAE` ¶

Bases: Module

`init(obs_shape, action_shape, ensemble_num=2, actor_head_hidden_size=64, actor_head_layer_num=1, critic_head_hidden_size=64, critic_head_layer_num=1, activation=nn.ReLU(), norm_type=None, **kwargs)` ¶

Overview

Initailize the EDAC Model according to input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation's shape, such as 128, (156, ). - action_shape (:obj:Union[int, SequenceType, EasyDict]): Action's shape, such as 4, (3, ), EasyDict({'action_type_shape': 3, 'action_args_shape': 4}). - ensemble_num (:obj:int): Q-net number. - actor_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to actor head. - actor_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for actor head. - critic_head_hidden_size (:obj:Optional[int]): The hidden_size to pass to critic head. - critic_head_layer_num (:obj:int): The num of layers used in the network to compute Q value output for critic head. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP after each FC layer, if None then default set to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization to after network layer (FC, Conv), see ding.torch_utils.network for more details.

`forward(inputs, mode)` ¶

Overview

The unique execution (forward) method of EDAC method, and one can indicate different modes to implement different computation graph, including compute_actor and compute_critic in EDAC.

Mode compute_actor: Arguments: - inputs (:obj:torch.Tensor): Observation data, defaults to tensor. Returns: - output (:obj:Dict): Output dict data, including differnet key-values among distinct action_space. Mode compute_critic: Arguments: - inputs (:obj:Dict): Input dict data, including obs and action tensor. Returns: - output (:obj:Dict): Output dict data, including q_value tensor.

.. note:: For specific examples, one can refer to API doc of compute_actor and compute_critic respectively.

`compute_actor(obs)` ¶

Overview

The forward computation graph of compute_actor mode, uses observation tensor to produce actor output, such as action, logit and so on.

Arguments: - obs (:obj:torch.Tensor): Observation tensor data, now supports a batch of 1-dim vector data, i.e. (B, obs_shape). Returns: - outputs (:obj:Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]): Actor output varying from action_space: reparameterization. ReturnsKeys (either): - logit (:obj:Dict[str, torch.Tensor]): Reparameterization logit, usually in SAC. - mu (:obj:torch.Tensor): Mean of parameterization gaussion distribution. - sigma (:obj:torch.Tensor): Standard variation of parameterization gaussion distribution. Shapes: - obs (:obj:torch.Tensor): :math:(B, N0), B is batch size and N0 corresponds to obs_shape. - action (:obj:torch.Tensor): :math:(B, N1), B is batch size and N1 corresponds to action_shape. - logit.mu (:obj:torch.Tensor): :math:(B, N1), B is batch size and N1 corresponds to action_shape. - logit.sigma (:obj:torch.Tensor): :math:(B, N1), B is batch size. - logit (:obj:torch.Tensor): :math:(B, N2), B is batch size and N2 corresponds to action_shape.action_type_shape. - action_args (:obj:torch.Tensor): :math:(B, N3), B is batch size and N3 corresponds to action_shape.action_args_shape. Examples: >>> model = EDAC(64, 64,) >>> obs = torch.randn(4, 64) >>> actor_outputs = model(obs,'compute_actor') >>> assert actor_outputs['logit'][0].shape == torch.Size([4, 64]) # mu >>> actor_outputs['logit'][1].shape == torch.Size([4, 64]) # sigma

`compute_critic(inputs)` ¶

Overview

The forward computation graph of compute_critic mode, uses observation and action tensor to produce critic output, such as q_value.

Arguments: - inputs (:obj:Dict[str, torch.Tensor]): Dict strcture of input data, including obs and action tensor Returns: - outputs (:obj:Dict[str, torch.Tensor]): Critic output, such as q_value. ArgumentsKeys: - obs: (:obj:torch.Tensor): Observation tensor data, now supports a batch of 1-dim vector data. - action (:obj:Union[torch.Tensor, Dict]): Continuous action with same size as action_shape. ReturnKeys: - q_value (:obj:torch.Tensor): Q value tensor with same size as batch size. Shapes: - obs (:obj:torch.Tensor): :math:(B, N1) or '(Ensemble_num, B, N1)', where B is batch size and N1 is obs_shape. - action (:obj:torch.Tensor): :math:(B, N2) or '(Ensemble_num, B, N2)', where B is batch size and N4 is action_shape. - q_value (:obj:torch.Tensor): :math:(Ensemble_num, B), where B is batch size. Examples: >>> inputs = {'obs': torch.randn(4, 8), 'action': torch.randn(4, 1)} >>> model = EDAC(obs_shape=(8, ),action_shape=1) >>> model(inputs, mode='compute_critic')['q_value'] # q value ... tensor([0.0773, 0.1639, 0.0917, 0.0370], grad_fn=)

`HPT` ¶

Bases: Module

Overview

The HPT model for reinforcement learning, which consists of a Policy Stem and a Dueling Head. The Policy Stem utilizes cross-attention to process input data, and the Dueling Head computes Q-values for discrete action spaces.

Interfaces

__init__, forward

GitHub: [https://github.com/liruiw/HPT/blob/main/hpt/models/policy_stem.py]

`init(state_dim, action_dim)` ¶

Overview

Initialize the HPT model, including the Policy Stem and the Dueling Head.

Parameters:

Name	Type	Description	Default
`- state_dim (`		obj:`int`): The dimension of the input state.	required
`- action_dim (`		obj:`int`): The dimension of the action space.	required

.. note:: The Policy Stem is initialized with cross-attention, and the Dueling Head is set to process the resulting tokens.

`forward(x)` ¶

Overview

Forward pass of the HPT model. Computes latent tokens from the input state and passes them through the Dueling Head.

Parameters:

Name	Type	Description	Default
`- x (`		obj:`torch.Tensor`): The input tensor representing the state.	required

Returns:

Type	Description
	q_values (:obj:`torch.Tensor`): The predicted Q-values for each action.

`QGPO` ¶

Bases: Module

Overview Basic interface, reset some stateful varaibles in the model wrapper, such as hidden state of RNN. Here we do nothing and just implement this interface method. Other derived model wrappers can override this method to add some extra operations. Arguments: - data_id (:obj:List[int]): The data id list to reset. If None, reset all data. In practice, model wrappers often needs to maintain some stateful variables for each data trajectory, so we leave this data_id argument to reset the stateful variables of the indicated data.

`forward(*args, **kwargs)` ¶

Overview

Basic interface, call the wrapped model's forward method. Other derived model wrappers can override this method to add some extra operations.

Overview

Register new wrapper to wrapper_name_map. When user implements a new wrapper, they must call this function to complete the registration. Then the wrapper can be called by model_wrap.

Arguments: - name (:obj:str): The name of the new wrapper to be registered. - wrapper_type (:obj:type): The wrapper class needs to be added in wrapper_name_map. This argument should be the subclass of IModelWrapper.

Full Source Code

../ding/model/__init__.py

1from .common import * 2from .template import * 3from .wrapper import *

ding.model¶

ding.model ¶

DiscreteHead ¶

__init__(hidden_size, output_size, layer_num=1, activation=nn.ReLU(), norm_type=None, dropout=None, noise=False) ¶

forward(x) ¶

DuelingHead ¶

__init__(hidden_size, output_size, layer_num=1, a_layer_num=None, v_layer_num=None, activation=nn.ReLU(), norm_type=None, dropout=None, noise=False) ¶

forward(x) ¶

DistributionHead ¶

__init__(hidden_size, output_size, layer_num=1, n_atom=51, v_min=-10, v_max=10, activation=nn.ReLU(), norm_type=None, noise=False, eps=1e-06) ¶

forward(x) ¶

RainbowHead ¶

__init__(hidden_size, output_size, layer_num=1, n_atom=51, v_min=-10, v_max=10, activation=nn.ReLU(), norm_type=None, noise=True, eps=1e-06) ¶

forward(x) ¶

QRDQNHead ¶

__init__(hidden_size, output_size, layer_num=1, num_quantiles=32, activation=nn.ReLU(), norm_type=None, noise=False) ¶

forward(x) ¶

StochasticDuelingHead ¶

__init__(hidden_size, action_shape, layer_num=1, a_layer_num=None, v_layer_num=None, activation=nn.ReLU(), norm_type=None, noise=False, last_tanh=True) ¶

forward(s, a, mu, sigma, sample_size=10) ¶

QuantileHead ¶

__init__(hidden_size, output_size, layer_num=1, num_quantiles=32, quantile_embedding_size=128, beta_function_type='uniform', activation=nn.ReLU(), norm_type=None, noise=False) ¶

quantile_net(quantiles) ¶

forward(x, num_quantiles=None) ¶

FQFHead ¶

__init__(hidden_size, output_size, layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None, noise=False) ¶

quantile_net(quantiles) ¶

forward(x, num_quantiles=None) ¶

RegressionHead ¶

__init__(input_size, output_size, layer_num=2, final_tanh=False, activation=nn.ReLU(), norm_type=None, hidden_size=None) ¶

forward(x) ¶

ReparameterizationHead ¶

__init__(input_size, output_size, layer_num=2, sigma_type=None, fixed_sigma_value=1.0, activation=nn.ReLU(), norm_type=None, bound_type=None, hidden_size=None) ¶

forward(x) ¶

MultiHead ¶

__init__(head_cls, hidden_size, output_size_list, **head_kwargs) ¶

forward(x) ¶

BranchingHead ¶

__init__(hidden_size, num_branches=0, action_bins_per_branch=2, layer_num=1, a_layer_num=None, v_layer_num=None, norm_type=None, activation=nn.ReLU(), noise=False) ¶

forward(x) ¶

AttentionPolicyHead ¶

forward(key, query) ¶

PopArtVHead ¶

__init__(hidden_size, output_size, layer_num=1, activation=nn.ReLU(), norm_type=None) ¶

forward(x) ¶

EnsembleHead ¶

forward(x) ¶

ConvEncoder ¶

__init__(obs_shape, hidden_size_list=[32, 64, 64, 128], activation=nn.ReLU(), kernel_size=[8, 4, 3], stride=[4, 2, 1], padding=None, layer_norm=False, norm_type=None) ¶

forward(x) ¶

FCEncoder ¶

__init__(obs_shape, hidden_size_list, res_block=False, activation=nn.ReLU(), norm_type=None, dropout=None) ¶

forward(x) ¶

IMPALAConvEncoder ¶

__init__(obs_shape, channels=(16, 32, 32), outsize=256, scale_ob=255.0, nblock=2, final_relu=True, **kwargs) ¶

forward(x) ¶

GaussianFourierProjectionTimeEncoder ¶

__init__(embed_dim, scale=30.0) ¶

forward(x) ¶

DQN ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, dropout=None, init_bias=None, noise=False) ¶

forward(x) ¶

RainbowDQN ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51) ¶

forward(x) ¶

QRDQN ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, activation=nn.ReLU(), norm_type=None) ¶

forward(x) ¶

IQN ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None) ¶

forward(x) ¶

FQF ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None) ¶

forward(x) ¶

DRQN ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, lstm_type='normal', activation=nn.ReLU(), norm_type=None, res_link=False) ¶

forward(inputs, inference=False, saved_state_timesteps=None) ¶

C51DQN ¶

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51) ¶

forward(x) ¶

`ding.model`¶

`ding.model` ¶

`DiscreteHead` ¶

`init(hidden_size, output_size, layer_num=1, activation=nn.ReLU(), norm_type=None, dropout=None, noise=False)` ¶

`forward(x)` ¶

`DuelingHead` ¶

`init(hidden_size, output_size, layer_num=1, a_layer_num=None, v_layer_num=None, activation=nn.ReLU(), norm_type=None, dropout=None, noise=False)` ¶

`forward(x)` ¶

`DistributionHead` ¶

`init(hidden_size, output_size, layer_num=1, n_atom=51, v_min=-10, v_max=10, activation=nn.ReLU(), norm_type=None, noise=False, eps=1e-06)` ¶

`forward(x)` ¶

`RainbowHead` ¶

`init(hidden_size, output_size, layer_num=1, n_atom=51, v_min=-10, v_max=10, activation=nn.ReLU(), norm_type=None, noise=True, eps=1e-06)` ¶

`forward(x)` ¶

`QRDQNHead` ¶

`init(hidden_size, output_size, layer_num=1, num_quantiles=32, activation=nn.ReLU(), norm_type=None, noise=False)` ¶

`forward(x)` ¶

`StochasticDuelingHead` ¶

`init(hidden_size, action_shape, layer_num=1, a_layer_num=None, v_layer_num=None, activation=nn.ReLU(), norm_type=None, noise=False, last_tanh=True)` ¶

`forward(s, a, mu, sigma, sample_size=10)` ¶

`QuantileHead` ¶

`init(hidden_size, output_size, layer_num=1, num_quantiles=32, quantile_embedding_size=128, beta_function_type='uniform', activation=nn.ReLU(), norm_type=None, noise=False)` ¶

`quantile_net(quantiles)` ¶

`forward(x, num_quantiles=None)` ¶

`FQFHead` ¶

`init(hidden_size, output_size, layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None, noise=False)` ¶

`quantile_net(quantiles)` ¶

`forward(x, num_quantiles=None)` ¶

`RegressionHead` ¶

`init(input_size, output_size, layer_num=2, final_tanh=False, activation=nn.ReLU(), norm_type=None, hidden_size=None)` ¶

`forward(x)` ¶

`ReparameterizationHead` ¶

`init(input_size, output_size, layer_num=2, sigma_type=None, fixed_sigma_value=1.0, activation=nn.ReLU(), norm_type=None, bound_type=None, hidden_size=None)` ¶

`forward(x)` ¶

`MultiHead` ¶

`init(head_cls, hidden_size, output_size_list, **head_kwargs)` ¶

`forward(x)` ¶

`BranchingHead` ¶

`init(hidden_size, num_branches=0, action_bins_per_branch=2, layer_num=1, a_layer_num=None, v_layer_num=None, norm_type=None, activation=nn.ReLU(), noise=False)` ¶

`forward(x)` ¶

`AttentionPolicyHead` ¶

`forward(key, query)` ¶

`PopArtVHead` ¶

`init(hidden_size, output_size, layer_num=1, activation=nn.ReLU(), norm_type=None)` ¶

`forward(x)` ¶

`EnsembleHead` ¶

`forward(x)` ¶

`ConvEncoder` ¶

`init(obs_shape, hidden_size_list=[32, 64, 64, 128], activation=nn.ReLU(), kernel_size=[8, 4, 3], stride=[4, 2, 1], padding=None, layer_norm=False, norm_type=None)` ¶

`forward(x)` ¶

`FCEncoder` ¶

`init(obs_shape, hidden_size_list, res_block=False, activation=nn.ReLU(), norm_type=None, dropout=None)` ¶

`forward(x)` ¶

`IMPALAConvEncoder` ¶

`init(obs_shape, channels=(16, 32, 32), outsize=256, scale_ob=255.0, nblock=2, final_relu=True, **kwargs)` ¶

`forward(x)` ¶

`GaussianFourierProjectionTimeEncoder` ¶

`init(embed_dim, scale=30.0)` ¶

`forward(x)` ¶

`DQN` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, dropout=None, init_bias=None, noise=False)` ¶

`forward(x)` ¶

`RainbowDQN` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51)` ¶

`forward(x)` ¶

`QRDQN` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, activation=nn.ReLU(), norm_type=None)` ¶

`forward(x)` ¶

`IQN` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None)` ¶

`forward(x)` ¶

`FQF` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None)` ¶

`forward(x)` ¶

`DRQN` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, lstm_type='normal', activation=nn.ReLU(), norm_type=None, res_link=False)` ¶

`forward(inputs, inference=False, saved_state_timesteps=None)` ¶

`C51DQN` ¶

`init(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51)` ¶

`forward(x)` ¶