Skip to content

ding.model.template.q_learning

ding.model.template.q_learning

DQN

Bases: Module

Overview

The neural nework structure and computation graph of Deep Q Network (DQN) algorithm, which is the most classic value-based RL algorithm for discrete action. The DQN is composed of two parts: encoder and head. The encoder is used to extract the feature from various observation, and the head is used to compute the Q value of each action dimension.

Interfaces: __init__, forward.

.. note:: Current DQN supports two types of encoder: FCEncoder and ConvEncoder, two types of head: DiscreteHead and DuelingHead. You can customize your own encoder or head by inheriting this class.

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, dropout=None, init_bias=None, noise=False)

Overview

initialize the DQN (encoder + head) Model according to corresponding input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3]. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size. - dueling (:obj:Optional[bool]): Whether choose DuelingHead or DiscreteHead (default). - head_hidden_size (:obj:Optional[int]): The hidden_size of head network, defaults to None, then it will be set to the last element of encoder_hidden_size_list. - head_layer_num (:obj:int): The number of layers used in the head network to compute Q value output. - activation (:obj:Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] - dropout (:obj:Optional[float]): The dropout rate of the dropout layer. if None then default disable dropout layer. - init_bias (:obj:Optional[float]): The initial value of the last layer bias in the head network. - noise (:obj:bool): Whether to use NoiseLinearLayer as layer_fn to boost exploration in Q networks' MLP. Default to False.

forward(x)

Overview

DQN forward computation graph, input observation tensor to predict q_value.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - outputs (:obj:Dict): The output of DQN's forward, including q_value. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output of each possible action dimension. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is obs_shape - logit (:obj:torch.Tensor): :math:(B, M), where B is batch size and M is action_shape Examples: >>> model = DQN(32, 6) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 32) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

.. note:: For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

BDQ

Bases: Module

__init__(obs_shape, num_branches=0, action_bins_per_branch=2, layer_num=3, a_layer_num=None, v_layer_num=None, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, norm_type=None, activation=nn.ReLU())

Overview

Init the BDQ (encoder + head) Model according to input arguments. referenced paper Action Branching Architectures for Deep Reinforcement Learning https://arxiv.org/pdf/1711.08946

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84]. - num_branches (:obj:int): The number of branches, which is equivalent to the action dimension, such as 6 in mujoco's halfcheetah environment. - action_bins_per_branch (:obj:int): The number of actions in each dimension. - layer_num (:obj:int): The number of layers used in the network to compute Advantage and Value output. - a_layer_num (:obj:int): The number of layers used in the network to compute Advantage output. - v_layer_num (:obj:int): The number of layers used in the network to compute Value output. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size. - head_hidden_size (:obj:Optional[int]): The hidden_size of head network. - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. - activation (:obj:Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU()

forward(x)

Overview

BDQ forward computation graph, input observation tensor to predict q_value.

Arguments: - x (:obj:torch.Tensor): Observation inputs Returns: - outputs (:obj:Dict): BDQ forward outputs, such as q_value. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output of each action dimension. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is obs_shape - logit (:obj:torch.FloatTensor): :math:(B, M), where B is batch size and M is num_branches * action_bins_per_branch Examples: >>> model = BDQ(8, 5, 2) # arguments: 'obs_shape', 'num_branches' and 'action_bins_per_branch'. >>> inputs = torch.randn(4, 8) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2])

C51DQN

Bases: Module

Overview

The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of encoder and head. encoder is used to extract the feature of observation, and head is used to compute the distribution of Q-value.

Interfaces: __init__, forward

.. note:: Current C51DQN supports two types of encoder: FCEncoder and ConvEncoder.

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51)

Overview

initialize the C51 Model according to corresponding input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3]. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size. - head_hidden_size (:obj:Optional[int]): The hidden_size of head network, defaults to None, then it will be set to the last element of encoder_hidden_size_list. - head_layer_num (:obj:int): The number of layers used in the head network to compute Q value output. - activation (:obj:Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU(). - norm_type (:obj:Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] - v_min (:obj:Optional[float]): The minimum value of the support of the distribution, which is related to the value (discounted sum of reward) scale of the specific environment. Defaults to -10. - v_max (:obj:Optional[float]): The maximum value of the support of the distribution, which is related to the value (discounted sum of reward) scale of the specific environment. Defaults to 10. - n_atom (:obj:Optional[int]): The number of atoms in the prediction distribution, 51 is the default value in the paper, you can also try other values such as 301.

forward(x)

Overview

C51DQN forward computation graph, input observation tensor to predict q_value and its distribution.

Arguments: - x (:obj:torch.Tensor): The input observation tensor data. Returns: - outputs (:obj:Dict): The output of DQN's forward, including q_value, and distribution. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output of each possible action dimension. - distribution (:obj:torch.Tensor): Q-Value discretized distribution, i.e., probability of each uniformly spaced atom Q-value, such as dividing [-10, 10] into 51 uniform spaces. Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is head_hidden_size. - logit (:obj:torch.Tensor): :math:(B, M), where M is action_shape. - distribution(:obj:torch.Tensor): :math:(B, M, P), where P is n_atom. Examples: >>> model = C51DQN(128, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 128) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> # default head_hidden_size: int = 64, >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default n_atom: int = 51 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

.. note:: For consistency and compatibility, we name all the outputs of the network which are related to action selections as logit.

.. note:: For convenience, we recommend that the number of atoms should be odd, so that the middle atom is exactly the value of the Q-value.

QRDQN

Bases: Module

Overview

The neural network structure and computation graph of QRDQN, which combines distributional RL and DQN. You can refer to Distributional Reinforcement Learning with Quantile Regression https://arxiv.org/pdf/1710.10044.pdf for more details.

Interfaces: __init__, forward

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, activation=nn.ReLU(), norm_type=None)

Overview

Initialize the QRDQN Model according to input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation's space. - action_shape (:obj:Union[int, SequenceType]): Action's space. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder - head_hidden_size (:obj:Optional[int]): The hidden_size to pass to Head. - head_layer_num (:obj:int): The num of layers used in the network to compute Q value output - num_quantiles (:obj:int): Number of quantiles in the prediction distribution. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details`

forward(x)

Overview

Use observation tensor to predict QRDQN's output. Parameter updates with QRDQN's MLPs forward setup.

Arguments: - x (:obj:torch.Tensor): The encoded embedding tensor with (B, N=hidden_size). Returns: - outputs (:obj:Dict): Run with encoder and head. Return the result prediction dictionary. ReturnsKeys: - logit (:obj:torch.Tensor): Logit tensor with same size as input x. - q (:obj:torch.Tensor): Q valye tensor tensor of size (B, N, num_quantiles) - tau (:obj:torch.Tensor): tau tensor of size (B, N, 1) Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is head_hidden_size. - logit (:obj:torch.FloatTensor): :math:(B, M), where M is action_shape. - tau (:obj:torch.Tensor): :math:(B, M, 1) Examples: >>> model = QRDQN(64, 64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles : int = 32 >>> assert outputs['q'].shape == torch.Size([4, 64, 32]) >>> assert outputs['tau'].shape == torch.Size([4, 32, 1])

IQN

Bases: Module

Overview

The neural network structure and computation graph of IQN, which combines distributional RL and DQN. You can refer to paper Implicit Quantile Networks for Distributional Reinforcement Learning https://arxiv.org/pdf/1806.06923.pdf for more details.

Interfaces: __init__, forward

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None)

Overview

Initialize the IQN Model according to input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape. - action_shape (:obj:Union[int, SequenceType]): Action space shape. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder - head_hidden_size (:obj:Optional[int]): The hidden_size to pass to Head. - head_layer_num (:obj:int): The num of layers used in the network to compute Q value output - num_quantiles (:obj:int): Number of quantiles in the prediction distribution. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details.

forward(x)

Overview

Use encoded embedding tensor to predict IQN's output. Parameter updates with IQN's MLPs forward setup.

Arguments: - x (:obj:torch.Tensor): The encoded embedding tensor with (B, N=hidden_size). Returns: - outputs (:obj:Dict): Run with encoder and head. Return the result prediction dictionary. ReturnsKeys: - logit (:obj:torch.Tensor): Logit tensor with same size as input x. - q (:obj:torch.Tensor): Q valye tensor tensor of size (num_quantiles, N, B) - quantiles (:obj:torch.Tensor): quantiles tensor of size (quantile_embedding_size, 1) Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is head_hidden_size. - logit (:obj:torch.FloatTensor): :math:(B, M), where M is action_shape - quantiles (:obj:torch.Tensor): :math:(P, 1), where P is quantile_embedding_size. Examples: >>> model = IQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles: int = 32 >>> assert outputs['q'].shape == torch.Size([32, 4, 64] >>> # default quantile_embedding_size: int = 128 >>> assert outputs['quantiles'].shape == torch.Size([128, 1])

FQF

Bases: Module

Overview

The neural network structure and computation graph of FQF, which combines distributional RL and DQN. You can refer to paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning https://arxiv.org/pdf/1911.02140.pdf for more details.

Interface: __init__, forward

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, num_quantiles=32, quantile_embedding_size=128, activation=nn.ReLU(), norm_type=None)

Overview

Initialize the FQF Model according to input arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape. - action_shape (:obj:Union[int, SequenceType]): Action space shape. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder - head_hidden_size (:obj:Optional[int]): The hidden_size to pass to Head. - head_layer_num (:obj:int): The num of layers used in the network to compute Q value output - num_quantiles (:obj:int): Number of quantiles in the prediction distribution. - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details.

forward(x)

Overview

Use encoded embedding tensor to predict FQF's output. Parameter updates with FQF's MLPs forward setup.

Arguments: - x (:obj:torch.Tensor): The encoded embedding tensor with (B, N=hidden_size). Returns: - outputs (:obj:Dict): Dict containing keywords logit (:obj:torch.Tensor), q (:obj:torch.Tensor), quantiles (:obj:torch.Tensor), quantiles_hats (:obj:torch.Tensor), q_tau_i (:obj:torch.Tensor), entropies (:obj:torch.Tensor). Shapes: - x: :math:(B, N), where B is batch size and N is head_hidden_size. - logit: :math:(B, M), where M is action_shape. - q: :math:(B, num_quantiles, M). - quantiles: :math:(B, num_quantiles + 1). - quantiles_hats: :math:(B, num_quantiles). - q_tau_i: :math:(B, num_quantiles - 1, M). - entropies: :math:(B, 1). Examples: >>> model = FQF(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default num_quantiles: int = 32 >>> assert outputs['q'].shape == torch.Size([4, 32, 64]) >>> assert outputs['quantiles'].shape == torch.Size([4, 33]) >>> assert outputs['quantiles_hats'].shape == torch.Size([4, 32]) >>> assert outputs['q_tau_i'].shape == torch.Size([4, 31, 64]) >>> assert outputs['quantiles'].shape == torch.Size([4, 1])

RainbowDQN

Bases: Module

Overview

The neural network structure and computation graph of RainbowDQN, which combines distributional RL and DQN. You can refer to paper Rainbow: Combining Improvements in Deep Reinforcement Learning https://arxiv.org/pdf/1710.02298.pdf for more details.

Interfaces: __init__, forward

.. note:: RainbowDQN contains dueling architecture by default.

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], head_hidden_size=None, head_layer_num=1, activation=nn.ReLU(), norm_type=None, v_min=-10, v_max=10, n_atom=51)

Overview

Init the Rainbow Model according to arguments.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Observation space shape. - action_shape (:obj:Union[int, SequenceType]): Action space shape. - encoder_hidden_size_list (:obj:SequenceType): Collection of hidden_size to pass to Encoder - head_hidden_size (:obj:Optional[int]): The hidden_size to pass to Head. - head_layer_num (:obj:int): The num of layers used in the network to compute Q value output - activation (:obj:Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU() - norm_type (:obj:Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details- n_atom (:obj:Optional[int]`): Number of atoms in the prediction distribution.

forward(x)

Overview

Use observation tensor to predict Rainbow output. Parameter updates with Rainbow's MLPs forward setup.

Arguments: - x (:obj:torch.Tensor): The encoded embedding tensor with (B, N=hidden_size). Returns: - outputs (:obj:Dict): Run MLP with RainbowHead setups and return the result prediction dictionary. ReturnsKeys: - logit (:obj:torch.Tensor): Logit tensor with same size as input x. - distribution (:obj:torch.Tensor): Distribution tensor of size (B, N, n_atom) Shapes: - x (:obj:torch.Tensor): :math:(B, N), where B is batch size and N is head_hidden_size. - logit (:obj:torch.FloatTensor): :math:(B, M), where M is action_shape. - distribution(:obj:torch.FloatTensor): :math:(B, M, P), where P is n_atom. Examples: >>> model = RainbowDQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs) >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == torch.Size([4, 64]) >>> # default n_atom: int =51 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

DRQN

Bases: Module

Overview

The DRQN (Deep Recurrent Q-Network) is a neural network model combining DQN with RNN to handle sequential data and partially observable environments. It consists of three main components: encoder, rnn, and head. - Encoder: Extracts features from various observation inputs. - RNN: Processes sequential observations and other data. - Head: Computes Q-values for each action dimension.

Interfaces

__init__, forward.

.. note:: The current implementation supports: - Two encoder types: FCEncoder and ConvEncoder. - Two head types: DiscreteHead and DuelingHead. - Three RNN types: normal (LSTM with LayerNorm), pytorch (PyTorch's native LSTM), and gru. You can extend the model by customizing your own encoder, RNN, or head by inheriting this class.

__init__(obs_shape, action_shape, encoder_hidden_size_list=[128, 128, 64], dueling=True, head_hidden_size=None, head_layer_num=1, lstm_type='normal', activation=nn.ReLU(), norm_type=None, res_link=False)

Overview

Initialize the DRQN model with specified parameters.

Arguments: - obs_shape (:obj:Union[int, SequenceType]): Shape of the observation space, e.g., 8 or [4, 84, 84]. - action_shape (:obj:Union[int, SequenceType]): Shape of the action space, e.g., 6 or [2, 3, 3]. - encoder_hidden_size_list (:obj:SequenceType): List of hidden sizes for the encoder. The last element must match head_hidden_size. - dueling (:obj:Optional[bool]): Use DuelingHead if True, otherwise use DiscreteHead. - head_hidden_size (:obj:Optional[int]): Hidden size for the head network. Defaults to the last element of encoder_hidden_size_list if None. - head_layer_num (:obj:int): Number of layers in the head network to compute Q-value outputs. - lstm_type (:obj:Optional[str]): Type of RNN module. Supported types are normal, pytorch, and gru. - activation (:obj:Optional[nn.Module]): Activation function used in the network. Defaults to nn.ReLU(). - norm_type (:obj:Optional[str]): Normalization type for the networks. Supported types are: ['BN', 'IN', 'SyncBN', 'LN']. See ding.torch_utils.fc_block for more details. - res_link (:obj:bool): Enables residual connections between single-frame data and sequential data. Defaults to False.

forward(inputs, inference=False, saved_state_timesteps=None)

Overview

Defines the forward pass of the DRQN model. Takes observation and previous RNN states as inputs and predicts Q-values.

Arguments: - inputs (:obj:Dict): Input data dictionary containing observation and previous RNN state. - inference (:obj:bool): If True, unrolls one timestep (used during evaluation). If False, unrolls the entire sequence (used during training). - saved_state_timesteps (:obj:Optional[list]): When inference is False, specifies the timesteps whose hidden states are saved and returned. ArgumentsKeys: - obs (:obj:torch.Tensor): Raw observation tensor. - prev_state (:obj:list): Previous RNN state tensor, structure depends on lstm_type. Returns: - outputs (:obj:Dict): The output of DRQN's forward, including logit (q_value) and next state. ReturnsKeys: - logit (:obj:torch.Tensor): Discrete Q-value output for each action dimension. - next_state (:obj:list): Next RNN state tensor. Shapes: - obs (:obj:torch.Tensor): :math:(B, N) where B is batch size and N is obs_shape. - logit (:obj:torch.Tensor): :math:(B, M) where B is batch size and M is action_shape. Examples: >>> # Initialize input keys >>> prev_state = [[torch.randn(1, 1, 64) for __ in range(2)] for _ in range(4)] # B=4 >>> obs = torch.randn(4,64) >>> model = DRQN(64, 64) # arguments: 'obs_shape' and 'action_shape' >>> outputs = model({'obs': inputs, 'prev_state': prev_state}, inference=True) >>> # Validate output keys and shapes >>> assert isinstance(outputs, dict) >>> assert outputs['logit'].shape == (4, 64) >>> assert len(outputs['next_state']) == 4 >>> assert all([len(t) == 2 for t in outputs['next_state']]) >>> assert all([t[0].shape == (1, 1, 64) for t in outputs['next_state']])

GTrXLDQN

Bases: Module

Overview

The neural network structure and computation graph of Gated Transformer-XL DQN algorithm, which is the enhanced version of DRQN, using Transformer-XL to improve long-term sequential modelling ability. The GTrXL-DQN is composed of three parts: encoder, head and core. The encoder is used to extract the feature from various observation, the core is used to process the sequential observation and other data, and the head is used to compute the Q value of each action dimension.

Interfaces: __init__, forward, reset_memory, get_memory .

__init__(obs_shape, action_shape, head_layer_num=1, att_head_dim=16, hidden_size=16, att_head_num=2, att_mlp_num=2, att_layer_num=3, memory_len=64, activation=nn.ReLU(), head_norm_type=None, dropout=0.0, gru_gating=True, gru_bias=2.0, dueling=True, encoder_hidden_size_list=[128, 128, 256], encoder_norm_type=None)

Overview

Initialize the GTrXLDQN model accoding to corresponding input arguments.

.. tip:: You can refer to GTrXl class in ding.torch_utils.network.gtrxl for more details about the input arguments.

Parameters:

Name Type Description Default
- obs_shape (

obj:Union[int, SequenceType]): Used by Transformer. Observation's space.

required
- action_shape (

obj:Union[int, SequenceType]): Used by Head. Action's space.

required
- head_layer_num (

obj:int): Used by Head. Number of layers.

required
- att_head_dim (

obj:int): Used by Transformer.

required
- hidden_size (

obj:int): Used by Transformer and Head.

required
- att_head_num (

obj:int): Used by Transformer.

required
- att_mlp_num (

obj:int): Used by Transformer.

required
- att_layer_num (

obj:int): Used by Transformer.

required
- memory_len (

obj:int): Used by Transformer.

required
- activation (

obj:Optional[nn.Module]): Used by Transformer and Head. if None then default set to nn.ReLU().

required
- head_norm_type (

obj:Optional[str]): Used by Head. The type of normalization to use, see ding.torch_utils.fc_block for more details`.

required
- dropout (

obj:bool): Used by Transformer.

required
- gru_gating (

obj:bool): Used by Transformer.

required
- gru_bias (

obj:float): Used by Transformer.

required
- dueling (

obj:bool): Used by Head. Make the head dueling.

required
- encoder_hidden_size_list(

obj:SequenceType): Used by Encoder. The collection of hidden_size if using a custom convolutional encoder.

required
- encoder_norm_type (

obj:Optional[str]): Used by Encoder. The type of normalization to use, see ding.torch_utils.fc_block for more details`.

required

forward(x)

Overview

Let input tensor go through GTrXl and the Head sequentially.

Arguments: - x (:obj:torch.Tensor): input tensor of shape (seq_len, bs, obs_shape). Returns: - out (:obj:Dict): run GTrXL with DiscreteHead setups and return the result prediction dictionary. ReturnKeys: - logit (:obj:torch.Tensor): discrete Q-value output of each action dimension, shape is (B, action_space). - memory (:obj:torch.Tensor): memory tensor of size (bs x layer_num+1 x memory_len x embedding_dim). - transformer_out (:obj:torch.Tensor): output tensor of transformer with same size as input x. Examples: >>> # Init input's Keys: >>> obs_dim, seq_len, bs, action_dim = 128, 64, 32, 4 >>> obs = torch.rand(seq_len, bs, obs_dim) >>> model = GTrXLDQN(obs_dim, action_dim) >>> outputs = model(obs) >>> assert isinstance(outputs, dict)

reset_memory(batch_size=None, state=None)

Overview

Clear or reset the memory of GTrXL.

Arguments: - batch_size (:obj:Optional[int]): The number of samples in a training batch. - state (:obj:Optional[torch.Tensor]): The input memory data, whose shape is (layer_num, memory_len, bs, embedding_dim).

get_memory()

Overview

Return the memory of GTrXL.

Returns: - memory: (:obj:Optional[torch.Tensor]): output memory or None if memory has not been initialized, whose shape is (layer_num, memory_len, bs, embedding_dim).

parallel_wrapper(forward_fn)

Overview

Process timestep T and batch_size B at the same time, in other words, treat different timestep data as different trajectories in a batch.

Arguments: - forward_fn (:obj:Callable): Normal nn.Module 's forward function. Returns: - wrapper (:obj:Callable): Wrapped function.

Full Source Code

../ding/model/template/q_learning.py

1from typing import Union, Optional, Dict, Callable, List 2import torch 3import torch.nn as nn 4 5from ding.torch_utils import get_lstm 6from ding.utils import MODEL_REGISTRY, SequenceType, squeeze 7from ..common import FCEncoder, ConvEncoder, DiscreteHead, DuelingHead, MultiHead, RainbowHead, \ 8 QuantileHead, FQFHead, QRDQNHead, DistributionHead, BranchingHead 9from ding.torch_utils.network.gtrxl import GTrXL 10 11 12@MODEL_REGISTRY.register('dqn') 13class DQN(nn.Module): 14 """ 15 Overview: 16 The neural nework structure and computation graph of Deep Q Network (DQN) algorithm, which is the most classic \ 17 value-based RL algorithm for discrete action. The DQN is composed of two parts: ``encoder`` and ``head``. \ 18 The ``encoder`` is used to extract the feature from various observation, and the ``head`` is used to compute \ 19 the Q value of each action dimension. 20 Interfaces: 21 ``__init__``, ``forward``. 22 23 .. note:: 24 Current ``DQN`` supports two types of encoder: ``FCEncoder`` and ``ConvEncoder``, two types of head: \ 25 ``DiscreteHead`` and ``DuelingHead``. You can customize your own encoder or head by inheriting this class. 26 """ 27 28 def __init__( 29 self, 30 obs_shape: Union[int, SequenceType], 31 action_shape: Union[int, SequenceType], 32 encoder_hidden_size_list: SequenceType = [128, 128, 64], 33 dueling: bool = True, 34 head_hidden_size: Optional[int] = None, 35 head_layer_num: int = 1, 36 activation: Optional[nn.Module] = nn.ReLU(), 37 norm_type: Optional[str] = None, 38 dropout: Optional[float] = None, 39 init_bias: Optional[float] = None, 40 noise: bool = False, 41 ) -> None: 42 """ 43 Overview: 44 initialize the DQN (encoder + head) Model according to corresponding input arguments. 45 Arguments: 46 - obs_shape (:obj:`Union[int, SequenceType]`): Observation space shape, such as 8 or [4, 84, 84]. 47 - action_shape (:obj:`Union[int, SequenceType]`): Action space shape, such as 6 or [2, 3, 3]. 48 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder``, \ 49 the last element must match ``head_hidden_size``. 50 - dueling (:obj:`Optional[bool]`): Whether choose ``DuelingHead`` or ``DiscreteHead (default)``. 51 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` of head network, defaults to None, \ 52 then it will be set to the last element of ``encoder_hidden_size_list``. 53 - head_layer_num (:obj:`int`): The number of layers used in the head network to compute Q value output. 54 - activation (:obj:`Optional[nn.Module]`): The type of activation function in networks \ 55 if ``None`` then default set it to ``nn.ReLU()``. 56 - norm_type (:obj:`Optional[str]`): The type of normalization in networks, see \ 57 ``ding.torch_utils.fc_block`` for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] 58 - dropout (:obj:`Optional[float]`): The dropout rate of the dropout layer. \ 59 if ``None`` then default disable dropout layer. 60 - init_bias (:obj:`Optional[float]`): The initial value of the last layer bias in the head network. \ 61 - noise (:obj:`bool`): Whether to use ``NoiseLinearLayer`` as ``layer_fn`` to boost exploration in \ 62 Q networks' MLP. Default to ``False``. 63 """ 64 super(DQN, self).__init__() 65 # Squeeze data from tuple, list or dict to single object. For example, from (4, ) to 4 66 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 67 if head_hidden_size is None: 68 head_hidden_size = encoder_hidden_size_list[-1] 69 # FC Encoder 70 if isinstance(obs_shape, int) or len(obs_shape) == 1: 71 self.encoder = FCEncoder( 72 obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type, dropout=dropout 73 ) 74 # Conv Encoder 75 elif len(obs_shape) == 3: 76 assert dropout is None, "dropout is not supported in ConvEncoder" 77 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 78 else: 79 raise RuntimeError( 80 "not support obs_shape for pre-defined encoder: {}, please customize your own DQN".format(obs_shape) 81 ) 82 # Head Type 83 if dueling: 84 head_cls = DuelingHead 85 else: 86 head_cls = DiscreteHead 87 multi_head = not isinstance(action_shape, int) 88 if multi_head: 89 self.head = MultiHead( 90 head_cls, 91 head_hidden_size, 92 action_shape, 93 layer_num=head_layer_num, 94 activation=activation, 95 norm_type=norm_type, 96 dropout=dropout, 97 noise=noise, 98 ) 99 else: 100 self.head = head_cls( 101 head_hidden_size, 102 action_shape, 103 head_layer_num, 104 activation=activation, 105 norm_type=norm_type, 106 dropout=dropout, 107 noise=noise, 108 ) 109 if init_bias is not None and head_cls == DuelingHead: 110 # Zero the last layer bias of advantage head 111 self.head.A[-1][0].bias.data.fill_(init_bias) 112 113 def forward(self, x: torch.Tensor) -> Dict: 114 """ 115 Overview: 116 DQN forward computation graph, input observation tensor to predict q_value. 117 Arguments: 118 - x (:obj:`torch.Tensor`): The input observation tensor data. 119 Returns: 120 - outputs (:obj:`Dict`): The output of DQN's forward, including q_value. 121 ReturnsKeys: 122 - logit (:obj:`torch.Tensor`): Discrete Q-value output of each possible action dimension. 123 Shapes: 124 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is ``obs_shape`` 125 - logit (:obj:`torch.Tensor`): :math:`(B, M)`, where B is batch size and M is ``action_shape`` 126 Examples: 127 >>> model = DQN(32, 6) # arguments: 'obs_shape' and 'action_shape' 128 >>> inputs = torch.randn(4, 32) 129 >>> outputs = model(inputs) 130 >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6]) 131 132 .. note:: 133 For consistency and compatibility, we name all the outputs of the network which are related to action \ 134 selections as ``logit``. 135 """ 136 x = self.encoder(x) 137 x = self.head(x) 138 return x 139 140 141@MODEL_REGISTRY.register('bdq') 142class BDQ(nn.Module): 143 144 def __init__( 145 self, 146 obs_shape: Union[int, SequenceType], 147 num_branches: int = 0, 148 action_bins_per_branch: int = 2, 149 layer_num: int = 3, 150 a_layer_num: Optional[int] = None, 151 v_layer_num: Optional[int] = None, 152 encoder_hidden_size_list: SequenceType = [128, 128, 64], 153 head_hidden_size: Optional[int] = None, 154 norm_type: Optional[nn.Module] = None, 155 activation: Optional[nn.Module] = nn.ReLU(), 156 ) -> None: 157 """ 158 Overview: 159 Init the BDQ (encoder + head) Model according to input arguments. \ 160 referenced paper Action Branching Architectures for Deep Reinforcement Learning \ 161 <https://arxiv.org/pdf/1711.08946> 162 Arguments: 163 - obs_shape (:obj:`Union[int, SequenceType]`): Observation space shape, such as 8 or [4, 84, 84]. 164 - num_branches (:obj:`int`): The number of branches, which is equivalent to the action dimension, \ 165 such as 6 in mujoco's halfcheetah environment. 166 - action_bins_per_branch (:obj:`int`): The number of actions in each dimension. 167 - layer_num (:obj:`int`): The number of layers used in the network to compute Advantage and Value output. 168 - a_layer_num (:obj:`int`): The number of layers used in the network to compute Advantage output. 169 - v_layer_num (:obj:`int`): The number of layers used in the network to compute Value output. 170 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder``, \ 171 the last element must match ``head_hidden_size``. 172 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` of head network. 173 - norm_type (:obj:`Optional[str]`): The type of normalization in networks, see \ 174 ``ding.torch_utils.fc_block`` for more details. 175 - activation (:obj:`Optional[nn.Module]`): The type of activation function in networks \ 176 if ``None`` then default set it to ``nn.ReLU()`` 177 """ 178 super(BDQ, self).__init__() 179 # For compatibility: 1, (1, ), [4, 32, 32] 180 obs_shape, num_branches = squeeze(obs_shape), squeeze(num_branches) 181 if head_hidden_size is None: 182 head_hidden_size = encoder_hidden_size_list[-1] 183 184 # backbone 185 # FC Encoder 186 if isinstance(obs_shape, int) or len(obs_shape) == 1: 187 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 188 # Conv Encoder 189 elif len(obs_shape) == 3: 190 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 191 else: 192 raise RuntimeError( 193 "not support obs_shape for pre-defined encoder: {}, please customize your own DQN".format(obs_shape) 194 ) 195 196 self.num_branches = num_branches 197 self.action_bins_per_branch = action_bins_per_branch 198 199 # head 200 self.head = BranchingHead( 201 head_hidden_size, 202 num_branches=self.num_branches, 203 action_bins_per_branch=self.action_bins_per_branch, 204 layer_num=layer_num, 205 a_layer_num=a_layer_num, 206 v_layer_num=v_layer_num, 207 activation=activation, 208 norm_type=norm_type 209 ) 210 211 def forward(self, x: torch.Tensor) -> Dict: 212 """ 213 Overview: 214 BDQ forward computation graph, input observation tensor to predict q_value. 215 Arguments: 216 - x (:obj:`torch.Tensor`): Observation inputs 217 Returns: 218 - outputs (:obj:`Dict`): BDQ forward outputs, such as q_value. 219 ReturnsKeys: 220 - logit (:obj:`torch.Tensor`): Discrete Q-value output of each action dimension. 221 Shapes: 222 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is ``obs_shape`` 223 - logit (:obj:`torch.FloatTensor`): :math:`(B, M)`, where B is batch size and M is 224 ``num_branches * action_bins_per_branch`` 225 Examples: 226 >>> model = BDQ(8, 5, 2) # arguments: 'obs_shape', 'num_branches' and 'action_bins_per_branch'. 227 >>> inputs = torch.randn(4, 8) 228 >>> outputs = model(inputs) 229 >>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 5, 2]) 230 """ 231 x = self.encoder(x) / (self.num_branches + 1) # corresponds to the "Gradient Rescaling" in the paper 232 x = self.head(x) 233 return x 234 235 236@MODEL_REGISTRY.register('c51dqn') 237class C51DQN(nn.Module): 238 """ 239 Overview: 240 The neural network structure and computation graph of C51DQN, which combines distributional RL and DQN. \ 241 You can refer to https://arxiv.org/pdf/1707.06887.pdf for more details. The C51DQN is composed of \ 242 ``encoder`` and ``head``. ``encoder`` is used to extract the feature of observation, and ``head`` is \ 243 used to compute the distribution of Q-value. 244 Interfaces: 245 ``__init__``, ``forward`` 246 247 .. note:: 248 Current C51DQN supports two types of encoder: ``FCEncoder`` and ``ConvEncoder``. 249 """ 250 251 def __init__( 252 self, 253 obs_shape: Union[int, SequenceType], 254 action_shape: Union[int, SequenceType], 255 encoder_hidden_size_list: SequenceType = [128, 128, 64], 256 head_hidden_size: int = None, 257 head_layer_num: int = 1, 258 activation: Optional[nn.Module] = nn.ReLU(), 259 norm_type: Optional[str] = None, 260 v_min: Optional[float] = -10, 261 v_max: Optional[float] = 10, 262 n_atom: Optional[int] = 51, 263 ) -> None: 264 """ 265 Overview: 266 initialize the C51 Model according to corresponding input arguments. 267 Arguments: 268 - obs_shape (:obj:`Union[int, SequenceType]`): Observation space shape, such as 8 or [4, 84, 84]. 269 - action_shape (:obj:`Union[int, SequenceType]`): Action space shape, such as 6 or [2, 3, 3]. 270 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder``, \ 271 the last element must match ``head_hidden_size``. 272 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` of head network, defaults to None, \ 273 then it will be set to the last element of ``encoder_hidden_size_list``. 274 - head_layer_num (:obj:`int`): The number of layers used in the head network to compute Q value output. 275 - activation (:obj:`Optional[nn.Module]`): The type of activation function in networks \ 276 if ``None`` then default set it to ``nn.ReLU()``. 277 - norm_type (:obj:`Optional[str]`): The type of normalization in networks, see \ 278 ``ding.torch_utils.fc_block`` for more details. you can choose one of ['BN', 'IN', 'SyncBN', 'LN'] 279 - v_min (:obj:`Optional[float]`): The minimum value of the support of the distribution, which is related \ 280 to the value (discounted sum of reward) scale of the specific environment. Defaults to -10. 281 - v_max (:obj:`Optional[float]`): The maximum value of the support of the distribution, which is related \ 282 to the value (discounted sum of reward) scale of the specific environment. Defaults to 10. 283 - n_atom (:obj:`Optional[int]`): The number of atoms in the prediction distribution, 51 is the default \ 284 value in the paper, you can also try other values such as 301. 285 """ 286 super(C51DQN, self).__init__() 287 # For compatibility: 1, (1, ), [4, 32, 32] 288 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 289 if head_hidden_size is None: 290 head_hidden_size = encoder_hidden_size_list[-1] 291 # FC Encoder 292 if isinstance(obs_shape, int) or len(obs_shape) == 1: 293 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 294 # Conv Encoder 295 elif len(obs_shape) == 3: 296 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 297 else: 298 raise RuntimeError( 299 "not support obs_shape for pre-defined encoder: {}, please customize your own C51DQN".format(obs_shape) 300 ) 301 # Head Type 302 multi_head = not isinstance(action_shape, int) 303 if multi_head: 304 self.head = MultiHead( 305 DistributionHead, 306 head_hidden_size, 307 action_shape, 308 layer_num=head_layer_num, 309 activation=activation, 310 norm_type=norm_type, 311 n_atom=n_atom, 312 v_min=v_min, 313 v_max=v_max, 314 ) 315 else: 316 self.head = DistributionHead( 317 head_hidden_size, 318 action_shape, 319 head_layer_num, 320 activation=activation, 321 norm_type=norm_type, 322 n_atom=n_atom, 323 v_min=v_min, 324 v_max=v_max, 325 ) 326 327 def forward(self, x: torch.Tensor) -> Dict: 328 """ 329 Overview: 330 C51DQN forward computation graph, input observation tensor to predict q_value and its distribution. 331 Arguments: 332 - x (:obj:`torch.Tensor`): The input observation tensor data. 333 Returns: 334 - outputs (:obj:`Dict`): The output of DQN's forward, including q_value, and distribution. 335 ReturnsKeys: 336 - logit (:obj:`torch.Tensor`): Discrete Q-value output of each possible action dimension. 337 - distribution (:obj:`torch.Tensor`): Q-Value discretized distribution, i.e., probability of each \ 338 uniformly spaced atom Q-value, such as dividing [-10, 10] into 51 uniform spaces. 339 Shapes: 340 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is head_hidden_size. 341 - logit (:obj:`torch.Tensor`): :math:`(B, M)`, where M is action_shape. 342 - distribution(:obj:`torch.Tensor`): :math:`(B, M, P)`, where P is n_atom. 343 Examples: 344 >>> model = C51DQN(128, 64) # arguments: 'obs_shape' and 'action_shape' 345 >>> inputs = torch.randn(4, 128) 346 >>> outputs = model(inputs) 347 >>> assert isinstance(outputs, dict) 348 >>> # default head_hidden_size: int = 64, 349 >>> assert outputs['logit'].shape == torch.Size([4, 64]) 350 >>> # default n_atom: int = 51 351 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51]) 352 353 .. note:: 354 For consistency and compatibility, we name all the outputs of the network which are related to action \ 355 selections as ``logit``. 356 357 .. note:: 358 For convenience, we recommend that the number of atoms should be odd, so that the middle atom is exactly \ 359 the value of the Q-value. 360 """ 361 x = self.encoder(x) 362 x = self.head(x) 363 return x 364 365 366@MODEL_REGISTRY.register('qrdqn') 367class QRDQN(nn.Module): 368 """ 369 Overview: 370 The neural network structure and computation graph of QRDQN, which combines distributional RL and DQN. \ 371 You can refer to Distributional Reinforcement Learning with Quantile Regression \ 372 https://arxiv.org/pdf/1710.10044.pdf for more details. 373 Interfaces: 374 ``__init__``, ``forward`` 375 """ 376 377 def __init__( 378 self, 379 obs_shape: Union[int, SequenceType], 380 action_shape: Union[int, SequenceType], 381 encoder_hidden_size_list: SequenceType = [128, 128, 64], 382 head_hidden_size: Optional[int] = None, 383 head_layer_num: int = 1, 384 num_quantiles: int = 32, 385 activation: Optional[nn.Module] = nn.ReLU(), 386 norm_type: Optional[str] = None, 387 ) -> None: 388 """ 389 Overview: 390 Initialize the QRDQN Model according to input arguments. 391 Arguments: 392 - obs_shape (:obj:`Union[int, SequenceType]`): Observation's space. 393 - action_shape (:obj:`Union[int, SequenceType]`): Action's space. 394 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder`` 395 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` to pass to ``Head``. 396 - head_layer_num (:obj:`int`): The num of layers used in the network to compute Q value output 397 - num_quantiles (:obj:`int`): Number of quantiles in the prediction distribution. 398 - activation (:obj:`Optional[nn.Module]`): 399 The type of activation function to use in ``MLP`` the after ``layer_fn``, 400 if ``None`` then default set to ``nn.ReLU()`` 401 - norm_type (:obj:`Optional[str]`): 402 The type of normalization to use, see ``ding.torch_utils.fc_block`` for more details` 403 """ 404 super(QRDQN, self).__init__() 405 # For compatibility: 1, (1, ), [4, 32, 32] 406 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 407 if head_hidden_size is None: 408 head_hidden_size = encoder_hidden_size_list[-1] 409 # FC Encoder 410 if isinstance(obs_shape, int) or len(obs_shape) == 1: 411 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 412 # Conv Encoder 413 elif len(obs_shape) == 3: 414 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 415 else: 416 raise RuntimeError( 417 "not support obs_shape for pre-defined encoder: {}, please customize your own QRDQN".format(obs_shape) 418 ) 419 # Head Type 420 multi_head = not isinstance(action_shape, int) 421 if multi_head: 422 self.head = MultiHead( 423 QRDQNHead, 424 head_hidden_size, 425 action_shape, 426 layer_num=head_layer_num, 427 num_quantiles=num_quantiles, 428 activation=activation, 429 norm_type=norm_type, 430 ) 431 else: 432 self.head = QRDQNHead( 433 head_hidden_size, 434 action_shape, 435 head_layer_num, 436 num_quantiles=num_quantiles, 437 activation=activation, 438 norm_type=norm_type, 439 ) 440 441 def forward(self, x: torch.Tensor) -> Dict: 442 """ 443 Overview: 444 Use observation tensor to predict QRDQN's output. 445 Parameter updates with QRDQN's MLPs forward setup. 446 Arguments: 447 - x (:obj:`torch.Tensor`): 448 The encoded embedding tensor with ``(B, N=hidden_size)``. 449 Returns: 450 - outputs (:obj:`Dict`): 451 Run with encoder and head. Return the result prediction dictionary. 452 ReturnsKeys: 453 - logit (:obj:`torch.Tensor`): Logit tensor with same size as input ``x``. 454 - q (:obj:`torch.Tensor`): Q valye tensor tensor of size ``(B, N, num_quantiles)`` 455 - tau (:obj:`torch.Tensor`): tau tensor of size ``(B, N, 1)`` 456 Shapes: 457 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is head_hidden_size. 458 - logit (:obj:`torch.FloatTensor`): :math:`(B, M)`, where M is action_shape. 459 - tau (:obj:`torch.Tensor`): :math:`(B, M, 1)` 460 Examples: 461 >>> model = QRDQN(64, 64) 462 >>> inputs = torch.randn(4, 64) 463 >>> outputs = model(inputs) 464 >>> assert isinstance(outputs, dict) 465 >>> assert outputs['logit'].shape == torch.Size([4, 64]) 466 >>> # default num_quantiles : int = 32 467 >>> assert outputs['q'].shape == torch.Size([4, 64, 32]) 468 >>> assert outputs['tau'].shape == torch.Size([4, 32, 1]) 469 """ 470 x = self.encoder(x) 471 x = self.head(x) 472 return x 473 474 475@MODEL_REGISTRY.register('iqn') 476class IQN(nn.Module): 477 """ 478 Overview: 479 The neural network structure and computation graph of IQN, which combines distributional RL and DQN. \ 480 You can refer to paper Implicit Quantile Networks for Distributional Reinforcement Learning \ 481 https://arxiv.org/pdf/1806.06923.pdf for more details. 482 Interfaces: 483 ``__init__``, ``forward`` 484 """ 485 486 def __init__( 487 self, 488 obs_shape: Union[int, SequenceType], 489 action_shape: Union[int, SequenceType], 490 encoder_hidden_size_list: SequenceType = [128, 128, 64], 491 head_hidden_size: Optional[int] = None, 492 head_layer_num: int = 1, 493 num_quantiles: int = 32, 494 quantile_embedding_size: int = 128, 495 activation: Optional[nn.Module] = nn.ReLU(), 496 norm_type: Optional[str] = None 497 ) -> None: 498 """ 499 Overview: 500 Initialize the IQN Model according to input arguments. 501 Arguments: 502 - obs_shape (:obj:`Union[int, SequenceType]`): Observation space shape. 503 - action_shape (:obj:`Union[int, SequenceType]`): Action space shape. 504 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder`` 505 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` to pass to ``Head``. 506 - head_layer_num (:obj:`int`): The num of layers used in the network to compute Q value output 507 - num_quantiles (:obj:`int`): Number of quantiles in the prediction distribution. 508 - activation (:obj:`Optional[nn.Module]`): 509 The type of activation function to use in ``MLP`` the after ``layer_fn``, 510 if ``None`` then default set to ``nn.ReLU()`` 511 - norm_type (:obj:`Optional[str]`): 512 The type of normalization to use, see ``ding.torch_utils.fc_block`` for more details. 513 """ 514 super(IQN, self).__init__() 515 # For compatibility: 1, (1, ), [4, 32, 32] 516 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 517 if head_hidden_size is None: 518 head_hidden_size = encoder_hidden_size_list[-1] 519 # FC Encoder 520 if isinstance(obs_shape, int) or len(obs_shape) == 1: 521 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 522 # Conv Encoder 523 elif len(obs_shape) == 3: 524 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 525 else: 526 raise RuntimeError( 527 "not support obs_shape for pre-defined encoder: {}, please customize your own IQN".format(obs_shape) 528 ) 529 # Head Type 530 head_cls = QuantileHead 531 multi_head = not isinstance(action_shape, int) 532 if multi_head: 533 self.head = MultiHead( 534 head_cls, 535 head_hidden_size, 536 action_shape, 537 layer_num=head_layer_num, 538 num_quantiles=num_quantiles, 539 quantile_embedding_size=quantile_embedding_size, 540 activation=activation, 541 norm_type=norm_type 542 ) 543 else: 544 self.head = head_cls( 545 head_hidden_size, 546 action_shape, 547 head_layer_num, 548 activation=activation, 549 norm_type=norm_type, 550 num_quantiles=num_quantiles, 551 quantile_embedding_size=quantile_embedding_size, 552 ) 553 554 def forward(self, x: torch.Tensor) -> Dict: 555 """ 556 Overview: 557 Use encoded embedding tensor to predict IQN's output. 558 Parameter updates with IQN's MLPs forward setup. 559 Arguments: 560 - x (:obj:`torch.Tensor`): 561 The encoded embedding tensor with ``(B, N=hidden_size)``. 562 Returns: 563 - outputs (:obj:`Dict`): 564 Run with encoder and head. Return the result prediction dictionary. 565 ReturnsKeys: 566 - logit (:obj:`torch.Tensor`): Logit tensor with same size as input ``x``. 567 - q (:obj:`torch.Tensor`): Q valye tensor tensor of size ``(num_quantiles, N, B)`` 568 - quantiles (:obj:`torch.Tensor`): quantiles tensor of size ``(quantile_embedding_size, 1)`` 569 Shapes: 570 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is head_hidden_size. 571 - logit (:obj:`torch.FloatTensor`): :math:`(B, M)`, where M is action_shape 572 - quantiles (:obj:`torch.Tensor`): :math:`(P, 1)`, where P is quantile_embedding_size. 573 Examples: 574 >>> model = IQN(64, 64) # arguments: 'obs_shape' and 'action_shape' 575 >>> inputs = torch.randn(4, 64) 576 >>> outputs = model(inputs) 577 >>> assert isinstance(outputs, dict) 578 >>> assert outputs['logit'].shape == torch.Size([4, 64]) 579 >>> # default num_quantiles: int = 32 580 >>> assert outputs['q'].shape == torch.Size([32, 4, 64] 581 >>> # default quantile_embedding_size: int = 128 582 >>> assert outputs['quantiles'].shape == torch.Size([128, 1]) 583 """ 584 x = self.encoder(x) 585 x = self.head(x) 586 return x 587 588 589@MODEL_REGISTRY.register('fqf') 590class FQF(nn.Module): 591 """ 592 Overview: 593 The neural network structure and computation graph of FQF, which combines distributional RL and DQN. \ 594 You can refer to paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning \ 595 https://arxiv.org/pdf/1911.02140.pdf for more details. 596 Interface: 597 ``__init__``, ``forward`` 598 """ 599 600 def __init__( 601 self, 602 obs_shape: Union[int, SequenceType], 603 action_shape: Union[int, SequenceType], 604 encoder_hidden_size_list: SequenceType = [128, 128, 64], 605 head_hidden_size: Optional[int] = None, 606 head_layer_num: int = 1, 607 num_quantiles: int = 32, 608 quantile_embedding_size: int = 128, 609 activation: Optional[nn.Module] = nn.ReLU(), 610 norm_type: Optional[str] = None 611 ) -> None: 612 """ 613 Overview: 614 Initialize the FQF Model according to input arguments. 615 Arguments: 616 - obs_shape (:obj:`Union[int, SequenceType]`): Observation space shape. 617 - action_shape (:obj:`Union[int, SequenceType]`): Action space shape. 618 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder`` 619 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` to pass to ``Head``. 620 - head_layer_num (:obj:`int`): The num of layers used in the network to compute Q value output 621 - num_quantiles (:obj:`int`): Number of quantiles in the prediction distribution. 622 - activation (:obj:`Optional[nn.Module]`): 623 The type of activation function to use in ``MLP`` the after ``layer_fn``, 624 if ``None`` then default set to ``nn.ReLU()`` 625 - norm_type (:obj:`Optional[str]`): 626 The type of normalization to use, see ``ding.torch_utils.fc_block`` for more details. 627 """ 628 super(FQF, self).__init__() 629 # For compatibility: 1, (1, ), [4, 32, 32] 630 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 631 if head_hidden_size is None: 632 head_hidden_size = encoder_hidden_size_list[-1] 633 # FC Encoder 634 if isinstance(obs_shape, int) or len(obs_shape) == 1: 635 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 636 # Conv Encoder 637 elif len(obs_shape) == 3: 638 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 639 else: 640 raise RuntimeError( 641 "not support obs_shape for pre-defined encoder: {}, please customize your own FQF".format(obs_shape) 642 ) 643 # Head Type 644 head_cls = FQFHead 645 multi_head = not isinstance(action_shape, int) 646 if multi_head: 647 self.head = MultiHead( 648 head_cls, 649 head_hidden_size, 650 action_shape, 651 layer_num=head_layer_num, 652 num_quantiles=num_quantiles, 653 quantile_embedding_size=quantile_embedding_size, 654 activation=activation, 655 norm_type=norm_type 656 ) 657 else: 658 self.head = head_cls( 659 head_hidden_size, 660 action_shape, 661 head_layer_num, 662 activation=activation, 663 norm_type=norm_type, 664 num_quantiles=num_quantiles, 665 quantile_embedding_size=quantile_embedding_size, 666 ) 667 668 def forward(self, x: torch.Tensor) -> Dict: 669 """ 670 Overview: 671 Use encoded embedding tensor to predict FQF's output. 672 Parameter updates with FQF's MLPs forward setup. 673 Arguments: 674 - x (:obj:`torch.Tensor`): 675 The encoded embedding tensor with ``(B, N=hidden_size)``. 676 Returns: 677 - outputs (:obj:`Dict`): Dict containing keywords ``logit`` (:obj:`torch.Tensor`), \ 678 ``q`` (:obj:`torch.Tensor`), ``quantiles`` (:obj:`torch.Tensor`), \ 679 ``quantiles_hats`` (:obj:`torch.Tensor`), \ 680 ``q_tau_i`` (:obj:`torch.Tensor`), ``entropies`` (:obj:`torch.Tensor`). 681 Shapes: 682 - x: :math:`(B, N)`, where B is batch size and N is head_hidden_size. 683 - logit: :math:`(B, M)`, where M is action_shape. 684 - q: :math:`(B, num_quantiles, M)`. 685 - quantiles: :math:`(B, num_quantiles + 1)`. 686 - quantiles_hats: :math:`(B, num_quantiles)`. 687 - q_tau_i: :math:`(B, num_quantiles - 1, M)`. 688 - entropies: :math:`(B, 1)`. 689 Examples: 690 >>> model = FQF(64, 64) # arguments: 'obs_shape' and 'action_shape' 691 >>> inputs = torch.randn(4, 64) 692 >>> outputs = model(inputs) 693 >>> assert isinstance(outputs, dict) 694 >>> assert outputs['logit'].shape == torch.Size([4, 64]) 695 >>> # default num_quantiles: int = 32 696 >>> assert outputs['q'].shape == torch.Size([4, 32, 64]) 697 >>> assert outputs['quantiles'].shape == torch.Size([4, 33]) 698 >>> assert outputs['quantiles_hats'].shape == torch.Size([4, 32]) 699 >>> assert outputs['q_tau_i'].shape == torch.Size([4, 31, 64]) 700 >>> assert outputs['quantiles'].shape == torch.Size([4, 1]) 701 """ 702 x = self.encoder(x) 703 x = self.head(x) 704 return x 705 706 707@MODEL_REGISTRY.register('rainbowdqn') 708class RainbowDQN(nn.Module): 709 """ 710 Overview: 711 The neural network structure and computation graph of RainbowDQN, which combines distributional RL and DQN. \ 712 You can refer to paper Rainbow: Combining Improvements in Deep Reinforcement Learning \ 713 https://arxiv.org/pdf/1710.02298.pdf for more details. 714 Interfaces: 715 ``__init__``, ``forward`` 716 717 .. note:: 718 RainbowDQN contains dueling architecture by default. 719 """ 720 721 def __init__( 722 self, 723 obs_shape: Union[int, SequenceType], 724 action_shape: Union[int, SequenceType], 725 encoder_hidden_size_list: SequenceType = [128, 128, 64], 726 head_hidden_size: Optional[int] = None, 727 head_layer_num: int = 1, 728 activation: Optional[nn.Module] = nn.ReLU(), 729 norm_type: Optional[str] = None, 730 v_min: Optional[float] = -10, 731 v_max: Optional[float] = 10, 732 n_atom: Optional[int] = 51, 733 ) -> None: 734 """ 735 Overview: 736 Init the Rainbow Model according to arguments. 737 Arguments: 738 - obs_shape (:obj:`Union[int, SequenceType]`): Observation space shape. 739 - action_shape (:obj:`Union[int, SequenceType]`): Action space shape. 740 - encoder_hidden_size_list (:obj:`SequenceType`): Collection of ``hidden_size`` to pass to ``Encoder`` 741 - head_hidden_size (:obj:`Optional[int]`): The ``hidden_size`` to pass to ``Head``. 742 - head_layer_num (:obj:`int`): The num of layers used in the network to compute Q value output 743 - activation (:obj:`Optional[nn.Module]`): The type of activation function to use in ``MLP`` the after \ 744 ``layer_fn``, if ``None`` then default set to ``nn.ReLU()`` 745 - norm_type (:obj:`Optional[str]`): The type of normalization to use, see ``ding.torch_utils.fc_block`` \ 746 for more details` 747 - n_atom (:obj:`Optional[int]`): Number of atoms in the prediction distribution. 748 """ 749 super(RainbowDQN, self).__init__() 750 # For compatibility: 1, (1, ), [4, 32, 32] 751 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 752 if head_hidden_size is None: 753 head_hidden_size = encoder_hidden_size_list[-1] 754 # FC Encoder 755 if isinstance(obs_shape, int) or len(obs_shape) == 1: 756 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 757 # Conv Encoder 758 elif len(obs_shape) == 3: 759 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 760 else: 761 raise RuntimeError( 762 "not support obs_shape for pre-defined encoder: {}, please customize your own RainbowDQN". 763 format(obs_shape) 764 ) 765 # Head Type 766 multi_head = not isinstance(action_shape, int) 767 if multi_head: 768 self.head = MultiHead( 769 RainbowHead, 770 head_hidden_size, 771 action_shape, 772 layer_num=head_layer_num, 773 activation=activation, 774 norm_type=norm_type, 775 n_atom=n_atom, 776 v_min=v_min, 777 v_max=v_max, 778 ) 779 else: 780 self.head = RainbowHead( 781 head_hidden_size, 782 action_shape, 783 head_layer_num, 784 activation=activation, 785 norm_type=norm_type, 786 n_atom=n_atom, 787 v_min=v_min, 788 v_max=v_max, 789 ) 790 791 def forward(self, x: torch.Tensor) -> Dict: 792 """ 793 Overview: 794 Use observation tensor to predict Rainbow output. 795 Parameter updates with Rainbow's MLPs forward setup. 796 Arguments: 797 - x (:obj:`torch.Tensor`): 798 The encoded embedding tensor with ``(B, N=hidden_size)``. 799 Returns: 800 - outputs (:obj:`Dict`): 801 Run ``MLP`` with ``RainbowHead`` setups and return the result prediction dictionary. 802 ReturnsKeys: 803 - logit (:obj:`torch.Tensor`): Logit tensor with same size as input ``x``. 804 - distribution (:obj:`torch.Tensor`): Distribution tensor of size ``(B, N, n_atom)`` 805 Shapes: 806 - x (:obj:`torch.Tensor`): :math:`(B, N)`, where B is batch size and N is head_hidden_size. 807 - logit (:obj:`torch.FloatTensor`): :math:`(B, M)`, where M is action_shape. 808 - distribution(:obj:`torch.FloatTensor`): :math:`(B, M, P)`, where P is n_atom. 809 Examples: 810 >>> model = RainbowDQN(64, 64) # arguments: 'obs_shape' and 'action_shape' 811 >>> inputs = torch.randn(4, 64) 812 >>> outputs = model(inputs) 813 >>> assert isinstance(outputs, dict) 814 >>> assert outputs['logit'].shape == torch.Size([4, 64]) 815 >>> # default n_atom: int =51 816 >>> assert outputs['distribution'].shape == torch.Size([4, 64, 51]) 817 """ 818 x = self.encoder(x) 819 x = self.head(x) 820 return x 821 822 823def parallel_wrapper(forward_fn: Callable) -> Callable: 824 """ 825 Overview: 826 Process timestep T and batch_size B at the same time, in other words, treat different timestep data as 827 different trajectories in a batch. 828 Arguments: 829 - forward_fn (:obj:`Callable`): Normal ``nn.Module`` 's forward function. 830 Returns: 831 - wrapper (:obj:`Callable`): Wrapped function. 832 """ 833 834 def wrapper(x: torch.Tensor) -> Union[torch.Tensor, List[torch.Tensor]]: 835 T, B = x.shape[:2] 836 837 def reshape(d): 838 if isinstance(d, list): 839 d = [reshape(t) for t in d] 840 elif isinstance(d, dict): 841 d = {k: reshape(v) for k, v in d.items()} 842 else: 843 d = d.reshape(T, B, *d.shape[1:]) 844 return d 845 846 # NOTE(rjy): the initial input shape will be (T, B, N), 847 # means encoder or head should process B trajectorys, each trajectory has T timestep, 848 # but T and B dimension can be both treated as batch_size in encoder and head, 849 # i.e., independent and parallel processing, 850 # so here we need such fn to reshape for encoder or head 851 x = x.reshape(T * B, *x.shape[2:]) 852 x = forward_fn(x) 853 x = reshape(x) 854 return x 855 856 return wrapper 857 858 859@MODEL_REGISTRY.register('drqn') 860class DRQN(nn.Module): 861 """ 862 Overview: 863 The DRQN (Deep Recurrent Q-Network) is a neural network model combining DQN with RNN to handle sequential 864 data and partially observable environments. It consists of three main components: ``encoder``, ``rnn``, 865 and ``head``. 866 - **Encoder**: Extracts features from various observation inputs. 867 - **RNN**: Processes sequential observations and other data. 868 - **Head**: Computes Q-values for each action dimension. 869 870 Interfaces: 871 ``__init__``, ``forward``. 872 873 .. note:: 874 The current implementation supports: 875 - Two encoder types: ``FCEncoder`` and ``ConvEncoder``. 876 - Two head types: ``DiscreteHead`` and ``DuelingHead``. 877 - Three RNN types: ``normal (LSTM with LayerNorm)``, ``pytorch`` (PyTorch's native LSTM), and ``gru``. 878 You can extend the model by customizing your own encoder, RNN, or head by inheriting this class. 879 """ 880 881 def __init__( 882 self, 883 obs_shape: Union[int, SequenceType], 884 action_shape: Union[int, SequenceType], 885 encoder_hidden_size_list: SequenceType = [128, 128, 64], 886 dueling: bool = True, 887 head_hidden_size: Optional[int] = None, 888 head_layer_num: int = 1, 889 lstm_type: Optional[str] = 'normal', 890 activation: Optional[nn.Module] = nn.ReLU(), 891 norm_type: Optional[str] = None, 892 res_link: bool = False 893 ) -> None: 894 """ 895 Overview: 896 Initialize the DRQN model with specified parameters. 897 Arguments: 898 - obs_shape (:obj:`Union[int, SequenceType]`): Shape of the observation space, e.g., 8 or [4, 84, 84]. 899 - action_shape (:obj:`Union[int, SequenceType]`): Shape of the action space, e.g., 6 or [2, 3, 3]. 900 - encoder_hidden_size_list (:obj:`SequenceType`): List of hidden sizes for the encoder. The last element \ 901 must match ``head_hidden_size``. 902 - dueling (:obj:`Optional[bool]`): Use ``DuelingHead`` if True, otherwise use ``DiscreteHead``. 903 - head_hidden_size (:obj:`Optional[int]`): Hidden size for the head network. Defaults to the last \ 904 element of ``encoder_hidden_size_list`` if None. 905 - head_layer_num (:obj:`int`): Number of layers in the head network to compute Q-value outputs. 906 - lstm_type (:obj:`Optional[str]`): Type of RNN module. Supported types are ``normal``, ``pytorch``, \ 907 and ``gru``. 908 - activation (:obj:`Optional[nn.Module]`): Activation function used in the network. Defaults to \ 909 ``nn.ReLU()``. 910 - norm_type (:obj:`Optional[str]`): Normalization type for the networks. Supported types are: \ 911 ['BN', 'IN', 'SyncBN', 'LN']. See ``ding.torch_utils.fc_block`` for more details. 912 - res_link (:obj:`bool`): Enables residual connections between single-frame data and sequential data. \ 913 Defaults to False. 914 """ 915 super(DRQN, self).__init__() 916 # Compatibility for obs_shape/action_shape: Handles scalar, tuple, or multi-dimensional inputs. 917 obs_shape, action_shape = squeeze(obs_shape), squeeze(action_shape) 918 if head_hidden_size is None: 919 head_hidden_size = encoder_hidden_size_list[-1] 920 921 # Encoder: Determines the encoder type based on the observation shape. 922 if isinstance(obs_shape, int) or len(obs_shape) == 1: 923 # FC Encoder 924 self.encoder = FCEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 925 elif len(obs_shape) == 3: 926 # Conv Encoder 927 self.encoder = ConvEncoder(obs_shape, encoder_hidden_size_list, activation=activation, norm_type=norm_type) 928 else: 929 raise RuntimeError( 930 f"Unsupported obs_shape for pre-defined encoder: {obs_shape}. Please customize your own DRQN." 931 ) 932 933 # RNN: Initializes the RNN module based on the specified lstm_type. 934 self.rnn = get_lstm(lstm_type, input_size=head_hidden_size, hidden_size=head_hidden_size) 935 self.res_link = res_link 936 937 # Head: Determines the head type (Dueling or Discrete) and its configuration. 938 if dueling: 939 head_cls = DuelingHead 940 else: 941 head_cls = DiscreteHead 942 multi_head = not isinstance(action_shape, int) 943 if multi_head: 944 self.head = MultiHead( 945 head_cls, 946 head_hidden_size, 947 action_shape, 948 layer_num=head_layer_num, 949 activation=activation, 950 norm_type=norm_type 951 ) 952 else: 953 self.head = head_cls( 954 head_hidden_size, action_shape, head_layer_num, activation=activation, norm_type=norm_type 955 ) 956 957 def forward(self, inputs: Dict, inference: bool = False, saved_state_timesteps: Optional[list] = None) -> Dict: 958 """ 959 Overview: 960 Defines the forward pass of the DRQN model. Takes observation and previous RNN states as inputs \ 961 and predicts Q-values. 962 Arguments: 963 - inputs (:obj:`Dict`): Input data dictionary containing observation and previous RNN state. 964 - inference (:obj:`bool`): If True, unrolls one timestep (used during evaluation). If False, unrolls \ 965 the entire sequence (used during training). 966 - saved_state_timesteps (:obj:`Optional[list]`): When inference is False, specifies the timesteps \ 967 whose hidden states are saved and returned. 968 ArgumentsKeys: 969 - obs (:obj:`torch.Tensor`): Raw observation tensor. 970 - prev_state (:obj:`list`): Previous RNN state tensor, structure depends on ``lstm_type``. 971 Returns: 972 - outputs (:obj:`Dict`): The output of DRQN's forward, including logit (q_value) and next state. 973 ReturnsKeys: 974 - logit (:obj:`torch.Tensor`): Discrete Q-value output for each action dimension. 975 - next_state (:obj:`list`): Next RNN state tensor. 976 Shapes: 977 - obs (:obj:`torch.Tensor`): :math:`(B, N)` where B is batch size and N is ``obs_shape``. 978 - logit (:obj:`torch.Tensor`): :math:`(B, M)` where B is batch size and M is ``action_shape``. 979 Examples: 980 >>> # Initialize input keys 981 >>> prev_state = [[torch.randn(1, 1, 64) for __ in range(2)] for _ in range(4)] # B=4 982 >>> obs = torch.randn(4,64) 983 >>> model = DRQN(64, 64) # arguments: 'obs_shape' and 'action_shape' 984 >>> outputs = model({'obs': inputs, 'prev_state': prev_state}, inference=True) 985 >>> # Validate output keys and shapes 986 >>> assert isinstance(outputs, dict) 987 >>> assert outputs['logit'].shape == (4, 64) 988 >>> assert len(outputs['next_state']) == 4 989 >>> assert all([len(t) == 2 for t in outputs['next_state']]) 990 >>> assert all([t[0].shape == (1, 1, 64) for t in outputs['next_state']]) 991 """ 992 993 x, prev_state = inputs['obs'], inputs['prev_state'] 994 # Forward pass: Encoder -> RNN -> Head 995 # in most situations, set inference=True when evaluate and inference=False when training 996 # Inference mode: Processes one timestep (seq_len=1). 997 if inference: 998 x = self.encoder(x) 999 if self.res_link:1000 a = x1001 x = x.unsqueeze(0) # for rnn input, put the seq_len of x as 1 instead of none.1002 # prev_state: DataType: List[Tuple[torch.Tensor]]; Initially, it is a list of None1003 x, next_state = self.rnn(x, prev_state)1004 x = x.squeeze(0) # to delete the seq_len dim to match head network input1005 if self.res_link:1006 x = x + a1007 x = self.head(x)1008 x['next_state'] = next_state1009 return x1010 # Training mode: Processes the entire sequence.1011 else:1012 # In order to better explain why rnn needs saved_state and which states need to be stored,1013 # let's take r2d2 as an example1014 # in r2d2,1015 # 1) data['burnin_nstep_obs'] = data['obs'][:bs + self._nstep]1016 # 2) data['main_obs'] = data['obs'][bs:-self._nstep]1017 # 3) data['target_obs'] = data['obs'][bs + self._nstep:]1018 assert len(x.shape) in [3, 5], f"Expected shape (T, B, N) or (T, B, C, H, W), got {x.shape}"1019 x = parallel_wrapper(self.encoder)(x) # (T, B, N)1020 if self.res_link:1021 a = x1022 # lstm_embedding stores all hidden_state1023 lstm_embedding = []1024 # TODO(nyz) how to deal with hidden_size key-value1025 hidden_state_list = []10261027 if saved_state_timesteps is not None:1028 saved_state = []1029 for t in range(x.shape[0]): # Iterate over timesteps (T).1030 # use x[t:t+1] but not x[t] can keep the original dimension1031 output, prev_state = self.rnn(x[t:t + 1], prev_state) # RNN step output: (1, B, hidden_size)1032 if saved_state_timesteps is not None and t + 1 in saved_state_timesteps:1033 saved_state.append(prev_state)1034 lstm_embedding.append(output)1035 hidden_state = [p['h'] for p in prev_state]1036 # only keep ht, {list: x.shape[0]{Tensor:(1, batch_size, head_hidden_size)}}1037 hidden_state_list.append(torch.cat(hidden_state, dim=1))1038 x = torch.cat(lstm_embedding, 0) # (T, B, head_hidden_size)1039 if self.res_link:1040 x = x + a1041 x = parallel_wrapper(self.head)(x) # (T, B, action_shape)1042 # x['next_state'] is the hidden state of the last timestep inputted to lstm1043 # the last timestep state including the hidden state (h) and the cell state (c)1044 # shape: {list: B{dict: 2{Tensor:(1, 1, head_hidden_size}}}1045 x['next_state'] = prev_state1046 # all hidden state h, this returns a tensor of the dim: seq_len*batch_size*head_hidden_size1047 # This key is used in qtran, the algorithm requires to retain all h_{t} during training1048 x['hidden_state'] = torch.cat(hidden_state_list, dim=0)1049 if saved_state_timesteps is not None:1050 # the selected saved hidden states, including the hidden state (h) and the cell state (c)1051 # in r2d2, set 'saved_hidden_state_timesteps=[self._burnin_step, self._burnin_step + self._nstep]',1052 # then saved_state will record the hidden_state for main_obs and target_obs to1053 # initialize their lstm (h c)1054 x['saved_state'] = saved_state1055 return x105610571058@MODEL_REGISTRY.register('gtrxldqn')1059class GTrXLDQN(nn.Module):1060 """1061 Overview:1062 The neural network structure and computation graph of Gated Transformer-XL DQN algorithm, which is the \1063 enhanced version of DRQN, using Transformer-XL to improve long-term sequential modelling ability. The \1064 GTrXL-DQN is composed of three parts: ``encoder``, ``head`` and ``core``. The ``encoder`` is used to extract \1065 the feature from various observation, the ``core`` is used to process the sequential observation and other \1066 data, and the ``head`` is used to compute the Q value of each action dimension.1067 Interfaces:1068 ``__init__``, ``forward``, ``reset_memory``, ``get_memory`` .1069 """10701071 def __init__(1072 self,1073 obs_shape: Union[int, SequenceType],1074 action_shape: Union[int, SequenceType],1075 head_layer_num: int = 1,1076 att_head_dim: int = 16,1077 hidden_size: int = 16,1078 att_head_num: int = 2,1079 att_mlp_num: int = 2,1080 att_layer_num: int = 3,1081 memory_len: int = 64,1082 activation: Optional[nn.Module] = nn.ReLU(),1083 head_norm_type: Optional[str] = None,1084 dropout: float = 0.,1085 gru_gating: bool = True,1086 gru_bias: float = 2.,1087 dueling: bool = True,1088 encoder_hidden_size_list: SequenceType = [128, 128, 256],1089 encoder_norm_type: Optional[str] = None,1090 ) -> None:1091 """1092 Overview:1093 Initialize the GTrXLDQN model accoding to corresponding input arguments.10941095 .. tip::1096 You can refer to GTrXl class in ``ding.torch_utils.network.gtrxl`` for more details about the input \1097 arguments.10981099 Arguments:1100 - obs_shape (:obj:`Union[int, SequenceType]`): Used by Transformer. Observation's space.1101 - action_shape (:obj:Union[int, SequenceType]): Used by Head. Action's space.1102 - head_layer_num (:obj:`int`): Used by Head. Number of layers.1103 - att_head_dim (:obj:`int`): Used by Transformer.1104 - hidden_size (:obj:`int`): Used by Transformer and Head.1105 - att_head_num (:obj:`int`): Used by Transformer.1106 - att_mlp_num (:obj:`int`): Used by Transformer.1107 - att_layer_num (:obj:`int`): Used by Transformer.1108 - memory_len (:obj:`int`): Used by Transformer.1109 - activation (:obj:`Optional[nn.Module]`): Used by Transformer and Head. if ``None`` then default set to \1110 ``nn.ReLU()``.1111 - head_norm_type (:obj:`Optional[str]`): Used by Head. The type of normalization to use, see \1112 ``ding.torch_utils.fc_block`` for more details`.1113 - dropout (:obj:`bool`): Used by Transformer.1114 - gru_gating (:obj:`bool`): Used by Transformer.1115 - gru_bias (:obj:`float`): Used by Transformer.1116 - dueling (:obj:`bool`): Used by Head. Make the head dueling.1117 - encoder_hidden_size_list(:obj:`SequenceType`): Used by Encoder. The collection of ``hidden_size`` if \1118 using a custom convolutional encoder.1119 - encoder_norm_type (:obj:`Optional[str]`): Used by Encoder. The type of normalization to use, see \1120 ``ding.torch_utils.fc_block`` for more details`.1121 """1122 super(GTrXLDQN, self).__init__()1123 self.core = GTrXL(1124 input_dim=obs_shape,1125 head_dim=att_head_dim,1126 embedding_dim=hidden_size,1127 head_num=att_head_num,1128 mlp_num=att_mlp_num,1129 layer_num=att_layer_num,1130 memory_len=memory_len,1131 activation=activation,1132 dropout_ratio=dropout,1133 gru_gating=gru_gating,1134 gru_bias=gru_bias,1135 )11361137 # for vector obs, use Identity Encoder, i.e. pass1138 if isinstance(obs_shape, int) or len(obs_shape) == 1:1139 pass1140 # replace the embedding layer of Transformer with Conv Encoder1141 elif len(obs_shape) == 3:1142 assert encoder_hidden_size_list[-1] == hidden_size1143 self.obs_encoder = ConvEncoder(1144 obs_shape, encoder_hidden_size_list, activation=activation, norm_type=encoder_norm_type1145 )1146 self.dropout = nn.Dropout(dropout)1147 self.core.use_embedding_layer = False1148 else:1149 raise RuntimeError(1150 "not support obs_shape for pre-defined encoder: {}, please customize your own GTrXL".format(obs_shape)1151 )1152 # Head Type1153 if dueling:1154 head_cls = DuelingHead1155 else:1156 head_cls = DiscreteHead1157 multi_head = not isinstance(action_shape, int)1158 if multi_head:1159 self.head = MultiHead(1160 head_cls,1161 hidden_size,1162 action_shape,1163 layer_num=head_layer_num,1164 activation=activation,1165 norm_type=head_norm_type1166 )1167 else:1168 self.head = head_cls(1169 hidden_size, action_shape, head_layer_num, activation=activation, norm_type=head_norm_type1170 )11711172 def forward(self, x: torch.Tensor) -> Dict:1173 """1174 Overview:1175 Let input tensor go through GTrXl and the Head sequentially.1176 Arguments:1177 - x (:obj:`torch.Tensor`): input tensor of shape (seq_len, bs, obs_shape).1178 Returns:1179 - out (:obj:`Dict`): run ``GTrXL`` with ``DiscreteHead`` setups and return the result prediction dictionary.1180 ReturnKeys:1181 - logit (:obj:`torch.Tensor`): discrete Q-value output of each action dimension, shape is (B, action_space).1182 - memory (:obj:`torch.Tensor`): memory tensor of size ``(bs x layer_num+1 x memory_len x embedding_dim)``.1183 - transformer_out (:obj:`torch.Tensor`): output tensor of transformer with same size as input ``x``.1184 Examples:1185 >>> # Init input's Keys:1186 >>> obs_dim, seq_len, bs, action_dim = 128, 64, 32, 41187 >>> obs = torch.rand(seq_len, bs, obs_dim)1188 >>> model = GTrXLDQN(obs_dim, action_dim)1189 >>> outputs = model(obs)1190 >>> assert isinstance(outputs, dict)1191 """1192 if len(x.shape) == 5:1193 # 3d obs: cur_seq, bs, ch, h, w1194 x_ = x.reshape([x.shape[0] * x.shape[1]] + list(x.shape[-3:]))1195 x_ = self.dropout(self.obs_encoder(x_))1196 x = x_.reshape(x.shape[0], x.shape[1], -1)1197 o1 = self.core(x)1198 out = self.head(o1['logit'])1199 # layer_num+1 x memory_len x bs embedding_dim -> bs x layer_num+1 x memory_len x embedding_dim1200 out['memory'] = o1['memory'].permute((2, 0, 1, 3)).contiguous()1201 out['transformer_out'] = o1['logit'] # output of gtrxl, out['logit'] is final output1202 return out12031204 def reset_memory(self, batch_size: Optional[int] = None, state: Optional[torch.Tensor] = None) -> None:1205 """1206 Overview:1207 Clear or reset the memory of GTrXL.1208 Arguments:1209 - batch_size (:obj:`Optional[int]`): The number of samples in a training batch.1210 - state (:obj:`Optional[torch.Tensor]`): The input memory data, whose shape is \1211 (layer_num, memory_len, bs, embedding_dim).1212 """1213 self.core.reset_memory(batch_size, state)12141215 def get_memory(self) -> Optional[torch.Tensor]:1216 """1217 Overview:1218 Return the memory of GTrXL.1219 Returns:1220 - memory: (:obj:`Optional[torch.Tensor]`): output memory or None if memory has not been initialized, \1221 whose shape is (layer_num, memory_len, bs, embedding_dim).1222 """1223 return self.core.get_memory()