Skip to content

ding.torch_utils.network.transformer

ding.torch_utils.network.transformer

Attention

Bases: Module

Overview

For each entry embedding, compute individual attention across all entries, add them up to get output attention.

Interfaces: __init__, split, forward

__init__(input_dim, head_dim, output_dim, head_num, dropout)

Overview

Initialize the Attention module with the provided dimensions and dropout layer.

Arguments: - input_dim (:obj:int): The dimension of the input. - head_dim (:obj:int): The dimension of each head in the multi-head attention mechanism. - output_dim (:obj:int): The dimension of the output. - head_num (:obj:int): The number of heads in the multi-head attention mechanism. - dropout (:obj:nn.Module): The dropout layer used in the attention mechanism.

split(x, T=False)

Overview

Split the input to get multi-head queries, keys, and values.

Arguments: - x (:obj:torch.Tensor): The tensor to be split, which could be a query, key, or value. - T (:obj:bool, optional): If True, transpose the output tensors. Defaults to False. Returns: - x (:obj:List[torch.Tensor]): A list of output tensors for each head.

forward(x, mask=None)

Overview

Compute the attention from the input tensor.

Arguments: - x (:obj:torch.Tensor): The input tensor for the forward computation. - mask (:obj:Optional[torch.Tensor], optional): Optional mask to exclude invalid entries. Defaults to None. Returns: - attention (:obj:torch.Tensor): The computed attention tensor.

TransformerLayer

Bases: Module

Overview

In transformer layer, first computes entries's attention and applies a feedforward layer.

Interfaces: __init__, forward

__init__(input_dim, head_dim, hidden_dim, output_dim, head_num, mlp_num, dropout, activation)

Overview

Initialize the TransformerLayer with the provided dimensions, dropout layer, and activation function.

Arguments: - input_dim (:obj:int): The dimension of the input. - head_dim (:obj:int): The dimension of each head in the multi-head attention mechanism. - hidden_dim (:obj:int): The dimension of the hidden layer in the MLP (Multi-Layer Perceptron). - output_dim (:obj:int): The dimension of the output. - head_num (:obj:int): The number of heads in the multi-head attention mechanism. - mlp_num (:obj:int): The number of layers in the MLP. - dropout (:obj:nn.Module): The dropout layer used in the attention mechanism. - activation (:obj:nn.Module): The activation function used in the MLP.

forward(inputs)

Overview

Compute the forward pass through the Transformer layer.

Arguments: - inputs (:obj:Tuple[torch.Tensor, torch.Tensor]): A tuple containing the input tensor x and the mask tensor. Returns: - output (:obj:Tuple[torch.Tensor, torch.Tensor]): A tuple containing the predicted value tensor and the mask tensor.

Transformer

Bases: Module

Overview

Implementation of the Transformer model.

.. note:: For more details, refer to "Attention is All You Need": http://arxiv.org/abs/1706.03762.

Interfaces

__init__, forward

__init__(input_dim, head_dim=128, hidden_dim=1024, output_dim=256, head_num=2, mlp_num=2, layer_num=3, dropout_ratio=0.0, activation=nn.ReLU())

Overview

Initialize the Transformer with the provided dimensions, dropout layer, activation function, and layer numbers.

Arguments: - input_dim (:obj:int): The dimension of the input. - head_dim (:obj:int): The dimension of each head in the multi-head attention mechanism. - hidden_dim (:obj:int): The dimension of the hidden layer in the MLP (Multi-Layer Perceptron). - output_dim (:obj:int): The dimension of the output. - head_num (:obj:int): The number of heads in the multi-head attention mechanism. - mlp_num (:obj:int): The number of layers in the MLP. - layer_num (:obj:int): The number of Transformer layers. - dropout_ratio (:obj:float): The dropout ratio for the dropout layer. - activation (:obj:nn.Module): The activation function used in the MLP.

forward(x, mask=None)

Overview

Perform the forward pass through the Transformer.

Arguments: - x (:obj:torch.Tensor): The input tensor, with shape (B, N, C), where B is batch size, N is the number of entries, and C is the feature dimension. - mask (:obj:Optional[torch.Tensor], optional): The mask tensor (bool), used to mask out invalid entries in attention. It has shape (B, N), where B is batch size and N is number of entries. Defaults to None. Returns: - x (:obj:torch.Tensor): The output tensor from the Transformer.

ScaledDotProductAttention

Bases: Module

Overview

Implementation of Scaled Dot Product Attention, a key component of Transformer models. This class performs the dot product of the query, key and value tensors, scales it with the square root of the dimension of the key vector (d_k) and applies dropout for regularization.

Interfaces: __init__, forward

__init__(d_k, dropout=0.0)

Overview

Initialize the ScaledDotProductAttention module with the dimension of the key vector and the dropout rate.

Arguments: - d_k (:obj:int): The dimension of the key vector. This will be used to scale the dot product of the query and key. - dropout (:obj:float, optional): The dropout rate to be applied after the softmax operation. Defaults to 0.0.

forward(q, k, v, mask=None)

Overview

Perform the Scaled Dot Product Attention operation on the query, key and value tensors.

Arguments: - q (:obj:torch.Tensor): The query tensor. - k (:obj:torch.Tensor): The key tensor. - v (:obj:torch.Tensor): The value tensor. - mask (:obj:Optional[torch.Tensor]): An optional mask tensor to be applied on the attention scores. Defaults to None. Returns: - output (:obj:torch.Tensor): The output tensor after the attention operation.

Full Source Code

../ding/torch_utils/network/transformer.py

1import torch 2import torch.nn as nn 3import torch.nn.functional as F 4import math 5from typing import List, Optional, Tuple 6 7from .nn_module import fc_block, build_normalization 8 9 10class Attention(nn.Module): 11 """ 12 Overview: 13 For each entry embedding, compute individual attention across all entries, add them up to get output attention. 14 Interfaces: 15 ``__init__``, ``split``, ``forward`` 16 """ 17 18 def __init__(self, input_dim: int, head_dim: int, output_dim: int, head_num: int, dropout: nn.Module) -> None: 19 """ 20 Overview: 21 Initialize the Attention module with the provided dimensions and dropout layer. 22 Arguments: 23 - input_dim (:obj:`int`): The dimension of the input. 24 - head_dim (:obj:`int`): The dimension of each head in the multi-head attention mechanism. 25 - output_dim (:obj:`int`): The dimension of the output. 26 - head_num (:obj:`int`): The number of heads in the multi-head attention mechanism. 27 - dropout (:obj:`nn.Module`): The dropout layer used in the attention mechanism. 28 """ 29 super(Attention, self).__init__() 30 self.head_num = head_num 31 self.head_dim = head_dim 32 self.dropout = dropout 33 self.attention_pre = fc_block(input_dim, head_dim * head_num * 3) # query, key, value 34 self.project = fc_block(head_dim * head_num, output_dim) 35 36 def split(self, x: torch.Tensor, T: bool = False) -> List[torch.Tensor]: 37 """ 38 Overview: 39 Split the input to get multi-head queries, keys, and values. 40 Arguments: 41 - x (:obj:`torch.Tensor`): The tensor to be split, which could be a query, key, or value. 42 - T (:obj:`bool`, optional): If True, transpose the output tensors. Defaults to False. 43 Returns: 44 - x (:obj:`List[torch.Tensor]`): A list of output tensors for each head. 45 """ 46 B, N = x.shape[:2] 47 x = x.view(B, N, self.head_num, self.head_dim) 48 x = x.permute(0, 2, 1, 3).contiguous() # B, head_num, N, head_dim 49 if T: 50 x = x.permute(0, 1, 3, 2).contiguous() 51 return x 52 53 def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor: 54 """ 55 Overview: 56 Compute the attention from the input tensor. 57 Arguments: 58 - x (:obj:`torch.Tensor`): The input tensor for the forward computation. 59 - mask (:obj:`Optional[torch.Tensor]`, optional): Optional mask to exclude invalid entries. 60 Defaults to None. 61 Returns: 62 - attention (:obj:`torch.Tensor`): The computed attention tensor. 63 """ 64 assert (len(x.shape) == 3) 65 B, N = x.shape[:2] 66 x = self.attention_pre(x) 67 query, key, value = torch.chunk(x, 3, dim=2) 68 query, key, value = self.split(query), self.split(key, T=True), self.split(value) 69 70 score = torch.matmul(query, key) # B, head_num, N, N 71 score /= math.sqrt(self.head_dim) 72 if mask is not None: 73 # inplace modification for reasonable softmax 74 score.masked_fill_(~mask, value=-1e9) 75 76 score = F.softmax(score, dim=-1) 77 score = self.dropout(score) 78 attention = torch.matmul(score, value) # B, head_num, N, head_dim 79 80 attention = attention.permute(0, 2, 1, 3).contiguous() # B, N, head_num, head_dim 81 attention = self.project(attention.view(B, N, -1)) # B, N, output_dim 82 return attention 83 84 85class TransformerLayer(nn.Module): 86 """ 87 Overview: 88 In transformer layer, first computes entries's attention and applies a feedforward layer. 89 Interfaces: 90 ``__init__``, ``forward`` 91 """ 92 93 def __init__( 94 self, input_dim: int, head_dim: int, hidden_dim: int, output_dim: int, head_num: int, mlp_num: int, 95 dropout: nn.Module, activation: nn.Module 96 ) -> None: 97 """ 98 Overview: 99 Initialize the TransformerLayer with the provided dimensions, dropout layer, and activation function. 100 Arguments: 101 - input_dim (:obj:`int`): The dimension of the input. 102 - head_dim (:obj:`int`): The dimension of each head in the multi-head attention mechanism. 103 - hidden_dim (:obj:`int`): The dimension of the hidden layer in the MLP (Multi-Layer Perceptron). 104 - output_dim (:obj:`int`): The dimension of the output. 105 - head_num (:obj:`int`): The number of heads in the multi-head attention mechanism. 106 - mlp_num (:obj:`int`): The number of layers in the MLP. 107 - dropout (:obj:`nn.Module`): The dropout layer used in the attention mechanism. 108 - activation (:obj:`nn.Module`): The activation function used in the MLP. 109 """ 110 super(TransformerLayer, self).__init__() 111 self.attention = Attention(input_dim, head_dim, output_dim, head_num, dropout) 112 self.layernorm1 = build_normalization('LN')(output_dim) 113 self.dropout = dropout 114 layers = [] 115 dims = [output_dim] + [hidden_dim] * (mlp_num - 1) + [output_dim] 116 for i in range(mlp_num): 117 layers.append(fc_block(dims[i], dims[i + 1], activation=activation)) 118 if i != mlp_num - 1: 119 layers.append(self.dropout) 120 layers.append(self.dropout) 121 self.mlp = nn.Sequential(*layers) 122 self.layernorm2 = build_normalization('LN')(output_dim) 123 124 def forward(self, inputs: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]: 125 """ 126 Overview: 127 Compute the forward pass through the Transformer layer. 128 Arguments: 129 - inputs (:obj:`Tuple[torch.Tensor, torch.Tensor]`): A tuple containing the input tensor `x` and 130 the mask tensor. 131 Returns: 132 - output (:obj:`Tuple[torch.Tensor, torch.Tensor]`): A tuple containing the predicted value tensor and 133 the mask tensor. 134 """ 135 x, mask = inputs 136 a = self.dropout(self.attention(x, mask)) 137 x = self.layernorm1(x + a) 138 m = self.dropout(self.mlp(x)) 139 x = self.layernorm2(x + m) 140 return x, mask 141 142 143class Transformer(nn.Module): 144 """ 145 Overview: 146 Implementation of the Transformer model. 147 148 .. note:: 149 For more details, refer to "Attention is All You Need": http://arxiv.org/abs/1706.03762. 150 151 Interfaces: 152 ``__init__``, ``forward`` 153 """ 154 155 def __init__( 156 self, 157 input_dim: int, 158 head_dim: int = 128, 159 hidden_dim: int = 1024, 160 output_dim: int = 256, 161 head_num: int = 2, 162 mlp_num: int = 2, 163 layer_num: int = 3, 164 dropout_ratio: float = 0., 165 activation: nn.Module = nn.ReLU(), 166 ): 167 """ 168 Overview: 169 Initialize the Transformer with the provided dimensions, dropout layer, activation function, 170 and layer numbers. 171 Arguments: 172 - input_dim (:obj:`int`): The dimension of the input. 173 - head_dim (:obj:`int`): The dimension of each head in the multi-head attention mechanism. 174 - hidden_dim (:obj:`int`): The dimension of the hidden layer in the MLP (Multi-Layer Perceptron). 175 - output_dim (:obj:`int`): The dimension of the output. 176 - head_num (:obj:`int`): The number of heads in the multi-head attention mechanism. 177 - mlp_num (:obj:`int`): The number of layers in the MLP. 178 - layer_num (:obj:`int`): The number of Transformer layers. 179 - dropout_ratio (:obj:`float`): The dropout ratio for the dropout layer. 180 - activation (:obj:`nn.Module`): The activation function used in the MLP. 181 """ 182 super(Transformer, self).__init__() 183 self.embedding = fc_block(input_dim, output_dim, activation=activation) 184 self.act = activation 185 layers = [] 186 dims = [output_dim] + [output_dim] * layer_num 187 self.dropout = nn.Dropout(dropout_ratio) 188 for i in range(layer_num): 189 layers.append( 190 TransformerLayer(dims[i], head_dim, hidden_dim, dims[i + 1], head_num, mlp_num, self.dropout, self.act) 191 ) 192 self.main = nn.Sequential(*layers) 193 194 def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor: 195 """ 196 Overview: 197 Perform the forward pass through the Transformer. 198 Arguments: 199 - x (:obj:`torch.Tensor`): The input tensor, with shape `(B, N, C)`, where `B` is batch size, \ 200 `N` is the number of entries, and `C` is the feature dimension. 201 - mask (:obj:`Optional[torch.Tensor]`, optional): The mask tensor (bool), used to mask out invalid \ 202 entries in attention. It has shape `(B, N)`, where `B` is batch size and `N` is number of \ 203 entries. Defaults to None. 204 Returns: 205 - x (:obj:`torch.Tensor`): The output tensor from the Transformer. 206 """ 207 if mask is not None: 208 mask = mask.unsqueeze(dim=1).repeat(1, mask.shape[1], 1).unsqueeze(dim=1) 209 x = self.embedding(x) 210 x = self.dropout(x) 211 x, mask = self.main((x, mask)) 212 return x 213 214 215class ScaledDotProductAttention(nn.Module): 216 """ 217 Overview: 218 Implementation of Scaled Dot Product Attention, a key component of Transformer models. 219 This class performs the dot product of the query, key and value tensors, scales it with the square root of the 220 dimension of the key vector (d_k) and applies dropout for regularization. 221 Interfaces: 222 ``__init__``, ``forward`` 223 """ 224 225 def __init__(self, d_k: int, dropout: float = 0.0) -> None: 226 """ 227 Overview: 228 Initialize the ScaledDotProductAttention module with the dimension of the key vector and the dropout rate. 229 Arguments: 230 - d_k (:obj:`int`): The dimension of the key vector. This will be used to scale the dot product of the \ 231 query and key. 232 - dropout (:obj:`float`, optional): The dropout rate to be applied after the softmax operation. \ 233 Defaults to 0.0. 234 """ 235 super(ScaledDotProductAttention, self).__init__() 236 self.d_k = d_k 237 self.dropout = nn.Dropout(dropout) 238 239 def forward( 240 self, 241 q: torch.Tensor, 242 k: torch.Tensor, 243 v: torch.Tensor, 244 mask: Optional[torch.Tensor] = None 245 ) -> torch.Tensor: 246 """ 247 Overview: 248 Perform the Scaled Dot Product Attention operation on the query, key and value tensors. 249 Arguments: 250 - q (:obj:`torch.Tensor`): The query tensor. 251 - k (:obj:`torch.Tensor`): The key tensor. 252 - v (:obj:`torch.Tensor`): The value tensor. 253 - mask (:obj:`Optional[torch.Tensor]`): An optional mask tensor to be applied on the attention scores. 254 Defaults to None. 255 Returns: 256 - output (:obj:`torch.Tensor`): The output tensor after the attention operation. 257 """ 258 attn = torch.matmul(q / (self.d_k ** 0.5), k.transpose(2, 3)) 259 if mask is not None: 260 # inplace modification for reasonable softmax 261 attn.masked_fill_(~mask, -1e9) 262 attn = self.dropout(F.softmax(attn, dim=-1)) 263 output = torch.matmul(attn, v) 264 return output