Skip to content

ding.reward_model.red_irl_model

ding.reward_model.red_irl_model

SENet

Bases: Module

support estimation network

RedRewardModel

Bases: BaseRewardModel

Overview

The implement of reward model in RED (https://arxiv.org/abs/1905.06750)

Interface: estimate, train, load_expert_data, collect_data, clear_date, __init__, _train Config: == ================== ===== ============= ======================================= ======================= ID Symbol Type Default Value Description Other(Shape) == ================== ===== ============= ======================================= ======================= 1 type str red | Reward model register name, refer | | to registry REWARD_MODEL_REGISTRY | 2 | expert_data_ str expert_data | Path to the expert dataset | Should be a '.pkl' | path .pkl | | file 3 | sample_size int 1000 | sample data from expert dataset | | with fixed size | 4 | sigma int 5 | hyperparameter of r(s,a) | r(s,a) = exp( | -sigma* L(s,a)) 5 | batch_size int 64 | Training batch size | 6 | hidden_size int 128 | Linear model hidden size | 7 | update_per_ int 100 | Number of updates per collect | | collect | | 8 | clear_buffer int 1 | clear buffer per fixed iters | make sure replay _per_iters | buffer's data count | isn't too few. | (code work in entry) == ================== ===== ============= ======================================= ======================= Properties: - online_net (:obj: SENet): The reward model, in default initialized once as the training begins.

__init__(config, device, tb_logger)

Overview

Initialize self. See help(type(self)) for accurate signature.

Arguments: - cfg (:obj:Dict): Training config - device (:obj:str): Device usage, i.e. "cpu" or "cuda" - tb_logger (:obj:str): Logger, defaultly set as 'SummaryWriter' for model summary

load_expert_data()

Overview

Getting the expert data from config['expert_data_path'] attribute in self.

Effects: This is a side effect function which updates the expert data attribute (e.g. self.expert_data)

train()

Overview

Training the RED reward model. In default, RED model should be trained once.

Effects: - This is a side effect function which updates the reward model and increment the train iteration count.

estimate(data)

Overview

Estimate reward by rewriting the reward key

Arguments: - data (:obj:list): the list of data used for estimation, with at least obs and action keys. Effects: - This is a side effect function which updates the reward values in place.

collect_data(data)

Overview

Collecting training data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in collect_data method

clear_data()

Overview

Collecting clearing data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in clear_data method

Full Source Code

../ding/reward_model/red_irl_model.py

1from typing import Dict, List 2import pickle 3import random 4import torch 5import torch.nn as nn 6import torch.optim as optim 7 8from ding.utils import REWARD_MODEL_REGISTRY, one_time_warning 9from .base_reward_model import BaseRewardModel 10 11 12class SENet(nn.Module): 13 """support estimation network""" 14 15 def __init__(self, input_size: int, hidden_size: int, output_dims: int) -> None: 16 super(SENet, self).__init__() 17 self.l_1 = nn.Linear(input_size, hidden_size) 18 self.l_2 = nn.Linear(hidden_size, output_dims) 19 self.act = nn.Tanh() 20 21 def forward(self, x: torch.Tensor) -> torch.Tensor: 22 out = self.l_1(x) 23 out = self.act(out) 24 out = self.l_2(out) 25 out = self.act(out) 26 return out 27 28 29@REWARD_MODEL_REGISTRY.register('red') 30class RedRewardModel(BaseRewardModel): 31 """ 32 Overview: 33 The implement of reward model in RED (https://arxiv.org/abs/1905.06750) 34 Interface: 35 ``estimate``, ``train``, ``load_expert_data``, ``collect_data``, ``clear_date``, \ 36 ``__init__``, ``_train`` 37 Config: 38 == ================== ===== ============= ======================================= ======================= 39 ID Symbol Type Default Value Description Other(Shape) 40 == ================== ===== ============= ======================================= ======================= 41 1 ``type`` str red | Reward model register name, refer | 42 | to registry ``REWARD_MODEL_REGISTRY`` | 43 2 | ``expert_data_`` str expert_data | Path to the expert dataset | Should be a '.pkl' 44 | ``path`` .pkl | | file 45 3 | ``sample_size`` int 1000 | sample data from expert dataset | 46 | with fixed size | 47 4 | ``sigma`` int 5 | hyperparameter of r(s,a) | r(s,a) = exp( 48 | -sigma* L(s,a)) 49 5 | ``batch_size`` int 64 | Training batch size | 50 6 | ``hidden_size`` int 128 | Linear model hidden size | 51 7 | ``update_per_`` int 100 | Number of updates per collect | 52 | ``collect`` | | 53 8 | ``clear_buffer`` int 1 | clear buffer per fixed iters | make sure replay 54 ``_per_iters`` | buffer's data count 55 | isn't too few. 56 | (code work in entry) 57 == ================== ===== ============= ======================================= ======================= 58 Properties: 59 - online_net (:obj: `SENet`): The reward model, in default initialized once as the training begins. 60 """ 61 config = dict( 62 # (str) Reward model register name, refer to registry ``REWARD_MODEL_REGISTRY``. 63 type='red', 64 # (int) Linear model input size. 65 # input_size=4, 66 # (int) Sample data from expert dataset with fixed size. 67 sample_size=1000, 68 # (int) Linear model hidden size. 69 hidden_size=128, 70 # (float) The step size of gradient descent. 71 learning_rate=1e-3, 72 # (int) How many updates(iterations) to train after collector's one collection. 73 # Bigger "update_per_collect" means bigger off-policy. 74 # collect data -> update policy-> collect data -> ... 75 update_per_collect=100, 76 # (str) Path to the expert dataset 77 # expert_data_path='expert_data.pkl', 78 # (int) How many samples in a training batch. 79 batch_size=64, 80 # (float) Hyperparameter at estimated score of r(s,a). 81 # r(s,a) = exp(-sigma* L(s,a)) 82 sigma=0.5, 83 # (int) Clear buffer per fixed iters. 84 clear_buffer_per_iters=1, 85 ) 86 87 def __init__(self, config: Dict, device: str, tb_logger: 'SummaryWriter') -> None: # noqa 88 """ 89 Overview: 90 Initialize ``self.`` See ``help(type(self))`` for accurate signature. 91 Arguments: 92 - cfg (:obj:`Dict`): Training config 93 - device (:obj:`str`): Device usage, i.e. "cpu" or "cuda" 94 - tb_logger (:obj:`str`): Logger, defaultly set as 'SummaryWriter' for model summary 95 """ 96 super(RedRewardModel, self).__init__() 97 self.cfg: Dict = config 98 self.expert_data: List[tuple] = [] 99 self.device = device 100 assert device in ["cpu", "cuda"] or "cuda" in device 101 self.tb_logger = tb_logger 102 self.target_net: SENet = SENet(config.input_size, config.hidden_size, 1) 103 self.online_net: SENet = SENet(config.input_size, config.hidden_size, 1) 104 self.target_net.to(device) 105 self.online_net.to(device) 106 self.opt: optim.Adam = optim.Adam(self.online_net.parameters(), config.learning_rate) 107 self.train_once_flag = False 108 109 self.load_expert_data() 110 111 def load_expert_data(self) -> None: 112 """ 113 Overview: 114 Getting the expert data from ``config['expert_data_path']`` attribute in self. 115 Effects: 116 This is a side effect function which updates the expert data attribute (e.g. ``self.expert_data``) 117 """ 118 with open(self.cfg.expert_data_path, 'rb') as f: 119 self.expert_data = pickle.load(f) 120 sample_size = min(len(self.expert_data), self.cfg.sample_size) 121 self.expert_data = random.sample(self.expert_data, sample_size) 122 print('the expert data size is:', len(self.expert_data)) 123 124 def _train(self, batch_data: torch.Tensor) -> float: 125 """ 126 Overview: 127 Helper function for ``train`` which caclulates loss for train data and expert data. 128 Arguments: 129 - batch_data (:obj:`torch.Tensor`): Data used for training 130 Returns: 131 - Combined loss calculated of reward model from using ``batch_data`` in both target and reward models. 132 """ 133 with torch.no_grad(): 134 target = self.target_net(batch_data) 135 hat: torch.Tensor = self.online_net(batch_data) 136 loss: torch.Tensor = ((hat - target) ** 2).mean() 137 self.opt.zero_grad() 138 loss.backward() 139 self.opt.step() 140 return loss.item() 141 142 def train(self) -> None: 143 """ 144 Overview: 145 Training the RED reward model. In default, RED model should be trained once. 146 Effects: 147 - This is a side effect function which updates the reward model and increment the train iteration count. 148 """ 149 if self.train_once_flag: 150 one_time_warning('RED model should be trained once, we do not train it anymore') 151 else: 152 for i in range(self.cfg.update_per_collect): 153 sample_batch = random.sample(self.expert_data, self.cfg.batch_size) 154 states_data = [] 155 actions_data = [] 156 for item in sample_batch: 157 states_data.append(item['obs']) 158 actions_data.append(item['action']) 159 states_tensor: torch.Tensor = torch.stack(states_data).float() 160 actions_tensor: torch.Tensor = torch.stack(actions_data).float() 161 states_actions_tensor: torch.Tensor = torch.cat([states_tensor, actions_tensor], dim=1) 162 states_actions_tensor = states_actions_tensor.to(self.device) 163 loss = self._train(states_actions_tensor) 164 self.tb_logger.add_scalar('reward_model/red_loss', loss, i) 165 self.train_once_flag = True 166 167 def estimate(self, data: list) -> List[Dict]: 168 """ 169 Overview: 170 Estimate reward by rewriting the reward key 171 Arguments: 172 - data (:obj:`list`): the list of data used for estimation, \ 173 with at least ``obs`` and ``action`` keys. 174 Effects: 175 - This is a side effect function which updates the reward values in place. 176 """ 177 # NOTE: deepcopy reward part of data is very important, 178 # otherwise the reward of data in the replay buffer will be incorrectly modified. 179 train_data_augmented = self.reward_deepcopy(data) 180 states_data = [] 181 actions_data = [] 182 for item in train_data_augmented: 183 states_data.append(item['obs']) 184 actions_data.append(item['action']) 185 states_tensor = torch.stack(states_data).float() 186 actions_tensor = torch.stack(actions_data).float() 187 states_actions_tensor = torch.cat([states_tensor, actions_tensor], dim=1) 188 states_actions_tensor = states_actions_tensor.to(self.device) 189 with torch.no_grad(): 190 hat_1 = self.online_net(states_actions_tensor) 191 hat_2 = self.target_net(states_actions_tensor) 192 c = ((hat_1 - hat_2) ** 2).mean(dim=1) 193 r = torch.exp(-self.cfg.sigma * c) 194 for item, rew in zip(train_data_augmented, r): 195 item['reward'] = rew 196 return train_data_augmented 197 198 def collect_data(self, data) -> None: 199 """ 200 Overview: 201 Collecting training data, not implemented if reward model (i.e. online_net) is only trained ones, \ 202 if online_net is trained continuously, there should be some implementations in collect_data method 203 """ 204 # if online_net is trained continuously, there should be some implementations in collect_data method 205 pass 206 207 def clear_data(self): 208 """ 209 Overview: 210 Collecting clearing data, not implemented if reward model (i.e. online_net) is only trained ones, \ 211 if online_net is trained continuously, there should be some implementations in clear_data method 212 """ 213 # if online_net is trained continuously, there should be some implementations in clear_data method 214 pass