ding.reward_model.red_irl_model¶
ding.reward_model.red_irl_model
¶
SENet
¶
Bases: Module
support estimation network
RedRewardModel
¶
Bases: BaseRewardModel
Overview
The implement of reward model in RED (https://arxiv.org/abs/1905.06750)
Interface:
estimate, train, load_expert_data, collect_data, clear_date, __init__, _train
Config:
== ================== ===== ============= ======================================= =======================
ID Symbol Type Default Value Description Other(Shape)
== ================== ===== ============= ======================================= =======================
1 type str red | Reward model register name, refer |
| to registry REWARD_MODEL_REGISTRY |
2 | expert_data_ str expert_data | Path to the expert dataset | Should be a '.pkl'
| path .pkl | | file
3 | sample_size int 1000 | sample data from expert dataset |
| with fixed size |
4 | sigma int 5 | hyperparameter of r(s,a) | r(s,a) = exp(
| -sigma* L(s,a))
5 | batch_size int 64 | Training batch size |
6 | hidden_size int 128 | Linear model hidden size |
7 | update_per_ int 100 | Number of updates per collect |
| collect | |
8 | clear_buffer int 1 | clear buffer per fixed iters | make sure replay
_per_iters | buffer's data count
| isn't too few.
| (code work in entry)
== ================== ===== ============= ======================================= =======================
Properties:
- online_net (:obj: SENet): The reward model, in default initialized once as the training begins.
__init__(config, device, tb_logger)
¶
Overview
Initialize self. See help(type(self)) for accurate signature.
Arguments:
- cfg (:obj:Dict): Training config
- device (:obj:str): Device usage, i.e. "cpu" or "cuda"
- tb_logger (:obj:str): Logger, defaultly set as 'SummaryWriter' for model summary
load_expert_data()
¶
Overview
Getting the expert data from config['expert_data_path'] attribute in self.
Effects:
This is a side effect function which updates the expert data attribute (e.g. self.expert_data)
train()
¶
Overview
Training the RED reward model. In default, RED model should be trained once.
Effects: - This is a side effect function which updates the reward model and increment the train iteration count.
estimate(data)
¶
Overview
Estimate reward by rewriting the reward key
Arguments:
- data (:obj:list): the list of data used for estimation, with at least obs and action keys.
Effects:
- This is a side effect function which updates the reward values in place.
collect_data(data)
¶
Overview
Collecting training data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in collect_data method
clear_data()
¶
Overview
Collecting clearing data, not implemented if reward model (i.e. online_net) is only trained ones, if online_net is trained continuously, there should be some implementations in clear_data method
Full Source Code
../ding/reward_model/red_irl_model.py