ding.policy¶
ding.policy
¶
R2D2CollectTrajPolicy
¶
Bases: Policy
Overview
Policy class of R2D2 for collecting expert traj for R2D3.
Config
== ==================== ======== ============== ======================================== =======================
ID Symbol Type Default Value Description Other(Shape)
== ==================== ======== ============== ======================================== =======================
1 type str dqn | RL policy register name, refer to | This arg is optional,
| registry POLICY_REGISTRY | a placeholder
2 cuda bool False | Whether to use cuda for network | This arg can be diff-
| erent from modes
3 on_policy bool False | Whether the RL algorithm is on-policy
| or off-policy
4 priority bool False | Whether use priority(PER) | Priority sample,
| update priority
5 | priority_IS bool False | Whether use Importance Sampling Weight
| _weight | to correct biased update. If True,
| priority must be True.
6 | discount_ float 0.997, | Reward's future discount factor, aka. | May be 1 when sparse
| factor [0.95, 0.999] | gamma | reward env
7 nstep int 3, | N-step reward discount sum for target
[3, 5] | q_value estimation
8 burnin_step int 2 | The timestep of burnin operation,
| which is designed to RNN hidden state
| difference caused by off-policy
9 | learn.update int 1 | How many updates(iterations) to train | This args can be vary
| per_collect | after collector's one collection. Only | from envs. Bigger val
| valid in serial training | means more off-policy
10 | learn.batch_ int 64 | The number of samples of an iteration
| size
11 | learn.learning float 0.001 | Gradient step length of an iteration.
| _rate
12 | learn.value_ bool True | Whether use value_rescale function for
| rescale | predicted value
13 | learn.target_ int 100 | Frequence of target network update. | Hard(assign) update
| update_freq
14 | learn.ignore_ bool False | Whether ignore done for target value | Enable it for some
| done | calculation. | fake termination env
15 collect.n_sample int [8, 128] | The number of training samples of a | It varies from
| call of collector. | different envs
16 | collect.unroll int 1 | unroll length of an iteration | In RNN, unroll_len>1
| _len
== ==================== ======== ============== ======================================== =======================
PPOSTDIMPolicy
¶
Bases: PPOPolicy
Overview
Policy class of on policy version PPO algorithm with ST-DIM auxiliary model. PPO paper link: https://arxiv.org/abs/1707.06347. ST-DIM paper link: https://arxiv.org/abs/1906.08226.
OffPPOCollectTrajPolicy
¶
MBSACPolicy
¶
Bases: SACPolicy
Overview
Model based SAC with value expansion (arXiv: 1803.00101) and value gradient (arXiv: 1510.09142) w.r.t lambda-return.
https://arxiv.org/pdf/1803.00101.pdf https://arxiv.org/pdf/1510.09142.pdf
Config
== ==================== ======== ============= ==================================
ID Symbol Type Default Value Description
== ==================== ======== ============= ==================================
1 learn._lambda float 0.8 | Lambda for TD-lambda return.
2 learn.grad_clip` float 100.0 | Max norm of gradients.
3 |learn.samplebool True | Whether to sample states or
|_state`` | transitions from env buffer.
== ==================== ======== ============= ==================================
.. note:: For other configs, please refer to ding.policy.sac.SACPolicy.
STEVESACPolicy
¶
Bases: SACPolicy
Overview
Model based SAC with stochastic value expansion (arXiv 1807.01675).\ This implementation also uses value gradient w.r.t the same STEVE target.
https://arxiv.org/pdf/1807.01675.pdf
Config
== ==================== ======== ============= =====================================
ID Symbol Type Default Value Description
== ==================== ======== ============= =====================================
1 learn.grad_clip` float 100.0 | Max norm of gradients.
2learn.ensemble_size`` int 1 | The number of ensemble world models.
== ==================== ======== ============= =====================================
.. note:: For other configs, please refer to ding.policy.sac.SACPolicy.
SQLPolicy
¶
Bases: Policy
Overview
Policy class of SQL algorithm.
default_model()
¶
Overview
Return this algorithm default model setting for demonstration.
Returns:
- model_info (:obj:Tuple[str, List[str]]): model name and mode import_names
.. note::
The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For DQN, ding.model.template.q_learning.DQN
DQFDPolicy
¶
Bases: DQNPolicy
Overview
Policy class of DQFD algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.
Config
== ==================== ======== ============== ======================================== =======================
ID Symbol Type Default Value Description Other(Shape)
== ==================== ======== ============== ======================================== =======================
1 type str dqn | RL policy register name, refer to | This arg is optional,
| registry POLICY_REGISTRY | a placeholder
2 cuda bool False | Whether to use cuda for network | This arg can be diff-
| erent from modes
3 on_policy bool False | Whether the RL algorithm is on-policy
| or off-policy
4 priority bool True | Whether use priority(PER) | Priority sample,
| update priority
5 | priority_IS bool True | Whether use Importance Sampling Weight
| _weight | to correct biased update. If True,
| priority must be True.
6 | discount_ float 0.97, | Reward's future discount factor, aka. | May be 1 when sparse
| factor [0.95, 0.999] | gamma | reward env
7 nstep int 10, | N-step reward discount sum for target
[3, 5] | q_value estimation
8 | lambda1 float 1 | multiplicative factor for n-step
9 | lambda2 float 1 | multiplicative factor for the
| supervised margin loss
10 | lambda3 float 1e-5 | L2 loss
11 | margin_fn float 0.8 | margin function in JE, here we set
| this as a constant
12 | per_train_ int 10 | number of pertraining iterations
| iter_k
13 | learn.update int 3 | How many updates(iterations) to train | This args can be vary
| per_collect | after collector's one collection. Only | from envs. Bigger val
| valid in serial training | means more off-policy
14 | learn.batch_ int 64 | The number of samples of an iteration
| size
15 | learn.learning float 0.001 | Gradient step length of an iteration.
| _rate
16 | learn.target_ int 100 | Frequency of target network update. | Hard(assign) update
| update_freq
17 | learn.ignore_ bool False | Whether ignore done for target value | Enable it for some
| done | calculation. | fake termination env
18 collect.n_sample int [8, 128] | The number of training samples of a | It varies from
| call of collector. | different envs
19 | collect.unroll int 1 | unroll length of an iteration | In RNN, unroll_len>1
| _len
== ==================== ======== ============== ======================================== =======================
IQLPolicy
¶
Bases: Policy
Overview
Policy class of Implicit Q-Learning (IQL) algorithm for continuous control. Paper link: https://arxiv.org/abs/2110.06169.
Config
== ==================== ======== ============= ================================= =======================
ID Symbol Type Default Value Description Other(Shape)
== ==================== ======== ============= ================================= =======================
1 type str iql | RL policy register name, refer | this arg is optional,
| to registry POLICY_REGISTRY | a placeholder
2 cuda bool True | Whether to use cuda for network |
3 | random_ int 10000 | Number of randomly collected | Default to 10000 for
| collect_size | training samples in replay | SAC, 25000 for DDPG/
| | buffer when training starts. | TD3.
4 | model.policy_ int 256 | Linear layer size for policy |
| embedding_size | network. |
5 | model.soft_q_ int 256 | Linear layer size for soft q |
| embedding_size | network. |
6 | model.value_ int 256 | Linear layer size for value | Defalut to None when
| embedding_size | network. | model.value_network
| | | is False.
7 | learn.learning float 3e-4 | Learning rate for soft q | Defalut to 1e-3, when
| _rate_q | network. | model.value_network
| | | is True.
8 | learn.learning float 3e-4 | Learning rate for policy | Defalut to 1e-3, when
| _rate_policy | network. | model.value_network
| | | is True.
9 | learn.learning float 3e-4 | Learning rate for policy | Defalut to None when
| _rate_value | network. | model.value_network
| | | is False.
10 | learn.alpha float 0.2 | Entropy regularization | alpha is initiali-
| | coefficient. | zation for auto
| | | alpha, when
| | | auto_alpha is True
11 | learn.repara_ bool True | Determine whether to use |
| meterization | reparameterization trick. |
12 | learn. bool False | Determine whether to use | Temperature parameter
| auto_alpha | auto temperature parameter | determines the
| | alpha. | relative importance
| | | of the entropy term
| | | against the reward.
13 | learn.- bool False | Determine whether to ignore | Use ignore_done only
| ignore_done | done flag. | in halfcheetah env.
14 | learn.- float 0.005 | Used for soft update of the | aka. Interpolation
| target_theta | target network. | factor in polyak aver
| | | aging for target
| | | networks.
== ==================== ======== ============= ================================= =======================
default_model()
¶
Overview
Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.
Returns:
- model_info (:obj:Tuple[str, List[str]]): The registered model name and model's import_names.
MADQNPolicy
¶
Bases: QMIXPolicy
default_model()
¶
Overview
Return this algorithm default model setting for demonstration.
Returns:
- model_info (:obj:Tuple[str, List[str]]): model name and mode import_names
EpsCommandModePolicy
¶
Bases: CommandModePolicy
DDPGCommandModePolicy
¶
Bases: DDPGPolicy, CommandModePolicy
BCCommandModePolicy
¶
Bases: BehaviourCloningPolicy, DummyCommandModePolicy
get_epsilon_greedy_fn(start, end, decay, type_='exp')
¶
Overview
Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.
Arguments:
- start (:obj:float): Epsilon start value. For linear , it should be 1.0.
- end (:obj:float): Epsilon end value.
- decay (:obj:int): Controls the speed that epsilon decreases from start to end. We recommend epsilon decays according to env step rather than iteration.
- type (:obj:str): How epsilon decays, now supports ['linear', 'exp'(exponential)] .
Returns:
- eps_fn (:obj:function): The epsilon greedy function with decay.
Full Source Code
../ding/policy/__init__.py