Skip to content

ding.policy

ding.policy

R2D2CollectTrajPolicy

Bases: Policy

Overview

Policy class of R2D2 for collecting expert traj for R2D3.

Config

== ==================== ======== ============== ======================================== ======================= ID Symbol Type Default Value Description Other(Shape) == ==================== ======== ============== ======================================== ======================= 1 type str dqn | RL policy register name, refer to | This arg is optional, | registry POLICY_REGISTRY | a placeholder 2 cuda bool False | Whether to use cuda for network | This arg can be diff- | erent from modes 3 on_policy bool False | Whether the RL algorithm is on-policy | or off-policy 4 priority bool False | Whether use priority(PER) | Priority sample, | update priority 5 | priority_IS bool False | Whether use Importance Sampling Weight | _weight | to correct biased update. If True, | priority must be True. 6 | discount_ float 0.997, | Reward's future discount factor, aka. | May be 1 when sparse | factor [0.95, 0.999] | gamma | reward env 7 nstep int 3, | N-step reward discount sum for target [3, 5] | q_value estimation 8 burnin_step int 2 | The timestep of burnin operation, | which is designed to RNN hidden state | difference caused by off-policy 9 | learn.update int 1 | How many updates(iterations) to train | This args can be vary | per_collect | after collector's one collection. Only | from envs. Bigger val | valid in serial training | means more off-policy 10 | learn.batch_ int 64 | The number of samples of an iteration | size 11 | learn.learning float 0.001 | Gradient step length of an iteration. | _rate 12 | learn.value_ bool True | Whether use value_rescale function for | rescale | predicted value 13 | learn.target_ int 100 | Frequence of target network update. | Hard(assign) update | update_freq 14 | learn.ignore_ bool False | Whether ignore done for target value | Enable it for some | done | calculation. | fake termination env 15 collect.n_sample int [8, 128] | The number of training samples of a | It varies from | call of collector. | different envs 16 | collect.unroll int 1 | unroll length of an iteration | In RNN, unroll_len>1 | _len == ==================== ======== ============== ======================================== =======================

PPOSTDIMPolicy

Bases: PPOPolicy

Overview

Policy class of on policy version PPO algorithm with ST-DIM auxiliary model. PPO paper link: https://arxiv.org/abs/1707.06347. ST-DIM paper link: https://arxiv.org/abs/1906.08226.

OffPPOCollectTrajPolicy

Bases: Policy

Overview

Policy class of off policy PPO algorithm to collect expert traj for R2D3.

MBSACPolicy

Bases: SACPolicy

Overview

Model based SAC with value expansion (arXiv: 1803.00101) and value gradient (arXiv: 1510.09142) w.r.t lambda-return.

https://arxiv.org/pdf/1803.00101.pdf https://arxiv.org/pdf/1510.09142.pdf

Config

== ==================== ======== ============= ================================== ID Symbol Type Default Value Description == ==================== ======== ============= ================================== 1 learn._lambda float 0.8 | Lambda for TD-lambda return. 2 learn.grad_clip` float 100.0 | Max norm of gradients. 3 |learn.samplebool True | Whether to sample states or |_state`` | transitions from env buffer. == ==================== ======== ============= ==================================

.. note:: For other configs, please refer to ding.policy.sac.SACPolicy.

STEVESACPolicy

Bases: SACPolicy

Overview

Model based SAC with stochastic value expansion (arXiv 1807.01675).\ This implementation also uses value gradient w.r.t the same STEVE target.

https://arxiv.org/pdf/1807.01675.pdf

Config

== ==================== ======== ============= ===================================== ID Symbol Type Default Value Description == ==================== ======== ============= ===================================== 1 learn.grad_clip` float 100.0 | Max norm of gradients. 2learn.ensemble_size`` int 1 | The number of ensemble world models. == ==================== ======== ============= =====================================

.. note:: For other configs, please refer to ding.policy.sac.SACPolicy.

DREAMERPolicy

Bases: Policy

SQLPolicy

Bases: Policy

Overview

Policy class of SQL algorithm.

default_model()

Overview

Return this algorithm default model setting for demonstration.

Returns: - model_info (:obj:Tuple[str, List[str]]): model name and mode import_names

.. note:: The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For DQN, ding.model.template.q_learning.DQN

DQFDPolicy

Bases: DQNPolicy

Overview

Policy class of DQFD algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.

Config

== ==================== ======== ============== ======================================== ======================= ID Symbol Type Default Value Description Other(Shape) == ==================== ======== ============== ======================================== ======================= 1 type str dqn | RL policy register name, refer to | This arg is optional, | registry POLICY_REGISTRY | a placeholder 2 cuda bool False | Whether to use cuda for network | This arg can be diff- | erent from modes 3 on_policy bool False | Whether the RL algorithm is on-policy | or off-policy 4 priority bool True | Whether use priority(PER) | Priority sample, | update priority 5 | priority_IS bool True | Whether use Importance Sampling Weight | _weight | to correct biased update. If True, | priority must be True. 6 | discount_ float 0.97, | Reward's future discount factor, aka. | May be 1 when sparse | factor [0.95, 0.999] | gamma | reward env 7 nstep int 10, | N-step reward discount sum for target [3, 5] | q_value estimation 8 | lambda1 float 1 | multiplicative factor for n-step 9 | lambda2 float 1 | multiplicative factor for the | supervised margin loss 10 | lambda3 float 1e-5 | L2 loss 11 | margin_fn float 0.8 | margin function in JE, here we set | this as a constant 12 | per_train_ int 10 | number of pertraining iterations | iter_k 13 | learn.update int 3 | How many updates(iterations) to train | This args can be vary | per_collect | after collector's one collection. Only | from envs. Bigger val | valid in serial training | means more off-policy 14 | learn.batch_ int 64 | The number of samples of an iteration | size 15 | learn.learning float 0.001 | Gradient step length of an iteration. | _rate 16 | learn.target_ int 100 | Frequency of target network update. | Hard(assign) update | update_freq 17 | learn.ignore_ bool False | Whether ignore done for target value | Enable it for some | done | calculation. | fake termination env 18 collect.n_sample int [8, 128] | The number of training samples of a | It varies from | call of collector. | different envs 19 | collect.unroll int 1 | unroll length of an iteration | In RNN, unroll_len>1 | _len == ==================== ======== ============== ======================================== =======================

IQLPolicy

Bases: Policy

Overview

Policy class of Implicit Q-Learning (IQL) algorithm for continuous control. Paper link: https://arxiv.org/abs/2110.06169.

Config

== ==================== ======== ============= ================================= ======================= ID Symbol Type Default Value Description Other(Shape) == ==================== ======== ============= ================================= ======================= 1 type str iql | RL policy register name, refer | this arg is optional, | to registry POLICY_REGISTRY | a placeholder 2 cuda bool True | Whether to use cuda for network | 3 | random_ int 10000 | Number of randomly collected | Default to 10000 for | collect_size | training samples in replay | SAC, 25000 for DDPG/ | | buffer when training starts. | TD3. 4 | model.policy_ int 256 | Linear layer size for policy | | embedding_size | network. | 5 | model.soft_q_ int 256 | Linear layer size for soft q | | embedding_size | network. | 6 | model.value_ int 256 | Linear layer size for value | Defalut to None when | embedding_size | network. | model.value_network | | | is False. 7 | learn.learning float 3e-4 | Learning rate for soft q | Defalut to 1e-3, when | _rate_q | network. | model.value_network | | | is True. 8 | learn.learning float 3e-4 | Learning rate for policy | Defalut to 1e-3, when | _rate_policy | network. | model.value_network | | | is True. 9 | learn.learning float 3e-4 | Learning rate for policy | Defalut to None when | _rate_value | network. | model.value_network | | | is False. 10 | learn.alpha float 0.2 | Entropy regularization | alpha is initiali- | | coefficient. | zation for auto | | | alpha, when | | | auto_alpha is True 11 | learn.repara_ bool True | Determine whether to use | | meterization | reparameterization trick. | 12 | learn. bool False | Determine whether to use | Temperature parameter | auto_alpha | auto temperature parameter | determines the | | alpha. | relative importance | | | of the entropy term | | | against the reward. 13 | learn.- bool False | Determine whether to ignore | Use ignore_done only | ignore_done | done flag. | in halfcheetah env. 14 | learn.- float 0.005 | Used for soft update of the | aka. Interpolation | target_theta | target network. | factor in polyak aver | | | aging for target | | | networks. == ==================== ======== ============= ================================= =======================

default_model()

Overview

Return this algorithm default neural network model setting for demonstration. __init__ method will automatically call this method to get the default model setting and create model.

Returns: - model_info (:obj:Tuple[str, List[str]]): The registered model name and model's import_names.

MADQNPolicy

Bases: QMIXPolicy

default_model()

Overview

Return this algorithm default model setting for demonstration.

Returns: - model_info (:obj:Tuple[str, List[str]]): model name and mode import_names

PDPolicy

Bases: Policy

Overview

Implicit Plan Diffuser https://arxiv.org/pdf/2205.09991.pdf

EpsCommandModePolicy

DDPGCommandModePolicy

BCCommandModePolicy

Bases: BehaviourCloningPolicy, DummyCommandModePolicy

get_epsilon_greedy_fn(start, end, decay, type_='exp')

Overview

Generate an epsilon_greedy function with decay, which inputs current timestep and outputs current epsilon.

Arguments: - start (:obj:float): Epsilon start value. For linear , it should be 1.0. - end (:obj:float): Epsilon end value. - decay (:obj:int): Controls the speed that epsilon decreases from start to end. We recommend epsilon decays according to env step rather than iteration. - type (:obj:str): How epsilon decays, now supports ['linear', 'exp'(exponential)] . Returns: - eps_fn (:obj:function): The epsilon greedy function with decay.

Full Source Code

../ding/policy/__init__.py

1from .base_policy import Policy, CommandModePolicy, create_policy, get_policy_cls 2from .common_utils import single_env_forward_wrapper, single_env_forward_wrapper_ttorch, default_preprocess_learn 3from .dqn import DQNSTDIMPolicy, DQNPolicy 4from .mdqn import MDQNPolicy 5from .iqn import IQNPolicy 6from .fqf import FQFPolicy 7from .qrdqn import QRDQNPolicy 8from .c51 import C51Policy 9from .rainbow import RainbowDQNPolicy 10from .ddpg import DDPGPolicy 11from .d4pg import D4PGPolicy 12from .td3 import TD3Policy 13from .td3_vae import TD3VAEPolicy 14from .td3_bc import TD3BCPolicy 15from .dt import DTPolicy 16 17from .pg import PGPolicy 18from .a2c import A2CPolicy 19from .ppo import PPOPolicy, PPOPGPolicy, PPOOffPolicy 20from .vmpo import VMPOPolicy 21from .sac import SACPolicy, DiscreteSACPolicy, SQILSACPolicy 22from .cql import CQLPolicy, DiscreteCQLPolicy 23from .edac import EDACPolicy 24from .impala import IMPALAPolicy 25from .ngu import NGUPolicy 26from .r2d2 import R2D2Policy 27from .r2d2_gtrxl import R2D2GTrXLPolicy 28from .ppg import PPGPolicy, PPGOffPolicy 29from .sqn import SQNPolicy 30from .bdq import BDQPolicy 31 32from .qmix import QMIXPolicy 33from .wqmix import WQMIXPolicy 34from .coma import COMAPolicy 35from .collaq import CollaQPolicy 36from .atoc import ATOCPolicy 37from .acer import ACERPolicy 38from .qtran import QTRANPolicy 39 40from .il import ILPolicy 41 42from .r2d3 import R2D3Policy 43 44from .command_mode_policy_instance import * 45 46from .policy_factory import PolicyFactory, get_random_policy 47from .pdqn import PDQNPolicy 48 49from .bc import BehaviourCloningPolicy 50from .ibc import IBCPolicy 51 52from .pc import ProcedureCloningBFSPolicy 53 54from .bcq import BCQPolicy 55from .qgpo import QGPOPolicy 56 57# new-type policy 58from .ppof import PPOFPolicy 59from .prompt_pg import PromptPGPolicy 60from .prompt_awr import PromptAWRPolicy 61from .happo import HAPPOPolicy