ding.rl_utils.td¶
ding.rl_utils.td
¶
q_1step_td_error(data, gamma, criterion=nn.MSELoss(reduction='none'))
¶
Overview
1 step td_error, support single agent case and multi agent case.
Arguments:
- data (:obj:q_1step_td_data): The input data, q_1step_td_data to calculate loss
- gamma (:obj:float): Discount factor
- criterion (:obj:torch.nn.modules): Loss function criterion
Returns:
- loss (:obj:torch.Tensor): 1step td error
Shapes:
- data (:obj:q_1step_td_data): the q_1step_td_data containing ['q', 'next_q', 'act', 'next_act', 'reward', 'done', 'weight']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- next_q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- act (:obj:torch.LongTensor): :math:(B, )
- next_act (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:( , B)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
Examples:
>>> action_dim = 4
>>> data = q_1step_td_data(
>>> q=torch.randn(3, action_dim),
>>> next_q=torch.randn(3, action_dim),
>>> act=torch.randint(0, action_dim, (3,)),
>>> next_act=torch.randint(0, action_dim, (3,)),
>>> reward=torch.randn(3),
>>> done=torch.randint(0, 2, (3,)).bool(),
>>> weight=torch.ones(3),
>>> )
>>> loss = q_1step_td_error(data, 0.99)
m_q_1step_td_error(data, gamma, tau, alpha, criterion=nn.MSELoss(reduction='none'))
¶
Overview
Munchausen td_error for DQN algorithm, support 1 step td error.
Arguments:
- data (:obj:m_q_1step_td_data): The input data, m_q_1step_td_data to calculate loss
- gamma (:obj:float): Discount factor
- tau (:obj:float): Entropy factor for Munchausen DQN
- alpha (:obj:float): Discount factor for Munchausen term
- criterion (:obj:torch.nn.modules): Loss function criterion
Returns:
- loss (:obj:torch.Tensor): 1step td error, 0-dim tensor
Shapes:
- data (:obj:m_q_1step_td_data): the m_q_1step_td_data containing ['q', 'target_q', 'next_q', 'act', 'reward', 'done', 'weight']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- target_q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- next_q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- act (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:( , B)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
Examples:
>>> action_dim = 4
>>> data = m_q_1step_td_data(
>>> q=torch.randn(3, action_dim),
>>> target_q=torch.randn(3, action_dim),
>>> next_q=torch.randn(3, action_dim),
>>> act=torch.randint(0, action_dim, (3,)),
>>> reward=torch.randn(3),
>>> done=torch.randint(0, 2, (3,)),
>>> weight=torch.ones(3),
>>> )
>>> loss = m_q_1step_td_error(data, 0.99, 0.01, 0.01)
q_v_1step_td_error(data, gamma, criterion=nn.MSELoss(reduction='none'))
¶
Overview
td_error between q and v value for SAC algorithm, support 1 step td error.
Arguments:
- data (:obj:q_v_1step_td_data): The input data, q_v_1step_td_data to calculate loss
- gamma (:obj:float): Discount factor
- criterion (:obj:torch.nn.modules): Loss function criterion
Returns:
- loss (:obj:torch.Tensor): 1step td error, 0-dim tensor
Shapes:
- data (:obj:q_v_1step_td_data): the q_v_1step_td_data containing ['q', 'v', 'act', 'reward', 'done', 'weight']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- v (:obj:torch.FloatTensor): :math:(B, )
- act (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:( , B)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
Examples:
>>> action_dim = 4
>>> data = q_v_1step_td_data(
>>> q=torch.randn(3, action_dim),
>>> v=torch.randn(3),
>>> act=torch.randint(0, action_dim, (3,)),
>>> reward=torch.randn(3),
>>> done=torch.randint(0, 2, (3,)),
>>> weight=torch.ones(3),
>>> )
>>> loss = q_v_1step_td_error(data, 0.99)
nstep_return(data, gamma, nstep, value_gamma=None)
¶
Overview
Calculate nstep return for DQN algorithm, support single agent case and multi agent case.
Arguments:
- data (:obj:nstep_return_data): The input data, nstep_return_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num
- value_gamma (:obj:torch.Tensor): Discount factor for value
Returns:
- return (:obj:torch.Tensor): nstep return
Shapes:
- data (:obj:nstep_return_data): the nstep_return_data containing ['reward', 'next_value', 'done']
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- next_value (:obj:torch.FloatTensor): :math:(, B)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
Examples:
>>> data = nstep_return_data(
>>> reward=torch.randn(3, 3),
>>> next_value=torch.randn(3),
>>> done=torch.randint(0, 2, (3,)),
>>> )
>>> loss = nstep_return(data, 0.99, 3)
dist_1step_td_error(data, gamma, v_min, v_max, n_atom)
¶
Overview
1 step td_error for distributed q-learning based algorithm
Arguments:
- data (:obj:dist_1step_td_data): The input data, dist_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- v_min (:obj:float): The min value of support
- v_max (:obj:float): The max value of support
- n_atom (:obj:int): The num of atom
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:dist_1step_td_data): the dist_1step_td_data containing ['dist', 'next_n_dist', 'act', 'reward', 'done', 'weight']
- dist (:obj:torch.FloatTensor): :math:(B, N, n_atom) i.e. [batch_size, action_dim, n_atom]
- next_dist (:obj:torch.FloatTensor): :math:(B, N, n_atom)
- act (:obj:torch.LongTensor): :math:(B, )
- next_act (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(, B)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_dist = torch.randn(4, 3, 51).abs()
>>> act = torch.randint(0, 3, (4,))
>>> next_act = torch.randint(0, 3, (4,))
>>> reward = torch.randn(4)
>>> done = torch.randint(0, 2, (4,))
>>> data = dist_1step_td_data(dist, next_dist, act, next_act, reward, done, None)
>>> loss = dist_1step_td_error(data, 0.99, -10.0, 10.0, 51)
shape_fn_dntd(args, kwargs)
¶
Overview
Return dntd shape for hpc
Returns: shape: [T, B, N, n_atom]
dist_nstep_td_error(data, gamma, v_min, v_max, n_atom, nstep=1, value_gamma=None)
¶
Overview
Multistep (1 step or n step) td_error for distributed q-learning based algorithm, support single agent case and multi agent case.
Arguments:
- data (:obj:dist_nstep_td_data): The input data, dist_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num, default set to 1
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:dist_nstep_td_data): the dist_nstep_td_data containing ['dist', 'next_n_dist', 'act', 'reward', 'done', 'weight']
- dist (:obj:torch.FloatTensor): :math:(B, N, n_atom) i.e. [batch_size, action_dim, n_atom]
- next_n_dist (:obj:torch.FloatTensor): :math:(B, N, n_atom)
- act (:obj:torch.LongTensor): :math:(B, )
- next_n_act (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
Examples:
>>> dist = torch.randn(4, 3, 51).abs().requires_grad_(True)
>>> next_n_dist = torch.randn(4, 3, 51).abs()
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> reward = torch.randn(5, 4)
>>> data = dist_nstep_td_data(dist, next_n_dist, action, next_action, reward, done, None)
>>> loss, _ = dist_nstep_td_error(data, 0.95, -10.0, 10.0, 51, 5)
v_1step_td_error(data, gamma, criterion=nn.MSELoss(reduction='none'))
¶
Overview
1 step td_error for distributed value based algorithm
Arguments:
- data (:obj:v_1step_td_data): The input data, v_1step_td_data to calculate loss
- gamma (:obj:float): Discount factor
- criterion (:obj:torch.nn.modules): Loss function criterion
Returns:
- loss (:obj:torch.Tensor): 1step td error, 0-dim tensor
Shapes:
- data (:obj:v_1step_td_data): the v_1step_td_data containing ['v', 'next_v', 'reward', 'done', 'weight']
- v (:obj:torch.FloatTensor): :math:(B, ) i.e. [batch_size, ]
- next_v (:obj:torch.FloatTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(, B)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
Examples:
>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5)
>>> done = torch.zeros(5)
>>> data = v_1step_td_data(v, next_v, reward, done, None)
>>> loss, td_error_per_sample = v_1step_td_error(data, 0.99)
v_nstep_td_error(data, gamma, nstep=1, criterion=nn.MSELoss(reduction='none'))
¶
Overview
Multistep (n step) td_error for distributed value based algorithm
Arguments:
- data (:obj:dist_nstep_td_data): The input data, v_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num, default set to 1
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:dist_nstep_td_data): The v_nstep_td_data containing ['v', 'next_n_v', 'reward', 'done', 'weight', 'value_gamma']
- v (:obj:torch.FloatTensor): :math:(B, ) i.e. [batch_size, ]
- next_v (:obj:torch.FloatTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
- value_gamma (:obj:torch.Tensor): If the remaining data in the buffer is less than n_step we use value_gamma as the gamma discount value for next_v rather than gamma**n_step
Examples:
>>> v = torch.randn(5).requires_grad_(True)
>>> next_v = torch.randn(5)
>>> reward = torch.rand(5, 5)
>>> done = torch.zeros(5)
>>> data = v_nstep_td_data(v, next_v, reward, done, 0.9, 0.99)
>>> loss, td_error_per_sample = v_nstep_td_error(data, 0.99, 5)
shape_fn_qntd(args, kwargs)
¶
Overview
Return qntd shape for hpc
Returns: shape: [T, B, N]
q_nstep_td_error(data, gamma, nstep=1, cum_reward=False, value_gamma=None, criterion=nn.MSELoss(reduction='none'))
¶
Overview
Multistep (1 step or n step) td_error for q-learning based algorithm
Arguments:
- data (:obj:q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- cum_reward (:obj:bool): Whether to use cumulative nstep reward, which is figured out when collecting data
- value_gamma (:obj:torch.Tensor): Gamma discount value for target q_value
- criterion (:obj:torch.nn.modules): Loss function criterion
- nstep (:obj:int): nstep num, default set to 1
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
- td_error_per_sample (:obj:torch.Tensor): nstep td error, 1-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(B, N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- td_error_per_sample (:obj:torch.FloatTensor): :math:(B, )
Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = q_nstep_td_error(data, 0.95, nstep=nstep)
bdq_nstep_td_error(data, gamma, nstep=1, cum_reward=False, value_gamma=None, criterion=nn.MSELoss(reduction='none'))
¶
Overview
Multistep (1 step or n step) td_error for BDQ algorithm, referenced paper "Action Branching Architectures for Deep Reinforcement Learning", link: https://arxiv.org/pdf/1711.08946. In fact, the original paper only provides the 1-step TD-error calculation method, and here we extend the calculation method of n-step, i.e., TD-error:
Arguments:
- data (:obj:q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- cum_reward (:obj:bool): Whether to use cumulative nstep reward, which is figured out when collecting data
- value_gamma (:obj:torch.Tensor): Gamma discount value for target q_value
- criterion (:obj:torch.nn.modules): Loss function criterion
- nstep (:obj:int): nstep num, default set to 1
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
- td_error_per_sample (:obj:torch.Tensor): nstep td error, 1-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(B, D, N) i.e. [batch_size, branch_num, action_bins_per_branch]
- next_n_q (:obj:torch.FloatTensor): :math:(B, D, N)
- action (:obj:torch.LongTensor): :math:(B, D)
- next_n_action (:obj:torch.LongTensor): :math:(B, D)
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- td_error_per_sample (:obj:torch.FloatTensor): :math:(B, )
Examples:
>>> action_per_branch = 3
>>> next_q = torch.randn(8, 6, action_per_branch)
>>> done = torch.randn(8)
>>> action = torch.randint(0, action_per_branch, size=(8, 6))
>>> next_action = torch.randint(0, action_per_branch, size=(8, 6))
>>> nstep =3
>>> q = torch.randn(8, 6, action_per_branch).requires_grad_(True)
>>> reward = torch.rand(nstep, 8)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample = bdq_nstep_td_error(data, 0.95, nstep=nstep)
shape_fn_qntd_rescale(args, kwargs)
¶
Overview
Return qntd_rescale shape for hpc
Returns: shape: [T, B, N]
q_nstep_td_error_with_rescale(data, gamma, nstep=1, value_gamma=None, criterion=nn.MSELoss(reduction='none'), trans_fn=value_transform, inv_trans_fn=value_inv_transform)
¶
Overview
Multistep (1 step or n step) td_error with value rescaling
Arguments:
- data (:obj:q_nstep_td_data): The input data, q_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num, default set to 1
- criterion (:obj:torch.nn.modules): Loss function criterion
- trans_fn (:obj:Callable): Value transfrom function, default to value_transform (refer to rl_utils/value_rescale.py)
- inv_trans_fn (:obj:Callable): Value inverse transfrom function, default to value_inv_transform (refer to rl_utils/value_rescale.py)
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(B, N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep =3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, _ = q_nstep_td_error_with_rescale(data, 0.95, nstep=nstep)
dqfd_nstep_td_error(data, gamma, lambda_n_step_td, lambda_supervised_loss, margin_function, lambda_one_step_td=1.0, nstep=1, cum_reward=False, value_gamma=None, criterion=nn.MSELoss(reduction='none'))
¶
Overview
Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd
Arguments:
- data (:obj:dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss
- gamma (:obj:float): discount factor
- cum_reward (:obj:bool): Whether to use cumulative nstep reward, which is figured out when collecting data
- value_gamma (:obj:torch.Tensor): Gamma discount value for target q_value
- criterion (:obj:torch.nn.modules): Loss function criterion
- nstep (:obj:int): nstep num, default set to 10
Returns:
- loss (:obj:torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor
- td_error_per_sample (:obj:torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): the q_nstep_td_data containing ['q', 'next_n_q', 'action', 'next_n_action', 'reward', 'done', 'weight' , 'new_n_q_one_step', 'next_n_action_one_step', 'is_expert']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(B, N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- td_error_per_sample (:obj:torch.FloatTensor): :math:(B, )
- new_n_q_one_step (:obj:torch.FloatTensor): :math:(B, N)
- next_n_action_one_step (:obj:torch.LongTensor): :math:(B, )
- is_expert (:obj:int) : 0 or 1
Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> done_1 = torch.randn(4)
>>> next_q_one_step = torch.randn(4, 3)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> next_action_one_step = torch.randint(0, 3, size=(4, ))
>>> is_expert = torch.ones((4))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = dqfd_nstep_td_data(
>>> q, next_q, action, next_action, reward, done, done_1, None,
>>> next_q_one_step, next_action_one_step, is_expert
>>> )
>>> loss, td_error_per_sample, loss_statistics = dqfd_nstep_td_error(
>>> data, 0.95, lambda_n_step_td=1, lambda_supervised_loss=1,
>>> margin_function=0.8, nstep=nstep
>>> )
dqfd_nstep_td_error_with_rescale(data, gamma, lambda_n_step_td, lambda_supervised_loss, lambda_one_step_td, margin_function, nstep=1, cum_reward=False, value_gamma=None, criterion=nn.MSELoss(reduction='none'), trans_fn=value_transform, inv_trans_fn=value_inv_transform)
¶
Overview
Multistep n step td_error + 1 step td_error + supervised margin loss or dqfd
Arguments:
- data (:obj:dqfd_nstep_td_data): The input data, dqfd_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- cum_reward (:obj:bool): Whether to use cumulative nstep reward, which is figured out when collecting data
- value_gamma (:obj:torch.Tensor): Gamma discount value for target q_value
- criterion (:obj:torch.nn.modules): Loss function criterion
- nstep (:obj:int): nstep num, default set to 10
Returns:
- loss (:obj:torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 0-dim tensor
- td_error_per_sample (:obj:torch.Tensor): Multistep n step td_error + 1 step td_error + supervised margin loss, 1-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'next_n_action', 'reward', 'done', 'weight' , 'new_n_q_one_step', 'next_n_action_one_step', 'is_expert']
- q (:obj:torch.FloatTensor): :math:(B, N) i.e. [batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(B, N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- td_error_per_sample (:obj:torch.FloatTensor): :math:(B, )
- new_n_q_one_step (:obj:torch.FloatTensor): :math:(B, N)
- next_n_action_one_step (:obj:torch.LongTensor): :math:(B, )
- is_expert (:obj:int) : 0 or 1
qrdqn_nstep_td_error(data, gamma, nstep=1, value_gamma=None)
¶
Overview
Multistep (1 step or n step) td_error with in QRDQN
Arguments:
- data (:obj:qrdqn_nstep_td_data): The input data, qrdqn_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num, default set to 1
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(tau, B, N) i.e. [tau x batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(tau', B, N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
Examples:
>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = qrdqn_nstep_td_data(q, next_q, action, next_action, reward, done, 3, None)
>>> loss, td_error_per_sample = qrdqn_nstep_td_error(data, 0.95, nstep=nstep)
q_nstep_sql_td_error(data, gamma, alpha, nstep=1, cum_reward=False, value_gamma=None, criterion=nn.MSELoss(reduction='none'))
¶
Overview
Multistep (1 step or n step) td_error for q-learning based algorithm
Arguments:
- data (:obj:q_nstep_td_data): The input data, q_nstep_sql_td_data to calculate loss
- gamma (:obj:float): Discount factor
- Alpha (:obj:`float): A parameter to weight entropy term in a policy equation
- cum_reward (:obj:bool): Whether to use cumulative nstep reward, which is figured out when collecting data
- value_gamma (:obj:torch.Tensor): Gamma discount value for target soft_q_value
- criterion (:obj:torch.nn.modules): Loss function criterion
- nstep (:obj:int): nstep num, default set to 1
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
- td_error_per_sample (:obj:torch.Tensor): nstep td error, 1-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(B, N)i.e. [batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(B, N)- action (:obj:torch.LongTensor): :math:(B, )- next_n_action (:obj:torch.LongTensor): :math:(B, )- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- td_error_per_sample (:obj:torch.FloatTensor): :math:(B, )`
Examples:
>>> next_q = torch.randn(4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3).requires_grad_(True)
>>> reward = torch.rand(nstep, 4)
>>> data = q_nstep_td_data(q, next_q, action, next_action, reward, done, None)
>>> loss, td_error_per_sample, record_target_v = q_nstep_sql_td_error(data, 0.95, 1.0, nstep=nstep)
iqn_nstep_td_error(data, gamma, nstep=1, kappa=1.0, value_gamma=None)
¶
Overview
Multistep (1 step or n step) td_error with in IQN, referenced paper Implicit Quantile Networks for Distributional Reinforcement Learning https://arxiv.org/pdf/1806.06923.pdf
Arguments:
- data (:obj:iqn_nstep_td_data): The input data, iqn_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num, default set to 1
- criterion (:obj:torch.nn.modules): Loss function criterion
- beta_function (:obj:Callable): The risk function
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(tau, B, N) i.e. [tau x batch_size, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(tau', B, N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
Examples:
>>> next_q = torch.randn(3, 4, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(3, 4, 3).requires_grad_(True)
>>> replay_quantile = torch.randn([3, 4, 1])
>>> reward = torch.rand(nstep, 4)
>>> data = iqn_nstep_td_data(q, next_q, action, next_action, reward, done, replay_quantile, None)
>>> loss, td_error_per_sample = iqn_nstep_td_error(data, 0.95, nstep=nstep)
fqf_nstep_td_error(data, gamma, nstep=1, kappa=1.0, value_gamma=None)
¶
Overview
Multistep (1 step or n step) td_error with in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning https://arxiv.org/pdf/1911.02140.pdf
Arguments:
- data (:obj:fqf_nstep_td_data): The input data, fqf_nstep_td_data to calculate loss
- gamma (:obj:float): Discount factor
- nstep (:obj:int): nstep num, default set to 1
- criterion (:obj:torch.nn.modules): Loss function criterion
- beta_function (:obj:Callable): The risk function
Returns:
- loss (:obj:torch.Tensor): nstep td error, 0-dim tensor
Shapes:
- data (:obj:q_nstep_td_data): The q_nstep_td_data containing ['q', 'next_n_q', 'action', 'reward', 'done']
- q (:obj:torch.FloatTensor): :math:(B, tau, N) i.e. [batch_size, tau, action_dim]
- next_n_q (:obj:torch.FloatTensor): :math:(B, tau', N)
- action (:obj:torch.LongTensor): :math:(B, )
- next_n_action (:obj:torch.LongTensor): :math:(B, )
- reward (:obj:torch.FloatTensor): :math:(T, B), where T is timestep(nstep)
- done (:obj:torch.BoolTensor) :math:(B, ), whether done in last timestep
- quantiles_hats (:obj:torch.FloatTensor): :math:(B, tau)
Examples:
>>> next_q = torch.randn(4, 3, 3)
>>> done = torch.randn(4)
>>> action = torch.randint(0, 3, size=(4, ))
>>> next_action = torch.randint(0, 3, size=(4, ))
>>> nstep = 3
>>> q = torch.randn(4, 3, 3).requires_grad_(True)
>>> quantiles_hats = torch.randn([4, 3])
>>> reward = torch.rand(nstep, 4)
>>> data = fqf_nstep_td_data(q, next_q, action, next_action, reward, done, quantiles_hats, None)
>>> loss, td_error_per_sample = fqf_nstep_td_error(data, 0.95, nstep=nstep)
fqf_calculate_fraction_loss(q_tau_i, q_value, quantiles, actions)
¶
Overview
Calculate the fraction loss in FQF, referenced paper Fully Parameterized Quantile Function for Distributional Reinforcement Learning https://arxiv.org/pdf/1911.02140.pdf
Arguments:
- q_tau_i (:obj:torch.FloatTensor): :math:(batch_size, num_quantiles-1, action_dim)
- q_value (:obj:torch.FloatTensor): :math:(batch_size, num_quantiles, action_dim)
- quantiles (:obj:torch.FloatTensor): :math:(batch_size, num_quantiles+1)
- actions (:obj:torch.LongTensor): :math:(batch_size, )
Returns:
- fraction_loss (:obj:torch.Tensor): fraction loss, 0-dim tensor
shape_fn_td_lambda(args, kwargs)
¶
Overview
Return td_lambda shape for hpc
Returns: shape: [T, B]
td_lambda_error(data, gamma=0.9, lambda_=0.8)
¶
Overview
Computing TD(lambda) loss given constant gamma and lambda. There is no special handling for terminal state value, if some state has reached the terminal, just fill in zeros for values and rewards beyond terminal (including the terminal state, values[terminal] should also be 0)
Arguments:
- data (:obj:namedtuple): td_lambda input data with fields ['value', 'reward', 'weight']
- gamma (:obj:float): Constant discount factor gamma, should be in [0, 1], defaults to 0.9
- lambda (:obj:float): Constant lambda, should be in [0, 1], defaults to 0.8
Returns:
- loss (:obj:torch.Tensor): Computed MSE loss, averaged over the batch
Shapes:
- value (:obj:torch.FloatTensor): :math:(T+1, B), where T is trajectory length and B is batch, which is the estimation of the state value at step 0 to T
- reward (:obj:torch.FloatTensor): :math:(T, B), the returns from time step 0 to T-1
- weight (:obj:torch.FloatTensor or None): :math:(B, ), the training sample weight
- loss (:obj:torch.FloatTensor): :math:(), 0-dim tensor
Examples:
>>> T, B = 8, 4
>>> value = torch.randn(T + 1, B).requires_grad_(True)
>>> reward = torch.rand(T, B)
>>> loss = td_lambda_error(td_lambda_data(value, reward, None))
generalized_lambda_returns(bootstrap_values, rewards, gammas, lambda_, done=None)
¶
Overview
Functional equivalent to trfl.value_ops.generalized_lambda_returns https://github.com/deepmind/trfl/blob/2c07ac22512a16715cc759f0072be43a5d12ae45/trfl/value_ops.py#L74 Passing in a number instead of tensor to make the value constant for all samples in batch
Arguments:
- bootstrap_values (:obj:torch.Tensor or :obj:float):
estimation of the value at step 0 to T, of size [T_traj+1, batchsize]
- rewards (:obj:torch.Tensor): The returns from 0 to T-1, of size [T_traj, batchsize]
- gammas (:obj:torch.Tensor or :obj:float):
Discount factor for each step (from 0 to T-1), of size [T_traj, batchsize]
- lambda (:obj:torch.Tensor or :obj:float): Determining the mix of bootstrapping
vs further accumulation of multistep returns at each timestep, of size [T_traj, batchsize]
- done (:obj:torch.Tensor or :obj:float):
Whether the episode done at current step (from 0 to T-1), of size [T_traj, batchsize]
Returns:
- return (:obj:torch.Tensor): Computed lambda return value
for each state from 0 to T-1, of size [T_traj, batchsize]
multistep_forward_view(bootstrap_values, rewards, gammas, lambda_, done=None)
¶
Overview
Same as trfl.sequence_ops.multistep_forward_view, which implements (12.18) in Sutton & Barto. Assuming the first dim of input tensors correspond to the index in batch.
.. note:: result[T-1] = rewards[T-1] + gammas[T-1] * bootstrap_values[T] for t in 0...T-2 : result[t] = rewards[t] + gammas[t](lambdas[t]result[t+1] + (1-lambdas[t])*bootstrap_values[t+1])
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
- bootstrap_values (
|
obj: |
required | |
- rewards (
|
obj: |
required | |
- gammas (
|
obj: |
required | |
- lambda (
|
obj: |
required | |
- done (
|
obj: |
required |
Returns:
- ret (:obj:torch.Tensor): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]
Full Source Code
../ding/rl_utils/td.py