Dynamic Programming
dynamic_programming
Functions:
-
get_state_action_values–Calculates the value of each action for a given state. Used within the main
-
state_value_iterator–Core value iteration function - calculates value function for the MDP and
Attributes:
-
solve_value_iteration(Tuple[ndarray, ndarray]) –Solves an MDP using value iteration given a reward function.
solve_value_iteration
module-attribute
solve_value_iteration: Tuple[ndarray, ndarray] = jit(solve_value_iteration, static_argnums=(0, 1))
Solves an MDP using value iteration given a reward function.
Parameters:
-
(n_statesint) –Number of states
-
(n_actionsint) –Number of actions
-
(reward_functionndarray) –Reward function (i.e., reward at each state)
-
(max_iterint) –Maximum number of iterations
-
(discountfloat) –Discount factor
-
(sasndarray) –State-action-state transition probabilities
-
(tolfloat) –Tolerance for convergence
Returns:
-
Tuple[ndarray, ndarray]–Tuple[jnp.ndarray, jnp.ndarray]: Final value function and action values (Q-values)
get_state_action_values
get_state_action_values(s: int, n_actions: int, sas: ndarray, reward: ndarray, discount: float, values: ndarray) -> ndarray
Calculates the value of each action for a given state. Used within the main value iteration loop.
Reward is typically conceived of as resulting from taking action A in state S. Here, we for the sake of simplicity, we assume that the reward results from visiting state S' - that is, taking action A in state S isn't rewarding in itself, but the reward received is dependent on the reward present in state S'.
Parameters:
-
(sint) –State ID
-
(n_actionsint) –Number of possible actions
-
(sasndarray) –State, action, state transition function
-
(rewardndarray) –Reward available at each state
-
(discountfloat) –Discount factor
-
(valuesndarray) –Current estimate of value function
Returns:
-
ndarray–np.ndarray: Estimated value of each state
Source code in behavioural_modelling/planning/dynamic_programming.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |
state_value_iterator
state_value_iterator(values: ndarray, reward: ndarray, discount: float, sas: ndarray, soft: bool = False) -> Tuple[ndarray, float, ndarray]
Core value iteration function - calculates value function for the MDP and returns q-values for each action in each state.
This function just runs one iteration of the value iteration algorithm.
"Soft" value iteration can optionally be performed. This essentially involves taking the softmax of action values rather than the max, and is useful for inverse reinforcement learning (see Bloem & Bambos, 2014).
Parameters:
-
(valuesndarray) –Current estimate of the value function
-
(rewardndarray) –Reward at each state (i.e. features x reward function)
-
(discountfloat) –Discount factor
-
(sasndarray) –State, action, state transition function
-
(softbool, default:False) –If True, this implements "soft" value iteration rather than standard value iteration. Defaults to False.
Returns:
-
Tuple[ndarray, float, ndarray]–Tuple[np.ndarray, float, np.ndarray]: Returns new estimate of the value function, new delta, and new q_values
Source code in behavioural_modelling/planning/dynamic_programming.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |