Hierarchical Reinforcement Learning Using Graphical Models
Victoria Manfredi
[email protected] Sridhar Mahadevan
[email protected] Computer Science Dept, 140 Governor’s Drive, University of Massachusetts, Amherst, MA 010039264 USA
Abstract The graphical models paradigm provides a general framework for automatically learning hierarchical models using ExpectationMaximization, enabling both abstract states and abstract policies to be learned. In this paper we describe a twophased method for incorporating policies learned with a graphical model to bias the behaviour of an SMDP Qlearning agent. In the first rewardfree phase, a graphical model is trained from sample trajectories; in the second phase, policies are extracted from the graphical model and improved by incorporating reward information. We present results from a simulated grid world Taxi task showing that the SMDP Qlearning agent using the learned policies quickly does as well as an SMDP Qlearning agent using handcoded policies.
1. Introduction Abstraction is essential to scaling reinforcement learning (RL) (Barto & Mahadevan, 2003; Dietterich, 2000; Parr & Russell, 1998; Sutton et al., 1999). Temporal abstraction permits structured initial exploration by RL agents, allowing reuse of learned activities, and simplifying human interpretation of the learned policy. Spatial abstraction decreases the number of states that need to be experienced, reducing the amount of memory needed, and capturing regularities in the policy structure. Given predefined state and policy abstractions, for instance a task hierarchy, existing methods (Dietterich, 2000; Parr & Russell, 1998; Sutton et al., 1999) can be used to learn the corresponding hierarchical policy. One of the most difficult problems in hierarchical reinforcement learning, however, is how to automatically learn the abstractions. For instance, Appearing in Proceedings of the ICML’05 Workshop on Rich Representations for Reinforcement Learning, Bonn, Germany, 2005. Copyright 2005 by the author(s)/owner(s).
suppose a MAXQ hierarchy is desired: What should the tasks be? How should the termination conditions be defined? The graphical models framework provides a powerful approach to automatically learning abstractions for hierarchical RL. For example, the abstract hidden Markov model (AHMM) proposed by Bui et al. (2002) is a hierarchical graphical model that encodes abstract policies; these policies are derived from the options framework (Sutton et al., 1999) and have associated initiation and termination states. Alternatively, the hierarchical hidden Markov model (HHMM) (Fine et al., 1998) encodes abstract states. In this work, we describe the use of graphical models to automate hierarchical reinforcement learning using imitation. The method we propose takes advantage of a mentor who provides examples of optimal behaviour in the form of state transitions and primitive actions. We believe humans exploit similar guidance when learning complex skills, such as driving. Previous approaches to automatic abstraction in RL can be divided into two groups: methods that identify subgoals, i.e., states or clusters of states, and then learn policies to those subgoals (McGovern & Barto, 2001; S ¸ im¸sek & Barto, 2004; Mannor et al., 2004) and methods that explicitly build a policy hierarchy (Hengst, 2002). Key differences between our method and previous work are that first, by modifying the structure of the graphical model, different abstractions can be learned; second, we learn subgoals (terminations) and policies simultaneously, rather than separately; finally, our method provides a mechanism for coping with partially observable state through the use of hidden variables. Previous work on imitation in the context of RL has focused on learning a flat policy model. In Price and Boutilier (2003) a mentor’s state transitions are used to help the learner converge more quickly to a good policy. In Abbeel and Ng (2004), the observer’s reward function is unknown; instead, a mentor’s state
transitions and feature expectations are used to identify a policy with similar feature expectations, where features are a mapping over states and feature expectations partially encode the value of the policy. In comparison, graphical models provide a mechanism for learning by imitation where the mentor learns not to just imitate the teacher but learns the task structure implicit in the higher level subgoals in the mentor’s policy. This inference involves computing a distribution over the mentor’s higherlevel policies from its state transitions and actions with higher level policy variables treated as hidden variables. While not studied here, rewards could additionally be incorporated into the graphical model, as in (Samejima et al., 2004). In the rest of this paper, we define the graphical model that we use and investigate its effectiveness in automating hierarchical RL.
2. Dynamic Abstraction Networks Previous work in graphical models has largely focused on studying temporal abstraction or state abstraction in isolation. Intuitively, abstract policies are intricately tied to abstract states. For instance, New York City is both a temporal and a spatial abstraction: its geographical location permits you to both execute such abstract policies as visit the Metropolitan Museum of Art or attend a Broadway play, and to define such spatial abstractions as the five boroughs of New York City or the state of New York. In other work (Manfredi & Mahadevan, 2005) we have proposed a new type of hierarchical graphical model, dynamic abstraction networks (DANs), that combines state and temporal abstraction. Jointly encoding state and temporal abstraction permits abstract states and policies to be learned simultaneously, unlike in the AHMM or HHMM alone. We showed in (Manfredi & Mahadevan, 2005) that empirically DANs seem to learn better policy abstractions than do AHMMs. Figure 1 shows the dynamic Bayesian network (DBN) representation of a 2level DAN. Informally, we can think of DANs as a merging of a state hierarchy, represented by the HHMM, and a policy hierarchy represented by a modified version of the AHMM which we refer to as an mAHMM. Technically we use the dynamic Bayesian network representations of the HHMM and AHMM defined in (Murphy & Paskin, 2001) and (Bui et al., 2002) respectively. We merge the mAHMM and HHMM by adding edges from state nodes at time t on level i to policy and policy termination nodes at time t on level i and from policy nodes at time t on level i to state nodes at time t + 1 on
2−Level AHMM
Π2
Π2 β2
1−Level AHMM
Π1
Π1
OA
α2 2−Level HHMM
S2
S1
α1 S1 α0
S0
S0
OS
OS
t=1
OA
S2
α0 HMM
Π0
α2
α1 1−Level HHMM
β1
β1 Π0
Action Level
β2
t=2
Figure 1. DBN representation of a dynamic abstraction network. We emphasize that this is just one realization of a dynamic abstraction network, and other configurations are possible.
level i. One of the key ideas for developing this model is that abstract states are useful for learning abstract policies. Consequently, abstract states are fed into all policy levels including the actions. We formally define a Klevel DAN as comprised of an intertwined state hierarchy and policy hierarchy defined as follows. Policy Hierarchy. A policy hierarchy Hπ with K levels is given by the ordered tuple Hπ = (OA , Π0 , Π1 , . . . , ΠK ). • OA is the set of action observations. OA is only necessary if actions are continuous or partially observable. • Π0 is the set of primitive (onestep) actions. • Πi is the set of abstract policies at level 1 ≤ i ≤ K, defined in more detail below. State Hierarchy. A state hierarchy Hs with K levels is given by the ordered tuple Hs = (OS , S0 , S1 , . . . , SK ). • OS is the set of state observations. • Sj is the set of abstract states at levels 0 ≤ j ≤ K, defined in more detail below.
Abstract Policies. At level i, each abstract policy πi ∈ Πi is given by the tuple πi =< Sπi , Bπi , βπi , σπi >. • Selection set. Sπi ⊂ Si is the set of states in which πi can be executed. • Termination probability. Bπi ⊂ Si is the set of states in which πi can terminate. βπi : Bπi → (0, 1] is the probability that policy πi terminates in state Bπi . Note that for all termination states that are not also selection states, policy πi terminates with probability one. • Selection probability. σπi : Πi+1:K × Sπi → [0, 1] is the probability with which a policy πi ∈ Πi can be initiated when executing abstract policies πi+1 ∈ Πi+1 , . . . , πK ∈ ΠK in state sπi ∈ Sπi . Abstract States. At level j, each abstract state sj ∈ Sj is given by the tuple sj =< Πsj , αsj , σsj >. Note that an abstract state may be part of more than one higher level state; for instance, states at level j may be letters while states at level j + 1 are words. • Entry set. Πsi ⊂ Πi is the set of policies which enter state si . • Exit probability. αsj : Sj → (0, 1] is the probability that the agent exits state sj ∈ Sj when in states sj+1 ∈ Sj+1 , . . . , sK ∈ SK . • Transition probability. σsj : Sj+1:K × Πj × Sj → [0, 1] is the probability that the agent transitions to state sjt+1 ∈ Sj from state stj ∈ Sj when in parent states stj+1 ∈ Sj+1 , . . . , stK ∈ SK and executing policy πjt ∈ Πj . For the policy hierarchy, Π nodes encode the selection sets Sπ and the selection probabilities σπ , while β nodes encode the termination probabilities βπ . For the state hierarchy, S nodes encode the entry sets Πs and the transition probabilities σs , while α nodes encode the exit probabilities αs . At level i, choosing policy πi ∈ Πi depends in part on state si ∈ Si but executing πi ∈ Πi , by choosing policy πi−1 ∈ Πi−1 , depends in part on state si−1 ∈ Si−1 . That is, choosing a policy at level i depends on the state at level i, but executing a policy at level i depends on the state at level i − 1.
3. Hierarchical Reinforcement Learning In this section, we show how DANs can be applied to the problem of automating hiearchical RL.
3.1. Approach There are two phases to our approach. In the first phase, the DAN is constructed and trained. Like MAXQ (Dietterich, 2000), our approach requires that the number of levels and the number of policies and states at each level be specified. However, unlike MAXQ, the dependencies between policies and states at different levels are unknown. Instead, all policies (states) at one level are connected to all policies (states) in the adjacent levels to permit any possible set of dependencies to be learned. Sequences of stateaction pairs obtained from a mentor are then used to train the model with ExpectationMaximization (EM) (Dempster et al., 1977). Unlike other approaches for learning hierarchies, by reducing the problem to parameter estimation, all levels of the state and policy hierarchies are learned simultaneously through joint inference on the model. Learning in the first phase is onpolicy; consequently, the quality of the sample trajectories used to train the model will affect the quality of the policies learned. In the second phase, the learned policies are obtained from the DAN and improved using RL. Once the DAN has been trained, the policy hierarchy is extracted. The options framework (Sutton et al., 1999) fits most naturally with the DAN policy hierarchy. Changing the policy hierarchy would permit other types of policy hierarchies, such as HAMs (Parr & Russell, 1998) or MAXQ task graphs (Dietterich, 2000), to be used. An option consists of (1) a set of states in which it can be initiated, (2) a set of states in which it terminates, (3) a probability distribution over termination states for each option, and (4) a probability distribution over actions (or lowerlevel options) for each state. Define a 1level option to be a policy over options. Then an ilevel option is encoded within an DAN: Π nodes encode (1), (2), and (4), while β nodes encode (3). Various methods could be used, but we use reward to improve the extracted policies by using semiMarkov decision process (SMDP) Qlearning (Sutton et al., 1999) to estimate the value function. We discuss this further in the Experiments section. We note that like DANs, MAXQ similarly encodes state abstractions with associated policy abstractions (task decompositions). However, unlike DANs, these abstractions are handspecified rather than learned. 3.2. Experiments The Taxi domain (Dietterich, 2000) is used to illustrate the proposed approach. The Taxi domain (Dietterich, 2000) consists of a fivebyfive grid with four taxi terminals, R, G, Y , and B, see Figure 2. The goal
R
G
1
2
3
6
7
8
9
10
11 12
13
14
15
16 17 Y 21 22
18
19 B 23 24
20
4
5 1−Level AHMM
Π1
1−Level AHMM
Figure 2. Taxi grid.
We generated 1000 training sequences from an hierarchical RL mentor trained in the Taxi domain using SMDP Qlearning over handcoded policies. Each sequence is the trajectory of states visited and actions taken in one episode as the RL mentor uses its learned policy to reach the goal. Examples were of variable lengths. A learning rate of α = 0.1, an exploration rate of = 0.01, and a discount rate of γ = 0.9 were used. Bayes Net Toolbox (Murphy, 2001) was used to implement and train the mAHMM and DAN models in Figure 3 using ExpectationMaximization (Dempster et al., 1977). All distributions were multinomials and except for the β1 , α0 , α1 , Π1 , S0 , and S1 distributions, were initialized randomly. For the Taxi data we set S1  = 5, S0  = 25, T L = 25, P L = 5, P D = 5, Π1  = 6, Π0  = 6, α0  = 2, α1  = 2, and β1  = 2. For all experiments, higher level states and policies were biased to change more slowly than lower level states and policies; e.g., the floor you are on should not change more frequently than the room you are in. This was done by initializing the β1 , α0 , α1 , Π1 , S0 , and S1 distributions as follows, where i = 0, 1 and OSt = {T Lt , P Lt , P Dt }.

Πt0 , Πt1 , S1t )
P (α0t  S0t , OSt )
P (α1t

α0t , S0t , S1t )
= =
=
Π0
S1
S1 α1
α1 S0
S0
TL
TL
α0
PL
PL
PD
PD
t=1
t=2
(a) mAHMM
TL
TL
PL
PL
PD
t=1
PD
t=2
(b) DAN
Figure 3. (a) 1Level mAHMM and (b) 1Level DAN for Taxi domain; shaded nodes are observed. Note that Π0 represents actions. P (Πt+1  Πt1 , S1t , β1t ) 1 =
(
1 1/Π1  0
if β1t = continue and Πt+1 = Πt1 1 t if β1 = end otherwise
P (S0t+1  Πt0 , S0t , S1t+1 , α0t , α1t ) =
(
1 1/S0  0
if α0t = continue and S0t+1 = S0t if α0t =end otherwise
P (S1t+1  Πt1 , S1t , α1t ) =
(
1 1/S1  0
if α1t = continue and S1t+1 = S1t if α1t =end otherwise
Given the trained mAHMM, policies were extracted and improved using SMDP Qlearning as follows. Once an option (policy) was chosen using an greedy exploration strategy, the learned probability distribution P (Π0 Π1 , T L, P L, P D) from the mAHMM was used to probabilistically select an action, π0 . Given the trained DAN, policies were again extracted and improved using SMDP Qlearning. However, in order to use the learned probability distributions, we must first compute the most likely abstract state, s0 , with,
0.95 0.05
if β1t = continue otherwise
P (S0 T L, P L, P D) =
0.95 0.05
if α0t = continue otherwise
P P (T LS0 )P (P LS0 )P (P DS0 )P (S1 )P (S0 S1 ) P S1
1
0.5
Π0
α0
of the agent is to pick up a passenger from one terminal, and deliver her to another one (possibly the same one). There are six actions: north (N), south (S), east (E), west (W), pick up passenger (PU), and put down passenger (PD). 80% of the time N, S, E, and W work correctly; for 10% of the time the agent goes right and 10% of the time the agent goes left. The agent’s state consists of the taxi location (TL), the passenger location (PL), and the passenger destination (PD). Note that P L = 1 when the passenger has been picked up and P D = 1 when the passenger has been delivered.
P (β1t
β1
Π0
β1
1−Level HHMM
Π1 β1
Action Level
Π1 β1
Π0
Action Level
25
Π1
0
if if
α1t α1t
α0t
= = continue = continue and α0t = end otherwise
S0 ,S1
P (T LS0 )P (P LS0 )P (P DS0 )P (S1 )P (S0 S1 )
Then given the abstract policy π1 that was selected using the greedy strategy, we can select an action π0 directly from the conditional probability distribution,
Number of Time Steps to Goal
Probability of each Level 1 policy
1Level EM Results Averaged Over 10 RL Runs 400
A  AHMM B  Flat QL C  SMDP
350 300 250
B
200 150 C
100 50
Policy 1 Policy 2 Policy 3 Policy 4 Policy 5 Policy 6
1
0.8
mAHMM
0.6
0.4
0.2
A 0
0 0
500
1000 Episode
1500
2000
0 TL PL PD A
18 Y G N
2 13 Y G W
12 Y G W
4 11 Y G S
16 Y G S
6 16 Y G S
16 Y G S
8 21 Y G PU
21 1 G N
10 16 1 G N
11 1 G E
12 6 1 G S
13 1 G N
16 8 1 G N
3 1 G E
18 3 1 G E
4 1 G E
20 5 1 G PD
14 12 1 G E 2
13 1 G N 2
16 8 1 G N 2
3 1 G E 2
18 3 1 G E 2
4 1 G E 2
20 5 1 G PD 2
8 11 1 B E
12 1 B E
10 13 1 B E
14 1 B S
12 19 1 B S
24 1 B PD
8 11 1 B E 2
12 1 B E 2
10 13 1 B E 2
14 1 B S 2
12 19 1 B S 2
24 1 B PD 2
11 1 G E
14 12 1 G E
Probability of each Level 1 policy
(a) mAHMM
1Level EM Results Averaged Over 10 RL Runs Number of Time Steps to Goal
400
A  DAN B  Flat QL C  SMDP
350 300 250
B
200 150
0.8
0.4
0.2
0
50
DAN
0.6
C
100
Policy 1 Policy 2 Policy 3 Policy 4 Policy 5 Policy 6
1
A
0 TL PL PD A S1
18 Y G N 1
2 13 Y G W 1
12 Y G W 2
4 11 Y G S 2
16 Y G S 2
6 16 Y G S 2
16 Y G S 2
8 21 Y G PU 2
21 1 G N 2
10 16 1 G N 2
11 1 G E 2
12 6 1 G S 2
11 1 G E 2
0 500
1000 Episode
1500
2000
(b) DAN Figure 4. (a) 1Level mAHMM results. (b) 1Level DAN results.
Probability of each Level 1 policy
0
0.8
Figure 4 compares how well the learned mAHMM and DAN policies do against the handcoded policies and a flat Qlearning agent. What Figure 4 shows is that within about the first fifty timesteps, the SMDP Qlearning agent using the learned policies does as well or better than the SMDP Qlearning agent using the handcoded policies. We note that the performance of the agent is slightly noisier when it uses the mAHMM rather than the DAN policies.
Probability of each Level 1 policy
3.3. Results
mAHMM
0.6
0.4
0.2
0
P (Π0 Π1 = π1 , S0 = s0 ). Other approaches could be used as well: for instance the full machinery of inference could be used to ascertain how the selected action will affect the predicted distributions over future states. We note that for both the mAHMM and DAN, we permitted options (policies) to be interrupted during SMDP Qlearning. This prevented looping behaviour due to bad policies.
Policy 1 Policy 2 Policy 3 Policy 4 Policy 5 Policy 6
1
0 TL PL PD A
21 R B N
2 16 R B N
11 R B N
4 6 R B N
1 R B PU
Policy 1 Policy 2 Policy 3 Policy 4 Policy 5 Policy 6
1
0.8
6 1 1 B S
6 1 B S
DAN
0.6
0.4
0.2
0
0 TL PL PD A S1
21 R B N 1
2 16 R B N 1
11 R B N 1
4 6 R B N 1
1 R B PU 2
6 1 1 B S 2
6 1 B S 2
Figure 5. Level 1 policies and states for two sample sequences from the Taxi data. The T L, P L, and P D state values and Π0 actions are shown for both models; the S0 abstract states are shown for the DAN.
Figure 5 shows the probability of each level 1 policy and state for two sample Taxi sequences for the mAHMM and DAN. What we see from Figure 5 is that the mAHMM has difficulty identifying a single most likely policy with high probability, while the DAN model is able to identify a unique most likely policy with high probability. As shown in Figure 5, the mAHMM performs particularly poorly on the second sequence, never identifying a single policy as most likely for more than a couple of timesteps. Note that while the mAHMM has difficulty identifying a single policy as most likely, this is not a consequence of the policies themselves being poorly learned. Rather, because the mAHMM has learned every policy over the entire state space (due to the structure of the model), all policies are equally good, so any can be used. In essence only a single global policy has been learned. The problem with this is that it cannot be reused, unlike the more local policies learned by the DAN. In particular we see from the DAN graphs in Figure 5 that Policy 1 is used for part or all of both sequences.
4. Conclusions We have presented a general method for automating hierarchical reinforcement learning. The first phase of our method trains a hierarchical graphical model; the second phase uses the learned policies in an hierarchical reinforcement learning agent. Assuming the graphical model in the first phase encodes the appropriate policy structure, other hierarchical reinforcement learning methods besides SMDP Qlearning can be used for the second phase. In future work we would like to incorporate both phases into an actorcritic architecture. The main disadvantages to our approach are the cost of ExpectationMaximization and having to specify the number of levels and the number of parameters within each level. In future work we plan to explore methods for approximate inference and model selection as applied to dynamic abstraction networks.
Acknowledgments This work was supported in part by the National Science Foundation under grant ECS0218125. Any opinons, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Victoria Manfredi was supported by a National Science Foundation graduate research fellowship.
References Abbeel, P., & Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. ICML’04 (pp. 506–513). Barto, A., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Special Issue on Reinforcement Learning, Discrete Event Systems Journal, 13, 41–77. Bui, H., Venkatesh, S., & West, G. (2002). Policy recognition in the abstract hidden Markov model. Journal of Artificial Intelligence Research, 17, 451–499. S ¸ im¸sek, O., & Barto, A. G. (2004). Using relative novelty to identify useful temporal abstractions in reinforcement learning. ICML’04 (pp. 751–758). Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Dietterich, T. G. (2000). Hierarchical reinforcment learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227–303. Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32, 41–62. Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ. ICML’02 (pp. 243–250). Manfredi, V., & Mahadevan, S. (2005). Dynamic abstraction networks. Technical Report 0533. Dept of Computer Science, U of Massachusetts Amherst. Mannor, S., Menache, I., Hoze, A., & Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. ICML’04 (pp. 751–758). McGovern, A., & Barto, A. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. ICML’01 (pp. 361–368). Murphy, K. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33. Murphy, K., & Paskin, M. (2001). Linear time inference in hierarchical hmms. NIPS’01. Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of machines. NIPS’98. Precup, D. (2000). Temporal abstraction in reinforcement learning. Doctoral dissertation, University of Massachusetts, Amherst, Department of Computer Science. Price, B., & Boutilier, C. (2003). Accelerating reinforcement learning through implicit imitation. Journal of Artificial Intelligence Research, 19, 569–629. Samejima, K., Doya, K., Ueda, Y., & Kumura, M. (2004). Estimating internal variables and parameters of a learning agent by a particle filter. NIPS’04. Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.